spark-instrumented-optimizer/mllib
zhengruifeng 5853e8b330 [SPARK-29754][ML] LoR/AFT/LiR/SVC use Summarizer instead of MultivariateOnlineSummarizer
### What changes were proposed in this pull request?
1, change the scope of `ml.SummarizerBuffer` and add a method `createSummarizerBuffer` for it, so it can be used as an aggregator like `MultivariateOnlineSummarizer`;
2, In LoR/AFT/LiR/SVC, use Summarizer instead of MultivariateOnlineSummarizer

### Why are the changes needed?
The computation of summary before learning iterations is a bottleneck in high-dimension cases, since `MultivariateOnlineSummarizer` compute much more than needed.
In the [ticket](https://issues.apache.org/jira/browse/SPARK-29754) is an example, with `--driver-memory=4G` LoR will always fail on KDDA dataset. If we swith to `ml.Summarizer`, then `--driver-memory=3G` is enough to train a model.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites & manual test in REPL

Closes #26396 from zhengruifeng/using_SummarizerBuffer.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-11-06 18:19:39 +08:00
..
benchmarks [SPARK-29297][TESTS] Compare core/mllib module benchmarks in JDK8/11 2019-09-29 21:43:58 -07:00
src [SPARK-29754][ML] LoR/AFT/LiR/SVC use Summarizer instead of MultivariateOnlineSummarizer 2019-11-06 18:19:39 +08:00
pom.xml Revert "Prepare Spark release v3.0.0-preview-rc2" 2019-10-30 17:45:44 -07:00