spark-instrumented-optimizer/mllib
zhengruifeng 7f3c8fa42e [SPARK-31032][ML] GMM compute summary and update distributions in one job
### What changes were proposed in this pull request?
1, compute summary and update distributions in one pass;
2, remove logic related to check `shouldDistributeGaussians`

### Why are the changes needed?
In current impl, GMM need to trigger two jobs at one iteration:
1, one to compute summary;
2, if `shouldDistributeGaussians = ((k - 1.0) / k) * numFeatures > 25.0`, trigger another to update distributions;

`shouldDistributeGaussians` is almost true in practice, since numFeatures is likely to be greater than 25.

We can use only one job to impl above computation, by following the logic in `KMeans`: using `reduceByKey` to compute statistics for each center

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27784 from zhengruifeng/gmm_avoid_distri_gaussian.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-12 11:21:30 +08:00
..
benchmarks [SPARK-29297][TESTS] Compare core/mllib module benchmarks in JDK8/11 2019-09-29 21:43:58 -07:00
src [SPARK-31032][ML] GMM compute summary and update distributions in one job 2020-03-12 11:21:30 +08:00
pom.xml [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00