spark-instrumented-optimizer/mllib
zhengruifeng 0a4080ec3b [SPARK-30736][ML] One-Pass ChiSquareTest
### What changes were proposed in this pull request?
1, distributedly gather matrix `contingency` of each feature
2, distributedly compute the results and then collect them back to the driver

### Why are the changes needed?
existing impl is not efficient:
1, it directly collect matrix `contingency` of partial featues to driver and compute the corresponding result on one pass;
2, a matrix  `contingency` of a featues is of size numDistinctValues X numDistinctLabels, so only 1000 matrices can be collected at a time;

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27461 from zhengruifeng/chisq_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-17 09:41:38 -06:00
..
benchmarks [SPARK-29297][TESTS] Compare core/mllib module benchmarks in JDK8/11 2019-09-29 21:43:58 -07:00
src [SPARK-30736][ML] One-Pass ChiSquareTest 2020-02-17 09:41:38 -06:00
pom.xml [INFRA] Reverts commit 56dcd79 and c216ef1 2019-12-16 19:57:44 -07:00