spark-instrumented-optimizer/python/pyspark/ml
zhengruifeng e7fa778dc7 [SPARK-30699][ML][PYSPARK] GMM blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, if blockSize==1, keep original behavior, code path trainOnRows;
3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks

### Why are the changes needed?
performance gain on dense dataset HIGGS:
1, save about 45% RAM;
2, 3X faster with openBLAS

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
added testsuites

Closes #27473 from zhengruifeng/blockify_gmm.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-12 12:54:03 +08:00
..
linalg [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
param [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
tests [SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark 2020-05-08 11:02:24 +08:00
__init__.py [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend 2020-03-04 12:20:02 +08:00
base.py [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
classification.py [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors 2020-05-07 10:07:24 +08:00
clustering.py [SPARK-30699][ML][PYSPARK] GMM blockify input vectors 2020-05-12 12:54:03 +08:00
common.py [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch 2016-10-03 14:12:03 -07:00
evaluation.py [SPARK-31012][ML][PYSPARK][DOCS] Updating ML API docs for 3.0 changes 2020-03-07 11:42:05 -06:00
feature.py [SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark 2020-05-08 11:02:24 +08:00
fpm.py [SPARK-29867][ML][PYTHON] Add __repr__ in Python ML Models 2019-11-15 21:44:39 -08:00
functions.py [SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation 2020-02-18 16:46:45 +09:00
image.py [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 2019-07-31 14:26:18 +09:00
pipeline.py [SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model 2020-04-26 21:04:14 -07:00
recommendation.py [SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP 2020-02-09 13:14:30 +08:00
regression.py [SPARK-31656][ML][PYSPARK] AFT blockify input vectors 2020-05-08 14:06:36 +08:00
stat.py [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest 2020-05-11 09:09:00 -05:00
tree.py [SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method 2020-01-23 16:44:13 +08:00
tuning.py [SPARK-30498][ML][PYSPARK] Fix some ml parity issues between python and scala 2020-01-14 17:24:17 +08:00
util.py [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
wrapper.py [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend 2020-03-04 12:20:02 +08:00