ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zhengruifeng	e7fa778dc7	[SPARK-30699][ML][PYSPARK] GMM blockify input vectors ### What changes were proposed in this pull request? 1, add new param blockSize; 2, if blockSize==1, keep original behavior, code path trainOnRows; 3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks ### Why are the changes needed? performance gain on dense dataset HIGGS: 1, save about 45% RAM; 2, 3X faster with openBLAS ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? added testsuites Closes #27473 from zhengruifeng/blockify_gmm. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-12 12:54:03 +08:00
Huaxin Gao	7a670b5a0a	[SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest ### What changes were proposed in this pull request? Add Python version of ``` Since("3.1.0") def test( dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame ``` ### Why are the changes needed? parity between scala and python ### Does this PR introduce _any_ user-facing change? yes new method ``` Since("3.1.0") def test( dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame ``` in PySpark ANOVATest/ChisqTest/FValueTest ### How was this patch tested? New doctest Closes #28483 from huaxingao/flatten_py. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-11 09:09:00 -05:00
zhengruifeng	bb9b50c217	[SPARK-31656][ML][PYSPARK] AFT blockify input vectors ### What changes were proposed in this pull request? 1, add new param blockSize; 2, add a new class InstanceBlock; 3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP); 4, if blockSize>1, standardize the input outside of optimization procedure; ### Why are the changes needed? it will obtain performance gain on dense datasets, such as epsilon 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (~10X speedup) ### Does this PR introduce _any_ user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28473 from zhengruifeng/blockify_aft. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-08 14:06:36 +08:00
Huaxin Gao	18d2ba53e4	[SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark ### What changes were proposed in this pull request? Add ANOVASelector and FValueSelector to PySpark ### Why are the changes needed? ANOVASelector and FValueSelector have been implemented in Scala. We need to implement these in Python as well. ### Does this PR introduce _any_ user-facing change? Yes. Add Python version of ANOVASelector and FValueSelector ### How was this patch tested? new doctest Closes #28464 from huaxingao/selector_py. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-08 11:02:24 +08:00
zhengruifeng	97332f26bf	[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors ### What changes were proposed in this pull request? 1, add new param blockSize; 2, add a new class InstanceBlock; 3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP); 4, if blockSize>1, standardize the input outside of optimization procedure; ### Why are the changes needed? it will obtain performance gain on dense datasets, such as `epsilon` 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (up to 6X(squaredError)~12X(huber) speedup) ### Does this PR introduce _any_ user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28471 from zhengruifeng/blockify_lir_II. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-08 10:52:01 +08:00
zhengruifeng	052ff49acd	[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors ### What changes were proposed in this pull request? 1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`); 2, add new param blockSize; 3, if blockSize==1, keep original behavior, code path `trainOnRows`; 4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks` ### Why are the changes needed? On dense dataset `epsilon_normalized.t`: 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (4x ~ 5x faster) ### Does this PR introduce _any_ user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28458 from zhengruifeng/blockify_lor_II. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-07 10:07:24 +08:00
Huaxin Gao	09ece50799	[SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark ### What changes were proposed in this pull request? Add VarianceThresholdSelector to PySpark ### Why are the changes needed? parity between Scala and Python ### Does this PR introduce any user-facing change? Yes. VarianceThresholdSelector is added to PySpark ### How was this patch tested? new doctest Closes #28409 from huaxingao/variance_py. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-06 09:11:03 -05:00
zhengruifeng	ebdf41dd69	[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors ### What changes were proposed in this pull request? 1, add new param `blockSize`; 2, add a new class InstanceBlock; 3, if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP); 4, if `blockSize>1`, standardize the input outside of optimization procedure; ### Why are the changes needed? 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`) ### Does this PR introduce any user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28349 from zhengruifeng/blockify_svc_II. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-06 10:06:23 +08:00
Weichen Xu	4a21c4cc92	[SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model ### What changes were proposed in this pull request? Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model. Most pyspark estimators/transformers inherit `JavaParams`, but some estimators are special (in order to support pure python implemented nested estimators/transformers): * Pipeline * OneVsRest * CrossValidator * TrainValidationSplit But note that, currently, in pyspark, estimators listed above, their model reader/writer do NOT support pure python implemented nested estimators/transformers. Because they use java reader/writer wrapper as python side reader/writer. Pyspark CrossValidator/TrainValidationSplit model reader/writer require all estimators define the `_transfer_param_map_to_java` and `_transfer_param_map_from_java` (used in model read/write). OneVsRest class already defines the two methods, but Pipeline do not, so it lead to this bug. In this PR I add `_transfer_param_map_to_java` and `_transfer_param_map_from_java` into Pipeline class. ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Manually test in pyspark shell: 1) CrossValidator with Simple Pipeline estimator ``` from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0), (4, "b spark who", 1.0), (5, "g d a y", 0.0), (6, "spark fly", 1.0), (7, "was mapreduce", 0.0), ], ["id", "text", "label"]) # Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice # Run cross-validation, and choose the best set of parameters. cvModel = crossval.fit(training) cvModel.save('/tmp/cv_model001') CrossValidatorModel.load('/tmp/cv_model001') ``` 2) CrossValidator with Pipeline estimator which include a OneVsRest estimator stage, and OneVsRest estimator nest a LogisticRegression estimator. ``` from pyspark.ml.linalg import Vectors from pyspark.ml import Estimator, Model from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel, OneVsRest from pyspark.ml.evaluation import BinaryClassificationEvaluator, \ MulticlassClassificationEvaluator, RegressionEvaluator from pyspark.ml.linalg import Vectors from pyspark.ml.param import Param, Params from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \ TrainValidationSplit, TrainValidationSplitModel from pyspark.sql.functions import rand from pyspark.testing.mlutils import SparkSessionTestCase dataset = spark.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) ova = OneVsRest(classifier=LogisticRegression()) lr1 = LogisticRegression().setMaxIter(100) lr2 = LogisticRegression().setMaxIter(150) grid = ParamGridBuilder().addGrid(ova.classifier, [lr1, lr2]).build() evaluator = MulticlassClassificationEvaluator() pipeline = Pipeline(stages=[ova]) cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) cvModel.save('/tmp/model002') cvModel2 = CrossValidatorModel.load('/tmp/model002') ``` TrainValidationSplit testing code are similar so I do not paste them. Closes #28279 from WeichenXu123/fix_pipeline_tuning. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-04-26 21:04:14 -07:00
Huaxin Gao	d279dbf09c	[SPARK-31243][ML][PYSPARK] Add ANOVATest and FValueTest to PySpark ### What changes were proposed in this pull request? Add ANOVATest and FValueTest to PySpark ### Why are the changes needed? Parity between Scala and Python. ### Does this PR introduce any user-facing change? Yes. Python ANOVATest and FValueTest ### How was this patch tested? doctest Closes #28012 from huaxingao/stats-python. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-27 14:05:49 +08:00
Huaxin Gao	3ce1dff7ba	[SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations ### What changes were proposed in this pull request? jira link: https://issues.apache.org/jira/browse/SPARK-30930 Remove ML/MLLIB DeveloperApi annotations. ### Why are the changes needed? The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR. ### Does this PR introduce any user-facing change? Yes. DeveloperApi annotations are removed from docs. ### How was this patch tested? existing tests Closes #27859 from huaxingao/spark-30930. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-16 12:41:22 -05:00
Huaxin Gao	4a64901ab7	[SPARK-31012][ML][PYSPARK][DOCS] Updating ML API docs for 3.0 changes ### What changes were proposed in this pull request? Updating ML docs for 3.0 changes ### Why are the changes needed? I am auditing 3.0 ML changes, found some docs are missing or not updated. Need to update these. ### Does this PR introduce any user-facing change? Yes, doc changes ### How was this patch tested? Manually build and check Closes #27762 from huaxingao/spark-doc. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-07 11:42:05 -06:00
zero323	e1b3e9a3d2	[SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend ### What changes were proposed in this pull request? Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends. Note - `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented. - `PredictionModel` should be abstract as `_transform` is not implemented. ### Why are the changes needed? To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors. For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or https://github.com/apache/spark/pull/25776. ### Does this PR introduce any user-facing change? It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same. Additionally "private" `Java` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore). It is for discussion if the same should be done to equivalent classes from `ml.wrapper`. If we take `JavaClassifier` as an example, type hierarchy will change from ![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png) to ![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png) Similarly the old model ![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png) will become ![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png) ### How was this patch tested? Existing unit tests. Closes #27245 from zero323/SPARK-29212. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-04 12:20:02 +08:00
zhengruifeng	111e9038d8	[SPARK-30770][ML] avoid vector conversion in GMM.transform ### What changes were proposed in this pull request? Current impl needs to convert ml.Vector to breeze.Vector, which can be skipped. ### Why are the changes needed? avoid unnecessary vector conversions ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27519 from zhengruifeng/gmm_transform_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-04 11:02:27 +08:00
David Toneian	504b5135d0	[SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation This commit is published into the public domain. ### What changes were proposed in this pull request? Some syntax issues in docstrings have been fixed. ### Why are the changes needed? In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such. ### Does this PR introduce any user-facing change? Slight improvements in documentation. ### How was this patch tested? Manual testing. No new Sphinx warnings arise due to this change. Closes #27613 from DavidToneian/SPARK-30859. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-18 16:46:45 +09:00
Liang Zhang	82d0aa37ae	[SPARK-30762] Add dtype=float32 support to vector_to_array UDF ### What changes were proposed in this pull request? In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32. ### Why are the changes needed? In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory. ### Does this PR introduce any user-facing change? Yes. Old: `vector_to_array()` only take one param ``` df.select(vector_to_array("colA"), ...) ``` New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64") ``` df.select(vector_to_array("colA", "float32"), ...) ``` ### How was this patch tested? Unit test in scala. doctest in python. Closes #27522 from liangz1/udf-float32. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: WeichenXu <weichen.xu@databricks.com>	2020-02-13 23:55:13 +08:00
Huaxin Gao	a7ae77a8d8	[SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP ### What changes were proposed in this pull request? Add ```HasBlockSize``` in shared Params in both Scala and Python. Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Add ```HasBlockSize ``` in ALS, so user can specify the blockSize. Make ```HasBlockSize``` a shared param so both ALS and MLP can use it. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize/getBlockSize``` ```ALSModel.setBlockSize/getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27501 from huaxingao/spark_30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-09 13:14:30 +08:00
zhengruifeng	12e1bbaddb	Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" ### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-08 08:46:16 +08:00
zhengruifeng	d0c3e9f1f7	[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors for performance improvement 2, use Level-2 BLAS 3, move standardization of input vectors outside of gradient computation ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (30% ~ 102%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27396 from zhengruifeng/blockify_lireg. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 21:04:26 -06:00
Huaxin Gao	6fac411076	[SPARK-29093][ML][PYSPARK][FOLLOW-UP] Remove duplicate setter ### What changes were proposed in this pull request? remove duplicate setter in ```BucketedRandomProjectionLSH``` ### Why are the changes needed? Remove the duplicate ```setInputCol/setOutputCol``` in ```BucketedRandomProjectionLSH``` because these two setter are already in super class ```LSH``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually checked. Closes #27397 from huaxingao/spark-29093. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 23:36:39 -08:00
Huaxin Gao	f59685acaa	[SPARK-30662][ML][PYSPARK] ALS/MLP extend HasBlockSize ### What changes were proposed in this pull request? Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently. ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize``` and ```ALS.getBlockSize``` ```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27389 from huaxingao/spark-30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 13:13:10 -06:00
zhengruifeng	073ce12543	[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors 2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (40% ~ 92%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27374 from zhengruifeng/blockify_lor. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 10:52:07 -06:00
zhengruifeng	96d27274f5	[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors ### What changes were proposed in this pull request? 1, stack input vectors to blocks (like ALS/MLP); 2, add new param `blockSize`; 3, add a new class `InstanceBlock` 4, standardize the input outside of optimization procedure; ### Why are the changes needed? 1, reduce RAM to persist traing dataset; (save ~40% in test) 2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS) ### Does this PR introduce any user-facing change? a new param `blockSize` ### How was this patch tested? existing and updated testsuites Closes #27360 from zhengruifeng/blockify_svc. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-28 20:55:21 +08:00
zhengruifeng	f35f352096	[SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method ### What changes were proposed in this pull request? add a param `bootstrap` to control whether bootstrap samples are used. ### Why are the changes needed? Current RF with numTrees=1 will directly build a tree using the orignial dataset, while with numTrees>1 it will use bootstrap samples to build trees. This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange. In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used. ### Does this PR introduce any user-facing change? Yes, new param is added ### How was this patch tested? existing testsuites Closes #27254 from zhengruifeng/add_bootstrap. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-23 16:44:13 +08:00
zero323	3228732fd5	[SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels ### What changes were proposed in this pull request? This PR adds: - `pyspark.ml.regression.JavaRegressor` - `pyspark.ml.regression.JavaRegressionModel` classes and replaces `JavaPredictor` and `JavaPredictionModel` in - `LinearRegression` / `LinearRegressionModel` - `DecisionTreeRegressor` / `DecisionTreeRegressionModel` (just addition as `JavaPredictionModel` hasn't been used) - `RandomForestRegressor` / `RandomForestRegressionModel` (just addition as `JavaPredictionModel` hasn't been used) - `GBTRegressor` / `GBTRegressionModel` (just addition as `JavaPredictionModel` hasn't been used) - `AFTSurvivalRegression` / `AFTSurvivalRegressionModel` - `GeneralizedLinearRegression` / `GeneralizedLinearRegressionModel` - `FMRegressor` / `FMRegressionModel` ### Why are the changes needed? - Internal PySpark consistency. - Feature parity with Scala. - Intermediate step towards implementing [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) ### Does this PR introduce any user-facing change? It adds new base classes, so it will affect `mro`. Otherwise interfaces should stay intact. ### How was this patch tested? Existing tests. Closes #27241 from zero323/SPARK-30533. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-17 19:34:30 -06:00
Huaxin Gao	92dd7c9d2a	[MINOR][ML] Change DecisionTreeClassifier to FMClassifier in OneVsRest setWeightCol test ### What changes were proposed in this pull request? Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test ### Why are the changes needed? In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing test Closes #27204 from huaxingao/spark-ovr-minor. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-17 10:04:41 +08:00
Huaxin Gao	1ef1d6caf2	[SPARK-29565][FOLLOWUP] add setInputCol/setOutputCol in OHEModel ### What changes were proposed in this pull request? add setInputCol/setOutputCol in OHEModel ### Why are the changes needed? setInputCol/setOutputCol should be in OHEModel too. ### Does this PR introduce any user-facing change? Yes. ```OHEModel.setInputCol``` ```OHEModel.setOutputCol``` ### How was this patch tested? Manually tested. Closes #27228 from huaxingao/spark-29565. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-16 19:23:10 +08:00
zero323	990a2be27f	[SPARK-30378][ML][PYSPARK][FOLLOWUP] Remove Param fields provided by _FactorizationMachinesParams ### What changes were proposed in this pull request? Removal of following `Param` fields: - `factorSize` - `fitLinear` - `miniBatchFraction` - `initStd` - `solver` from `FMClassifier` and `FMRegressor` ### Why are the changes needed? This `Param` members are already provided by `_FactorizationMachinesParams` `0f3d744c3f/python/pyspark/ml/regression.py (L2303-L2318)` which is mixed into `FMRegressor`: `0f3d744c3f/python/pyspark/ml/regression.py (L2350)` and `FMClassifier`: `0f3d744c3f/python/pyspark/ml/classification.py (L2793)` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual testing. Closes #27205 from zero323/SPARK-30378-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-15 08:43:36 -06:00
zero323	525c5695f8	[SPARK-30504][PYTHON][ML] Set weightCol in OneVsRest(Model) _to_java and _from_java ### What changes were proposed in this pull request? This PR adjusts `_to_java` and `_from_java` of `OneVsRest` and `OneVsRestModel` to preserve `weightCol`. ### Why are the changes needed? Currently both `Params` don't preserve `weightCol` `Params` when data is saved / loaded: ```python from pyspark.ml.classification import LogisticRegression, OneVsRest, OneVsRestModel from pyspark.ml.linalg import DenseVector df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, DenseVector([1.0, 0.0]))], ("label", "w", "features")) ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w") ovrm = ovr.fit(df) ovr.getWeightCol() ## 'w' ovrm.getWeightCol() ## 'w' ovr.write().overwrite().save("/tmp/ovr") ovr_ = OneVsRest.load("/tmp/ovr") ovr_.getWeightCol() ## KeyError ## ... ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', doc='weight column name. ...) ovrm.write().overwrite().save("/tmp/ovrm") ovrm_ = OneVsRestModel.load("/tmp/ovrm") ovrm_ .getWeightCol() ## KeyError ## ... ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', doc='weight column name ... ``` ### Does this PR introduce any user-facing change? After this PR is merged, loaded objects will have `weightCol` `Param` set. ### How was this patch tested? - Manual testing. - Extension of existing persistence tests. Closes #27190 from zero323/SPARK-30504. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-15 08:42:24 -06:00
zero323	3668291e6b	[SPARK-30452][ML][PYSPARK][FOLLOWUP] Change IsotonicRegressionModel.numFeatures to property ### What changes were proposed in this pull request? Change `IsotonicRegressionModel.numFeatures` from plain method to property. ### Why are the changes needed? Consistency. Right now we use `numFeatures` in two other places in `pyspark.ml` `0f3d744c3f/python/pyspark/ml/feature.py (L4289-L4291)` `0f3d744c3f/python/pyspark/ml/wrapper.py (L437-L439)` and one in `pyspark,mllib` `0f3d744c3f/python/pyspark/mllib/classification.py (L177-L179)` each time as a property. Additionally all similar values in `ml` are exposed as properties, for example `0f3d744c3f/python/pyspark/ml/regression.py (L451-L453)` ### Does this PR introduce any user-facing change? Yes, but current API hasn't been released yet. ### How was this patch tested? Existing doctests. Closes #27206 from zero323/SPARK-30452-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-15 12:29:23 +08:00
zhengruifeng	93200115d7	[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest ### What changes were proposed in this pull request? 1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights 2, make RF supports weights ### Why are the changes needed? `weightCol` is already exposed, while RF has not support weights. ### Does this PR introduce any user-facing change? Yes, new setters ### How was this patch tested? added testsuites Closes #27097 from zhengruifeng/rf_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-14 08:25:51 -06:00
Huaxin Gao	2688faeea5	[SPARK-30498][ML][PYSPARK] Fix some ml parity issues between python and scala ### What changes were proposed in this pull request? There are some parity issues between python and scala ### Why are the changes needed? keep parity between python and scala ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? existing tests Closes #27196 from huaxingao/spark-30498. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-14 17:24:17 +08:00
Huaxin Gao	f77dcfc55a	[SPARK-30351][ML][PYSPARK] BisectingKMeans support instance weighting ### What changes were proposed in this pull request? add weight support in BisectingKMeans ### Why are the changes needed? BisectingKMeans should support instance weighting ### Does this PR introduce any user-facing change? Yes. BisectingKMeans.setWeight ### How was this patch tested? Unit test Closes #27035 from huaxingao/spark_30351. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-13 08:24:49 -06:00
Huaxin Gao	d6e28f2922	[SPARK-30377][ML] Make Regressors extend abstract class Regressor ### What changes were proposed in this pull request? Make Regressors extend abstract class Regressor: ```AFTSurvivalRegression extends Estimator => extends Regressor``` ```DecisionTreeRegressor extends Predictor => extends Regressor``` ```FMRegressor extends Predictor => extends Regressor``` ```GBTRegressor extends Predictor => extends Regressor``` ```RandomForestRegressor extends Predictor => extends Regressor``` We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType. ### Why are the changes needed? Make class hierarchy consistent for all Regressors ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27168 from huaxingao/spark-30377. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-13 08:22:20 -06:00
zero323	6502c66025	[SPARK-30493][PYTHON][ML] Remove OneVsRestModel setClassifier, setLabelCol and setWeightCol methods ### What changes were proposed in this pull request? Removal of `OneVsRestModel.setClassifier`, `OneVsRestModel.setLabelCol` and `OneVsRestModel.setWeightCol` methods. ### Why are the changes needed? Aforementioned methods shouldn't by included by [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093), as they're not present in Scala `OneVsRestModel` and have no practical application. ### Does this PR introduce any user-facing change? Not beyond scope of SPARK-29093]. ### How was this patch tested? Existing tests. CC huaxingao zhengruifeng Closes #27181 from zero323/SPARK-30493. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-13 19:03:32 +08:00
Huaxin Gao	c88124a246	[SPARK-30452][ML][PYSPARK] Add predict and numFeatures in Python IsotonicRegressionModel ### What changes were proposed in this pull request? Add ```predict``` and ```numFeatures``` in Python ```IsotonicRegressionModel``` ### Why are the changes needed? ```IsotonicRegressionModel``` doesn't extend ```JavaPredictionModel```, so it doesn't get ```predict``` and ```numFeatures``` from the super class. ### Does this PR introduce any user-facing change? Yes. Python version of ``` IsotonicRegressionModel.predict IsotonicRegressionModel.numFeatures ``` ### How was this patch tested? doctest Closes #27122 from huaxingao/spark-30452. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-09 09:23:10 -06:00
WeichenXu	88542bc3d9	[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays ### What changes were proposed in this pull request? PySpark UDF to convert MLlib vectors to dense arrays. Example: ``` from pyspark.ml.functions import vector_to_array df.select(vector_to_array(col("features")) ``` ### Why are the changes needed? If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #26910 from WeichenXu123/vector_to_array. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-01-06 16:18:51 -08:00
Huaxin Gao	d32ed25f0d	[SPARK-30144][ML][PYSPARK] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams ### What changes were proposed in this pull request? Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` ### Why are the changes needed? Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap``` ### Does this PR introduce any user-facing change? Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now ### How was this patch tested? Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there. Closes #26838 from huaxingao/spark-30144. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-03 12:01:11 -06:00
Huaxin Gao	6196c20ee0	[SPARK-30358][ML][PYSPARK][FOLLOWUP] ML expose predictRaw and predictProbability on Python side ### What changes were proposed in this pull request? expose predictRaw and predictProbability on Python side ### Why are the changes needed? to keep parity between scala and python ### Does this PR introduce any user-facing change? Yes. Expose python ```predictRaw``` and ```predictProbability``` ### How was this patch tested? doctest Closes #27082 from huaxingao/spark-30358. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-03 11:42:56 -06:00
Huaxin Gao	9ee8da298d	[SPARK-30378][ML][PYSPARK] Add getter/setter in Python FM ### What changes were proposed in this pull request? add getter/setter in Python FM ### Why are the changes needed? to be consistent with other algorithms ### Does this PR introduce any user-facing change? Yes. add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel ### How was this patch tested? doctest Closes #27044 from huaxingao/spark-30378. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 12:56:19 +08:00
zhengruifeng	9c046dc808	[SPARK-30102][ML][PYSPARK] GMM supports instance weighting ### What changes were proposed in this pull request? supports instance weighting in GMM ### Why are the changes needed? ML should support instance weighting ### Does this PR introduce any user-facing change? yes, a new param `weightCol` is exposed ### How was this patch tested? added testsuits Closes #26735 from zhengruifeng/gmm_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:32:57 +08:00
Huaxin Gao	a3cf9c564e	[SPARK-30247][PYSPARK][FOLLOWUP] Add Python class MultivariateGaussian ### What changes were proposed in this pull request? add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side. ### Does this PR introduce any user-facing change? add Python class ```MultivariateGaussian``` ### How was this patch tested? doctest Closes #27020 from huaxingao/spark-30247. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:30:18 +08:00
zhanjf	8d3eed33ee	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #27000 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 11:39:53 -06:00
zhengruifeng	8f07839e74	[SPARK-30178][ML] RobustScaler support large numFeatures ### What changes were proposed in this pull request? compute the medians/ranges more distributedly ### Why are the changes needed? It is a bottleneck to collect the whole Array[QuantileSummaries] from executors, since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`). In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM. After this PR, it will sucessfuly fit the model. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26803 from zhengruifeng/robust_high_dim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-25 09:44:19 +08:00
Wenchen Fan	ba3f6330dd	Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component" This reverts commit `c6ab7165dd`.	2019-12-24 14:01:27 +08:00
zhanjf	c6ab7165dd	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26124 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-23 10:11:09 -06:00
Huaxin Gao	5ed72a1940	[SPARK-30247][PYSPARK] GaussianMixtureModel in py side should expose gaussian ### What changes were proposed in this pull request? expose gaussian in PySpark ### Why are the changes needed? A ```GaussianMixtureModel``` contains two parts of coefficients: ```weights``` & ```gaussians```. However, ```gaussians``` is not exposed on Python side. ### Does this PR introduce any user-facing change? Yes. ```GaussianMixtureModel.gaussians``` is exposed in PySpark. ### How was this patch tested? add doctest Closes #26882 from huaxingao/spark-30247. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-16 18:15:40 -06:00
Huaxin Gao	1cac9b2cc6	[SPARK-29967][ML][PYTHON] KMeans support instance weighting ### What changes were proposed in this pull request? add weight support in KMeans ### Why are the changes needed? KMeans should support weighting ### Does this PR introduce any user-facing change? Yes. ```KMeans.setWeightCol``` ### How was this patch tested? Unit Tests Closes #26739 from huaxingao/spark-29967. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-10 09:33:06 -06:00
Huaxin Gao	8a9cccf1f3	[SPARK-30146][ML][PYSPARK] Add setWeightCol to GBTs in PySpark ### What changes were proposed in this pull request? add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side of ```GBTClassifier``` and ```GBTRegressor``` ### Why are the changes needed? https://github.com/apache/spark/pull/25926 added ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on python side ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? doc test Closes #26774 from huaxingao/spark-30146. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-09 13:39:33 -06:00
zhengruifeng	4021354b73	[SPARK-30044][ML] MNB/CNB/BNB use empty sigma matrix instead of null ### What changes were proposed in this pull request? MNB/CNB/BNB use empty sigma matrix instead of null ### Why are the changes needed? 1,Using empty sigma matrix will simplify the impl 2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null` ### Does this PR introduce any user-facing change? yes, sigma from `null` to empty matrix ### How was this patch tested? updated testsuites Closes #26679 from zhengruifeng/nb_use_empty_sigma. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-03 10:02:23 +08:00

1 2 3 4 5 ...

533 commits