ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	eb037a8180	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations ### What changes were proposed in this pull request? The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R. The changes below can be summarized as: - Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental - Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched - I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering) It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those. ### Why are the changes needed? Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25558 from srowen/SPARK-28855. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-01 10:15:00 -05:00
zhengruifeng	573b1cb835	[SPARK-28858][ML][PYSPARK] add tree-based transformation in the py side ### What changes were proposed in this pull request? expose the newly added tree-based transformation in the py side ### Why are the changes needed? function parity ### Does this PR introduce any user-facing change? yes, add `setLeafCol` & `getLeafCol` in the py side ### How was this patch tested? added tests & local tests Closes #25566 from zhengruifeng/py_tree_path. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-08-23 15:18:35 -07:00
Huaxin Gao	ba5ee27706	[SPARK-28243][PYSPARK][ML][FOLLOW-UP] Move Python DecisionTreeParams to regression.py ## What changes were proposed in this pull request? Leave ```shared.py``` untouched. Move Python ```DecisionTreeParams``` to ```regression.py``` ## How was this patch tested? Use existing tests Closes #25406 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-15 10:21:26 -05:00
Huaxin Gao	72c80ee81c	[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## What changes were proposed in this pull request? Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## How was this patch tested? Use existing tests. Closes #25046 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-20 10:44:33 -05:00
Huaxin Gao	be5d95adc6	[SPARK-27007][PYTHON] add rawPrediction to OneVsRest in PySpark ## What changes were proposed in this pull request? Add RawPrediction to OneVsRest in PySpark to make it consistent with scala implementation ## How was this patch tested? Add doctest Closes #23910 from huaxingao/spark-27007. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-02 09:09:28 -06:00
zhengruifeng	acd086f207	[SPARK-19591][ML][PYSPARK][FOLLOWUP] Add sample weights to decision trees ## What changes were proposed in this pull request? Add sample weights to decision trees ## How was this patch tested? updated testsuites Closes #23818 from zhengruifeng/py_tree_support_sample_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 21:11:30 -06:00
Huaxin Gao	5bb9647e10	[SPARK-26754][PYTHON] Add hasTrainingSummary to replace duplicate code in PySpark ## What changes were proposed in this pull request? Python version of https://github.com/apache/spark/pull/17654 ## How was this patch tested? Existing Python unit test Closes #23676 from huaxingao/spark26754. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-01 17:29:58 -06:00
Sean Owen	c2d0d700b5	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis ## What changes were proposed in this pull request? Misc code cleanup from lgtm.com analysis. See comments below for details. ## How was this patch tested? Existing tests. Closes #23571 from srowen/SPARK-26640. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-17 19:40:39 -06:00
Huaxin Gao	20278e719e	[SPARK-24333][ML][PYTHON] Add fit with validation set to spark.ml GBT: Python API ## What changes were proposed in this pull request? Add validationIndicatorCol and validationTol to GBT Python. ## How was this patch tested? Add test in doctest to test the new API. Closes #21465 from huaxingao/spark-24333. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-12-07 13:53:35 -08:00
Sean Owen	08c76b5d39	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 (This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231) ## What changes were proposed in this pull request? Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines. ## How was this patch tested? Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure. Closes #22400 from srowen/SPARK-25238.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-13 11:19:43 +08:00
bravo-zhang	524827f062	[SPARK-14712][ML] LogisticRegressionModel.toString should summarize model ## What changes were proposed in this pull request? [SPARK-14712](https://issues.apache.org/jira/browse/SPARK-14712) spark.mllib LogisticRegressionModel overrides toString to print a little model info. We should do the same in spark.ml and override repr in pyspark. ## How was this patch tested? LogisticRegressionSuite.scala Python doctest in pyspark.ml.classification.py Author: bravo-zhang <mzhang1230@gmail.com> Closes #18826 from bravo-zhang/spark-14712.	2018-06-28 12:40:39 -07:00
Huaxin Gao	ec6f971dc5	[SPARK-23161][PYSPARK][ML] Add missing APIs to Python GBTClassifier ## What changes were proposed in this pull request? Add featureSubsetStrategy in GBTClassifier and GBTRegressor. Also make GBTClassificationModel inherit from JavaClassificationModel instead of prediction model so it will have numClasses. ## How was this patch tested? Add tests in doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21413 from huaxingao/spark-23161.	2018-05-30 11:04:09 -07:00
Lu WANG	6b94420f6c	[SPARK-24231][PYSPARK][ML] Provide Python API for evaluateEachIteration for spark.ml GBTs ## What changes were proposed in this pull request? Add evaluateEachIteration for GBTClassification and GBTRegressionModel ## How was this patch tested? doctest Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21335 from ludatabricks/SPARK-14682.	2018-05-15 14:16:31 -07:00
JBauerKogentix	9d960de081	typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: JBauerKogentix <37910022+JBauerKogentix@users.noreply.github.com> Closes #21030 from JBauerKogentix/patch-1.	2018-04-11 15:52:13 -07:00
Benjamin Peterson	7013eea11c	[SPARK-23522][PYTHON] always use sys.exit over builtin exit The exit() builtin is only for interactive use. applications should use sys.exit(). ## What changes were proposed in this pull request? All usage of the builtin `exit()` function is replaced by `sys.exit()`. ## How was this patch tested? I ran `python/run-tests`. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Benjamin Peterson <benjamin@python.org> Closes #20682 from benjaminp/sys-exit.	2018-03-08 20:38:34 +09:00
Yanbo Liang	c76153cc7d	[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19220 from yanboliang/SPARK-18608.	2017-09-14 14:09:44 +08:00
Ming Jiang	8d8641f122	[SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## What changes were proposed in this pull request? Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## How was this patch tested? Added unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ming Jiang <mjiang@fanatics.com> Author: Ming Jiang <jmwdpk@gmail.com> Author: jmwdpk <jmwdpk@gmail.com> Closes #19185 from jmwdpk/SPARK-21854.	2017-09-14 13:53:28 +08:00
Ajay Saini	720c94fe77	[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism. I take this PR #18281 over because the original author is busy but we need merge this PR soon. After this been merged, we can close #18281 . ## How was this patch tested? Test suite added. Author: Ajay Saini <ajays725@gmail.com> Author: WeichenXu <weichen.xu@databricks.com> Closes #19110 from WeichenXu123/spark-21027.	2017-09-12 10:02:27 -07:00
Chunsheng Ji	4bab8f5996	[SPARK-21856] Add probability and rawPrediction to MLPC for Python Probability and rawPrediction has been added to MultilayerPerceptronClassifier for Python Add unit test. Author: Chunsheng Ji <chunsheng.ji@gmail.com> Closes #19172 from chunshengji/SPARK-21856.	2017-09-11 16:52:48 +08:00
Weichen Xu	d6b30edd49	[SPARK-12664][ML] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column when transforming data. ## How was this patch tested? Test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.	2017-08-22 21:16:34 -07:00
Bryan Cutler	41bb1ddc63	[SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator ## What changes were proposed in this pull request? Added call to copy values of Params from Estimator to Model after fit in PySpark ML. This will copy values for any params that are also defined in the Model. Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object. This is a temporary fix that can be removed once the PySpark models properly define the params themselves. ## How was this patch tested? Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.	2017-08-22 17:40:50 -07:00
zero323	845c039ceb	[SPARK-20601][ML] Python API for Constrained Logistic Regression ## What changes were proposed in this pull request? Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 . ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes #18759 from yanboliang/SPARK-20601.	2017-08-02 18:10:26 +08:00
Zheng RuiFeng	253a07e43a	[SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold ## What changes were proposed in this pull request? GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #18612 from zhengruifeng/override_HasXXX.	2017-08-01 21:34:26 +08:00
Yan Facai (颜发才)	a5a3189974	[SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.	2017-07-28 10:10:35 +08:00
Ajay Saini	7047f49f45	[SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest ## What changes were proposed in this pull request? Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark. ## How was this patch tested? Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms. Author: Ajay Saini <ajays725@gmail.com> Closes #18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.	2017-07-17 10:07:32 -07:00
Ruifeng Zheng	e0b047eafe	[SPARK-18518][ML] HasSolver supports override ## What changes were proposed in this pull request? 1, make param support non-final with `finalFields` option 2, generate `HasSolver` with `finalFields = false` 3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver` ## How was this patch tested? existing tests Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16028 from zhengruifeng/param_non_final.	2017-07-01 15:37:41 +08:00
Joseph K. Bradley	cc67bd5732	[SPARK-20929][ML] LinearSVC should use its own threshold param ## What changes were proposed in this pull request? LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.	2017-06-19 23:04:17 -07:00
Yanbo Liang	dbe81633a7	[SPARK-20501][ML] ML 2.2 QA: New Scala APIs, docs ## What changes were proposed in this pull request? Review new Scala APIs introduced in 2.2. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17934 from yanboliang/spark-20501.	2017-05-15 21:21:54 -07:00
zero323	804949c6bf	[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17891 from zero323/SPARK-20631.	2017-05-10 16:57:52 +08:00
Yanbo Liang	dbb06c689c	[MINOR][ML] Fix some PySpark & SparkR flaky tests ## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17757 from yanboliang/flaky-test.	2017-04-26 21:34:18 +08:00
Yanbo Liang	67eef47acf	[SPARK-20449][ML] Upgrade breeze version to 0.13.1 ## What changes were proposed in this pull request? Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449.	2017-04-25 17:10:41 +00:00
Bryan Cutler	44281ca81d	[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe ## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.	2017-03-03 16:43:45 -08:00
Zheng RuiFeng	317fa75081	[SPARK-19421][ML][PYSPARK] Remove numClasses and numFeatures methods in LinearSVC ## What changes were proposed in this pull request? Methods `numClasses` and `numFeatures` in LinearSVCModel are already usable by inheriting `JavaClassificationModel` we should not explicitly add them. ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16727 from zhengruifeng/nits_in_linearSVC.	2017-02-05 19:06:51 -08:00
Joseph K. Bradley	1d5d2a9d09	[SPARK-19389][ML][PYTHON][DOC] Minor doc fixes for ML Python Params and LinearSVC ## What changes were proposed in this pull request? * Removed Since tags in Python Params since they are inherited by other classes * Fixed doc links for LinearSVC ## How was this patch tested? * doc tests * generating docs locally and checking manually Author: Joseph K. Bradley <joseph@databricks.com> Closes #16723 from jkbradley/pyparam-fix-doc.	2017-02-02 11:58:46 -08:00
Bryan Cutler	57d70d26c8	[SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays ## What changes were proposed in this pull request? Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed. ## How was this patch tested? Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model. Author: Bryan Cutler <cutlerb@gmail.com> Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.	2017-01-31 15:42:36 -08:00
wm624@hotmail.com	bb1a1fe05e	[SPARK-19336][ML][PYSPARK] LinearSVC Python API ## What changes were proposed in this pull request? Add Python API for the newly added LinearSVC algorithm. ## How was this patch tested? Add new doc string test. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16694 from wangmiao1981/ser.	2017-01-27 16:03:53 -08:00
Yuhao	9b670bcaec	[SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit ## What changes were proposed in this pull request? make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15972 from hhbyyh/experimental21.	2016-11-29 18:46:59 -08:00
hyukjinkwon	933a6548d4	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation ## What changes were proposed in this pull request? It seems in Python, there are - `Note:` - `NOTE:` - `Note that` - `.. note::` This PR proposes to fix those to `.. note::` to be consistent. Before <img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png"> <img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png"> After <img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png"> <img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png"> ## How was this patch tested? The notes were found via ```bash grep -r "Note: " . grep -r "NOTE: " . grep -r "Note that " . ``` And then fixed one by one comparing with API documentation. After that, manually tested via `make html` under `./python/docs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15947 from HyukjinKwon/SPARK-18447.	2016-11-22 11:40:18 +00:00
sethah	e811fbf9ed	[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <seth.hendrickson16@gmail.com> Closes #15777 from sethah/pyspark_cluster_summaries.	2016-11-21 05:36:49 -08:00
Zheng RuiFeng	9dc9f9a5dd	[SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier ## What changes were proposed in this pull request? Add missing 'subsamplingRate' of pyspark GBTClassifier ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15692 from zhengruifeng/gbt_subsamplingRate.	2016-11-03 07:45:20 -07:00
Felix Cheung	7c37869292	[SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest regression and classification ## What changes were proposed in this pull request? Add subsmaplingRate to randomForestClassifier Add varianceCol to randomForestRegressor In Python ## How was this patch tested? manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15638 from felixcheung/pyrandomforest.	2016-10-30 16:21:37 -07:00
WeichenXu	0d4a695279	[SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter ## What changes were proposed in this pull request? update python api for NaiveBayes: add weight col parameter. ## How was this patch tested? doctests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15406 from WeichenXu123/nb_python_update.	2016-10-12 19:52:57 -07:00
Zheng RuiFeng	c17f971839	[SPARK-17744][ML] Parity check between the ml and mllib test suites for NB ## What changes were proposed in this pull request? 1,parity check and add missing test suites for ml's NB 2,remove some unused imports ## How was this patch tested? manual tests in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15312 from zhengruifeng/nb_test_parity.	2016-10-04 06:54:48 -07:00
WeichenXu	7f16affa26	[SPARK-17138][ML][MLIB] Add Python API for multinomial logistic regression ## What changes were proposed in this pull request? Add Python API for multinomial logistic regression. - add `family` param in python api. - expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel` - add python-side testcase for multinomial logistic regression - update python doc. ## How was this patch tested? existing and added doc tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14852 from WeichenXu123/add_MLOR_python.	2016-09-27 00:00:21 -07:00
Yanbo Liang	39d538dddf	[MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel. ## What changes were proposed in this pull request? ```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it. ## How was this patch tested? Doc change. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14967 from yanboliang/mlp-weights.	2016-09-06 03:30:37 -07:00
Yanbo Liang	6b8cb1fe52	[SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level configurable. ## What changes were proposed in this pull request? [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function. ## How was this patch tested? Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result: ![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14766 from yanboliang/spark-17197.	2016-08-25 02:26:33 -07:00
Holden Karau	b264cbb16f	[SPARK-15113][PYSPARK][ML] Add missing num features num classes ## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.	2016-08-22 12:21:22 +02:00
Yanbo Liang	ccc6dc0f4b	[MINOR][ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySpark ## What changes were proposed in this pull request? Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve breaking change. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14454 from yanboliang/TreeEnsembleModel.	2016-08-11 22:39:19 -07:00
WeichenXu	ad3708e783	[SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14286 from WeichenXu123/update_ann_tol.	2016-07-25 20:00:37 +01:00
Yanbo Liang	670891496a	[SPARK-16494][ML] Upgrade breeze version to 0.12 ## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14150 from yanboliang/spark-16494.	2016-07-19 12:31:04 +01:00

1 2 3

115 commits