## What changes were proposed in this pull request?
Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.
## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13786
Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.
## How was this patch tested?
Test with Python doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12020 from yinxusen/SPARK-13786.
## What changes were proposed in this pull request?
PySpark ml.clustering BisectingKMeans support export/import
## How was this patch tested?
doc test.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12112 from yanboliang/spark-14305.
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers
These features decreased the memory usage and increased flexibility of
internal API.
Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>
Closes#9229 from avulanov/mlp-refactoring.
## What changes were proposed in this pull request?
Feature importances are exposed in the python API for GBTs.
Other changes:
* Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.
## How was this patch tested?
Python doc tests were updated to validate GBT feature importance.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12056 from sethah/Pyspark_GBT_feature_importance.
## What changes were proposed in this pull request?
```MultilayerPerceptronClassifier``` supports save/load for Python API.
## How was this patch tested?
doctest.
cc mengxr jkbradley yinxusen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11952 from yanboliang/spark-14152.
Add property to MLWritable.write method, so we can use .write instead of .write()
Add a new test to ml/test.py to check whether the write is a property.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (11s)
Finished test(python2.7): pyspark.ml.clustering (16s)
Finished test(python2.7): pyspark.ml.classification (24s)
Finished test(python2.7): pyspark.ml.recommendation (24s)
Finished test(python2.7): pyspark.ml.feature (39s)
Finished test(python2.7): pyspark.ml.regression (26s)
Finished test(python2.7): pyspark.ml.tuning (15s)
Finished test(python2.7): pyspark.ml.tests (30s)
Tests passed in 55 seconds
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#11945 from wangmiao1981/fix_property.
## What changes were proposed in this pull request?
Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests.
## How was this patch tested?
Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor.
Author: GayathriMurali <gayathri.m.softie@gmail.com>
Closes#11892 from GayathriMurali/SPARK-13949.
## What changes were proposed in this pull request?
GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value.
## How was this patch tested?
Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11944 from sethah/SPARK-14107.
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.
ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses. These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.
ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests
Copied developer comments from spark.mllib implementation to spark.ml one.
Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite. The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run. Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in. Those same split calculations were already being tested in other unit tests, for each dataset type.
**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB. This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit(). Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats. This does not remove information from the tree, and it will save a bit of storage.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11855 from jkbradley/remove-mllib-tree-impl.
## What changes were proposed in this pull request?
This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
## How was this patch tested?
Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11663 from sethah/SPARK-13068-tc.
Adds support for saving and loading nested ML Pipelines from Python. Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
Also:
* Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
* Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
Added new unit test for nested Pipelines. Abstracted validity check into a helper method for the 2 unit tests.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11866 from jkbradley/nested-pipeline-io.
Closes#11835
## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.
## How was this patch tested?
Ran python tests for ML and MLlib.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
## What changes were proposed in this pull request?
Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.
## How was this patch tested?
./python/run-tests
./dev/lint-python
Unit tests added to check persistence in Logistic Regression
Author: GayathriMurali <gayathri.m.softie@gmail.com>
Closes#11707 from GayathriMurali/SPARK-13034.
## What changes were proposed in this pull request?
JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038
1. Add load/save to PySpark Pipeline and PipelineModel
2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.
## How was this patch tested?
Test with doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11683 from yinxusen/SPARK-13038-only.
## What changes were proposed in this pull request?
This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`.
## How was this patch tested?
Python doc tests for the affected classes were updated to check feature importances.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11622 from sethah/SPARK-13787.
## What changes were proposed in this pull request?
Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance. This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.
## How was this patch tested?
Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class. Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.
In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11203 from yinxusen/SPARK-13036.
## What changes were proposed in this pull request?
The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.
**Scala**
```
scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
```
**Python**
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
```
## How was this patch tested?
manual. Check the following in `pyspark`.
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.0
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11519 from dongjoon-hyun/SPARK-13676.
## What changes were proposed in this pull request?
The changes proposed were to add train-validation-split to pyspark.ml.tuning.
## How was the this patch tested?
This patch was tested through unit tests located in pyspark/ml/test.py.
This is my original work and I license it to Spark.
Author: JeremyNixon <jnixon2@gmail.com>
Closes#11335 from JeremyNixon/tvs_pyspark.
This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top. Since we keep it alphabetized, it often creates a lot more changes than needed. It is also easy to add the Estimator and forget the Model. I'm going to switch it to have one algorithm per line.
This also alphabetizes a few out-of-place classes in pyspark.ml.feature. No changes have been made to the moved classes.
CC: thunterdb
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10927 from jkbradley/ml-python-all-list.
## What changes were proposed in this pull request?
After SPARK-13028, we should add Python API for MaxAbsScaler.
## How was this patch tested?
unit test
Author: zlpmichelle <zlpmichelle@gmail.com>
Closes#11393 from zlpmichelle/master.
Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py.
yanboliang Please help to review.
For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it.
Author: Tommy YU <tummyyu@163.com>
Closes#11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits:
3646b36 [Tommy YU] address review comments
9cddc98 [Tommy YU] change base on review and pr 11197
cc61d9d [Tommy YU] remove default parameter set
19535d4 [Tommy YU] add export/import to regression
44a9dc2 [Tommy YU] add import/export for ml.regression
## What changes were proposed in this pull request?
QuantileDiscretizer in Python should also specify a random seed.
## How was this patch tested?
unit tests
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#11362 from yu-iskw/SPARK-13292 and squashes the following commits:
02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153
Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
Closes#10152 from ygcao/improvementForSentenceBoundary.
Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
In part this is a follow up to https://github.com/apache/spark/pull/10999
Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.
Author: Holden Karau <holden@us.ibm.com>
Closes#11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
Fix this defect by check default value exist or not.
yanboliang Please help to review.
Author: Tommy YU <tummyyu@163.com>
Closes#11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false
cc holdenk
Author: sethah <seth.hendrickson16@gmail.com>
Closes#10962 from sethah/SPARK-13047.
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10469 from yanboliang/spark-11939.
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
Author: Holden Karau <holden@us.ibm.com>
Closes#10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
Add Python API for ml.feature.QuantileDiscretizer.
One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc brkyvz & mengxr
Author: Holden Karau <holden@us.ibm.com>
Closes#10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10830 from yanboliang/spark-12905.
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.
Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#10850 from mengxr/SPARK-11295.
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10472 from BenFradet/SPARK-9716.
SPARK-11295 Add packages to JUnit output for Python tests
This improves grouping/display of test case results.
Author: Gábor Lipták <gliptak@gmail.com>
Closes#9263 from gliptak/SPARK-11295.
Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9908 from yanboliang/spark-11925.
PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9807 from yanboliang/spark-11815.
Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9931 from yanboliang/SPARK-11945.
From JIRA:
Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.
A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.
This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.
Author: Holden Karau <holden@us.ibm.com>
Closes#9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
No jira is created since this is a trivial change.
davies Please help review it
Author: Jeff Zhang <zjffdu@apache.org>
Closes#10143 from zjffdu/pyspark_typo.
Extend CrossValidator with HasSeed in PySpark.
This PR replaces [https://github.com/apache/spark/pull/7997]
CC: yanboliang thunterdb mmenestret Would one of you mind taking a look? Thanks!
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Martin MENESTRET <mmenestret@ippon.fr>
Closes#10268 from jkbradley/pyspark-cv-seed.
Use ```coefficients``` replace ```weights```, I wish they are the last two.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10065 from yanboliang/coefficients.