## What changes were proposed in this pull request?
Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 .
## How was this patch tested?
Unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18759 from yanboliang/SPARK-20601.
## What changes were proposed in this pull request?
GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Closes#18612 from zhengruifeng/override_HasXXX.
## What changes were proposed in this pull request?
add `setWeightCol` method for OneVsRest.
`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
## How was this patch tested?
+ [x] add an unit test.
Author: Yan Facai (颜发才) <facai.yan@gmail.com>
Closes#18554 from facaiy/BUG/oneVsRest_missing_weightCol.
## What changes were proposed in this pull request?
Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.
## How was this patch tested?
Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.
Author: Ajay Saini <ajays725@gmail.com>
Closes#18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.
## What changes were proposed in this pull request?
1, make param support non-final with `finalFields` option
2, generate `HasSolver` with `finalFields = false`
3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver`
## How was this patch tested?
existing tests
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#16028 from zhengruifeng/param_non_final.
## What changes were proposed in this pull request?
LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs.
## How was this patch tested?
New unit test to make sure the threshold can be set to any Double value.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
## What changes were proposed in this pull request?
Review new Scala APIs introduced in 2.2.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17934 from yanboliang/spark-20501.
## What changes were proposed in this pull request?
- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests
## How was this patch tested?
New unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17891 from zero323/SPARK-20631.
## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17757 from yanboliang/flaky-test.
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17746 from yanboliang/spark-20449.
## What changes were proposed in this pull request?
The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
## How was this patch tested?
Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
## What changes were proposed in this pull request?
Methods `numClasses` and `numFeatures` in LinearSVCModel are already usable by inheriting `JavaClassificationModel`
we should not explicitly add them.
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#16727 from zhengruifeng/nits_in_linearSVC.
## What changes were proposed in this pull request?
* Removed Since tags in Python Params since they are inherited by other classes
* Fixed doc links for LinearSVC
## How was this patch tested?
* doc tests
* generating docs locally and checking manually
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#16723 from jkbradley/pyparam-fix-doc.
## What changes were proposed in this pull request?
Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.
## How was this patch tested?
Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
## What changes were proposed in this pull request?
Add Python API for the newly added LinearSVC algorithm.
## How was this patch tested?
Add new doc string test.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16694 from wangmiao1981/ser.
## What changes were proposed in this pull request?
make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.
Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319
## How was this patch tested?
existing ut
Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#15972 from hhbyyh/experimental21.
## What changes were proposed in this pull request?
Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.
## How was this patch tested?
Unit tests.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15777 from sethah/pyspark_cluster_summaries.
## What changes were proposed in this pull request?
Add missing 'subsamplingRate' of pyspark GBTClassifier
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15692 from zhengruifeng/gbt_subsamplingRate.
## What changes were proposed in this pull request?
Add subsmaplingRate to randomForestClassifier
Add varianceCol to randomForestRegressor
In Python
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15638 from felixcheung/pyrandomforest.
## What changes were proposed in this pull request?
update python api for NaiveBayes: add weight col parameter.
## How was this patch tested?
doctests added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15406 from WeichenXu123/nb_python_update.
## What changes were proposed in this pull request?
1,parity check and add missing test suites for ml's NB
2,remove some unused imports
## How was this patch tested?
manual tests in spark-shell
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15312 from zhengruifeng/nb_test_parity.
## What changes were proposed in this pull request?
Add Python API for multinomial logistic regression.
- add `family` param in python api.
- expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel`
- add python-side testcase for multinomial logistic regression
- update python doc.
## How was this patch tested?
existing and added doc tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14852 from WeichenXu123/add_MLOR_python.
## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it.
## How was this patch tested?
Doc change.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14967 from yanboliang/mlp-weights.
## What changes were proposed in this pull request?
Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc.
## How was this patch tested?
Extended doctests
Author: Holden Karau <holden@us.ibm.com>
Closes#12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
## What changes were proposed in this pull request?
Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve breaking change.
## How was this patch tested?
No new tests, should pass existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14454 from yanboliang/TreeEnsembleModel.
## What changes were proposed in this pull request?
replace ANN convergence tolerance param default
from 1e-4 to 1e-6
so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.
## How was this patch tested?
Existing Test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14286 from WeichenXu123/update_ann_tol.
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
## How was this patch tested?
No new tests, should pass the existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14150 from yanboliang/spark-16494.
## What changes were proposed in this pull request?
General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.
Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier
DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi
## How was this patch tested?
N/A
Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14147 from jkbradley/experimental-audit.
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.
## How was this patch tested?
Existing unit tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13840 from MLnick/SPARK-16127-ml-linalg-since.
## What changes were proposed in this pull request?
Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.
## How was this patch tested?
Built docs locally & PySpark SQL tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
## What changes were proposed in this pull request?
Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero.
## How was this patch tested?
Ran PySpark tests locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
## What changes were proposed in this pull request?
Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method.
## How was this patch tested?
Local tests
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
## What changes were proposed in this pull request?
Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13219 from viirya/pyspark-pickler-ml.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.
Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )
## How was this patch tested?
Doc tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
## What changes were proposed in this pull request?
Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel`
## How was this patch tested?
Extended doc tests.
Author: Holden Karau <holden@us.ibm.com>
Closes#12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
## What changes were proposed in this pull request?
Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13242 from WeichenXu123/python_doctest_update_sparksession.
## What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#12627 from dbtsai/SPARK-14615-NewML.
## What changes were proposed in this pull request?
Add missing thresholds param to NiaveBayes
## How was this patch tested?
doctests
Author: Holden Karau <holden@us.ibm.com>
Closes#12963 from holdenk/SPARK-15188-add-missing-naive-bayes-param.
## What changes were proposed in this pull request?
PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT
## How was this patch tested?
Built docs locally.
Author: Holden Karau <holden@us.ibm.com>
Closes#12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
## What changes were proposed in this pull request?
PySpark ML Params setter code clean up.
For examples,
```setInputCol``` can be simplified from
```
self._set(inputCol=value)
return self
```
to:
```
return self._set(inputCol=value)
```
This is a pretty big sweeps, and we cleaned wherever possible.
## How was this patch tested?
Exist unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12749 from yanboliang/spark-14971.
## What changes were proposed in this pull request?
This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
* This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
Defaults changed:
* Scala
* LogisticRegression: weightCol defaults to not set (instead of empty string)
* StringIndexer: labels default to not set (instead of empty array)
* GeneralizedLinearRegression:
* maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
* weightCol defaults to not set (instead of empty string)
* LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
* MultilayerPerceptron: layers default to not set (instead of [1,1])
* ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
## How was this patch tested?
Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>
Closes#12816 from jkbradley/yinxusen-SPARK-14931.
#### What changes were proposed in this pull request?
This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`
The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.
#### How was this patch tested?
Compilation succeded.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12732 from hvanhovell/SPARK-14952.
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>
Closes#12320 from brkyvz/stream-python.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14306
Add PySpark OneVsRest save/load supports.
## How was this patch tested?
Test with Python unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12439 from yinxusen/SPARK-14306-0415.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-7861
Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.
## How was this patch tested?
Test with doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12124 from yinxusen/SPARK-14306-7861.
## What changes were proposed in this pull request?
Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
## How was this patch tested?
Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11939 from sethah/SPARK-14104.
## What changes were proposed in this pull request?
PySpark ml GBTClassifier, Regressor support export/import.
## How was this patch tested?
Doc test.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12383 from yanboliang/spark-14374.