## What changes were proposed in this pull request?
make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.
Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319
## How was this patch tested?
existing ut
Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#15972 from hhbyyh/experimental21.
## What changes were proposed in this pull request?
Remove deprecated methods for ML.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15913 from yanboliang/spark-18481.
## What changes were proposed in this pull request?
Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.
## How was this patch tested?
Unit tests.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15777 from sethah/pyspark_cluster_summaries.
## What changes were proposed in this pull request?
Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.
Since this is relatively isolated I'd like to target this for branch-2.1
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15746 from felixcheung/rgbt.
## What changes were proposed in this pull request?
Add missing 'subsamplingRate' of pyspark GBTClassifier
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15692 from zhengruifeng/gbt_subsamplingRate.
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?
Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15647 from jkbradley/chisqselector-follow-ups.
## What changes were proposed in this pull request?
Add subsmaplingRate to randomForestClassifier
Add varianceCol to randomForestRegressor
In Python
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15638 from felixcheung/pyrandomforest.
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15428 from VinceShieh/spark-17219_followup.
## What changes were proposed in this pull request?
For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
So we change statistic to pValue for SelectKBest and SelectPercentile
## How was this patch tested?
change existing test
Author: Peng <peng.meng@intel.com>
Closes#15444 from mpjlu/chisqure-bug.
## What changes were proposed in this pull request?
Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy.
## How was this patch tested?
Add python doctest.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13194 from yanboliang/spark-15402.
## What changes were proposed in this pull request?
Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15430 from yanboliang/spark-15957-python.
## What changes were proposed in this pull request?
update python api for NaiveBayes: add weight col parameter.
## How was this patch tested?
doctests added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15406 from WeichenXu123/nb_python_update.
## What changes were proposed in this pull request?
1,parity check and add missing test suites for ml's NB
2,remove some unused imports
## How was this patch tested?
manual tests in spark-shell
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15312 from zhengruifeng/nb_test_parity.
## What changes were proposed in this pull request?
Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.
Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.
## How was this patch tested?
PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.
Author: zero323 <zero323@users.noreply.github.com>
Closes#15144 from zero323/SPARK-17587.
## What changes were proposed in this pull request?
Partial revert of #15277 to instead sort and store input to model rather than require sorted input
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15299 from srowen/SPARK-17704.2.
## What changes were proposed in this pull request?
Add Python API for multinomial logistic regression.
- add `family` param in python api.
- expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel`
- add python-side testcase for multinomial logistic regression
- update python doc.
## How was this patch tested?
existing and added doc tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14852 from WeichenXu123/add_MLOR_python.
## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15214 from yanboliang/spark-17017.
## What changes were proposed in this pull request?
Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0
## How was this patch tested?
Jenkins tests plus new test cases
Author: Sean Owen <sowen@cloudera.com>
Closes#15149 from srowen/SPARK-17057.
## What changes were proposed in this pull request?
Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR.
## How was this patch tested?
Existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes#14858 from VinceShieh/spark-17219.
## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15050 from yanboliang/spark-17389.
## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it.
## How was this patch tested?
Doc change.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14967 from yanboliang/mlp-weights.
## What changes were proposed in this pull request?
Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc.
## How was this patch tested?
Extended doctests
Author: Holden Karau <holden@us.ibm.com>
Closes#12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
## What changes were proposed in this pull request?
When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages. A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list. This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input. Also minor cleanup of related code.
## How was this patch tested?
Added new unit tests to verify an empty Pipeline acts as an identity transformer
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#12790 from BryanCutler/pipeline-identity-SPARK-15018.
## What changes were proposed in this pull request?
1. In scala, add negative low bound checking and put all the low/upper bound checking in one place
2. In python, add low/upper bound checking of indices.
## How was this patch tested?
unit test added
Author: Jeff Zhang <zjffdu@apache.org>
Closes#14555 from zjffdu/SPARK-16965.
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961
Changed one line of Utils.randomizeInPlace to allow elements to stay in place.
Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.
Author: Nick Lavers <nick.lavers@videoamp.com>
Closes#14551 from nicklavers/SPARK-16961-randomizeInPlace.
## What changes were proposed in this pull request?
Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve breaking change.
## How was this patch tested?
No new tests, should pass existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14454 from yanboliang/TreeEnsembleModel.
## What changes were proposed in this pull request?
avgMetrics was summed, not averaged, across folds
Author: =^_^= <maxmoroz@gmail.com>
Closes#14456 from pkch/pkch-patch-1.
## What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.
## How was this patch tested?
Documentation update
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#13894 from krishnakalyan3/kfold-cv.
## What changes were proposed in this pull request?
replace ANN convergence tolerance param default
from 1e-4 to 1e-6
so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.
## How was this patch tested?
Existing Test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14286 from WeichenXu123/update_ann_tol.
## What changes were proposed in this pull request?
add `SparseMatrix` class whick support pickler.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14265 from WeichenXu123/picklable_py.
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
## How was this patch tested?
No new tests, should pass the existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14150 from yanboliang/spark-16494.
## What changes were proposed in this pull request?
Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
* **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
* Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
* **Reviewers**: I did not change any of the content of the migration guides.
Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
* **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added
Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page
## How was this patch tested?
Generated docs locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14213 from jkbradley/ml-guide-2.0.
## What changes were proposed in this pull request?
General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.
Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier
DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi
## How was this patch tested?
N/A
Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14147 from jkbradley/experimental-audit.
## What changes were proposed in this pull request?
Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.
This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X
## How was this patch tested?
Existing unit tests. Manual testing in an environment where this was an issue.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14023 from jkbradley/SPARK-16348.
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.
## How was this patch tested?
Existing unit tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13840 from MLnick/SPARK-16127-ml-linalg-since.
## What changes were proposed in this pull request?
Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.
## How was this patch tested?
Built docs locally & PySpark SQL tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
## What changes were proposed in this pull request?
Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero.
## How was this patch tested?
Ran PySpark tests locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
This PR adds missing `Since` annotations to `ml.feature` package.
Closes#8505.
## How was this patch tested?
Existing tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13641 from MLnick/add-since-annotations.
## What changes were proposed in this pull request?
Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method.
## How was this patch tested?
Local tests
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
## What changes were proposed in this pull request?
Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13219 from viirya/pyspark-pickler-ml.
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.
## How was this patch tested?
run pyspark-ml tests locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
## What changes were proposed in this pull request?
Word2vec python add maxsentence parameter.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13578 from WeichenXu123/word2vec_python_add_maxsentence.
## What changes were proposed in this pull request?
add method idf to IDF in pyspark
## How was this patch tested?
add unit test
Author: Jeff Zhang <zjffdu@apache.org>
Closes#13540 from zjffdu/SPARK-15788.
## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.
## How was this patch tested?
Offline tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13519 from yanboliang/spark-15771.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
1, del precision,recall in `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`
## How was this patch tested?
local build
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Closes#13390 from zhengruifeng/clarify_f1.