### What changes were proposed in this pull request?
In LogisticRegression and LinearRegression, if set maxIter=n, the model.summary.totalIterations returns n+1 if the training procedure does not drop out. This is because we use ```objectiveHistory.length``` as totalIterations, but ```objectiveHistory``` contains init sate, thus ```objectiveHistory.length``` is 1 larger than number of training iterations.
### Why are the changes needed?
correctness
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
add new tests and also modify existing tests
Closes#28786 from huaxingao/summary_iter.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add instance weight support in LogisticRegressionSummary
### Why are the changes needed?
LogisticRegression, MulticlassClassificationEvaluator and BinaryClassificationEvaluator support instance weight. We should support instance weight in LogisticRegressionSummary too.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new tests
Closes#28657 from huaxingao/weighted_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is published into the public domain.
### What changes were proposed in this pull request?
Some syntax issues in docstrings have been fixed.
### Why are the changes needed?
In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such.
### Does this PR introduce any user-facing change?
Slight improvements in documentation.
### How was this patch tested?
Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change.
Closes#28559 from DavidToneian/SPARK-31739.
Authored-by: David Toneian <david@toneian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark
### Why are the changes needed?
Currently we have
```
since("2.0.0")
def evaluate(self, dataset):
if not isinstance(dataset, DataFrame):
raise ValueError("dataset must be a DataFrame but got %s." % type(dataset))
java_blr_summary = self._call_java("evaluate", dataset)
return BinaryLogisticRegressionSummary(java_blr_summary)
```
we should return LogisticRegressionSummary for multiclass logistic regression
### Does this PR introduce _any_ user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python
### How was this patch tested?
unit test
Closes#28503 from huaxingao/lr_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`);
2, add new param blockSize;
3, if blockSize==1, keep original behavior, code path `trainOnRows`;
4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks`
### Why are the changes needed?
On dense dataset `epsilon_normalized.t`:
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster)
### Does this PR introduce _any_ user-facing change?
Yes, a new param is added
### How was this patch tested?
existing and added testsuites
Closes#28458 from zhengruifeng/blockify_lor_II.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add new param `blockSize`;
2, add a new class InstanceBlock;
3, **if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP);**
4, if `blockSize>1`, standardize the input outside of optimization procedure;
### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`)
### Does this PR introduce any user-facing change?
Yes, a new param is added
### How was this patch tested?
existing and added testsuites
Closes#28349 from zhengruifeng/blockify_svc_II.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends.
Note
- `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented.
- `PredictionModel` should be abstract as `_transform` is not implemented.
### Why are the changes needed?
To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors.
For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or https://github.com/apache/spark/pull/25776.
### Does this PR introduce any user-facing change?
It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same.
Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore).
It is for discussion if the same should be done to equivalent classes from `ml.wrapper`.
If we take `JavaClassifier` as an example, type hierarchy will change from
![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png)
to
![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png)
Similarly the old model
![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png)
will become
![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png)
### How was this patch tested?
Existing unit tests.
Closes#27245 from zero323/SPARK-29212.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add ```HasBlockSize``` in shared Params in both Scala and Python.
Make ALS/MLP extend ```HasBlockSize```
### Why are the changes needed?
Add ```HasBlockSize ``` in ALS, so user can specify the blockSize.
Make ```HasBlockSize``` a shared param so both ALS and MLP can use it.
### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize/getBlockSize```
```ALSModel.setBlockSize/getBlockSize```
### How was this patch tested?
Manually tested. Also added doctest.
Closes#27501 from huaxingao/spark_30662.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Revert
#27360#27396#27374#27389
### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.
### Does this PR introduce any user-facing change?
remove newly added param blockSize
### How was this patch tested?
reverted testsuites
Closes#27487 from zhengruifeng/revert_blockify_ii.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Make ALS/MLP extend ```HasBlockSize```
### Why are the changes needed?
Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently.
ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```.
### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize``` and ```ALS.getBlockSize```
```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize```
### How was this patch tested?
Manually tested. Also added doctest.
Closes#27389 from huaxingao/spark-30662.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial
### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)
### Does this PR introduce any user-facing change?
add a new expert param `blockSize`
### How was this patch tested?
updated testsuites
Closes#27374 from zhengruifeng/blockify_lor.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param `blockSize`;
3, add a new class `InstanceBlock`
4, standardize the input outside of optimization procedure;
### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS)
### Does this PR introduce any user-facing change?
a new param `blockSize`
### How was this patch tested?
existing and updated testsuites
Closes#27360 from zhengruifeng/blockify_svc.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add a param `bootstrap` to control whether bootstrap samples are used.
### Why are the changes needed?
Current RF with numTrees=1 will directly build a tree using the orignial dataset,
while with numTrees>1 it will use bootstrap samples to build trees.
This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange.
In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used.
### Does this PR introduce any user-facing change?
Yes, new param is added
### How was this patch tested?
existing testsuites
Closes#27254 from zhengruifeng/add_bootstrap.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Removal of following `Param` fields:
- `factorSize`
- `fitLinear`
- `miniBatchFraction`
- `initStd`
- `solver`
from `FMClassifier` and `FMRegressor`
### Why are the changes needed?
This `Param` members are already provided by `_FactorizationMachinesParams`
0f3d744c3f/python/pyspark/ml/regression.py (L2303-L2318)
which is mixed into `FMRegressor`:
0f3d744c3f/python/pyspark/ml/regression.py (L2350)
and `FMClassifier`:
0f3d744c3f/python/pyspark/ml/classification.py (L2793)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual testing.
Closes#27205 from zero323/SPARK-30378-FOLLOWUP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adjusts `_to_java` and `_from_java` of `OneVsRest` and `OneVsRestModel` to preserve `weightCol`.
### Why are the changes needed?
Currently both `Params` don't preserve `weightCol` `Params` when data is saved / loaded:
```python
from pyspark.ml.classification import LogisticRegression, OneVsRest, OneVsRestModel
from pyspark.ml.linalg import DenseVector
df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, DenseVector([1.0, 0.0]))], ("label", "w", "features"))
ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
ovrm = ovr.fit(df)
ovr.getWeightCol()
## 'w'
ovrm.getWeightCol()
## 'w'
ovr.write().overwrite().save("/tmp/ovr")
ovr_ = OneVsRest.load("/tmp/ovr")
ovr_.getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', doc='weight column name. ...)
ovrm.write().overwrite().save("/tmp/ovrm")
ovrm_ = OneVsRestModel.load("/tmp/ovrm")
ovrm_ .getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', doc='weight column name ...
```
### Does this PR introduce any user-facing change?
After this PR is merged, loaded objects will have `weightCol` `Param` set.
### How was this patch tested?
- Manual testing.
- Extension of existing persistence tests.
Closes#27190 from zero323/SPARK-30504.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights
2, make RF supports weights
### Why are the changes needed?
`weightCol` is already exposed, while RF has not support weights.
### Does this PR introduce any user-facing change?
Yes, new setters
### How was this patch tested?
added testsuites
Closes#27097 from zhengruifeng/rf_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some parity issues between python and scala
### Why are the changes needed?
keep parity between python and scala
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
existing tests
Closes#27196 from huaxingao/spark-30498.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Removal of `OneVsRestModel.setClassifier`, `OneVsRestModel.setLabelCol` and `OneVsRestModel.setWeightCol` methods.
### Why are the changes needed?
Aforementioned methods shouldn't by included by [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093), as they're not present in Scala `OneVsRestModel` and have no practical application.
### Does this PR introduce any user-facing change?
Not beyond scope of SPARK-29093].
### How was this patch tested?
Existing tests.
CC huaxingao zhengruifeng
Closes#27181 from zero323/SPARK-30493.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams```
### Why are the changes needed?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap```
### Does this PR introduce any user-facing change?
Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now
### How was this patch tested?
Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there.
Closes#26838 from huaxingao/spark-30144.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
expose predictRaw and predictProbability on Python side
### Why are the changes needed?
to keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes. Expose python ```predictRaw``` and ```predictProbability```
### How was this patch tested?
doctest
Closes#27082 from huaxingao/spark-30358.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add getter/setter in Python FM
### Why are the changes needed?
to be consistent with other algorithms
### Does this PR introduce any user-facing change?
Yes.
add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel
### How was this patch tested?
doctest
Closes#27044 from huaxingao/spark-30378.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#27000 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#26124 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side of ```GBTClassifier``` and ```GBTRegressor```
### Why are the changes needed?
https://github.com/apache/spark/pull/25926 added ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on python side
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
doc test
Closes#26774 from huaxingao/spark-30146.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
MNB/CNB/BNB use empty sigma matrix instead of null
### Why are the changes needed?
1,Using empty sigma matrix will simplify the impl
2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null`
### Does this PR introduce any user-facing change?
yes, sigma from `null` to empty matrix
### How was this patch tested?
updated testsuites
Closes#26679 from zhengruifeng/nb_use_empty_sigma.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes`
### Why are the changes needed?
1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.'
2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB.
### Does this PR introduce any user-facing change?
yes, a new `modelType` is supported
### How was this patch tested?
added testsuites
Closes#26575 from zhengruifeng/cnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
support `modelType` `gaussian`
### Why are the changes needed?
current modelTypes do not support continuous data
### Does this PR introduce any user-facing change?
yes, add a `modelType` option
### How was this patch tested?
existing testsuites and added ones
Closes#26413 from zhengruifeng/gnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add ```__repr__``` in Python ML Models
### Why are the changes needed?
In Python ML Models, some of them have ```__repr__```, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the ```__repr__``` method. For example:
```
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> model = gm.fit(df)
>>> model.setPredictionCol("newPrediction")
GaussianMixture...
```
After the change, the above code will become the following:
```
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> model = gm.fit(df)
>>> model.setPredictionCol("newPrediction")
GaussianMixtureModel...
```
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
doctest
Closes#26489 from huaxingao/spark-29876.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1,ML models should extend toString method to expose basic information.
Current some algs (GBT/RF/LoR) had done this, while others not yet.
2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel`
### Why are the changes needed?
ML models should extend toString method to expose basic information.
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
existing testsuites
Closes#26439 from zhengruifeng/models_toString.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove automatically generated param setters in _shared_params_code_gen.py
### Why are the changes needed?
To keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes
Add some setters in Python ML XXXModels
### How was this patch tested?
unit tests
Closes#26232 from huaxingao/spark-29093.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add private _XXXParams classes for classification & regression
### Why are the changes needed?
To keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes. Add gettters/setters for the following Model classes
```
LinearSVCModel:
get/setRegParam
get/setMaxIte
get/setFitIntercept
get/setTol
get/setStandardization
get/setWeightCol
get/setAggregationDepth
get/setThreshold
LogisticRegressionModel:
get/setRegParam
get/setElasticNetParam
get/setMaxIter
get/setFitIntercept
get/setTol
get/setStandardization
get/setWeightCol
get/setAggregationDepth
get/setThreshold
NaiveBayesModel:
get/setWeightCol
LinearRegressionModel:
get/setRegParam
get/setElasticNetParam
get/setMaxIter
get/setTol
get/setFitIntercept
get/setStandardization
get/setWeight
get/setSolver
get/setAggregationDepth
get/setLoss
GeneralizedLinearRegressionModel:
get/setFitIntercept
get/setMaxIter
get/setTol
get/setRegParam
get/setWeightCol
get/setSolver
```
### How was this patch tested?
Add a few doctest
Closes#26142 from huaxingao/spark-29381.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
change PySpark ml ```Params._clear``` to ```Params.clear```
### Why are the changes needed?
PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param.
### Does this PR introduce any user-facing change?
Yes. PySpark ml ```Params._clear``` ---> ```Params.clear```
### How was this patch tested?
Add test.
Closes#26130 from huaxingao/spark-29464.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
Add _ before XXXParams classes to indicate internal usage
### Why are the changes needed?
Follow the PEP 8 convention to use _single_leading_underscore to indicate internal use
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
use existing tests
Closes#26103 from huaxingao/spark-29381.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
- Move tree related classes to a separate file ```tree.py```
- add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
### Why are the changes needed?
- keep parity between scala and python
- easy code maintenance
### Does this PR introduce any user-facing change?
Yes
add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
add ```setMinWeightFractionPerNode``` in ```DecisionTreeClassifier``` and ```DecisionTreeRegressor```
### How was this patch tested?
add some doc tests
Closes#25929 from huaxingao/spark_29116.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add some common classes in Python to make it have the same structure as Scala
1. Scala has ClassifierParams/Classifier/ClassificationModel:
```
trait ClassifierParams
extends PredictorParams with HasRawPredictionCol
abstract class Classifier
extends Predictor with ClassifierParams {
def setRawPredictionCol
}
abstract class ClassificationModel
extends PredictionModel with ClassifierParams {
def setRawPredictionCol
}
```
This PR makes Python has the following:
```
class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
pass
class JavaClassifier(JavaPredictor, JavaClassifierParams):
def setRawPredictionCol
class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams):
def setRawPredictionCol
```
2. Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel:
```
trait ProbabilisticClassifierParams
extends ClassifierParams with HasProbabilityCol with HasThresholds
abstract class ProbabilisticClassifier
extends Classifier with ProbabilisticClassifierParams {
def setProbabilityCol
def setThresholds
}
abstract class ProbabilisticClassificationModel
extends ClassificationModel with ProbabilisticClassifierParams {
def setProbabilityCol
def setThresholds
}
```
This PR makes Python have the following:
```
class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams):
pass
class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams):
def setProbabilityCol
def setThresholds
class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams):
def setProbabilityCol
def setThresholds
```
3. Scala has PredictorParams/Predictor/PredictionModel:
```
trait PredictorParams extends Params
with HasLabelCol with HasFeaturesCol with HasPredictionCol
abstract class Predictor
extends Estimator with PredictorParams {
def setLabelCol
def setFeaturesCol
def setPredictionCol
}
abstract class PredictionModel
extends Model with PredictorParams {
def setFeaturesCol
def setPredictionCol
def numFeatures
def predict
}
```
This PR makes Python have the following:
```
class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol):
pass
class JavaPredictor(JavaEstimator, JavaPredictorParams):
def setLabelCol
def setFeaturesCol
def setPredictionCol
class JavaPredictionModel(JavaModel, JavaPredictorParams):
def setFeaturesCol
def setPredictionCol
def numFeatures
def predict
```
### Why are the changes needed?
Have parity between Python and Scala ML
### Does this PR introduce any user-facing change?
Yes. Add the following changes:
```
LinearSVCModel
- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict
```
```
LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel
- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict
```
```
LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel
- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict
```
### How was this patch tested?
Add a few doc tests.
Closes#25776 from huaxingao/spark-28985.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Follow the scala ```OneVsRestParams``` implementation, move ```setClassifier``` from ```OneVsRestParams``` to ```OneVsRest``` in Pyspark
### Why are the changes needed?
1. Maintain the parity between scala and python code.
2. ```Classifier``` can only be set in the estimator.
### Does this PR introduce any user-facing change?
Yes.
Previous behavior: ```OneVsRestModel``` has method ```setClassifier```
Current behavior: ```setClassifier``` is removed from ```OneVsRestModel```. ```classifier``` can only be set in ```OneVsRest```.
### How was this patch tested?
Use existing tests
Closes#25715 from huaxingao/spark-28969.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R.
The changes below can be summarized as:
- Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental
- Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched
- I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering)
It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those.
### Why are the changes needed?
Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25558 from srowen/SPARK-28855.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
expose the newly added tree-based transformation in the py side
### Why are the changes needed?
function parity
### Does this PR introduce any user-facing change?
yes, add `setLeafCol` & `getLeafCol` in the py side
### How was this patch tested?
added tests & local tests
Closes#25566 from zhengruifeng/py_tree_path.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
## What changes were proposed in this pull request?
Leave ```shared.py``` untouched. Move Python ```DecisionTreeParams``` to ```regression.py```
## How was this patch tested?
Use existing tests
Closes#25406 from huaxingao/spark-28243.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams
## How was this patch tested?
Use existing tests.
Closes#25046 from huaxingao/spark-28243.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add RawPrediction to OneVsRest in PySpark to make it consistent with scala implementation
## How was this patch tested?
Add doctest
Closes#23910 from huaxingao/spark-27007.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add sample weights to decision trees
## How was this patch tested?
updated testsuites
Closes#23818 from zhengruifeng/py_tree_support_sample_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Python version of https://github.com/apache/spark/pull/17654
## How was this patch tested?
Existing Python unit test
Closes#23676 from huaxingao/spark26754.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Misc code cleanup from lgtm.com analysis. See comments below for details.
## How was this patch tested?
Existing tests.
Closes#23571 from srowen/SPARK-26640.
Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add validationIndicatorCol and validationTol to GBT Python.
## How was this patch tested?
Add test in doctest to test the new API.
Closes#21465 from huaxingao/spark-24333.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
(This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)
## What changes were proposed in this pull request?
Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.
## How was this patch tested?
Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.
Closes#22400 from srowen/SPARK-25238.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
[SPARK-14712](https://issues.apache.org/jira/browse/SPARK-14712)
spark.mllib LogisticRegressionModel overrides toString to print a little model info. We should do the same in spark.ml and override repr in pyspark.
## How was this patch tested?
LogisticRegressionSuite.scala
Python doctest in pyspark.ml.classification.py
Author: bravo-zhang <mzhang1230@gmail.com>
Closes#18826 from bravo-zhang/spark-14712.
## What changes were proposed in this pull request?
Add featureSubsetStrategy in GBTClassifier and GBTRegressor. Also make GBTClassificationModel inherit from JavaClassificationModel instead of prediction model so it will have numClasses.
## How was this patch tested?
Add tests in doctest
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21413 from huaxingao/spark-23161.
## What changes were proposed in this pull request?
Add evaluateEachIteration for GBTClassification and GBTRegressionModel
## How was this patch tested?
doctest
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes#21335 from ludatabricks/SPARK-14682.