### What changes were proposed in this pull request?
This PR adjusts `_to_java` and `_from_java` of `OneVsRest` and `OneVsRestModel` to preserve `weightCol`.
### Why are the changes needed?
Currently both `Params` don't preserve `weightCol` `Params` when data is saved / loaded:
```python
from pyspark.ml.classification import LogisticRegression, OneVsRest, OneVsRestModel
from pyspark.ml.linalg import DenseVector
df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, DenseVector([1.0, 0.0]))], ("label", "w", "features"))
ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
ovrm = ovr.fit(df)
ovr.getWeightCol()
## 'w'
ovrm.getWeightCol()
## 'w'
ovr.write().overwrite().save("/tmp/ovr")
ovr_ = OneVsRest.load("/tmp/ovr")
ovr_.getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', doc='weight column name. ...)
ovrm.write().overwrite().save("/tmp/ovrm")
ovrm_ = OneVsRestModel.load("/tmp/ovrm")
ovrm_ .getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', doc='weight column name ...
```
### Does this PR introduce any user-facing change?
After this PR is merged, loaded objects will have `weightCol` `Param` set.
### How was this patch tested?
- Manual testing.
- Extension of existing persistence tests.
Closes#27190 from zero323/SPARK-30504.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights
2, make RF supports weights
### Why are the changes needed?
`weightCol` is already exposed, while RF has not support weights.
### Does this PR introduce any user-facing change?
Yes, new setters
### How was this patch tested?
added testsuites
Closes#27097 from zhengruifeng/rf_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some parity issues between python and scala
### Why are the changes needed?
keep parity between python and scala
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
existing tests
Closes#27196 from huaxingao/spark-30498.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add weight support in BisectingKMeans
### Why are the changes needed?
BisectingKMeans should support instance weighting
### Does this PR introduce any user-facing change?
Yes. BisectingKMeans.setWeight
### How was this patch tested?
Unit test
Closes#27035 from huaxingao/spark_30351.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Make Regressors extend abstract class Regressor:
```AFTSurvivalRegression extends Estimator => extends Regressor```
```DecisionTreeRegressor extends Predictor => extends Regressor```
```FMRegressor extends Predictor => extends Regressor```
```GBTRegressor extends Predictor => extends Regressor```
```RandomForestRegressor extends Predictor => extends Regressor```
We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType.
### Why are the changes needed?
Make class hierarchy consistent for all Regressors
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#27168 from huaxingao/spark-30377.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Removal of `OneVsRestModel.setClassifier`, `OneVsRestModel.setLabelCol` and `OneVsRestModel.setWeightCol` methods.
### Why are the changes needed?
Aforementioned methods shouldn't by included by [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093), as they're not present in Scala `OneVsRestModel` and have no practical application.
### Does this PR introduce any user-facing change?
Not beyond scope of SPARK-29093].
### How was this patch tested?
Existing tests.
CC huaxingao zhengruifeng
Closes#27181 from zero323/SPARK-30493.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add ```predict``` and ```numFeatures``` in Python ```IsotonicRegressionModel```
### Why are the changes needed?
```IsotonicRegressionModel``` doesn't extend ```JavaPredictionModel```, so it doesn't get ```predict``` and ```numFeatures``` from the super class.
### Does this PR introduce any user-facing change?
Yes. Python version of
```
IsotonicRegressionModel.predict
IsotonicRegressionModel.numFeatures
```
### How was this patch tested?
doctest
Closes#27122 from huaxingao/spark-30452.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```
### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
UT.
Closes#26910 from WeichenXu123/vector_to_array.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
### What changes were proposed in this pull request?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams```
### Why are the changes needed?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap```
### Does this PR introduce any user-facing change?
Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now
### How was this patch tested?
Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there.
Closes#26838 from huaxingao/spark-30144.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
expose predictRaw and predictProbability on Python side
### Why are the changes needed?
to keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes. Expose python ```predictRaw``` and ```predictProbability```
### How was this patch tested?
doctest
Closes#27082 from huaxingao/spark-30358.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add getter/setter in Python FM
### Why are the changes needed?
to be consistent with other algorithms
### Does this PR introduce any user-facing change?
Yes.
add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel
### How was this patch tested?
doctest
Closes#27044 from huaxingao/spark-30378.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
supports instance weighting in GMM
### Why are the changes needed?
ML should support instance weighting
### Does this PR introduce any user-facing change?
yes, a new param `weightCol` is exposed
### How was this patch tested?
added testsuits
Closes#26735 from zhengruifeng/gmm_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side.
### Does this PR introduce any user-facing change?
add Python class ```MultivariateGaussian```
### How was this patch tested?
doctest
Closes#27020 from huaxingao/spark-30247.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#27000 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
compute the medians/ranges more distributedly
### Why are the changes needed?
It is a bottleneck to collect the whole Array[QuantileSummaries] from executors,
since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`).
In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM.
After this PR, it will sucessfuly fit the model.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26803 from zhengruifeng/robust_high_dim.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#26124 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
expose gaussian in PySpark
### Why are the changes needed?
A ```GaussianMixtureModel``` contains two parts of coefficients: ```weights``` & ```gaussians```. However, ```gaussians``` is not exposed on Python side.
### Does this PR introduce any user-facing change?
Yes. ```GaussianMixtureModel.gaussians``` is exposed in PySpark.
### How was this patch tested?
add doctest
Closes#26882 from huaxingao/spark-30247.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add weight support in KMeans
### Why are the changes needed?
KMeans should support weighting
### Does this PR introduce any user-facing change?
Yes. ```KMeans.setWeightCol```
### How was this patch tested?
Unit Tests
Closes#26739 from huaxingao/spark-29967.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side of ```GBTClassifier``` and ```GBTRegressor```
### Why are the changes needed?
https://github.com/apache/spark/pull/25926 added ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on python side
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
doc test
Closes#26774 from huaxingao/spark-30146.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
MNB/CNB/BNB use empty sigma matrix instead of null
### Why are the changes needed?
1,Using empty sigma matrix will simplify the impl
2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null`
### Does this PR introduce any user-facing change?
yes, sigma from `null` to empty matrix
### How was this patch tested?
updated testsuites
Closes#26679 from zhengruifeng/nb_use_empty_sigma.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Summarizer support more metrics: sum, std
### Why are the changes needed?
Those metrics are widely used, it will be convenient to directly obtain them other than a conversion.
in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied
in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std
### Does this PR introduce any user-facing change?
yes, new metrics are exposed to end users
### How was this patch tested?
added testsuites
Closes#26596 from zhengruifeng/summarizer_add_metrics.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
MulticlassClassificationEvaluator support hammingLoss
### Why are the changes needed?
1, it is an easy to compute hammingLoss based on confusion matrix
2, scikit-learn supports it
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
added testsuites
Closes#26597 from zhengruifeng/multi_class_hamming_loss.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes`
### Why are the changes needed?
1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.'
2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB.
### Does this PR introduce any user-facing change?
yes, a new `modelType` is supported
### How was this patch tested?
added testsuites
Closes#26575 from zhengruifeng/cnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
modify Param._copyValues to check valid Param objects supplied as extra
### What changes were proposed in this pull request?
Estimator.fit() and Model.transform() accept a dictionary of extra parameters whose values are used to overwrite those supplied at initialization or by default. Additionally, the ParamGridBuilder.addGrid accepts a parameter and list of values. The keys are presumed to be valid Param objects. This change adds a check that only Param objects are supplied as keys.
### Why are the changes needed?
Param objects are created by and bound to an instance of Params (Estimator, Model, or Transformer). They may be obtained from their parent as attributes, or by name through getParam.
The documentation does not state that keys must be valid Param objects, nor describe how one may be obtained. The current behavior is to silently ignore keys which are not valid Param objects.
### Does this PR introduce any user-facing change?
If the user does not pass in a Param object as required for keys in `extra` for Estimator.fit() and Model.transform(), and `param` for ParamGridBuilder.addGrid, an error will be raised indicating it is an invalid object.
### How was this patch tested?
Added method test_copy_param_extras_check to test_param.py. Tested with Python 3.7
Closes#26527 from JohnHBauer/paramExtra.
Authored-by: John Bauer <john.h.bauer@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
support `modelType` `gaussian`
### Why are the changes needed?
current modelTypes do not support continuous data
### Does this PR introduce any user-facing change?
yes, add a `modelType` option
### How was this patch tested?
existing testsuites and added ones
Closes#26413 from zhengruifeng/gnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add ```__repr__``` in Python ML Models
### Why are the changes needed?
In Python ML Models, some of them have ```__repr__```, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the ```__repr__``` method. For example:
```
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> model = gm.fit(df)
>>> model.setPredictionCol("newPrediction")
GaussianMixture...
```
After the change, the above code will become the following:
```
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> model = gm.fit(df)
>>> model.setPredictionCol("newPrediction")
GaussianMixtureModel...
```
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
doctest
Closes#26489 from huaxingao/spark-29876.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add multi-cols support in StopWordsRemover
### Why are the changes needed?
As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.
### Does this PR introduce any user-facing change?
```StopWordsRemover.setInputCols```
```StopWordsRemover.setOutputCols```
### How was this patch tested?
Unit tests
Closes#26480 from huaxingao/spark-29808.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
1,ML models should extend toString method to expose basic information.
Current some algs (GBT/RF/LoR) had done this, while others not yet.
2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel`
### Why are the changes needed?
ML models should extend toString method to expose basic information.
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
existing testsuites
Closes#26439 from zhengruifeng/models_toString.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR proposes to add **Single threading model design (pinned thread model)** mode which is an experimental mode to sync threads on PVM and JVM. See https://www.py4j.org/advanced_topics.html#using-single-threading-model-pinned-thread
### Multi threading model
Currently, PySpark uses this model. Threads on PVM and JVM are independent. For instance, in a different Python thread, callbacks are received and relevant Python codes are executed. JVM threads are reused when possible.
Py4J will create a new thread every time a command is received and there is no thread available. See the current model we're using - https://www.py4j.org/advanced_topics.html#the-multi-threading-model
One problem in this model is that we can't sync threads on PVM and JVM out of the box. This leads to some problems in particular at some codes related to threading in JVM side. See:
7056e004ee/core/src/main/scala/org/apache/spark/SparkContext.scala (L334)
Due to reusing JVM threads, seems the job groups in Python threads cannot be set in each thread as described in the JIRA.
### Single threading model design (pinned thread model)
This mode pins and syncs the threads on PVM and JVM to work around the problem above. For instance, in the same Python thread, callbacks are received and relevant Python codes are executed. See https://www.py4j.org/advanced_topics.html#the-single-threading-model
Even though this mode can sync threads on PVM and JVM for other thread related code paths,
this might cause another problem: seems unable to inherit properties as below (assuming multi-thread mode still creates new threads when existing threads are busy, I suspect this issue already exists when multiple jobs are submitted in multi-thread mode; however, it can be always seen in single threading mode):
```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
import threading
spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
print(spark.sparkContext.getLocalProperty("a"))
threading.Thread(target=print_prop).start()
```
```
None
```
Unlike Scala side:
```scala
spark.sparkContext.setLocalProperty("a", "hi")
new Thread(new Runnable {
def run() = println(spark.sparkContext.getLocalProperty("a"))
}).start()
```
```
hi
```
This behaviour potentially could cause weird issues but this PR currently does not target this fix this for now since this mode is experimental.
### How does this PR fix?
Basically there are two types of Py4J servers `GatewayServer` and `ClientServer`. The former is for multi threading and the latter is for single threading. This PR adds a switch to use the latter.
In Scala side:
The logic to select a server is encapsulated in `Py4JServer` and use `Py4JServer` at `PythonRunner` for Spark summit and `PythonGatewayServer` for Spark shell. Each uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise.
In Python side:
Simply do an if-else to switch the server to talk. It uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise.
This is disabled by default for now.
## How was this patch tested?
Manually tested. This can be tested via:
```python
PYSPARK_PIN_THREAD=true ./bin/pyspark
```
and/or
```bash
cd python
./run-tests --python-executables=python --testnames "pyspark.tests.test_pin_thread"
```
Also, ran the Jenkins tests with `PYSPARK_PIN_THREAD` enabled.
Closes#24898 from HyukjinKwon/pinned-thread.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
expose expert param `aggregationDepth` in algs: GMM/GLR
### Why are the changes needed?
SVC/LoR/LiR/AFT had exposed expert param aggregationDepth to end users. It should be nice to expose it in similar algs.
### Does this PR introduce any user-facing change?
yes, expose new param
### How was this patch tested?
added pytext tests
Closes#26322 from zhengruifeng/agg_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add shared param `relativeError`
2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError`
### Why are the changes needed?
It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead.
`QuantileDiscretizer` had already added this param, while other algs not yet.
### Does this PR introduce any user-facing change?
yes, new param is added in `Imputer`/`RobusterScaler`
### How was this patch tested?
existing testsutes
Closes#26305 from zhengruifeng/add_relative_err.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add single-column input/ouput support in OneHotEncoder
### Why are the changes needed?
Currently, OneHotEncoder only has multi columns support. It makes sense to support single column as well.
### Does this PR introduce any user-facing change?
Yes
```OneHotEncoder.setInputCol```
```OneHotEncoder.setOutputCol```
### How was this patch tested?
Unit test
Closes#26265 from huaxingao/spark-29565.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
add single-column input/output support in Imputer
### Why are the changes needed?
Currently, Imputer only has multi-column support. This PR adds single-column input/output support.
### Does this PR introduce any user-facing change?
Yes. add single-column input/output support in Imputer
```Imputer.setInputCol```
```Imputer.setOutputCol```
### How was this patch tested?
add unit tests
Closes#26247 from huaxingao/spark-29566.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Remove automatically generated param setters in _shared_params_code_gen.py
### Why are the changes needed?
To keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes
Add some setters in Python ML XXXModels
### How was this patch tested?
unit tests
Closes#26232 from huaxingao/spark-29093.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Currently pyspark doesn't write/read `avgMetrics` in `CrossValidatorModel`, whereas scala supports it.
### Why are the changes needed?
Test step to reproduce it:
```
dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.4]), 1.0),
(Vectors.dense([0.5]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,parallelism=2)
cvModel = cv.fit(dataset)
cvModel.write().save("/tmp/model")
cvModel2 = CrossValidatorModel.read().load("/tmp/model")
print(cvModel.avgMetrics) # prints non empty result as expected
print(cvModel2.avgMetrics) # Bug: prints an empty result.
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested
Before patch:
```
>>> cvModel.write().save("/tmp/model_0")
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_0")
>>> print(cvModel2.avgMetrics)
[]
```
After patch:
```
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_2")
>>> print(cvModel2.avgMetrics[0])
0.5
```
Closes#26038 from shahidki31/avgMetrics.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
`ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss
### Why are the changes needed?
log-loss is an important classification metric and is widely used in practice
### Does this PR introduce any user-facing change?
Yes, add new option ("logloss") and a related param `eps`
### How was this patch tested?
added testsuites & local tests refering to sklearn
Closes#26135 from zhengruifeng/logloss.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add private _XXXParams classes for classification & regression
### Why are the changes needed?
To keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes. Add gettters/setters for the following Model classes
```
LinearSVCModel:
get/setRegParam
get/setMaxIte
get/setFitIntercept
get/setTol
get/setStandardization
get/setWeightCol
get/setAggregationDepth
get/setThreshold
LogisticRegressionModel:
get/setRegParam
get/setElasticNetParam
get/setMaxIter
get/setFitIntercept
get/setTol
get/setStandardization
get/setWeightCol
get/setAggregationDepth
get/setThreshold
NaiveBayesModel:
get/setWeightCol
LinearRegressionModel:
get/setRegParam
get/setElasticNetParam
get/setMaxIter
get/setTol
get/setFitIntercept
get/setStandardization
get/setWeight
get/setSolver
get/setAggregationDepth
get/setLoss
GeneralizedLinearRegressionModel:
get/setFitIntercept
get/setMaxIter
get/setTol
get/setRegParam
get/setWeightCol
get/setSolver
```
### How was this patch tested?
Add a few doctest
Closes#26142 from huaxingao/spark-29381.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
change PySpark ml ```Params._clear``` to ```Params.clear```
### Why are the changes needed?
PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param.
### Does this PR introduce any user-facing change?
Yes. PySpark ml ```Params._clear``` ---> ```Params.clear```
### How was this patch tested?
Add test.
Closes#26130 from huaxingao/spark-29464.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds`
### Why are the changes needed?
similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer`
### Does this PR introduce any user-facing change?
yes, add setter/getter of `thresholds`/`inputCols`/`outputCols`
### How was this patch tested?
added suites
Closes#26064 from zhengruifeng/binarizer_multicols.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add _ before XXXParams classes to indicate internal usage
### Why are the changes needed?
Follow the PEP 8 convention to use _single_leading_underscore to indicate internal use
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
use existing tests
Closes#26103 from huaxingao/spark-29381.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Follow Scala ml tuning implementation
- put leading underscore before python ```ValidatorParams``` to indicate private
- add ```_CrossValidatorParams``` and ```_TrainValidationSplitParams```
- separate the getters and setters. Put getters in _XXXParams and setters in the Classes.
### Why are the changes needed?
Keep parity between scala and python
### Does this PR introduce any user-facing change?
add ```CrossValidatorModel.getNumFolds``` and ```TrainValidationSplitModel.getTrainRatio()```
### How was this patch tested?
Add doctest
Closes#26057 from huaxingao/spark-tuning.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
- Move tree related classes to a separate file ```tree.py```
- add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
### Why are the changes needed?
- keep parity between scala and python
- easy code maintenance
### Does this PR introduce any user-facing change?
Yes
add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
add ```setMinWeightFractionPerNode``` in ```DecisionTreeClassifier``` and ```DecisionTreeRegressor```
### How was this patch tested?
add some doc tests
Closes#25929 from huaxingao/spark_29116.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add getters/setters in Pyspark ALSModel.
### Why are the changes needed?
To keep parity between python and scala.
### Does this PR introduce any user-facing change?
Yes.
add the following getters/setters to ALSModel
```
get/setUserCol
get/setItemCol
get/setColdStartStrategy
get/setPredictionCol
```
### How was this patch tested?
add doctest
Closes#25947 from huaxingao/spark-29269.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add column setters/getters support in Pyspark feature models
### Why are the changes needed?
keep parity between Pyspark and Scala
### Does this PR introduce any user-facing change?
Yes.
After the change, Pyspark feature models have column setters/getters support.
### How was this patch tested?
Add some doctests
Closes#25908 from huaxingao/spark-29143.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
Keep parity between Scala and Python
### Does this PR introduce any user-facing change?
add the following getters/setter to FPGrowthModel
```
getMinSupport
getNumPartitions
getMinConfidence
getItemsCol
getPredictionCol
setItemsCol
setMinConfidence
setPredictionCol
```
add following getters/setters to PrefixSpan
```
set/getMinSupport
set/getMaxPatternLength
set/getMaxLocalProjDBSize
set/getSequenceCol
```
### How was this patch tested?
add doctest
Closes#26035 from huaxingao/spark-29360.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Adds
```python
class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ...
```
with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes.
### Why are the changes needed?
Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`:
e97b55d322/python/pyspark/ml/wrapper.py (L377)
and `JavaPredictionModel`
e97b55d322/python/pyspark/ml/wrapper.py (L405)
respectively.
This however is inconsistent with Scala counterpart where both classes extend private `IsotonicRegressionBase`
3cb1b57809/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala (L42-L43)
This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e.
```python
from pyspark.ml.regression impor IsotonicRegressionMode
from pyspark.ml.param.shared import HasWeightCol
issubclass(IsotonicRegressionModel, HasWeightCol)
# False
hasattr(model, "weightCol")
# True
```
as well as introduces a bug, by adding unsupported `predict` method:
```python
import inspect
hasattr(model, "predict")
# True
inspect.getfullargspec(IsotonicRegressionModel.predict)
# FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={})
IsotonicRegressionModel.predict.__doc__
# Predict label for the given features.\n\n .. versionadded:: 3.0.0'
model.predict(dataset.first().features)
# Py4JError: An error occurred while calling o49.predict. Trace:
# py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist
# ...
```
Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes.
### Does this PR introduce any user-facing change?
Yes. It:
- Removes invalid `IsotonicRegressionModel.predict` method.
- Adds `HasWeightColumn` to `IsotonicRegressionModel`.
however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776).
### How was this patch tested?
- Existing unit tests.
- Manual testing.
CC huaxingao, zhengruifeng
Closes#26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Adds
```python
_AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol,
HasMaxIter, HasTol, HasFitIntercept,
HasAggregationDepth): ...
```
with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes.
### Why are the changes needed?
Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`:
e97b55d322/python/pyspark/ml/wrapper.py (L377)
and `JavaPredictionModel`
e97b55d322/python/pyspark/ml/wrapper.py (L405)
respectively.
This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase`
eb037a8180/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L48-L50)
This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p))
```
from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel
from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth
from pyspark.ml.param import Param
issubclass(AFTSurvivalRegressionModel, HasMaxIter)
# False
hasattr(model, "maxIter") and isinstance(model.maxIter, Param)
# True
issubclass(AFTSurvivalRegressionModel, HasTol)
# False
hasattr(model, "tol") and isinstance(model.tol, Param)
# True
```
and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](https://github.com/apache/spark/pull/26023), current implementation is technically speaking correct, though incomplete).
### Does this PR introduce any user-facing change?
Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet.
### How was this patch tested?
- Existing unit tests.
- Manual testing.
CC huaxingao, zhengruifeng
Closes#26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR replaces some references with correct ones (`:py:class:`).
### Why are the changes needed?
Newly added mixins from the original PR incorrectly reference classes with `:py:attr:`. While these classes are marked as internal, and not rendered in the standard documentation, it still makes sense to use correct roles.
### Does this PR introduce any user-facing change?
No. The changed part is not a part of generated PySpark documents.
### How was this patch tested?
Since this PR is a kind of typo fix, manually checking the patch.
We can build document for compilation test although there is no UI change.
Closes#26004 from zero323/SPARK-29142-FOLLOWUP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>