Commit graph

536 commits

Author SHA1 Message Date
Huaxin Gao d4007776f2 [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator
### What changes were proposed in this pull request?
Add weight support in ClusteringEvaluator

### Why are the changes needed?
Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator.

### Does this PR introduce _any_ user-facing change?
Yes.
ClusteringEvaluator.setWeightCol

### How was this patch tested?
add new unit test

Closes #28553 from huaxingao/weight_evaluator.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-25 09:18:08 -05:00
David Toneian acab558e55 [SPARK-31739][PYSPARK][DOCS][MINOR] Fix docstring syntax issues and misplaced space characters
This commit is published into the public domain.

### What changes were proposed in this pull request?
Some syntax issues in docstrings have been fixed.

### Why are the changes needed?
In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such.

### Does this PR introduce any user-facing change?
Slight improvements in documentation.

### How was this patch tested?
Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change.

Closes #28559 from DavidToneian/SPARK-31739.

Authored-by: David Toneian <david@toneian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-18 20:25:02 +09:00
Huaxin Gao e10516ae63 [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary
### What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark

### Why are the changes needed?
Currently we have
```
    since("2.0.0")
    def evaluate(self, dataset):
        if not isinstance(dataset, DataFrame):
            raise ValueError("dataset must be a DataFrame but got %s." % type(dataset))
        java_blr_summary = self._call_java("evaluate", dataset)
        return BinaryLogisticRegressionSummary(java_blr_summary)
```
we should return LogisticRegressionSummary for multiclass logistic regression

### Does this PR introduce _any_ user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python

### How was this patch tested?
unit test

Closes #28503 from huaxingao/lr_summary.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-14 10:54:35 -05:00
zhengruifeng e7fa778dc7 [SPARK-30699][ML][PYSPARK] GMM blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, if blockSize==1, keep original behavior, code path trainOnRows;
3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks

### Why are the changes needed?
performance gain on dense dataset HIGGS:
1, save about 45% RAM;
2, 3X faster with openBLAS

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
added testsuites

Closes #27473 from zhengruifeng/blockify_gmm.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-12 12:54:03 +08:00
Huaxin Gao 7a670b5a0a [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
### What changes were proposed in this pull request?
Add Python version of
```
Since("3.1.0")
def test(
    dataset: DataFrame,
    featuresCol: String,
    labelCol: String,
    flatten: Boolean): DataFrame
```

### Why are the changes needed?
parity between scala and python

### Does this PR introduce _any_ user-facing change?
yes
new method
```
Since("3.1.0")
def test(
    dataset: DataFrame,
    featuresCol: String,
    labelCol: String,
    flatten: Boolean): DataFrame
```
in PySpark ANOVATest/ChisqTest/FValueTest

### How was this patch tested?
New doctest

Closes #28483 from huaxingao/flatten_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 09:09:00 -05:00
zhengruifeng bb9b50c217 [SPARK-31656][ML][PYSPARK] AFT blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

### Why are the changes needed?
it will obtain performance gain on dense datasets, such as epsilon
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (~10X speedup)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28473 from zhengruifeng/blockify_aft.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 14:06:36 +08:00
Huaxin Gao 18d2ba53e4 [SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark
### What changes were proposed in this pull request?
Add ANOVASelector and FValueSelector to PySpark

### Why are the changes needed?
ANOVASelector and FValueSelector have been implemented in Scala. We need to implement these in Python as well.

### Does this PR introduce _any_ user-facing change?
Yes. Add Python version of ANOVASelector and FValueSelector

### How was this patch tested?
new doctest

Closes #28464 from huaxingao/selector_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 11:02:24 +08:00
zhengruifeng 97332f26bf [SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

### Why are the changes needed?
it will obtain performance gain on dense datasets, such as `epsilon`
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines;  (up to 6X(squaredError)~12X(huber) speedup)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28471 from zhengruifeng/blockify_lir_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:52:01 +08:00
zhengruifeng 052ff49acd [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`);
2, add new param blockSize;
3, if blockSize==1, keep original behavior, code path `trainOnRows`;
4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks`

### Why are the changes needed?
On dense dataset `epsilon_normalized.t`:
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28458 from zhengruifeng/blockify_lor_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-07 10:07:24 +08:00
Huaxin Gao 09ece50799 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark
### What changes were proposed in this pull request?
Add VarianceThresholdSelector to PySpark

### Why are the changes needed?
parity between Scala and Python

### Does this PR introduce any user-facing change?
Yes.
VarianceThresholdSelector is added to PySpark

### How was this patch tested?
new doctest

Closes #28409 from huaxingao/variance_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-06 09:11:03 -05:00
zhengruifeng ebdf41dd69 [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors
### What changes were proposed in this pull request?
1, add new param `blockSize`;
2, add a new class InstanceBlock;
3, **if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP);**
4, if `blockSize>1`, standardize the input outside of optimization procedure;

### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`)

### Does this PR introduce any user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28349 from zhengruifeng/blockify_svc_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-06 10:06:23 +08:00
Weichen Xu 4a21c4cc92 [SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model
### What changes were proposed in this pull request?
Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model.

Most pyspark estimators/transformers inherit `JavaParams`, but some estimators are special (in order to support pure python implemented nested estimators/transformers):
* Pipeline
* OneVsRest
* CrossValidator
* TrainValidationSplit

But note that, currently, in pyspark, estimators listed above, their model reader/writer do NOT support pure python implemented nested estimators/transformers. Because they use java reader/writer wrapper as python side reader/writer.

Pyspark CrossValidator/TrainValidationSplit model reader/writer require all estimators define the `_transfer_param_map_to_java` and `_transfer_param_map_from_java` (used in model read/write).

OneVsRest class already defines the two methods, but Pipeline do not, so it lead to this bug.

In this PR I add `_transfer_param_map_to_java` and `_transfer_param_map_from_java` into Pipeline class.

### Why are the changes needed?
Bug fix.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test.

Manually test in pyspark shell:
1) CrossValidator with Simple Pipeline estimator
```
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001')
CrossValidatorModel.load('/tmp/cv_model001')
```

2) CrossValidator with Pipeline estimator which include a OneVsRest estimator stage, and OneVsRest estimator nest a LogisticRegression estimator.

```
from pyspark.ml.linalg import Vectors
from pyspark.ml import Estimator, Model
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel, OneVsRest
from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
    MulticlassClassificationEvaluator, RegressionEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.param import Param, Params
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \
    TrainValidationSplit, TrainValidationSplitModel
from pyspark.sql.functions import rand
from pyspark.testing.mlutils import SparkSessionTestCase

dataset = spark.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])

ova = OneVsRest(classifier=LogisticRegression())
lr1 = LogisticRegression().setMaxIter(100)
lr2 = LogisticRegression().setMaxIter(150)
grid = ParamGridBuilder().addGrid(ova.classifier, [lr1, lr2]).build()
evaluator = MulticlassClassificationEvaluator()

pipeline = Pipeline(stages=[ova])

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
cvModel.save('/tmp/model002')

cvModel2 = CrossValidatorModel.load('/tmp/model002')
```

TrainValidationSplit testing code are similar so I do not paste them.

Closes #28279 from WeichenXu123/fix_pipeline_tuning.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2020-04-26 21:04:14 -07:00
Huaxin Gao d279dbf09c [SPARK-31243][ML][PYSPARK] Add ANOVATest and FValueTest to PySpark
### What changes were proposed in this pull request?
Add ANOVATest and FValueTest to PySpark

### Why are the changes needed?
Parity between Scala and Python.

### Does this PR introduce any user-facing change?
Yes. Python ANOVATest and FValueTest

### How was this patch tested?
doctest

Closes #28012 from huaxingao/stats-python.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-27 14:05:49 +08:00
Huaxin Gao 3ce1dff7ba [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations
### What changes were proposed in this pull request?
jira link: https://issues.apache.org/jira/browse/SPARK-30930

Remove ML/MLLIB DeveloperApi annotations.

### Why are the changes needed?

The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR.

### Does this PR introduce any user-facing change?
Yes. DeveloperApi annotations are removed from docs.

### How was this patch tested?
existing tests

Closes #27859 from huaxingao/spark-30930.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-16 12:41:22 -05:00
Huaxin Gao 4a64901ab7 [SPARK-31012][ML][PYSPARK][DOCS] Updating ML API docs for 3.0 changes
### What changes were proposed in this pull request?
Updating ML docs for 3.0 changes

### Why are the changes needed?
I am auditing 3.0 ML changes, found some docs are missing or not updated. Need to update these.

### Does this PR introduce any user-facing change?
Yes, doc changes

### How was this patch tested?
Manually build and check

Closes #27762 from huaxingao/spark-doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-07 11:42:05 -06:00
zero323 e1b3e9a3d2 [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend
### What changes were proposed in this pull request?

Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends.

Note

- `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented.
- `PredictionModel` should be abstract as `_transform` is not implemented.

### Why are the changes needed?

To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors.

For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or https://github.com/apache/spark/pull/25776.

### Does this PR introduce any user-facing change?

It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same.

Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore).

It is for discussion if the same should be done to equivalent classes from `ml.wrapper`.

If we take `JavaClassifier` as an example, type hierarchy will change from

![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png)

to

![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png)

Similarly the old model

![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png)

will become

![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png)

### How was this patch tested?

Existing unit tests.

Closes #27245 from zero323/SPARK-29212.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-04 12:20:02 +08:00
zhengruifeng 111e9038d8 [SPARK-30770][ML] avoid vector conversion in GMM.transform
### What changes were proposed in this pull request?
Current impl needs to convert ml.Vector to breeze.Vector, which can be skipped.

### Why are the changes needed?
avoid unnecessary vector conversions

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27519 from zhengruifeng/gmm_transform_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-04 11:02:27 +08:00
David Toneian 504b5135d0 [SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation
This commit is published into the public domain.

### What changes were proposed in this pull request?
Some syntax issues in docstrings have been fixed.

### Why are the changes needed?
In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such.

### Does this PR introduce any user-facing change?
Slight improvements in documentation.

### How was this patch tested?
Manual testing. No new Sphinx warnings arise due to this change.

Closes #27613 from DavidToneian/SPARK-30859.

Authored-by: David Toneian <david@toneian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-18 16:46:45 +09:00
Liang Zhang 82d0aa37ae [SPARK-30762] Add dtype=float32 support to vector_to_array UDF
### What changes were proposed in this pull request?
In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32.

### Why are the changes needed?
In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory.

### Does this PR introduce any user-facing change?
Yes.
Old: `vector_to_array()` only take one param
```
df.select(vector_to_array("colA"), ...)
```
New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64")
```
df.select(vector_to_array("colA", "float32"), ...)
```

### How was this patch tested?
Unit test in scala.
doctest in python.

Closes #27522 from liangz1/udf-float32.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
2020-02-13 23:55:13 +08:00
Huaxin Gao a7ae77a8d8 [SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP
### What changes were proposed in this pull request?
Add ```HasBlockSize``` in shared Params in both Scala and Python.
Make ALS/MLP extend ```HasBlockSize```

### Why are the changes needed?
Add ```HasBlockSize ``` in ALS, so user can specify the blockSize.
Make ```HasBlockSize``` a shared param so both ALS and MLP can use it.

### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize/getBlockSize```
```ALSModel.setBlockSize/getBlockSize```

### How was this patch tested?
Manually tested. Also added doctest.

Closes #27501 from huaxingao/spark_30662.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-09 13:14:30 +08:00
zhengruifeng 12e1bbaddb Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]"
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-08 08:46:16 +08:00
zhengruifeng d0c3e9f1f7 [SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors
### What changes were proposed in this pull request?
1, use blocks instead of vectors for performance improvement
2, use Level-2 BLAS
3, move standardization of input vectors outside of gradient computation

### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (30% ~ 102%)

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
updated testsuites

Closes #27396 from zhengruifeng/blockify_lireg.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-31 21:04:26 -06:00
Huaxin Gao 6fac411076 [SPARK-29093][ML][PYSPARK][FOLLOW-UP] Remove duplicate setter
### What changes were proposed in this pull request?
remove duplicate setter in ```BucketedRandomProjectionLSH```

### Why are the changes needed?
Remove the duplicate ```setInputCol/setOutputCol``` in ```BucketedRandomProjectionLSH``` because these two setter are already in super class ```LSH```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually checked.

Closes #27397 from huaxingao/spark-29093.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 23:36:39 -08:00
Huaxin Gao f59685acaa [SPARK-30662][ML][PYSPARK] ALS/MLP extend HasBlockSize
### What changes were proposed in this pull request?
Make ALS/MLP extend ```HasBlockSize```

### Why are the changes needed?

Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently.

ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```.

### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize``` and ```ALS.getBlockSize```
```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize```

### How was this patch tested?
Manually tested. Also added doctest.

Closes #27389 from huaxingao/spark-30662.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-30 13:13:10 -06:00
zhengruifeng 073ce12543 [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial

### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
updated testsuites

Closes #27374 from zhengruifeng/blockify_lor.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-30 10:52:07 -06:00
zhengruifeng 96d27274f5 [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors
### What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param `blockSize`;
3, add a new class `InstanceBlock`
4, standardize the input outside of optimization procedure;

### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS)

### Does this PR introduce any user-facing change?
a new param `blockSize`

### How was this patch tested?
existing and updated testsuites

Closes #27360 from zhengruifeng/blockify_svc.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-28 20:55:21 +08:00
zhengruifeng f35f352096 [SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method
### What changes were proposed in this pull request?
add a param `bootstrap` to control whether bootstrap samples are used.

### Why are the changes needed?
Current RF with numTrees=1 will directly build a tree using the orignial dataset,

while with numTrees>1 it will use bootstrap samples to build trees.

This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange.

In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used.

### Does this PR introduce any user-facing change?
Yes, new param is added

### How was this patch tested?
existing testsuites

Closes #27254 from zhengruifeng/add_bootstrap.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-23 16:44:13 +08:00
zero323 3228732fd5 [SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels
### What changes were proposed in this pull request?

This PR adds:

- `pyspark.ml.regression.JavaRegressor`
- `pyspark.ml.regression.JavaRegressionModel`

classes and replaces `JavaPredictor` and `JavaPredictionModel` in

- `LinearRegression` / `LinearRegressionModel`
- `DecisionTreeRegressor` / `DecisionTreeRegressionModel` (just addition as `JavaPredictionModel` hasn't been used)
- `RandomForestRegressor` / `RandomForestRegressionModel`  (just addition as `JavaPredictionModel` hasn't been used)
- `GBTRegressor` / `GBTRegressionModel` (just addition as `JavaPredictionModel` hasn't been used)
- `AFTSurvivalRegression` / `AFTSurvivalRegressionModel`
- `GeneralizedLinearRegression` / `GeneralizedLinearRegressionModel`
- `FMRegressor` / `FMRegressionModel`

### Why are the changes needed?

- Internal PySpark consistency.
- Feature parity with Scala.
- Intermediate step towards implementing [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212)

### Does this PR introduce any user-facing change?

It adds new base classes, so it will affect `mro`. Otherwise interfaces should stay intact.

### How was this patch tested?

Existing tests.

Closes #27241 from zero323/SPARK-30533.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-17 19:34:30 -06:00
Huaxin Gao 92dd7c9d2a [MINOR][ML] Change DecisionTreeClassifier to FMClassifier in OneVsRest setWeightCol test
### What changes were proposed in this pull request?
Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test

### Why are the changes needed?
In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing test

Closes #27204 from huaxingao/spark-ovr-minor.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-17 10:04:41 +08:00
Huaxin Gao 1ef1d6caf2 [SPARK-29565][FOLLOWUP] add setInputCol/setOutputCol in OHEModel
### What changes were proposed in this pull request?
add setInputCol/setOutputCol in OHEModel

### Why are the changes needed?
setInputCol/setOutputCol should be in OHEModel too.

### Does this PR introduce any user-facing change?
Yes.
```OHEModel.setInputCol```
```OHEModel.setOutputCol```

### How was this patch tested?
Manually tested.

Closes #27228 from huaxingao/spark-29565.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-16 19:23:10 +08:00
zero323 990a2be27f [SPARK-30378][ML][PYSPARK][FOLLOWUP] Remove Param fields provided by _FactorizationMachinesParams
### What changes were proposed in this pull request?

Removal of following `Param` fields:

- `factorSize`
- `fitLinear`
- `miniBatchFraction`
- `initStd`
- `solver`

from `FMClassifier` and `FMRegressor`

### Why are the changes needed?

This `Param` members are already provided by `_FactorizationMachinesParams`

0f3d744c3f/python/pyspark/ml/regression.py (L2303-L2318)

which is mixed into `FMRegressor`:

0f3d744c3f/python/pyspark/ml/regression.py (L2350)

and `FMClassifier`:

0f3d744c3f/python/pyspark/ml/classification.py (L2793)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manual testing.

Closes #27205 from zero323/SPARK-30378-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-15 08:43:36 -06:00
zero323 525c5695f8 [SPARK-30504][PYTHON][ML] Set weightCol in OneVsRest(Model) _to_java and _from_java
### What changes were proposed in this pull request?

This PR adjusts `_to_java` and `_from_java` of `OneVsRest` and `OneVsRestModel` to preserve `weightCol`.

### Why are the changes needed?

Currently both `Params` don't preserve `weightCol` `Params` when data is saved / loaded:

```python
from pyspark.ml.classification import LogisticRegression, OneVsRest, OneVsRestModel
from pyspark.ml.linalg import DenseVector

df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, DenseVector([1.0, 0.0]))], ("label", "w", "features"))

ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
ovrm = ovr.fit(df)
ovr.getWeightCol()
## 'w'
ovrm.getWeightCol()
## 'w'

ovr.write().overwrite().save("/tmp/ovr")
ovr_ = OneVsRest.load("/tmp/ovr")
ovr_.getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', doc='weight column name. ...)

ovrm.write().overwrite().save("/tmp/ovrm")
ovrm_ = OneVsRestModel.load("/tmp/ovrm")
ovrm_ .getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', doc='weight column name ...
```

### Does this PR introduce any user-facing change?

After this PR is merged, loaded objects will have `weightCol` `Param` set.

### How was this patch tested?

- Manual testing.
- Extension of existing persistence tests.

Closes #27190 from zero323/SPARK-30504.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-15 08:42:24 -06:00
zero323 3668291e6b [SPARK-30452][ML][PYSPARK][FOLLOWUP] Change IsotonicRegressionModel.numFeatures to property
### What changes were proposed in this pull request?

Change `IsotonicRegressionModel.numFeatures` from plain method to property.

### Why are the changes needed?

Consistency. Right now we use `numFeatures` in two other places in `pyspark.ml`

0f3d744c3f/python/pyspark/ml/feature.py (L4289-L4291)
0f3d744c3f/python/pyspark/ml/wrapper.py (L437-L439)

and one in `pyspark,mllib`

0f3d744c3f/python/pyspark/mllib/classification.py (L177-L179)

each time as a property.

Additionally all similar values in `ml` are exposed as properties, for example

0f3d744c3f/python/pyspark/ml/regression.py (L451-L453)

### Does this PR introduce any user-facing change?

Yes, but current API hasn't been released yet.

### How was this patch tested?

Existing doctests.

Closes #27206 from zero323/SPARK-30452-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-15 12:29:23 +08:00
zhengruifeng 93200115d7 [SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest
### What changes were proposed in this pull request?
1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights
2, make RF supports weights

### Why are the changes needed?
`weightCol` is already exposed, while RF has not support weights.

### Does this PR introduce any user-facing change?
Yes, new setters

### How was this patch tested?
added testsuites

Closes #27097 from zhengruifeng/rf_support_weight.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-14 08:25:51 -06:00
Huaxin Gao 2688faeea5 [SPARK-30498][ML][PYSPARK] Fix some ml parity issues between python and scala
### What changes were proposed in this pull request?
There are some parity issues between python and scala

### Why are the changes needed?
keep parity between python and scala

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
existing tests

Closes #27196 from huaxingao/spark-30498.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-14 17:24:17 +08:00
Huaxin Gao f77dcfc55a [SPARK-30351][ML][PYSPARK] BisectingKMeans support instance weighting
### What changes were proposed in this pull request?
add weight support in BisectingKMeans

### Why are the changes needed?
BisectingKMeans should support instance weighting

### Does this PR introduce any user-facing change?
Yes. BisectingKMeans.setWeight

### How was this patch tested?
Unit test

Closes #27035 from huaxingao/spark_30351.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-13 08:24:49 -06:00
Huaxin Gao d6e28f2922 [SPARK-30377][ML] Make Regressors extend abstract class Regressor
### What changes were proposed in this pull request?
Make Regressors extend abstract class Regressor:

```AFTSurvivalRegression extends Estimator => extends Regressor```
```DecisionTreeRegressor extends Predictor => extends Regressor```
```FMRegressor extends Predictor => extends Regressor```
```GBTRegressor extends Predictor => extends Regressor```
```RandomForestRegressor extends Predictor => extends Regressor```

We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType.

### Why are the changes needed?
Make class hierarchy consistent for all Regressors

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing tests

Closes #27168 from huaxingao/spark-30377.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-13 08:22:20 -06:00
zero323 6502c66025 [SPARK-30493][PYTHON][ML] Remove OneVsRestModel setClassifier, setLabelCol and setWeightCol methods
### What changes were proposed in this pull request?

Removal of `OneVsRestModel.setClassifier`, `OneVsRestModel.setLabelCol` and `OneVsRestModel.setWeightCol`  methods.

### Why are the changes needed?

Aforementioned methods shouldn't by included by [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093), as they're not present in Scala `OneVsRestModel` and have no practical application.

### Does this PR introduce any user-facing change?

Not beyond scope of SPARK-29093].

### How was this patch tested?

Existing tests.

CC huaxingao zhengruifeng

Closes #27181 from zero323/SPARK-30493.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-13 19:03:32 +08:00
Huaxin Gao c88124a246 [SPARK-30452][ML][PYSPARK] Add predict and numFeatures in Python IsotonicRegressionModel
### What changes were proposed in this pull request?
Add ```predict``` and ```numFeatures``` in Python ```IsotonicRegressionModel```

### Why are the changes needed?
```IsotonicRegressionModel``` doesn't extend ```JavaPredictionModel```,  so it doesn't get ```predict``` and ```numFeatures``` from the super class.

### Does this PR introduce any user-facing change?
Yes. Python version of
```
IsotonicRegressionModel.predict
IsotonicRegressionModel.numFeatures
```

### How was this patch tested?
doctest

Closes #27122 from huaxingao/spark-30452.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-09 09:23:10 -06:00
WeichenXu 88542bc3d9 [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays
### What changes were proposed in this pull request?

PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```

### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT.

Closes #26910 from WeichenXu123/vector_to_array.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2020-01-06 16:18:51 -08:00
Huaxin Gao d32ed25f0d [SPARK-30144][ML][PYSPARK] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams
### What changes were proposed in this pull request?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams```

### Why are the changes needed?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap```

### Does this PR introduce any user-facing change?
Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now

### How was this patch tested?
Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there.

Closes #26838 from huaxingao/spark-30144.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-03 12:01:11 -06:00
Huaxin Gao 6196c20ee0 [SPARK-30358][ML][PYSPARK][FOLLOWUP] ML expose predictRaw and predictProbability on Python side
### What changes were proposed in this pull request?
expose predictRaw and predictProbability on Python side

### Why are the changes needed?
to keep parity between scala and python

### Does this PR introduce any user-facing change?
Yes. Expose python ```predictRaw``` and ```predictProbability```

### How was this patch tested?
doctest

Closes #27082 from huaxingao/spark-30358.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-03 11:42:56 -06:00
Huaxin Gao 9ee8da298d [SPARK-30378][ML][PYSPARK] Add getter/setter in Python FM
### What changes were proposed in this pull request?
add getter/setter in Python FM

### Why are the changes needed?
to be consistent with other algorithms

### Does this PR introduce any user-facing change?
Yes.
add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel

### How was this patch tested?
doctest

Closes #27044 from huaxingao/spark-30378.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-31 12:56:19 +08:00
zhengruifeng 9c046dc808 [SPARK-30102][ML][PYSPARK] GMM supports instance weighting
### What changes were proposed in this pull request?
supports instance weighting in GMM

### Why are the changes needed?
ML should support instance weighting

### Does this PR introduce any user-facing change?
yes, a new param `weightCol` is exposed

### How was this patch tested?
added testsuits

Closes #26735 from zhengruifeng/gmm_support_weight.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-27 13:32:57 +08:00
Huaxin Gao a3cf9c564e [SPARK-30247][PYSPARK][FOLLOWUP] Add Python class MultivariateGaussian
### What changes were proposed in this pull request?
add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side.

### Does this PR introduce any user-facing change?
add Python class ```MultivariateGaussian```

### How was this patch tested?
doctest

Closes #27020 from huaxingao/spark-30247.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-27 13:30:18 +08:00
zhanjf 8d3eed33ee [SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component
### What changes were proposed in this pull request?

Implement Factorization Machines as a ml-pipeline component

1. loss function supports: logloss, mse
2. optimizer: GD, adamW

### Why are the changes needed?

Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:

1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

run unit tests

Closes #27000 from mob-ai/ml/fm.

Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-26 11:39:53 -06:00
zhengruifeng 8f07839e74 [SPARK-30178][ML] RobustScaler support large numFeatures
### What changes were proposed in this pull request?
compute the medians/ranges more distributedly

### Why are the changes needed?
It is a bottleneck to collect the whole Array[QuantileSummaries] from executors,
since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`).

In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM.
After this PR, it will sucessfuly fit the model.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #26803 from zhengruifeng/robust_high_dim.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-25 09:44:19 +08:00
Wenchen Fan ba3f6330dd Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component"
This reverts commit c6ab7165dd.
2019-12-24 14:01:27 +08:00
zhanjf c6ab7165dd [SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component
### What changes were proposed in this pull request?

Implement Factorization Machines as a ml-pipeline component

1. loss function supports: logloss, mse
2. optimizer: GD, adamW

### Why are the changes needed?

Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:

1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

run unit tests

Closes #26124 from mob-ai/ml/fm.

Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-23 10:11:09 -06:00
Huaxin Gao 5ed72a1940 [SPARK-30247][PYSPARK] GaussianMixtureModel in py side should expose gaussian
### What changes were proposed in this pull request?
expose gaussian in PySpark
### Why are the changes needed?
A ```GaussianMixtureModel``` contains two parts of coefficients: ```weights``` & ```gaussians```. However, ```gaussians``` is not exposed on Python side.

### Does this PR introduce any user-facing change?
Yes. ```GaussianMixtureModel.gaussians``` is exposed in PySpark.

### How was this patch tested?
add doctest

Closes #26882 from huaxingao/spark-30247.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-16 18:15:40 -06:00