spark-instrumented-optimizer/python/pyspark/ml
Weichen Xu 4a21c4cc92 [SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model
### What changes were proposed in this pull request?
Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model.

Most pyspark estimators/transformers inherit `JavaParams`, but some estimators are special (in order to support pure python implemented nested estimators/transformers):
* Pipeline
* OneVsRest
* CrossValidator
* TrainValidationSplit

But note that, currently, in pyspark, estimators listed above, their model reader/writer do NOT support pure python implemented nested estimators/transformers. Because they use java reader/writer wrapper as python side reader/writer.

Pyspark CrossValidator/TrainValidationSplit model reader/writer require all estimators define the `_transfer_param_map_to_java` and `_transfer_param_map_from_java` (used in model read/write).

OneVsRest class already defines the two methods, but Pipeline do not, so it lead to this bug.

In this PR I add `_transfer_param_map_to_java` and `_transfer_param_map_from_java` into Pipeline class.

### Why are the changes needed?
Bug fix.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test.

Manually test in pyspark shell:
1) CrossValidator with Simple Pipeline estimator
```
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001')
CrossValidatorModel.load('/tmp/cv_model001')
```

2) CrossValidator with Pipeline estimator which include a OneVsRest estimator stage, and OneVsRest estimator nest a LogisticRegression estimator.

```
from pyspark.ml.linalg import Vectors
from pyspark.ml import Estimator, Model
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel, OneVsRest
from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
    MulticlassClassificationEvaluator, RegressionEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.param import Param, Params
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \
    TrainValidationSplit, TrainValidationSplitModel
from pyspark.sql.functions import rand
from pyspark.testing.mlutils import SparkSessionTestCase

dataset = spark.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])

ova = OneVsRest(classifier=LogisticRegression())
lr1 = LogisticRegression().setMaxIter(100)
lr2 = LogisticRegression().setMaxIter(150)
grid = ParamGridBuilder().addGrid(ova.classifier, [lr1, lr2]).build()
evaluator = MulticlassClassificationEvaluator()

pipeline = Pipeline(stages=[ova])

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
cvModel.save('/tmp/model002')

cvModel2 = CrossValidatorModel.load('/tmp/model002')
```

TrainValidationSplit testing code are similar so I do not paste them.

Closes #28279 from WeichenXu123/fix_pipeline_tuning.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2020-04-26 21:04:14 -07:00
..
linalg [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
param [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
tests [SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model 2020-04-26 21:04:14 -07:00
__init__.py [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend 2020-03-04 12:20:02 +08:00
base.py [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
classification.py [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend 2020-03-04 12:20:02 +08:00
clustering.py [SPARK-30770][ML] avoid vector conversion in GMM.transform 2020-03-04 11:02:27 +08:00
common.py [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch 2016-10-03 14:12:03 -07:00
evaluation.py [SPARK-31012][ML][PYSPARK][DOCS] Updating ML API docs for 3.0 changes 2020-03-07 11:42:05 -06:00
feature.py [SPARK-31012][ML][PYSPARK][DOCS] Updating ML API docs for 3.0 changes 2020-03-07 11:42:05 -06:00
fpm.py [SPARK-29867][ML][PYTHON] Add __repr__ in Python ML Models 2019-11-15 21:44:39 -08:00
functions.py [SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation 2020-02-18 16:46:45 +09:00
image.py [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 2019-07-31 14:26:18 +09:00
pipeline.py [SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model 2020-04-26 21:04:14 -07:00
recommendation.py [SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP 2020-02-09 13:14:30 +08:00
regression.py [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend 2020-03-04 12:20:02 +08:00
stat.py [SPARK-31243][ML][PYSPARK] Add ANOVATest and FValueTest to PySpark 2020-03-27 14:05:49 +08:00
tree.py [SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method 2020-01-23 16:44:13 +08:00
tuning.py [SPARK-30498][ML][PYSPARK] Fix some ml parity issues between python and scala 2020-01-14 17:24:17 +08:00
util.py [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
wrapper.py [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend 2020-03-04 12:20:02 +08:00