spark-instrumented-optimizer/dev
Weichen Xu 80161238fe [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
### What changes were proposed in this pull request?
Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.

Two typical cases to manually test:
~~~python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

~~~python
lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-01 09:36:42 +08:00
..
create-release Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
deps [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2 2020-11-25 12:37:59 -08:00
sparktestsupport [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading 2020-12-01 09:36:42 +08:00
tests Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
.gitignore [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file. 2018-01-31 00:51:00 +09:00
.rat-excludes [SPARK-33483][INFRA][TESTS] Fix rat exclusion patterns and add a LICENSE 2020-11-18 23:59:11 -08:00
.scalafmt.conf [SPARK-26177] Config change followup to [] Automated formatting for Scala code 2018-12-03 10:03:51 -06:00
appveyor-guide.md Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
appveyor-install-dependencies.ps1 [SPARK-33105][INFRA] Change default R arch from i386 to x64 and parametrize BINPREF 2020-10-10 13:48:26 +09:00
change-scala-version.sh [SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13 2019-12-03 08:59:43 -08:00
check-license [MINOR][INFRA] Suppress warning in check-license 2020-11-23 10:38:40 +09:00
checkstyle-suppressions.xml [SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+ 2019-11-03 15:13:06 -08:00
checkstyle.xml [MINOR] Fix google style guide address 2019-12-12 11:04:01 -06:00
github_jira_sync.py Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
lint-java [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses) 2018-01-13 21:34:28 -08:00
lint-python [SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency 2020-10-27 14:03:57 +09:00
lint-r [SPARK-29932][R][TESTS] lint-r should do non-zero exit in case of errors 2019-11-17 10:09:46 -08:00
lint-r.R [MINOR][R] small tidying of sh scripts for R 2020-04-30 16:58:05 -07:00
lint-scala [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles 2019-03-15 08:20:42 +09:00
make-distribution.sh [SPARK-31041][BUILD] Show Maven errors from within make-distribution.sh 2020-03-11 08:22:02 -05:00
merge_spark_pr.py [MINOR] Fix usage print to guide pip3 to install jira-python library 2020-09-03 01:10:59 +09:00
mima [SPARK-33510][BUILD] Update SBT to 1.4.4 2020-11-22 22:56:59 -08:00
pip-sanity-check.py [SPARK-32319][PYSPARK] Disallow the use of unused imports 2020-08-08 08:51:57 -07:00
README.md Merge pull request #565 from pwendell/dev-scripts. Closes #565. 2014-02-08 23:13:34 -08:00
requirements.txt [SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency 2020-10-27 14:03:57 +09:00
run-pip-tests [SPARK-32419][PYTHON][BUILD] Avoid using subshell for Conda env (de)activation in pip packaging test 2020-07-25 13:09:23 +09:00
run-tests [SPARK-29672][PYSPARK] update spark testing framework to use python3 2019-11-14 10:18:55 -08:00
run-tests-jenkins [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script 2020-11-24 09:50:10 -08:00
run-tests-jenkins.py Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
run-tests.py Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
sbt-checkstyle [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles 2019-03-15 08:20:42 +09:00
scalafmt [SPARK-30570][BUILD] Update scalafmt plugin to 1.0.3 with onlyChangedFiles feature 2020-01-23 12:44:43 -08:00
scalastyle Revert "[SPARK-30534][INFRA] Use mvn in dev/scalastyle" 2020-01-21 18:23:03 +09:00
test-dependencies.sh [SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1) 2020-10-05 15:29:56 -07:00
tox.ini [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.