## What changes were proposed in this pull request?
The changes proposed were to add train-validation-split to pyspark.ml.tuning.
## How was the this patch tested?
This patch was tested through unit tests located in pyspark/ml/test.py.
This is my original work and I license it to Spark.
Author: JeremyNixon <jnixon2@gmail.com>
Closes#11335 from JeremyNixon/tvs_pyspark.
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
Author: Holden Karau <holden@us.ibm.com>
Closes#10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
Extend CrossValidator with HasSeed in PySpark.
This PR replaces [https://github.com/apache/spark/pull/7997]
CC: yanboliang thunterdb mmenestret Would one of you mind taking a look? Thanks!
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Martin MENESTRET <mmenestret@ippon.fr>
Closes#10268 from jkbradley/pyspark-cv-seed.
* Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
* JavaEvaluator delegates isLargerBetter() to underlying Scala object.
* Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
* Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).
(This contribution is my original work and that I license the work to the project under Sparks' open source license)
Author: noelsmith <mail@noelsmith.com>
Closes#8399 from noel-smith/pyspark-rmse-xval-fix.
The new test uses CV to compare `maxIter=0` and `maxIter=1`, and validate on the evaluation result. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6572 from mengxr/SPARK-7432 and squashes the following commits:
c236bb8 [Xiangrui Meng] fix flacky cv doctest
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6088 from mengxr/SPARK-7380 and squashes the following commits:
413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
Fixes bug with PySpark cvModel not having UID
Also made small PySpark fixes: Evaluator should inherit from Params. MockModel should inherit from Model.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
Temporarily disable flaky doctest for CrossValidator. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5962 from mengxr/disable-pyspark-cv-test and squashes the following commits:
5db7e5b [Xiangrui Meng] disable cv doctest
Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5926 from mengxr/SPARK-6940 and squashes the following commits:
6af181f [Xiangrui Meng] add TODOs
8285134 [Xiangrui Meng] update doc
060f7c3 [Xiangrui Meng] update doctest
acac727 [Xiangrui Meng] add keyword args
cdddecd [Xiangrui Meng] add CrossValidator in Python
as suggested by justinuang on #5601.
Author: Xiangrui Meng <meng@databricks.com>
Closes#5873 from mengxr/SPARK-7329 and squashes the following commits:
d08f9cf [Xiangrui Meng] simplify tests
b7a7b9b [Xiangrui Meng] simplify grid build
Author: Omede Firouz <ofirouz@palantir.com>
Author: Omede <omedefirouz@gmail.com>
Closes#5601 from oefirouz/paramgrid and squashes the following commits:
c9e2481 [Omede Firouz] Make test a doctest
9a8ce22 [Omede] Fix linter issues
8b8a6d2 [Omede Firouz] [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamGridBuilder to PySpark