spark-instrumented-optimizer/python/pyspark/ml/tests
Phillip Henry 397b843890 [SPARK-34415][ML] Randomization in hyperparameter optimization
### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-27 08:34:39 -06:00
..
__init__.py [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py file into smaller files 2018-11-18 16:02:15 +08:00
test_algorithms.py Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
test_base.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_evaluation.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_feature.py [SPARK-22798][PYTHON][ML][FOLLOWUP] Add labelsArray to PySpark StringIndexer 2020-12-03 10:57:14 +09:00
test_image.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_linalg.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_param.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_persistence.py [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator 2020-12-04 08:35:50 +08:00
test_pipeline.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_stat.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_training_summary.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_tuning.py [SPARK-34415][ML] Randomization in hyperparameter optimization 2021-02-27 08:34:39 -06:00
test_util.py [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading 2020-12-01 09:36:42 +08:00
test_wrapper.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00