spark-instrumented-optimizer/python/pyspark/ml
Phillip Henry 397b843890 [SPARK-34415][ML] Randomization in hyperparameter optimization
### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-27 08:34:39 -06:00
..
linalg [SPARK-33457][PYTHON] Adjust mypy configuration 2020-11-25 09:27:04 +09:00
param [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading 2020-12-01 09:36:42 +08:00
tests [SPARK-34415][ML] Randomization in hyperparameter optimization 2021-02-27 08:34:39 -06:00
__init__.py [SPARK-32319][PYSPARK] Disallow the use of unused imports 2020-08-08 08:51:57 -07:00
_typing.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
base.py [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) 2020-11-10 09:33:48 +09:00
base.pyi [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) 2020-11-10 09:33:48 +09:00
classification.py [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator 2020-12-04 08:35:50 +08:00
classification.pyi [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator 2020-12-04 08:35:50 +08:00
clustering.py [SPARK-33730][PYTHON] Standardize warning types 2021-01-18 09:32:55 +09:00
clustering.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
common.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
common.pyi [SPARK-33457][PYTHON] Adjust mypy configuration 2020-11-25 09:27:04 +09:00
evaluation.py [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) 2020-11-10 09:33:48 +09:00
evaluation.pyi [SPARK-33457][PYTHON] Adjust mypy configuration 2020-11-25 09:27:04 +09:00
feature.py [SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document 2021-02-10 09:24:25 +09:00
feature.pyi [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector 2021-01-16 11:09:23 +08:00
fpm.py [SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns PrefixSpan.findFrequentSequentialPatterns 2020-11-10 09:17:00 -08:00
fpm.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
functions.py [SPARK-33556][ML] Add array_to_vector function for dataframe column 2020-12-01 09:52:19 +09:00
functions.pyi [SPARK-33556][ML] Add array_to_vector function for dataframe column 2020-12-01 09:52:19 +09:00
image.py [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) 2020-11-10 09:33:48 +09:00
image.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
pipeline.py [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading 2020-12-01 09:36:42 +08:00
pipeline.pyi [SPARK-33457][PYTHON] Adjust mypy configuration 2020-11-25 09:27:04 +09:00
recommendation.py [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) 2020-11-10 09:33:48 +09:00
recommendation.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
regression.py [SPARK-32320][PYSPARK] Remove mutable default arguments 2020-12-08 09:35:36 +08:00
regression.pyi Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
stat.py [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector 2021-01-16 11:09:23 +08:00
stat.pyi [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector 2021-01-16 11:09:23 +08:00
tree.py [SPARK-34093][ML] param maxDepth should check upper bound 2021-01-18 11:36:10 -06:00
tree.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
tuning.py [SPARK-34415][ML] Randomization in hyperparameter optimization 2021-02-27 08:34:39 -06:00
tuning.pyi [SPARK-34415][ML] Randomization in hyperparameter optimization 2021-02-27 08:34:39 -06:00
util.py [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator 2020-12-04 08:35:50 +08:00
util.pyi [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator 2020-12-04 08:35:50 +08:00
wrapper.py [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) 2020-11-10 09:33:48 +09:00
wrapper.pyi [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00