### What changes were proposed in this pull request? Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here: http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html All code is entirely my own work and I license the work to the project under the project’s open source license. ### Why are the changes needed? Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. ### Does this PR introduce _any_ user-facing change? A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined. ### How was this patch tested? Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added. `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface. `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed. Closes #31535 from PhillHenry/ParamRandomBuilder. Authored-by: Phillip Henry <PhillHenry@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
10 KiB
layout | title | displayTitle | license |
---|---|---|---|
global | ML Tuning | ML Tuning: model selection and hyperparameter tuning | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]
This section describes how to use MLlib's tooling for tuning ML algorithms and Pipelines. Built-in Cross-Validation and other tooling allow users to optimize hyperparameters in algorithms and Pipelines.
Table of contents
- This will become a table of contents (this text will be scraped). {:toc}
Model selection (a.k.a. hyperparameter tuning)
An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning.
Tuning may be done for individual Estimator
s such as LogisticRegression
, or for entire Pipeline
s which include multiple algorithms, featurization, and other steps. Users can tune an entire Pipeline
at once, rather than tuning each element in the Pipeline
separately.
MLlib supports model selection using tools such as CrossValidator
and TrainValidationSplit
.
These tools require the following items:
Estimator
: algorithm orPipeline
to tune- Set of
ParamMap
s: parameters to choose from, sometimes called a "parameter grid" to search over Evaluator
: metric to measure how well a fittedModel
does on held-out test data
At a high level, these model selection tools work as follows:
- They split the input data into separate training and test datasets.
- For each (training, test) pair, they iterate through the set of
ParamMap
s:- For each
ParamMap
, they fit theEstimator
using those parameters, get the fittedModel
, and evaluate theModel
's performance using theEvaluator
.
- For each
- They select the
Model
produced by the best-performing set of parameters.
The Evaluator
can be a RegressionEvaluator
for regression problems, a BinaryClassificationEvaluator
for binary data, a MulticlassClassificationEvaluator
for multiclass problems, a MultilabelClassificationEvaluator
for multi-label classifications, or a
RankingEvaluator
for ranking problems. The default metric used to
choose the best ParamMap
can be overridden by the setMetricName
method in each of these evaluators.
To help construct the parameter grid, users can use the ParamGridBuilder
utility (see the Cross-Validation section below for an example).
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting parallelism
with a value of 2 or more (a value of 1 will be serial) before running model selection with CrossValidator
or TrainValidationSplit
.
The value of parallelism
should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.
Alternatively, users can use the ParamRandomBuilder
utility.
This has the same properties of ParamGridBuilder
mentioned above, but hyperparameters are chosen at random within a user-defined range.
The mathematical principle behind this is that given enough samples, the probability of at least one sample not being near the optimum within a range tends to zero.
Irrespective of machine learning model, the expected number of samples needed to have at least one within 5% of the optimum is about 60.
If this 5% volume lies between the parameters defined in a grid search, it will never be found by ParamGridBuilder
.
Refer to the ParamRandomBuilder
Scala docs for details on the API.
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala %}
Refer to the ParamRandomBuilder
Java docs for details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java %}
Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt.
Refer to the ParamRandomBuilder
Java docs for details on the API.
{% include_example python/ml/model_selection_random_hyperparameters_example.py %}
Cross-Validation
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with $k=3$
folds, CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular ParamMap
, CrossValidator
computes the average evaluation metric for the 3 Model
s produced by fitting the Estimator
on the 3 different (training, test) dataset pairs.
After identifying the best ParamMap
, CrossValidator
finally re-fits the Estimator
using the best ParamMap
and the entire dataset.
Examples: model selection via cross-validation
The following example demonstrates using CrossValidator
to select from a grid of parameters.
Note that cross-validation over a grid of parameters is expensive.
E.g., in the example below, the parameter grid has 3 values for hashingTF.numFeatures
and 2 values for lr.regParam
, and CrossValidator
uses 2 folds. This multiplies out to $(3 \times 2) \times 2 = 12$
different models being trained.
In realistic settings, it can be common to try many more parameters and use more folds ($k=3$
and $k=10$
are common).
In other words, using CrossValidator
can be very expensive.
However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
Refer to the CrossValidator
Scala docs for details on the API.
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
Refer to the CrossValidator
Java docs for details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java %}
Refer to the CrossValidator
Python docs for more details on the API.
{% include_example python/ml/cross_validator.py %}
Train-Validation Split
In addition to CrossValidator
Spark also offers TrainValidationSplit
for hyper-parameter tuning.
TrainValidationSplit
only evaluates each combination of parameters once, as opposed to k times in
the case of CrossValidator
. It is, therefore, less expensive,
but will not produce as reliable results when the training dataset is not sufficiently large.
Unlike CrossValidator
, TrainValidationSplit
creates a single (training, test) dataset pair.
It splits the dataset into these two parts using the trainRatio
parameter. For example with $trainRatio=0.75$
,
TrainValidationSplit
will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
Like CrossValidator
, TrainValidationSplit
finally fits the Estimator
using the best ParamMap
and the entire dataset.
Examples: model selection via train validation split
Refer to the TrainValidationSplit
Scala docs for details on the API.
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
Refer to the TrainValidationSplit
Java docs for details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java %}
Refer to the TrainValidationSplit
Python docs for more details on the API.
{% include_example python/ml/train_validation_split.py %}