[SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS
[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage. ## How was this patch tested? Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17102 from MLnick/SPARK-19345-coldstart-doc.
This commit is contained in:
parent
50c08e82f0
commit
9cca3dbf4a
|
@ -59,6 +59,34 @@ This approach is named "ALS-WR" and discussed in the paper
|
|||
It makes `regParam` less dependent on the scale of the dataset, so we can apply the
|
||||
best parameter learned from a sampled subset to the full dataset and expect similar performance.
|
||||
|
||||
### Cold-start strategy
|
||||
|
||||
When making predictions using an `ALSModel`, it is common to encounter users and/or items in the
|
||||
test dataset that were not present during training the model. This typically occurs in two
|
||||
scenarios:
|
||||
|
||||
1. In production, for new users or items that have no rating history and on which the model has not
|
||||
been trained (this is the "cold start problem").
|
||||
2. During cross-validation, the data is split between training and evaluation sets. When using
|
||||
simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually
|
||||
very common to encounter users and/or items in the evaluation set that are not in the training set
|
||||
|
||||
By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item
|
||||
factor is not present in the model. This can be useful in a production system, since it indicates
|
||||
a new user or item, and so the system can make a decision on some fallback to use as the prediction.
|
||||
|
||||
However, this is undesirable during cross-validation, since any `NaN` predicted values will result
|
||||
in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`).
|
||||
This makes model selection impossible.
|
||||
|
||||
Spark allows users to set the `coldStartStrategy` parameter
|
||||
to "drop" in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values.
|
||||
The evaluation metric will then be computed over the non-`NaN` data and will be valid.
|
||||
Usage of this parameter is illustrated in the example below.
|
||||
|
||||
**Note:** currently the supported cold start strategies are "nan" (the default behavior mentioned
|
||||
above) and "drop". Further strategies may be supported in future.
|
||||
|
||||
**Examples**
|
||||
|
||||
<div class="codetabs">
|
||||
|
|
|
@ -103,6 +103,8 @@ public class JavaALSExample {
|
|||
ALSModel model = als.fit(training);
|
||||
|
||||
// Evaluate the model by computing the RMSE on the test data
|
||||
// Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
|
||||
model.setColdStartStrategy("drop");
|
||||
Dataset<Row> predictions = model.transform(test);
|
||||
|
||||
RegressionEvaluator evaluator = new RegressionEvaluator()
|
||||
|
|
|
@ -44,7 +44,9 @@ if __name__ == "__main__":
|
|||
(training, test) = ratings.randomSplit([0.8, 0.2])
|
||||
|
||||
# Build the recommendation model using ALS on the training data
|
||||
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
|
||||
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
|
||||
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
|
||||
coldStartStrategy="drop")
|
||||
model = als.fit(training)
|
||||
|
||||
# Evaluate the model by computing the RMSE on the test data
|
||||
|
|
|
@ -65,6 +65,8 @@ object ALSExample {
|
|||
val model = als.fit(training)
|
||||
|
||||
// Evaluate the model by computing the RMSE on the test data
|
||||
// Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
|
||||
model.setColdStartStrategy("drop")
|
||||
val predictions = model.transform(test)
|
||||
|
||||
val evaluator = new RegressionEvaluator()
|
||||
|
|
Loading…
Reference in a new issue