2016-02-16 08:03:28 -05:00
|
|
|
---
|
|
|
|
layout: global
|
2016-07-15 16:38:23 -04:00
|
|
|
title: Collaborative Filtering
|
|
|
|
displayTitle: Collaborative Filtering
|
2019-03-30 20:49:45 -04:00
|
|
|
license: |
|
|
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
|
|
this work for additional information regarding copyright ownership.
|
|
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
(the "License"); you may not use this file except in compliance with
|
|
|
|
the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
2016-02-16 08:03:28 -05:00
|
|
|
---
|
|
|
|
|
|
|
|
* Table of contents
|
|
|
|
{:toc}
|
|
|
|
|
|
|
|
## Collaborative filtering
|
|
|
|
|
|
|
|
[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
|
|
|
|
is commonly used for recommender systems. These techniques aim to fill in the
|
|
|
|
missing entries of a user-item association matrix. `spark.ml` currently supports
|
|
|
|
model-based collaborative filtering, in which users and products are described
|
|
|
|
by a small set of latent factors that can be used to predict missing entries.
|
|
|
|
`spark.ml` uses the [alternating least squares
|
|
|
|
(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
|
|
|
|
algorithm to learn these latent factors. The implementation in `spark.ml` has the
|
|
|
|
following parameters:
|
|
|
|
|
|
|
|
* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
|
|
|
|
* *rank* is the number of latent factors in the model (defaults to 10).
|
|
|
|
* *maxIter* is the maximum number of iterations to run (defaults to 10).
|
|
|
|
* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
|
|
|
|
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
|
|
|
|
*implicit feedback* data (defaults to `false` which means using *explicit feedback*).
|
|
|
|
* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
|
|
|
|
*baseline* confidence in preference observations (defaults to 1.0).
|
|
|
|
* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
|
|
|
|
|
2016-05-24 14:34:06 -04:00
|
|
|
**Note:** The DataFrame-based API for ALS currently only supports integers for user and item ids.
|
|
|
|
Other numeric types are supported for the user and item id columns,
|
|
|
|
but the ids must be within the integer value range.
|
|
|
|
|
2016-02-16 08:03:28 -05:00
|
|
|
### Explicit vs. implicit feedback
|
|
|
|
|
2018-04-06 01:37:08 -04:00
|
|
|
The standard approach to matrix factorization-based collaborative filtering treats
|
2016-02-16 08:03:28 -05:00
|
|
|
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
|
|
|
|
for example, users giving ratings to movies.
|
|
|
|
|
|
|
|
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
|
2016-05-24 14:34:06 -04:00
|
|
|
clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
|
2018-11-25 18:43:55 -05:00
|
|
|
from [Collaborative Filtering for Implicit Feedback Datasets](https://doi.org/10.1109/ICDM.2008.22).
|
2016-02-16 08:03:28 -05:00
|
|
|
Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
|
|
|
|
as numbers representing the *strength* in observations of user actions (such as the number of clicks,
|
|
|
|
or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
|
|
|
|
confidence in observed user preferences, rather than explicit ratings given to items. The model
|
|
|
|
then tries to find latent factors that can be used to predict the expected preference of a user for
|
|
|
|
an item.
|
|
|
|
|
|
|
|
### Scaling of the regularization parameter
|
|
|
|
|
|
|
|
We scale the regularization parameter `regParam` in solving each least squares problem by
|
|
|
|
the number of ratings the user generated in updating user factors,
|
|
|
|
or the number of ratings the product received in updating product factors.
|
|
|
|
This approach is named "ALS-WR" and discussed in the paper
|
2018-11-25 18:43:55 -05:00
|
|
|
"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](https://doi.org/10.1007/978-3-540-68880-8_32)".
|
2016-02-16 08:03:28 -05:00
|
|
|
It makes `regParam` less dependent on the scale of the dataset, so we can apply the
|
|
|
|
best parameter learned from a sampled subset to the full dataset and expect similar performance.
|
|
|
|
|
2017-03-02 08:51:16 -05:00
|
|
|
### Cold-start strategy
|
|
|
|
|
|
|
|
When making predictions using an `ALSModel`, it is common to encounter users and/or items in the
|
|
|
|
test dataset that were not present during training the model. This typically occurs in two
|
|
|
|
scenarios:
|
|
|
|
|
|
|
|
1. In production, for new users or items that have no rating history and on which the model has not
|
|
|
|
been trained (this is the "cold start problem").
|
|
|
|
2. During cross-validation, the data is split between training and evaluation sets. When using
|
|
|
|
simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually
|
|
|
|
very common to encounter users and/or items in the evaluation set that are not in the training set
|
|
|
|
|
|
|
|
By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item
|
|
|
|
factor is not present in the model. This can be useful in a production system, since it indicates
|
|
|
|
a new user or item, and so the system can make a decision on some fallback to use as the prediction.
|
|
|
|
|
|
|
|
However, this is undesirable during cross-validation, since any `NaN` predicted values will result
|
|
|
|
in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`).
|
|
|
|
This makes model selection impossible.
|
|
|
|
|
|
|
|
Spark allows users to set the `coldStartStrategy` parameter
|
|
|
|
to "drop" in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values.
|
|
|
|
The evaluation metric will then be computed over the non-`NaN` data and will be valid.
|
|
|
|
Usage of this parameter is illustrated in the example below.
|
|
|
|
|
|
|
|
**Note:** currently the supported cold start strategies are "nan" (the default behavior mentioned
|
|
|
|
above) and "drop". Further strategies may be supported in future.
|
|
|
|
|
2016-11-08 09:04:07 -05:00
|
|
|
**Examples**
|
2016-02-16 08:03:28 -05:00
|
|
|
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
|
2016-05-20 02:29:37 -04:00
|
|
|
In the following example, we load ratings data from the
|
2016-02-16 08:03:28 -05:00
|
|
|
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
|
|
|
|
consisting of a user, a movie, a rating and a timestamp.
|
|
|
|
We then train an ALS model which assumes, by default, that the ratings are
|
|
|
|
explicit (`implicitPrefs` is `false`).
|
|
|
|
We evaluate the recommendation model by measuring the root-mean-square error of
|
|
|
|
rating prediction.
|
|
|
|
|
2020-02-16 10:55:03 -05:00
|
|
|
Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/ml/recommendation/ALS.html)
|
2016-02-16 08:03:28 -05:00
|
|
|
for more details on the API.
|
|
|
|
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
|
|
|
|
|
|
|
|
If the rating matrix is derived from another source of information (i.e. it is
|
|
|
|
inferred from other signals), you can set `implicitPrefs` to `true` to get
|
|
|
|
better results:
|
|
|
|
|
|
|
|
{% highlight scala %}
|
|
|
|
val als = new ALS()
|
|
|
|
.setMaxIter(5)
|
|
|
|
.setRegParam(0.01)
|
|
|
|
.setImplicitPrefs(true)
|
|
|
|
.setUserCol("userId")
|
|
|
|
.setItemCol("movieId")
|
|
|
|
.setRatingCol("rating")
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
|
2016-05-20 02:29:37 -04:00
|
|
|
In the following example, we load ratings data from the
|
2016-02-16 08:03:28 -05:00
|
|
|
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
|
|
|
|
consisting of a user, a movie, a rating and a timestamp.
|
|
|
|
We then train an ALS model which assumes, by default, that the ratings are
|
|
|
|
explicit (`implicitPrefs` is `false`).
|
|
|
|
We evaluate the recommendation model by measuring the root-mean-square error of
|
|
|
|
rating prediction.
|
|
|
|
|
|
|
|
Refer to the [`ALS` Java docs](api/java/org/apache/spark/ml/recommendation/ALS.html)
|
|
|
|
for more details on the API.
|
|
|
|
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaALSExample.java %}
|
|
|
|
|
|
|
|
If the rating matrix is derived from another source of information (i.e. it is
|
|
|
|
inferred from other signals), you can set `implicitPrefs` to `true` to get
|
|
|
|
better results:
|
|
|
|
|
|
|
|
{% highlight java %}
|
|
|
|
ALS als = new ALS()
|
|
|
|
.setMaxIter(5)
|
|
|
|
.setRegParam(0.01)
|
|
|
|
.setImplicitPrefs(true)
|
|
|
|
.setUserCol("userId")
|
|
|
|
.setItemCol("movieId")
|
|
|
|
.setRatingCol("rating");
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
|
2016-05-20 02:29:37 -04:00
|
|
|
In the following example, we load ratings data from the
|
2016-02-16 08:03:28 -05:00
|
|
|
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
|
|
|
|
consisting of a user, a movie, a rating and a timestamp.
|
|
|
|
We then train an ALS model which assumes, by default, that the ratings are
|
|
|
|
explicit (`implicitPrefs` is `False`).
|
|
|
|
We evaluate the recommendation model by measuring the root-mean-square error of
|
|
|
|
rating prediction.
|
|
|
|
|
2021-01-17 20:06:45 -05:00
|
|
|
Refer to the [`ALS` Python docs](api/python/reference/api/pyspark.ml.recommendation.ALS.html)
|
2016-02-16 08:03:28 -05:00
|
|
|
for more details on the API.
|
|
|
|
|
|
|
|
{% include_example python/ml/als_example.py %}
|
|
|
|
|
|
|
|
If the rating matrix is derived from another source of information (i.e. it is
|
|
|
|
inferred from other signals), you can set `implicitPrefs` to `True` to get
|
|
|
|
better results:
|
|
|
|
|
|
|
|
{% highlight python %}
|
|
|
|
als = ALS(maxIter=5, regParam=0.01, implicitPrefs=True,
|
|
|
|
userCol="userId", itemCol="movieId", ratingCol="rating")
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
</div>
|
2016-12-08 09:19:38 -05:00
|
|
|
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [R API docs](api/R/spark.als.html) for more details.
|
|
|
|
|
|
|
|
{% include_example r/ml/als.R %}
|
|
|
|
</div>
|
|
|
|
|
2016-02-16 08:03:28 -05:00
|
|
|
</div>
|