2014-02-08 14:39:13 -05:00
|
|
|
---
|
|
|
|
layout: global
|
2016-07-15 16:38:23 -04:00
|
|
|
title: Collaborative Filtering - RDD-based API
|
|
|
|
displayTitle: Collaborative Filtering - RDD-based API
|
2014-02-08 14:39:13 -05:00
|
|
|
---
|
|
|
|
|
|
|
|
* Table of contents
|
|
|
|
{:toc}
|
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
## Collaborative filtering
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
|
|
|
|
is commonly used for recommender systems. These techniques aim to fill in the
|
2015-12-10 15:50:46 -05:00
|
|
|
missing entries of a user-item association matrix. `spark.mllib` currently supports
|
2014-02-08 14:39:13 -05:00
|
|
|
model-based collaborative filtering, in which users and products are described
|
|
|
|
by a small set of latent factors that can be used to predict missing entries.
|
2015-12-10 15:50:46 -05:00
|
|
|
`spark.mllib` uses the [alternating least squares
|
2014-04-22 14:20:47 -04:00
|
|
|
(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
|
2015-12-10 15:50:46 -05:00
|
|
|
algorithm to learn these latent factors. The implementation in `spark.mllib` has the
|
2014-02-08 14:39:13 -05:00
|
|
|
following parameters:
|
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
* *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
|
2014-08-12 20:15:21 -04:00
|
|
|
* *rank* is the number of latent factors in the model.
|
2016-02-22 05:48:37 -05:00
|
|
|
* *iterations* is the number of iterations of ALS to run. ALS typically converges to a reasonable
|
|
|
|
solution in 20 iterations or less.
|
2014-02-08 14:39:13 -05:00
|
|
|
* *lambda* specifies the regularization parameter in ALS.
|
2014-04-22 14:20:47 -04:00
|
|
|
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
|
|
|
|
*implicit feedback* data.
|
|
|
|
* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
|
|
|
|
*baseline* confidence in preference observations.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
### Explicit vs. implicit feedback
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
The standard approach to matrix factorization based collaborative filtering treats
|
2016-02-16 08:03:28 -05:00
|
|
|
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
|
|
|
|
for example, users giving ratings to movies.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
|
2015-12-10 15:50:46 -05:00
|
|
|
clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
|
2016-02-16 08:03:28 -05:00
|
|
|
from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
|
|
|
|
Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
|
|
|
|
as numbers representing the *strength* in observations of user actions (such as the number of clicks,
|
|
|
|
or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
|
|
|
|
confidence in observed user preferences, rather than explicit ratings given to items. The model
|
|
|
|
then tries to find latent factors that can be used to predict the expected preference of a user for
|
|
|
|
an item.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-08-20 20:47:39 -04:00
|
|
|
### Scaling of the regularization parameter
|
|
|
|
|
|
|
|
Since v1.1, we scale the regularization parameter `lambda` in solving each least squares problem by
|
|
|
|
the number of ratings the user generated in updating user factors,
|
|
|
|
or the number of ratings the product received in updating product factors.
|
|
|
|
This approach is named "ALS-WR" and discussed in the paper
|
|
|
|
"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
|
2016-02-16 08:03:28 -05:00
|
|
|
It makes `lambda` less dependent on the scale of the dataset, so we can apply the
|
|
|
|
best parameter learned from a sampled subset to the full dataset and expect similar performance.
|
2014-08-20 20:47:39 -04:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
## Examples
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div class="codetabs">
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div data-lang="scala" markdown="1">
|
2014-02-08 14:39:13 -05:00
|
|
|
In the following example we load rating data. Each row consists of a user, a product and a rating.
|
2014-05-18 20:00:57 -04:00
|
|
|
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
|
2014-04-22 14:20:47 -04:00
|
|
|
method which assumes ratings are explicit. We evaluate the
|
|
|
|
recommendation model by measuring the Mean Squared Error of rating prediction.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2016-02-16 08:03:28 -05:00
|
|
|
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for more details on the API.
|
2015-10-07 10:00:19 -04:00
|
|
|
|
2015-11-09 17:27:36 -05:00
|
|
|
{% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2016-02-16 08:03:28 -05:00
|
|
|
If the rating matrix is derived from another source of information (i.e. it is inferred from
|
2014-08-12 20:15:21 -04:00
|
|
|
other signals), you can use the `trainImplicit` method to get better results.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
{% highlight scala %}
|
2014-04-22 14:20:47 -04:00
|
|
|
val alpha = 0.01
|
2015-05-26 21:08:57 -04:00
|
|
|
val lambda = 0.01
|
|
|
|
val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
|
2014-02-08 14:39:13 -05:00
|
|
|
{% endhighlight %}
|
2014-04-22 14:20:47 -04:00
|
|
|
</div>
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div data-lang="java" markdown="1">
|
2014-02-08 14:39:13 -05:00
|
|
|
All of MLlib's methods use Java-friendly types, so you can import and call them there the same
|
|
|
|
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
|
|
|
|
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
|
2014-10-15 00:37:51 -04:00
|
|
|
calling `.rdd()` on your `JavaRDD` object. A self-contained application example
|
2015-10-15 17:47:11 -04:00
|
|
|
that is equivalent to the provided example in Scala is given below:
|
2014-07-20 23:48:44 -04:00
|
|
|
|
2016-02-16 08:03:28 -05:00
|
|
|
Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
|
2015-10-07 10:00:19 -04:00
|
|
|
|
2015-11-09 17:27:36 -05:00
|
|
|
{% include_example java/org/apache/spark/examples/mllib/JavaRecommendationExample.java %}
|
2014-04-22 14:20:47 -04:00
|
|
|
</div>
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div data-lang="python" markdown="1">
|
2014-02-08 14:39:13 -05:00
|
|
|
In the following example we load rating data. Each row consists of a user, a product and a rating.
|
|
|
|
We use the default ALS.train() method which assumes ratings are explicit. We evaluate the
|
|
|
|
recommendation by measuring the Mean Squared Error of rating prediction.
|
|
|
|
|
2015-10-07 10:00:19 -04:00
|
|
|
Refer to the [`ALS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS) for more details on the API.
|
|
|
|
|
2015-11-09 17:27:36 -05:00
|
|
|
{% include_example python/mllib/recommendation_example.py %}
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2016-02-16 08:03:28 -05:00
|
|
|
If the rating matrix is derived from other source of information (i.e. it is inferred from other
|
2014-02-08 14:39:13 -05:00
|
|
|
signals), you can use the trainImplicit method to get better results.
|
|
|
|
|
|
|
|
{% highlight python %}
|
|
|
|
# Build the recommendation model using Alternating Least Squares based on implicit ratings
|
2015-01-27 18:33:01 -05:00
|
|
|
model = ALS.trainImplicit(ratings, rank, numIterations, alpha=0.01)
|
2014-02-08 14:39:13 -05:00
|
|
|
{% endhighlight %}
|
2014-04-22 14:20:47 -04:00
|
|
|
</div>
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
2014-10-15 00:37:51 -04:00
|
|
|
In order to run the above application, follow the instructions
|
|
|
|
provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
|
|
|
|
section of the Spark
|
|
|
|
Quick Start guide. Be sure to also include *spark-mllib* to your build file as
|
|
|
|
a dependency.
|
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
## Tutorial
|
|
|
|
|
2014-08-12 20:15:21 -04:00
|
|
|
The [training exercises](https://databricks-training.s3.amazonaws.com/index.html) from the Spark Summit 2014 include a hands-on tutorial for
|
2015-12-10 15:50:46 -05:00
|
|
|
[personalized movie recommendation with `spark.mllib`](https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).
|