2014-02-08 14:39:13 -05:00
|
|
|
---
|
|
|
|
layout: global
|
2014-05-18 20:00:57 -04:00
|
|
|
title: Collaborative Filtering - MLlib
|
|
|
|
displayTitle: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering
|
2014-02-08 14:39:13 -05:00
|
|
|
---
|
|
|
|
|
|
|
|
* Table of contents
|
|
|
|
{:toc}
|
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
## Collaborative filtering
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
|
|
|
|
is commonly used for recommender systems. These techniques aim to fill in the
|
|
|
|
missing entries of a user-item association matrix. MLlib currently supports
|
|
|
|
model-based collaborative filtering, in which users and products are described
|
|
|
|
by a small set of latent factors that can be used to predict missing entries.
|
|
|
|
In particular, we implement the [alternating least squares
|
2014-04-22 14:20:47 -04:00
|
|
|
(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
|
2014-02-08 14:39:13 -05:00
|
|
|
algorithm to learn these latent factors. The implementation in MLlib has the
|
|
|
|
following parameters:
|
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
* *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
|
2014-02-08 14:39:13 -05:00
|
|
|
* *rank* is the number of latent factors in our model.
|
|
|
|
* *iterations* is the number of iterations to run.
|
|
|
|
* *lambda* specifies the regularization parameter in ALS.
|
2014-04-22 14:20:47 -04:00
|
|
|
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
|
|
|
|
*implicit feedback* data.
|
|
|
|
* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
|
|
|
|
*baseline* confidence in preference observations.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
### Explicit vs. implicit feedback
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
The standard approach to matrix factorization based collaborative filtering treats
|
|
|
|
the entries in the user-item matrix as *explicit* preferences given by the user to the item.
|
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
|
|
|
|
clicks, purchases, likes, shares etc.). The approach used in MLlib to deal with such data is taken
|
|
|
|
from
|
|
|
|
[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
|
|
|
|
Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
|
|
|
|
as a combination of binary preferences and *confidence values*. The ratings are then related to the
|
|
|
|
level of confidence in observed user preferences, rather than explicit ratings given to items. The
|
|
|
|
model then tries to find latent factors that can be used to predict the expected preference of a
|
|
|
|
user for an item.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
## Examples
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div class="codetabs">
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div data-lang="scala" markdown="1">
|
2014-02-08 14:39:13 -05:00
|
|
|
In the following example we load rating data. Each row consists of a user, a product and a rating.
|
2014-05-18 20:00:57 -04:00
|
|
|
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
|
2014-04-22 14:20:47 -04:00
|
|
|
method which assumes ratings are explicit. We evaluate the
|
|
|
|
recommendation model by measuring the Mean Squared Error of rating prediction.
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
{% highlight scala %}
|
|
|
|
import org.apache.spark.mllib.recommendation.ALS
|
|
|
|
import org.apache.spark.mllib.recommendation.Rating
|
|
|
|
|
|
|
|
// Load and parse the data
|
|
|
|
val data = sc.textFile("mllib/data/als/test.data")
|
2014-05-18 20:00:57 -04:00
|
|
|
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
|
|
|
|
Rating(user.toInt, item.toInt, rate.toDouble)
|
|
|
|
})
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
// Build the recommendation model using ALS
|
2014-04-22 14:20:47 -04:00
|
|
|
val rank = 10
|
2014-02-08 14:39:13 -05:00
|
|
|
val numIterations = 20
|
2014-04-22 14:20:47 -04:00
|
|
|
val model = ALS.train(ratings, rank, numIterations, 0.01)
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
// Evaluate the model on rating data
|
2014-05-18 20:00:57 -04:00
|
|
|
val usersProducts = ratings.map { case Rating(user, product, rate) =>
|
|
|
|
(user, product)
|
2014-02-08 14:39:13 -05:00
|
|
|
}
|
2014-05-18 20:00:57 -04:00
|
|
|
val predictions =
|
|
|
|
model.predict(usersProducts).map { case Rating(user, product, rate) =>
|
|
|
|
((user, product), rate)
|
|
|
|
}
|
|
|
|
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
|
|
|
|
((user, product), rate)
|
2014-02-08 14:39:13 -05:00
|
|
|
}.join(predictions)
|
2014-05-18 20:00:57 -04:00
|
|
|
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
|
|
|
|
val err = (r1 - r2)
|
|
|
|
err * err
|
2014-05-06 23:07:22 -04:00
|
|
|
}.mean()
|
2014-02-08 14:39:13 -05:00
|
|
|
println("Mean Squared Error = " + MSE)
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
If the rating matrix is derived from other source of information (i.e., it is inferred from
|
|
|
|
other signals), you can use the trainImplicit method to get better results.
|
|
|
|
|
|
|
|
{% highlight scala %}
|
2014-04-22 14:20:47 -04:00
|
|
|
val alpha = 0.01
|
|
|
|
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
|
2014-02-08 14:39:13 -05:00
|
|
|
{% endhighlight %}
|
2014-04-22 14:20:47 -04:00
|
|
|
</div>
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div data-lang="java" markdown="1">
|
2014-02-08 14:39:13 -05:00
|
|
|
All of MLlib's methods use Java-friendly types, so you can import and call them there the same
|
|
|
|
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
|
|
|
|
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
|
|
|
|
calling `.rdd()` on your `JavaRDD` object.
|
2014-04-22 14:20:47 -04:00
|
|
|
</div>
|
2014-02-08 14:39:13 -05:00
|
|
|
|
2014-04-22 14:20:47 -04:00
|
|
|
<div data-lang="python" markdown="1">
|
2014-02-08 14:39:13 -05:00
|
|
|
In the following example we load rating data. Each row consists of a user, a product and a rating.
|
|
|
|
We use the default ALS.train() method which assumes ratings are explicit. We evaluate the
|
|
|
|
recommendation by measuring the Mean Squared Error of rating prediction.
|
|
|
|
|
|
|
|
{% highlight python %}
|
|
|
|
from pyspark.mllib.recommendation import ALS
|
|
|
|
from numpy import array
|
|
|
|
|
|
|
|
# Load and parse the data
|
|
|
|
data = sc.textFile("mllib/data/als/test.data")
|
|
|
|
ratings = data.map(lambda line: array([float(x) for x in line.split(',')]))
|
|
|
|
|
|
|
|
# Build the recommendation model using Alternating Least Squares
|
2014-04-22 14:20:47 -04:00
|
|
|
rank = 10
|
|
|
|
numIterations = 20
|
|
|
|
model = ALS.train(ratings, rank, numIterations)
|
2014-02-08 14:39:13 -05:00
|
|
|
|
|
|
|
# Evaluate the model on training data
|
|
|
|
testdata = ratings.map(lambda p: (int(p[0]), int(p[1])))
|
|
|
|
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
|
|
|
|
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
|
|
|
|
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y)/ratesAndPreds.count()
|
|
|
|
print("Mean Squared Error = " + str(MSE))
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
If the rating matrix is derived from other source of information (i.e., it is inferred from other
|
|
|
|
signals), you can use the trainImplicit method to get better results.
|
|
|
|
|
|
|
|
{% highlight python %}
|
|
|
|
# Build the recommendation model using Alternating Least Squares based on implicit ratings
|
2014-04-22 14:20:47 -04:00
|
|
|
model = ALS.trainImplicit(ratings, rank, numIterations, alpha = 0.01)
|
2014-02-08 14:39:13 -05:00
|
|
|
{% endhighlight %}
|
2014-04-22 14:20:47 -04:00
|
|
|
</div>
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
## Tutorial
|
|
|
|
|
|
|
|
[AMP Camp](http://ampcamp.berkeley.edu/) provides a hands-on tutorial for
|
|
|
|
[personalized movie recommendation with MLlib](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html).
|