spark-instrumented-optimizer/docs/mllib-frequent-pattern-mining.md

---
layout: global
title: Frequent Pattern Mining - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Frequent Pattern Mining
---

Mining frequent items, itemsets, subsequences, or other substructures is usually among the
first steps to analyze a large-scale dataset, which has been an active research topic in
data mining for years.
We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
for more information.
MLlib provides a parallel implementation of FP-growth,
a popular algorithm to mining frequent itemsets.

## FP-growth

The FP-growth algorithm is described in the paper
[Han et al., Mining frequent patterns without candidate generation](http://dx.doi.org/10.1145/335191.335372),
where "FP" stands for frequent pattern.
Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.
Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,
the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
explicitly, which are usually expensive to generate.
After the second step, the frequent itemsets can be extracted from the FP-tree.
In MLlib, we implemented a parallel version of FP-growth called PFP,
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
PFP distributes the work of growing FP-trees based on the suffices of transactions,
and hence more scalable than a single-machine implementation.
We refer users to the papers for more details.

MLlib's FP-growth implementation takes the following (hyper-)parameters:

* `minSupport`: the minimum support for an itemset to be identified as frequent.
  For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
* `numPartitions`: the number of partitions used to distribute the work.

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
FP-growth algorithm.
It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type.
Calling `FPGrowth.run` with transactions returns an
[`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
that stores the frequent itemsets with their frequencies.

{% highlight scala %}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}

val transactions: RDD[Array[String]] = ...

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { case (itemset, freq) =>
  println(itemset.mkString("[", ",", "]") + ", " + freq)
}
{% endhighlight %}

</div>

<div data-lang="java" markdown="1">

[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
FP-growth algorithm.
It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
Calling `FPGrowth.run` with transactions returns an
[`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
that stores the frequent itemsets with their frequencies.

{% highlight java %}
import java.util.Arrays;
import java.util.List;

import scala.Tuple2;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.fpm.FPGrowth;
import org.apache.spark.mllib.fpm.FPGrowthModel;

JavaRDD<List<String>> transactions = ...

FPGrowth fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10);

FPGrowthModel<String> model = fpg.run(transactions);

for (Tuple2<Object, Long> s: model.javaFreqItemsets().collect()) {
   System.out.println("(" + Arrays.toString((Object[]) s._1()) + "): " + s._2());
}
{% endhighlight %}

</div>
</div>
[SPARK-5519][MLLIB] add user guide with example code for fp-growth The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly. Author: Xiangrui Meng <meng@databricks.com> Closes #4661 from mengxr/SPARK-5519 and squashes the following commits: 58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth 2015-02-18 13:09:56 -05:00			`---`
			`layout: global`
			`title: Frequent Pattern Mining - MLlib`
			`displayTitle: <a href="mllib-guide.html">MLlib</a> - Frequent Pattern Mining`
			`---`

			`Mining frequent items, itemsets, subsequences, or other substructures is usually among the`
			`first steps to analyze a large-scale dataset, which has been an active research topic in`
			`data mining for years.`
			`We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)`
			`for more information.`
			`MLlib provides a parallel implementation of FP-growth,`
			`a popular algorithm to mining frequent itemsets.`

			`## FP-growth`

			`The FP-growth algorithm is described in the paper`
			`[Han et al., Mining frequent patterns without candidate generation](http://dx.doi.org/10.1145/335191.335372),`
			`where "FP" stands for frequent pattern.`
			`Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.`
			`Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,`
			`the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets`
			`explicitly, which are usually expensive to generate.`
			`After the second step, the frequent itemsets can be extracted from the FP-tree.`
			`In MLlib, we implemented a parallel version of FP-growth called PFP,`
			`as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).`
			`PFP distributes the work of growing FP-trees based on the suffices of transactions,`
			`and hence more scalable than a single-machine implementation.`
			`We refer users to the papers for more details.`

			`MLlib's FP-growth implementation takes the following (hyper-)parameters:`

			* `minSupport`: the minimum support for an itemset to be identified as frequent.
			`For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.`
			* `numPartitions`: the number of partitions used to distribute the work.

			`Examples`

			`<div class="codetabs">`
			`<div data-lang="scala" markdown="1">`

			[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
			`FP-growth algorithm.`
			It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type.
			Calling `FPGrowth.run` with transactions returns an
			[`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
			`that stores the frequent itemsets with their frequencies.`

			`{% highlight scala %}`
			`import org.apache.spark.rdd.RDD`
			`import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}`

			`val transactions: RDD[Array[String]] = ...`

			`val fpg = new FPGrowth()`
			`.setMinSupport(0.2)`
			`.setNumPartitions(10)`
			`val model = fpg.run(transactions)`

			`model.freqItemsets.collect().foreach { case (itemset, freq) =>`
			`println(itemset.mkString("[", ",", "]") + ", " + freq)`
			`}`
			`{% endhighlight %}`

			`</div>`

			`<div data-lang="java" markdown="1">`

			[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
			`FP-growth algorithm.`
			It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
			Calling `FPGrowth.run` with transactions returns an
			[`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
			`that stores the frequent itemsets with their frequencies.`

			`{% highlight java %}`
			`import java.util.Arrays;`
			`import java.util.List;`

			`import scala.Tuple2;`

			`import org.apache.spark.api.java.JavaRDD;`
			`import org.apache.spark.mllib.fpm.FPGrowth;`
			`import org.apache.spark.mllib.fpm.FPGrowthModel;`

			`JavaRDD<List<String>> transactions = ...`

			`FPGrowth fpg = new FPGrowth()`
			`.setMinSupport(0.2)`
			`.setNumPartitions(10);`

			`FPGrowthModel<String> model = fpg.run(transactions);`

			`for (Tuple2<Object, Long> s: model.javaFreqItemsets().collect()) {`
			`System.out.println("(" + Arrays.toString((Object[]) s._1()) + "): " + s._2());`
			`}`
			`{% endhighlight %}`

			`</div>`
			`</div>`