2014-04-22 14:20:47 -04:00
---
layout: global
2014-05-18 20:00:57 -04:00
title: Naive Bayes - MLlib
displayTitle: < a href = "mllib-guide.html" > MLlib< / a > - Naive Bayes
2014-04-22 14:20:47 -04:00
---
2014-08-12 20:15:21 -04:00
[Naive Bayes ](http://en.wikipedia.org/wiki/Naive_Bayes_classifier ) is a simple
multiclass classification algorithm with the assumption of independence between
every pair of features. Naive Bayes can be trained very efficiently. Within a
single pass to the training data, it computes the conditional probability
distribution of each feature given label, and then it applies Bayes' theorem to
compute the conditional probability distribution of label given an observation
and use it for prediction.
2014-04-22 14:20:47 -04:00
2014-08-12 20:15:21 -04:00
MLlib supports [multinomial naive
2015-03-31 14:16:55 -04:00
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
2015-05-26 12:05:58 -04:00
and [Bernoulli naive Bayes ](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html ).
These models are typically used for [document classification ](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html ).
2014-08-12 20:15:21 -04:00
Within that context, each observation is a document and each
2015-03-31 14:16:55 -04:00
feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
Feature values must be nonnegative. The model type is selected with an optional parameter
2015-05-21 13:30:08 -04:00
"multinomial" or "bernoulli" with "multinomial" as the default.
2014-08-12 20:15:21 -04:00
[Additive smoothing ](http://en.wikipedia.org/wiki/Lidstone_smoothing ) can be used by
2014-04-22 14:20:47 -04:00
setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
2014-08-12 20:15:21 -04:00
vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
2014-04-22 14:20:47 -04:00
sparsity. Since the training data is only used once, it is not necessary to cache it.
## Examples
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2014-05-18 20:00:57 -04:00
[NaiveBayes ](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$ ) implements
2014-04-22 14:20:47 -04:00
multinomial naive Bayes. It takes an RDD of
2014-05-18 20:00:57 -04:00
[LabeledPoint ](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint ) and an optional
2015-05-21 13:30:08 -04:00
smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
2014-05-18 20:00:57 -04:00
[NaiveBayesModel ](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel ), which
2014-04-22 14:20:47 -04:00
can be used for evaluation and prediction.
{% highlight scala %}
2015-02-25 19:13:17 -05:00
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
2014-05-06 23:07:22 -04:00
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
2014-07-13 22:27:43 -04:00
val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")
2014-05-06 23:07:22 -04:00
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Split data into training (60%) and test (40%).
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
2014-04-22 14:20:47 -04:00
2015-05-21 13:30:08 -04:00
val model = NaiveBayes.train(training, lambda = 1.0, model = "multinomial")
2014-04-22 14:20:47 -04:00
2014-06-30 19:03:38 -04:00
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
2014-04-22 14:20:47 -04:00
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
2015-02-25 19:13:17 -05:00
2015-02-27 16:00:36 -05:00
// Save and load model
model.save(sc, "myModelPath")
val sameModel = NaiveBayesModel.load(sc, "myModelPath")
2014-04-22 14:20:47 -04:00
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
2014-05-18 20:00:57 -04:00
[NaiveBayes ](api/java/org/apache/spark/mllib/classification/NaiveBayes.html ) implements
2014-04-22 14:20:47 -04:00
multinomial naive Bayes. It takes a Scala RDD of
2014-05-18 20:00:57 -04:00
[LabeledPoint ](api/java/org/apache/spark/mllib/regression/LabeledPoint.html ) and an
2014-04-22 14:20:47 -04:00
optionally smoothing parameter `lambda` as input, and output a
2014-05-18 20:00:57 -04:00
[NaiveBayesModel ](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html ), which
2014-04-22 14:20:47 -04:00
can be used for evaluation and prediction.
{% highlight java %}
2015-05-21 13:30:08 -04:00
import scala.Tuple2;
2014-05-06 23:07:22 -04:00
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
2014-06-30 19:03:38 -04:00
import org.apache.spark.api.java.function.PairFunction;
2014-04-22 14:20:47 -04:00
import org.apache.spark.mllib.classification.NaiveBayes;
2014-05-06 23:07:22 -04:00
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
2014-04-22 14:20:47 -04:00
JavaRDD< LabeledPoint > training = ... // training set
JavaRDD< LabeledPoint > test = ... // test set
2014-05-06 23:07:22 -04:00
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
2014-04-22 14:20:47 -04:00
JavaPairRDD< Double , Double > predictionAndLabel =
2014-06-30 19:03:38 -04:00
test.mapToPair(new PairFunction< LabeledPoint , Double , Double > () {
@Override public Tuple2< Double , Double > call(LabeledPoint p) {
return new Tuple2< Double , Double > (model.predict(p.features()), p.label());
2014-04-22 14:20:47 -04:00
}
2014-06-30 19:03:38 -04:00
});
2014-11-04 12:53:43 -05:00
double accuracy = predictionAndLabel.filter(new Function< Tuple2 < Double , Double > , Boolean>() {
2014-05-06 23:07:22 -04:00
@Override public Boolean call(Tuple2< Double , Double > pl) {
2014-11-04 12:53:43 -05:00
return pl._1().equals(pl._2());
2014-04-22 14:20:47 -04:00
}
2014-11-04 12:53:43 -05:00
}).count() / (double) test.count();
2015-02-25 19:13:17 -05:00
2015-02-27 16:00:36 -05:00
// Save and load model
model.save(sc.sc(), "myModelPath");
NaiveBayesModel sameModel = NaiveBayesModel.load(sc.sc(), "myModelPath");
2014-04-22 14:20:47 -04:00
{% endhighlight %}
< / div >
< div data-lang = "python" markdown = "1" >
2015-03-22 11:56:25 -04:00
[NaiveBayes ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes ) implements multinomial
2014-04-22 14:20:47 -04:00
naive Bayes. It takes an RDD of
2015-03-22 11:56:25 -04:00
[LabeledPoint ](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint ) and an optionally
2014-04-22 14:20:47 -04:00
smoothing parameter `lambda` as input, and output a
2015-03-22 11:56:25 -04:00
[NaiveBayesModel ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel ), which can be
2014-04-22 14:20:47 -04:00
used for evaluation and prediction.
2015-02-25 19:13:17 -05:00
Note that the Python API does not yet support model save/load but will in the future.
2014-04-22 14:20:47 -04:00
{% highlight python %}
from pyspark.mllib.classification import NaiveBayes
2015-03-01 19:28:15 -05:00
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
def parseLine(line):
parts = line.split(',')
label = float(parts[0])
features = Vectors.dense([float(x) for x in parts[1].split(' ')])
return LabeledPoint(label, features)
data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)
2014-04-22 14:20:47 -04:00
2015-03-01 19:28:15 -05:00
# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed = 0)
2014-04-22 14:20:47 -04:00
# Train a naive Bayes model.
2015-03-01 19:28:15 -05:00
model = NaiveBayes.train(training, 1.0)
2014-04-22 14:20:47 -04:00
2015-03-01 19:28:15 -05:00
# Make prediction and test accuracy.
predictionAndLabel = test.map(lambda p : (model.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()
2014-04-22 14:20:47 -04:00
{% endhighlight %}
< / div >
< / div >