[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
multinomial naive Bayes. It takes an RDD of
[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
can be used for evaluation and prediction.
{% highlight scala %}
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("mllib/data/sample_naive_bayes_data.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Split data into training (60%) and test (40%).
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0)
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
{% endhighlight %}
[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
multinomial naive Bayes. It takes a Scala RDD of
[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an
optionally smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
can be used for evaluation and prediction.
{% highlight java %}
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD training = ... // training set
JavaRDD test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaRDD prediction =
test.map(new Function() {
@Override public Double call(LabeledPoint p) {
return model.predict(p.features());
}
});
JavaPairRDD predictionAndLabel =
prediction.zip(test.map(new Function() {
@Override public Double call(LabeledPoint p) {
return p.label();
}
}));
double accuracy = 1.0 * predictionAndLabel.filter(new Function, Boolean>() {
@Override public Boolean call(Tuple2 pl) {
return pl._1() == pl._2();
}
}).count() / test.count();
{% endhighlight %}
[NaiveBayes](api/pyspark/pyspark.mllib.classification.NaiveBayes-class.html) implements multinomial
naive Bayes. It takes an RDD of
[LabeledPoint](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html) and an optionally
smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
used for evaluation and prediction.
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes
# an RDD of LabeledPoint
data = sc.parallelize([
LabeledPoint(0.0, [0.0, 0.0])
... # more labeled points
])
# Train a naive Bayes model.
model = NaiveBayes.train(data, 1.0)
# Make prediction.
prediction = model.predict([0.0, 0.0])
{% endhighlight %}