2014-04-22 14:20:47 -04:00
---
layout: global
2016-07-15 16:38:23 -04:00
title: Naive Bayes - RDD-based API
displayTitle: Naive Bayes - RDD-based API
2014-04-22 14:20:47 -04:00
---
2014-08-12 20:15:21 -04:00
[Naive Bayes ](http://en.wikipedia.org/wiki/Naive_Bayes_classifier ) is a simple
multiclass classification algorithm with the assumption of independence between
every pair of features. Naive Bayes can be trained very efficiently. Within a
single pass to the training data, it computes the conditional probability
distribution of each feature given label, and then it applies Bayes' theorem to
compute the conditional probability distribution of label given an observation
and use it for prediction.
2014-04-22 14:20:47 -04:00
2015-12-10 15:50:46 -05:00
`spark.mllib` supports [multinomial naive
2015-03-31 14:16:55 -04:00
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
2015-05-26 12:05:58 -04:00
and [Bernoulli naive Bayes ](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html ).
These models are typically used for [document classification ](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html ).
2014-08-12 20:15:21 -04:00
Within that context, each observation is a document and each
2015-03-31 14:16:55 -04:00
feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
Feature values must be nonnegative. The model type is selected with an optional parameter
2015-05-21 13:30:08 -04:00
"multinomial" or "bernoulli" with "multinomial" as the default.
2014-08-12 20:15:21 -04:00
[Additive smoothing ](http://en.wikipedia.org/wiki/Lidstone_smoothing ) can be used by
2014-04-22 14:20:47 -04:00
setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
2014-08-12 20:15:21 -04:00
vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
2014-04-22 14:20:47 -04:00
sparsity. Since the training data is only used once, it is not necessary to cache it.
## Examples
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2014-05-18 20:00:57 -04:00
[NaiveBayes ](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$ ) implements
2014-04-22 14:20:47 -04:00
multinomial naive Bayes. It takes an RDD of
2014-05-18 20:00:57 -04:00
[LabeledPoint ](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint ) and an optional
2015-05-21 13:30:08 -04:00
smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
2014-05-18 20:00:57 -04:00
[NaiveBayesModel ](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel ), which
2014-04-22 14:20:47 -04:00
can be used for evaluation and prediction.
2015-10-07 10:00:19 -04:00
Refer to the [`NaiveBayes` Scala docs ](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes ) and [`NaiveBayesModel` Scala docs ](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel ) for details on the API.
2015-11-02 17:03:50 -05:00
{% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
2014-04-22 14:20:47 -04:00
< / div >
< div data-lang = "java" markdown = "1" >
2014-05-18 20:00:57 -04:00
[NaiveBayes ](api/java/org/apache/spark/mllib/classification/NaiveBayes.html ) implements
2014-04-22 14:20:47 -04:00
multinomial naive Bayes. It takes a Scala RDD of
2014-05-18 20:00:57 -04:00
[LabeledPoint ](api/java/org/apache/spark/mllib/regression/LabeledPoint.html ) and an
2014-04-22 14:20:47 -04:00
optionally smoothing parameter `lambda` as input, and output a
2014-05-18 20:00:57 -04:00
[NaiveBayesModel ](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html ), which
2014-04-22 14:20:47 -04:00
can be used for evaluation and prediction.
2015-10-07 10:00:19 -04:00
Refer to the [`NaiveBayes` Java docs ](api/java/org/apache/spark/mllib/classification/NaiveBayes.html ) and [`NaiveBayesModel` Java docs ](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html ) for details on the API.
2015-11-02 17:03:50 -05:00
{% include_example java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java %}
2014-04-22 14:20:47 -04:00
< / div >
< div data-lang = "python" markdown = "1" >
2015-03-22 11:56:25 -04:00
[NaiveBayes ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes ) implements multinomial
2014-04-22 14:20:47 -04:00
naive Bayes. It takes an RDD of
2015-03-22 11:56:25 -04:00
[LabeledPoint ](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint ) and an optionally
2014-04-22 14:20:47 -04:00
smoothing parameter `lambda` as input, and output a
2015-03-22 11:56:25 -04:00
[NaiveBayesModel ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel ), which can be
2014-04-22 14:20:47 -04:00
used for evaluation and prediction.
2015-02-25 19:13:17 -05:00
Note that the Python API does not yet support model save/load but will in the future.
2015-10-07 10:00:19 -04:00
Refer to the [`NaiveBayes` Python docs ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes ) and [`NaiveBayesModel` Python docs ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel ) for more details on the API.
2015-11-02 17:03:50 -05:00
{% include_example python/mllib/naive_bayes_example.py %}
2014-04-22 14:20:47 -04:00
< / div >
< / div >