2014-04-22 14:20:47 -04:00
---
layout: global
2016-07-15 16:38:23 -04:00
title: Naive Bayes - RDD-based API
displayTitle: Naive Bayes - RDD-based API
2019-03-30 20:49:45 -04:00
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
2014-04-22 14:20:47 -04:00
---
2014-08-12 20:15:21 -04:00
[Naive Bayes ](http://en.wikipedia.org/wiki/Naive_Bayes_classifier ) is a simple
multiclass classification algorithm with the assumption of independence between
every pair of features. Naive Bayes can be trained very efficiently. Within a
single pass to the training data, it computes the conditional probability
distribution of each feature given label, and then it applies Bayes' theorem to
compute the conditional probability distribution of label given an observation
and use it for prediction.
2014-04-22 14:20:47 -04:00
2015-12-10 15:50:46 -05:00
`spark.mllib` supports [multinomial naive
2015-03-31 14:16:55 -04:00
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
2015-05-26 12:05:58 -04:00
and [Bernoulli naive Bayes ](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html ).
These models are typically used for [document classification ](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html ).
2014-08-12 20:15:21 -04:00
Within that context, each observation is a document and each
2015-03-31 14:16:55 -04:00
feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
Feature values must be nonnegative. The model type is selected with an optional parameter
2015-05-21 13:30:08 -04:00
"multinomial" or "bernoulli" with "multinomial" as the default.
2014-08-12 20:15:21 -04:00
[Additive smoothing ](http://en.wikipedia.org/wiki/Lidstone_smoothing ) can be used by
2014-04-22 14:20:47 -04:00
setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
2014-08-12 20:15:21 -04:00
vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
2014-04-22 14:20:47 -04:00
sparsity. Since the training data is only used once, it is not necessary to cache it.
## Examples
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2020-02-16 10:55:03 -05:00
[NaiveBayes ](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html ) implements
2014-04-22 14:20:47 -04:00
multinomial naive Bayes. It takes an RDD of
2020-02-16 10:55:03 -05:00
[LabeledPoint ](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html ) and an optional
2015-05-21 13:30:08 -04:00
smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
2020-02-16 10:55:03 -05:00
[NaiveBayesModel ](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html ), which
2014-04-22 14:20:47 -04:00
can be used for evaluation and prediction.
2020-02-16 10:55:03 -05:00
Refer to the [`NaiveBayes` Scala docs ](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html ) and [`NaiveBayesModel` Scala docs ](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html ) for details on the API.
2015-10-07 10:00:19 -04:00
2015-11-02 17:03:50 -05:00
{% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
2014-04-22 14:20:47 -04:00
< / div >
< div data-lang = "java" markdown = "1" >
2014-05-18 20:00:57 -04:00
[NaiveBayes ](api/java/org/apache/spark/mllib/classification/NaiveBayes.html ) implements
2014-04-22 14:20:47 -04:00
multinomial naive Bayes. It takes a Scala RDD of
2014-05-18 20:00:57 -04:00
[LabeledPoint ](api/java/org/apache/spark/mllib/regression/LabeledPoint.html ) and an
2014-04-22 14:20:47 -04:00
optionally smoothing parameter `lambda` as input, and output a
2014-05-18 20:00:57 -04:00
[NaiveBayesModel ](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html ), which
2014-04-22 14:20:47 -04:00
can be used for evaluation and prediction.
2015-10-07 10:00:19 -04:00
Refer to the [`NaiveBayes` Java docs ](api/java/org/apache/spark/mllib/classification/NaiveBayes.html ) and [`NaiveBayesModel` Java docs ](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html ) for details on the API.
2015-11-02 17:03:50 -05:00
{% include_example java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java %}
2014-04-22 14:20:47 -04:00
< / div >
< div data-lang = "python" markdown = "1" >
2015-03-22 11:56:25 -04:00
[NaiveBayes ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes ) implements multinomial
2014-04-22 14:20:47 -04:00
naive Bayes. It takes an RDD of
2015-03-22 11:56:25 -04:00
[LabeledPoint ](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint ) and an optionally
2014-04-22 14:20:47 -04:00
smoothing parameter `lambda` as input, and output a
2015-03-22 11:56:25 -04:00
[NaiveBayesModel ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel ), which can be
2014-04-22 14:20:47 -04:00
used for evaluation and prediction.
2015-02-25 19:13:17 -05:00
Note that the Python API does not yet support model save/load but will in the future.
2015-10-07 10:00:19 -04:00
Refer to the [`NaiveBayes` Python docs ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes ) and [`NaiveBayesModel` Python docs ](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel ) for more details on the API.
2015-11-02 17:03:50 -05:00
{% include_example python/mllib/naive_bayes_example.py %}
2014-04-22 14:20:47 -04:00
< / div >
< / div >