In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in 64743870f2/docs/mllib-feature-extraction.md
This JIRA is just for spark.mllib, not spark.ml.
Please let me know if more work is needed, thanks a lot.
Author: Xin Ren <iamshrek@126.com>
Closes #8977 from keypointt/SPARK-10669.
7.1 KiB
layout | title | displayTitle |
---|---|---|
global | Naive Bayes - MLlib | <a href="mllib-guide.html">MLlib</a> - Naive Bayes |
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes' theorem to compute the conditional probability distribution of label given an observation and use it for prediction.
MLlib supports multinomial naive
Bayes
and Bernoulli naive Bayes.
These models are typically used for document classification.
Within that context, each observation is a document and each
feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
Feature values must be nonnegative. The model type is selected with an optional parameter
"multinomial" or "bernoulli" with "multinomial" as the default.
Additive smoothing can be used by
setting the parameter \lambda
(default to $1.0$). For document classification, the input feature
vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
sparsity. Since the training data is only used once, it is not necessary to cache it.
Examples
NaiveBayes implements
multinomial naive Bayes. It takes an RDD of
LabeledPoint and an optional
smoothing parameter lambda
as input, an optional model type parameter (default is "multinomial"), and outputs a
NaiveBayesModel, which
can be used for evaluation and prediction.
Refer to the NaiveBayes
Scala docs and NaiveBayesModel
Scala docs for details on the API.
{% highlight scala %} import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) } // Split data into training (60%) and test (40%). val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0) val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label)) val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
// Save and load model model.save(sc, "myModelPath") val sameModel = NaiveBayesModel.load(sc, "myModelPath") {% endhighlight %}
NaiveBayes implements
multinomial naive Bayes. It takes a Scala RDD of
LabeledPoint and an
optionally smoothing parameter lambda
as input, and output a
NaiveBayesModel, which
can be used for evaluation and prediction.
Refer to the NaiveBayes
Java docs and NaiveBayesModel
Java docs for details on the API.
{% highlight java %} import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.classification.NaiveBayes; import org.apache.spark.mllib.classification.NaiveBayesModel; import org.apache.spark.mllib.regression.LabeledPoint;
JavaRDD training = ... // training set JavaRDD test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel = test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<Double, Double>(model.predict(p.features()), p.label()); } }); double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1().equals(pl._2()); } }).count() / (double) test.count();
// Save and load model model.save(sc.sc(), "myModelPath"); NaiveBayesModel sameModel = NaiveBayesModel.load(sc.sc(), "myModelPath"); {% endhighlight %}
NaiveBayes implements multinomial
naive Bayes. It takes an RDD of
LabeledPoint and an optionally
smoothing parameter lambda
as input, and output a
NaiveBayesModel, which can be
used for evaluation and prediction.
Note that the Python API does not yet support model save/load but will in the future.
Refer to the NaiveBayes
Python docs and NaiveBayesModel
Python docs for more details on the API.
{% highlight python %} from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint
def parseLine(line): parts = line.split(',') label = float(parts[0]) features = Vectors.dense([float(x) for x in parts[1].split(' ')]) return LabeledPoint(label, features)
data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)
Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed = 0)
Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)
Make prediction and test accuracy.
predictionAndLabel = test.map(lambda p : (model.predict(p.features), p.label)) accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()
Save and load model
model.save(sc, "myModelPath") sameModel = NaiveBayesModel.load(sc, "myModelPath") {% endhighlight %}