(Just made a PR for this, mengxr was the reporter of:) MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3) mllib/data/* Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files. Author: Sean Owen <sowen@cloudera.com> Closes #1394 from srowen/SPARK-2363 and squashes the following commits: 54313dd [Sean Owen] Move ML example data from /mllib/data/ and /data/ into /data/mllib/
5.3 KiB
layout | title | displayTitle |
---|---|---|
global | Naive Bayes - MLlib | <a href="mllib-guide.html">MLlib</a> - Naive Bayes |
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes' theorem to compute the conditional probability distribution of label given an observation and use it for prediction. For more details, please visit the Wikipedia page Naive Bayes classifier.
In MLlib, we implemented multinomial naive Bayes, which is typically used for document
classification. Within that context, each observation is a document, each feature represents a term,
whose value is the frequency of the term. For its formulation, please visit the Wikipedia page
Multinomial Naive Bayes
or the section
Naive Bayes text classification
from the book Introduction to Information
Retrieval. Additive smoothing can be used by
setting the parameter \lambda
(default to $1.0$). For document classification, the input feature
vectors are usually sparse. Please supply sparse vectors as input to take advantage of
sparsity. Since the training data is only used once, it is not necessary to cache it.
Examples
NaiveBayes implements
multinomial naive Bayes. It takes an RDD of
LabeledPoint and an optional
smoothing parameter lambda
as input, and output a
NaiveBayesModel, which
can be used for evaluation and prediction.
{% highlight scala %} import org.apache.spark.mllib.classification.NaiveBayes import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) } // Split data into training (60%) and test (40%). val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0) val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0)
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label)) val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count() {% endhighlight %}
NaiveBayes implements
multinomial naive Bayes. It takes a Scala RDD of
LabeledPoint and an
optionally smoothing parameter lambda
as input, and output a
NaiveBayesModel, which
can be used for evaluation and prediction.
{% highlight java %} import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.classification.NaiveBayes; import org.apache.spark.mllib.classification.NaiveBayesModel; import org.apache.spark.mllib.regression.LabeledPoint; import scala.Tuple2;
JavaRDD training = ... // training set JavaRDD test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel = test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<Double, Double>(model.predict(p.features()), p.label()); } }); double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1() == pl._2(); } }).count() / test.count(); {% endhighlight %}
NaiveBayes implements multinomial
naive Bayes. It takes an RDD of
LabeledPoint and an optionally
smoothing parameter lambda
as input, and output a
NaiveBayesModel, which can be
used for evaluation and prediction.
{% highlight python %} from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import NaiveBayes
an RDD of LabeledPoint
data = sc.parallelize([ LabeledPoint(0.0, [0.0, 0.0]) ... # more labeled points ])
Train a naive Bayes model.
model = NaiveBayes.train(data, 1.0)
Make prediction.
prediction = model.predict([0.0, 0.0]) {% endhighlight %}