SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs

While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs.

Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown.

Author: Sean Owen <sowen@cloudera.com>

Closes #653 from srowen/SPARK-1727 and squashes the following commits:

6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count
8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output)
99966a9 [Sean Owen] Update issue tracker URL in docs
23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak)
8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
This commit is contained in:
Sean Owen 2014-05-06 20:07:22 -07:00 committed by Patrick Wendell
parent a000b5c3b0
commit 25ad8f9301
17 changed files with 96 additions and 67 deletions

View file

@ -15,8 +15,9 @@ The markdown code can be compiled to HTML using the
To use the `jekyll` command, you will need to have Jekyll installed.
The easiest way to do this is via a Ruby Gem, see the
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
Compiling the site with Jekyll will create a directory called
_site containing index.html as well as the rest of the compiled files.
If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
`_site` containing index.html as well as the rest of the compiled files.
You can modify the default Jekyll build as follows:
@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR
Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.

View file

@ -8,5 +8,5 @@ SPARK_VERSION_SHORT: 1.0.0
SCALA_BINARY_VERSION: "2.10"
SCALA_VERSION: "2.10.4"
MESOS_VERSION: 0.13.0
SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK
SPARK_GITHUB_URL: https://github.com/apache/spark

View file

@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
{% highlight scala %}
val input = sc.textFile("pagerank_data.txt")
val input = sc.textFile("data/pagerank_data.txt")
val numVerts = input.count()

View file

@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept
<td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver
outside of the cluster.</td>
<tr>
</tr>
<tr>
<td>Worker node</td>
<td>Any node that can run application code in the cluster</td>

View file

@ -26,10 +26,10 @@ application name), as well as arbitrary key-value pairs through the `set()` meth
initialize an application as follows:
{% highlight scala %}
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
val conf = new SparkConf().
setMaster("local").
setAppName("My application").
set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
{% endhighlight %}
@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful
When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
objects to prevent writing redundant data, however that stops garbage collection of those
objects. By calling 'reset' you flush that info from the serializer, and allow old
objects to be collected. To turn off this periodic reset set it to a value of <= 0.
objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
By default it will reset the serializer every 10,000 objects.
</td>
</tr>

View file

@ -55,7 +55,7 @@ classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
etc (this acheives the "same-result-type" principle used by the [Scala collections
etc (this achieves the "same-result-type" principle used by the [Scala collections
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
## Function Interfaces
@ -102,7 +102,7 @@ the following changes:
`Function` classes will need to use `implements` rather than `extends`.
* Certain transformation functions now have multiple versions depending
on the return type. In Spark core, the map functions (`map`, `flatMap`, and
`mapPartitons`) have type-specific versions, e.g.
`mapPartitions`) have type-specific versions, e.g.
[`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
JavaSparkContext sc = new JavaSparkContext(...);
JavaRDD<String> lines = ctx.textFile("hdfs://...");
JavaSparkContext jsc = new JavaSparkContext(...);
JavaRDD<String> lines = jsc.textFile("hdfs://...");
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
@Override public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
}
@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass
{% highlight java %}
class Split extends FlatMapFunction<String, String> {
public Iterable<String> call(String s) {
@Override public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
);
}
JavaRDD<String> words = lines.flatMap(new Split());
{% endhighlight %}
@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
import scala.Tuple2;
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2(s, 1);
@Override public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}
);
@ -178,7 +178,7 @@ occurrences of each word:
{% highlight java %}
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
@Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}

View file

@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
MLlib supports local vectors and matrices stored on a single machine,
as well as distributed matrices backed by one or more RDDs.
In the current implementation, local vectors and matrices are simple data models
to serve public interfaces. The underly linear algebra operations are provided by
to serve public interfaces. The underlying linear algebra operations are provided by
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
A training example used in supervised learning is called "labeled point" in MLlib.
@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.rdd.RDDimport;
RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
{% endhighlight %}
</div>
</div>
@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.
{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzers) // number of nonzeros in each column
println(summary.numNonzeros) // number of nonzeros in each column
// Compute the covariance matrix.
val Cov: Matrix = mat.computeCovariance()
val cov: Matrix = mat.computeCovariance()
{% endhighlight %}
</div>
</div>
@ -397,11 +398,12 @@ wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `Row
its row indices.
{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
@ -458,7 +460,9 @@ wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a
with sparse rows by calling `toIndexedRowMatrix`.
{% highlight scala %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries

View file

@ -18,7 +18,7 @@ models are trained for each cluster).
MLlib supports
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
the most commonly used clustering algorithms that clusters the data points into
predfined number of clusters. The MLlib implementation includes a parallelized
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
The implementation in MLlib has the following parameters:
@ -30,7 +30,7 @@ initialization via k-means\|\|.
* *runs* is the number of times to run the k-means algorithm (k-means is not
guaranteed to find a globally optimal solution, and when run multiple times on
a given dataset, the algorithm returns the best clustering result).
* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
## Examples

View file

@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
}.join(predictions)
val MSE = ratesAndPreds.map{
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
}.reduce(_ + _)/ratesAndPreds.count
}.mean()
println("Mean Squared Error = " + MSE)
{% endhighlight %}

View file

@ -83,19 +83,19 @@ Section 9.2.4 in
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
details). For example, for a binary classification problem with one categorical feature with three
categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
and A , B \| C where \| denotes the split.
### Stopping rule
The recursive tree construction is stopped at a node when one of the two conditions is met:
1. The node depth is equal to the `maxDepth` training parammeter
1. The node depth is equal to the `maxDepth` training parameter
2. No split candidate leads to an information gain at the node.
### Practical limitations
1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)*
1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)*
in memory for aggregating histograms over partitions. The current implementation might not scale
to very deep trees since the memory requirement grows exponentially with tree depth.
2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for
@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}
</div>

View file

@ -44,6 +44,10 @@ say, less than $1000$, but many rows, which we call *tall-and-skinny*.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.SingularValueDecomposition
val mat: RowMatrix = ...
// Compute the top 20 singular values and corresponding singular vectors.
@ -74,6 +78,9 @@ and use them to project the vectors into a low-dimensional space.
The number of columns should be small, e.g, less than 1000.
{% highlight scala %}
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val mat: RowMatrix = ...
// Compute the top 10 principal components.

View file

@ -94,7 +94,7 @@ import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
double[] array = ... // a double array
Vector vector = Vectors.dense(array) // a dense vector
Vector vector = Vectors.dense(array); // a dense vector
{% endhighlight %}
[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to

View file

@ -63,7 +63,7 @@ methods MLlib supports:
<tbody>
<tr>
<td>hinge loss</td><td>$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$</td>
<td>$\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 &
<td>$\begin{cases}-y \cdot \x &amp; \text{if $y \wv^T \x &lt;1$}, \\ 0 &amp;
\text{otherwise}.\end{cases}$</td>
</tr>
<tr>
@ -225,10 +225,11 @@ algorithm for 200 iterations.
import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200)
.setRegParam(0.1)
.setUpdater(new L1Updater)
val modelL1 = svmAlg.run(parsedData)
svmAlg.optimizer.
setNumIterations(200).
setRegParam(0.1).
setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)
{% endhighlight %}
Similarly, you can use replace `SVMWithSGD` by
@ -322,7 +323,7 @@ val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}

View file

@ -7,13 +7,13 @@ Naive Bayes is a simple multiclass classification algorithm with the assumption
between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
the training data, it computes the conditional probability distribution of each feature given label,
and then it applies Bayes' theorem to compute the conditional probability distribution of label
given an observation and use it for prediction. For more details, please visit the wikipedia page
given an observation and use it for prediction. For more details, please visit the Wikipedia page
[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
In MLlib, we implemented multinomial naive Bayes, which is typically used for document
classification. Within that context, each observation is a document, each feature represents a term,
whose value is the frequency of the term. For its formulation, please visit the wikipedia page
[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
whose value is the frequency of the term. For its formulation, please visit the Wikipedia page
[Multinomial Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
or the section
[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
from the book Introduction to Information
@ -36,9 +36,18 @@ can be used for evaluation and prediction.
{% highlight scala %}
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val training: RDD[LabeledPoint] = ... // training set
val test: RDD[LabeledPoint] = ... // test set
val data = sc.textFile("mllib/data/sample_naive_bayes_data.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Split data into training (60%) and test (40%).
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0)
val prediction = model.predict(test.map(_.features))
@ -58,29 +67,36 @@ optionally smoothing parameter `lambda` as input, and output a
can be used for evaluation and prediction.
{% highlight java %}
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaRDD<Double> prediction = model.predict(test.map(new Function<LabeledPoint, Vector>() {
public Vector call(LabeledPoint p) {
return p.features();
JavaRDD<Double> prediction =
test.map(new Function<LabeledPoint, Double>() {
@Override public Double call(LabeledPoint p) {
return model.predict(p.features());
}
})
});
JavaPairRDD<Double, Double> predictionAndLabel =
prediction.zip(test.map(new Function<LabeledPoint, Double>() {
public Double call(LabeledPoint p) {
@Override public Double call(LabeledPoint p) {
return p.label();
}
})
}));
double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
public Boolean call(Tuple2<Double, Double> pl) {
@Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1() == pl._2();
}
}).count() / test.count()
}).count() / test.count();
{% endhighlight %}
</div>
@ -93,7 +109,7 @@ smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
used for evaluation and prediction.
<!--- TODO: Make Python's example consistent with Scala's and Java's. --->
<!-- TODO: Make Python's example consistent with Scala's and Java's. -->
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes

View file

@ -48,12 +48,12 @@ how to access a cluster. To create a `SparkContext` you first need to build a `S
that contains information about your application.
{% highlight scala %}
val conf = new SparkConf().setAppName(<app name>).setMaster(<master>)
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
{% endhighlight %}
The `<master>` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
to connect to, or a special "local" string to run in local mode, as described below. `<app name>` is
The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
to connect to, or a special "local" string to run in local mode, as described below. `appName` is
a name for your application, which will be shown in the cluster web UI. It's also possible to set
these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
which avoids hard-coding the master name in your application.
@ -81,9 +81,8 @@ The master URL passed to Spark can be in one of the following formats:
<table class="table">
<tr><th>Master URL</th><th>Meaning</th></tr>
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
</td></tr>
<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
</td></tr>

View file

@ -416,3 +416,4 @@ results = hiveCtx.hql("FROM src SELECT key, value").collect()
{% endhighlight %}
</div>
</div>

View file

@ -1,6 +1,6 @@
0, 1 0 0
0, 2 0 0
1, 0 1 0
1, 0 2 0
2, 0 0 1
2, 0 0 2
0,1 0 0
0,2 0 0
1,0 1 0
1,0 2 0
2,0 0 1
2,0 0 2