SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs

While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
2014-05-06 20:07:22 -07:00 · 2014-05-06 20:07:22 -07:00 · 25ad8f9301
parent a000b5c3b0
commit 25ad8f9301
17 changed files with 96 additions and 67 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -15,8 +15,9 @@ The markdown code can be compiled to HTML using the
 To use the `jekyll` command, you will need to have Jekyll installed. 
 The easiest way to do this is via a Ruby Gem, see the 
 [jekyll installation instructions](http://jekyllrb.com/docs/installation).
-Compiling the site with Jekyll will create a directory called 
-_site containing index.html as well as the rest of the compiled files.
+If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
+Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
+`_site` containing index.html as well as the rest of the compiled files.

 You can modify the default Jekyll build as follows:

@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR

 Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.

-When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
+When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).

 NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
--- a/docs/_config.yml
+++ b/docs/_config.yml
@ -8,5 +8,5 @@ SPARK_VERSION_SHORT: 1.0.0
 SCALA_BINARY_VERSION: "2.10"
 SCALA_VERSION: "2.10.4"
 MESOS_VERSION: 0.13.0
-SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
+SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK
 SPARK_GITHUB_URL: https://github.com/apache/spark
--- a/docs/bagel-programming-guide.md
+++ b/docs/bagel-programming-guide.md
@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
 Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.

 {% highlight scala %}
-val input = sc.textFile("pagerank_data.txt")
+val input = sc.textFile("data/pagerank_data.txt")

 val numVerts = input.count()

--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept
      <td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
        the driver inside of the cluster. In "client" mode, the submitter launches the driver
        outside of the cluster.</td>
-    <tr>
+    </tr>
    <tr>
      <td>Worker node</td>
      <td>Any node that can run application code in the cluster</td>
--- a/docs/configuration.md
+++ b/docs/configuration.md
@ -26,10 +26,10 @@ application name), as well as arbitrary key-value pairs through the `set()` meth
 initialize an application as follows:

 {% highlight scala %}
-val conf = new SparkConf()
-             .setMaster("local")
-             .setAppName("My application")
-             .set("spark.executor.memory", "1g")
+val conf = new SparkConf().
+             setMaster("local").
+             setAppName("My application").
+             set("spark.executor.memory", "1g")
 val sc = new SparkContext(conf)
 {% endhighlight %}

@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful
    When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
    objects to prevent writing redundant data, however that stops garbage collection of those
    objects. By calling 'reset' you flush that info from the serializer, and allow old
-    objects to be collected. To turn off this periodic reset set it to a value of <= 0.
+    objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
    By default it will reset the serializer every 10,000 objects.
  </td>
 </tr>
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@ -55,7 +55,7 @@ classes.  RDD methods like `map` are overloaded by specialized `PairFunction`
 and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
 types.  Common methods like `filter` and `sample` are implemented by
 each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
-etc (this acheives the "same-result-type" principle used by the [Scala collections
+etc (this achieves the "same-result-type" principle used by the [Scala collections
 framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).

 ## Function Interfaces
@ -102,7 +102,7 @@ the following changes:
  `Function` classes will need to use `implements` rather than `extends`.
 * Certain transformation functions now have multiple versions depending
  on the return type. In Spark core, the map functions (`map`, `flatMap`, and
-  `mapPartitons`) have type-specific versions, e.g. 
+  `mapPartitions`) have type-specific versions, e.g.
  [`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
  and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
  Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
 import org.apache.spark.api.java.*;
 import org.apache.spark.api.java.function.*;

-JavaSparkContext sc = new JavaSparkContext(...);
-JavaRDD<String> lines = ctx.textFile("hdfs://...");
+JavaSparkContext jsc = new JavaSparkContext(...);
+JavaRDD<String> lines = jsc.textFile("hdfs://...");
 JavaRDD<String> words = lines.flatMap(
  new FlatMapFunction<String, String>() {
-    public Iterable<String> call(String s) {
+    @Override public Iterable<String> call(String s) {
      return Arrays.asList(s.split(" "));
    }
  }
@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass

 {% highlight java %}
 class Split extends FlatMapFunction<String, String> {
-  public Iterable<String> call(String s) {
+  @Override public Iterable<String> call(String s) {
    return Arrays.asList(s.split(" "));
  }
-);
+}
 JavaRDD<String> words = lines.flatMap(new Split());
 {% endhighlight %}

@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
 import scala.Tuple2;
 JavaPairRDD<String, Integer> ones = words.mapToPair(
  new PairFunction<String, String, Integer>() {
-    public Tuple2<String, Integer> call(String s) {
-      return new Tuple2(s, 1);
+    @Override public Tuple2<String, Integer> call(String s) {
+      return new Tuple2<String, Integer>(s, 1);
    }
  }
 );
@ -178,7 +178,7 @@ occurrences of each word:
 {% highlight java %}
 JavaPairRDD<String, Integer> counts = ones.reduceByKey(
  new Function2<Integer, Integer, Integer>() {
-    public Integer call(Integer i1, Integer i2) {
+    @Override public Integer call(Integer i1, Integer i2) {
      return i1 + i2;
    }
  }
--- a/docs/mllib-basics.md
+++ b/docs/mllib-basics.md
@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
 MLlib supports local vectors and matrices stored on a single machine, 
 as well as distributed matrices backed by one or more RDDs.
 In the current implementation, local vectors and matrices are simple data models 
-to serve public interfaces. The underly linear algebra operations are provided by
+to serve public interfaces. The underlying linear algebra operations are provided by
 [Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
 A training example used in supervised learning is called "labeled point" in MLlib.

@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
 import org.apache.spark.mllib.util.MLUtils;
 import org.apache.spark.rdd.RDDimport;

-RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
+RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
 {% endhighlight %}
 </div>
 </div>
@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
 created from a `JavaRDD<Vector>` instance.  Then we can compute its column summary statistics.

 {% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.mllib.linalg.distributed.RowMatrix;

@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
 val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
 println(summary.mean) // a dense vector containing the mean value for each column
 println(summary.variance) // column-wise variance
-println(summary.numNonzers) // number of nonzeros in each column
+println(summary.numNonzeros) // number of nonzeros in each column

 // Compute the covariance matrix.
-val Cov: Matrix = mat.computeCovariance()
+val cov: Matrix = mat.computeCovariance()
 {% endhighlight %}
 </div>
 </div>
@ -397,11 +398,12 @@ wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a `Row
 its row indices.

 {% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.distributed.IndexedRow;
 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
 import org.apache.spark.mllib.linalg.distributed.RowMatrix;

-JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
+JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
 // Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
 IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());

@ -458,7 +460,9 @@ wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to a
 with sparse rows by calling `toIndexedRowMatrix`.

 {% highlight scala %}
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
+import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
 import org.apache.spark.mllib.linalg.distributed.MatrixEntry;

 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@ -18,7 +18,7 @@ models are trained for each cluster).
 MLlib supports
 [k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
 the most commonly used clustering algorithms that clusters the data points into
-predfined number of clusters. The MLlib implementation includes a parallelized
+predefined number of clusters. The MLlib implementation includes a parallelized
 variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
 called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
 The implementation in MLlib has the following parameters:  
@ -30,7 +30,7 @@ initialization via k-means\|\|.
 * *runs* is the number of times to run the k-means algorithm (k-means is not
 guaranteed to find a globally optimal solution, and when run multiple times on
 a given dataset, the algorithm returns the best clustering result).
-* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
+* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
 * *epsilon* determines the distance threshold within which we consider k-means to have converged. 

 ## Examples
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
 }.join(predictions)
 val MSE = ratesAndPreds.map{
    case ((user, product), (r1, r2)) =>  math.pow((r1- r2), 2)
-}.reduce(_ + _)/ratesAndPreds.count
+}.mean()
 println("Mean Squared Error = " + MSE)
 {% endhighlight %}

--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@ -83,19 +83,19 @@ Section 9.2.4 in
 [Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
 details). For example, for a binary classification problem with one categorical feature with three
 categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
-features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
+features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
 and A , B \| C where \| denotes the split.

 ### Stopping rule

 The recursive tree construction is stopped at a node when one of the two conditions is met:

-1. The node depth is equal to the `maxDepth` training parammeter
+1. The node depth is equal to the `maxDepth` training parameter
 2. No split candidate leads to an information gain at the node.

 ### Practical limitations

-1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)*
+1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)*
   in memory for aggregating histograms over partitions. The current implementation might not scale
   to very deep trees since the memory requirement grows exponentially with tree depth.
 2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for
@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
 }
-val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
+val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean()
 println("training Mean Squared Error = " + MSE)
 {% endhighlight %}
 </div>
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@ -44,6 +44,10 @@ say, less than $1000$, but many rows, which we call *tall-and-skinny*.
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 {% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.SingularValueDecomposition
+
 val mat: RowMatrix = ...

 // Compute the top 20 singular values and corresponding singular vectors.
@ -74,6 +78,9 @@ and use them to project the vectors into a low-dimensional space.
 The number of columns should be small, e.g, less than 1000.

 {% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+
 val mat: RowMatrix = ...

 // Compute the top 10 principal components.
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@ -94,7 +94,7 @@ import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.mllib.linalg.Vectors;

 double[] array = ... // a double array
-Vector vector = Vectors.dense(array) // a dense vector
+Vector vector = Vectors.dense(array); // a dense vector
 {% endhighlight %}

 [`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@ -63,7 +63,7 @@ methods MLlib supports:
  <tbody>
    <tr>
      <td>hinge loss</td><td>$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$</td>
-      <td>$\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 &
+      <td>$\begin{cases}-y \cdot \x &amp; \text{if $y \wv^T \x &lt;1$}, \\ 0 &amp;
 \text{otherwise}.\end{cases}$</td>
    </tr>
    <tr>
@ -225,10 +225,11 @@ algorithm for 200 iterations.
 import org.apache.spark.mllib.optimization.L1Updater

 val svmAlg = new SVMWithSGD()
-svmAlg.optimizer.setNumIterations(200)
-  .setRegParam(0.1)
-  .setUpdater(new L1Updater)
-val modelL1 = svmAlg.run(parsedData)
+svmAlg.optimizer.
+  setNumIterations(200).
+  setRegParam(0.1).
+  setUpdater(new L1Updater)
+val modelL1 = svmAlg.run(training)
 {% endhighlight %}

 Similarly, you can use replace `SVMWithSGD` by
@ -322,7 +323,7 @@ val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
 }
-val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
+val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
 println("training Mean Squared Error = " + MSE)
 {% endhighlight %}

--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@ -7,13 +7,13 @@ Naive Bayes is a simple multiclass classification algorithm with the assumption
 between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
 the training data, it computes the conditional probability distribution of each feature given label,
 and then it applies Bayes' theorem to compute the conditional probability distribution of label
-given an observation and use it for prediction. For more details, please visit the wikipedia page
+given an observation and use it for prediction. For more details, please visit the Wikipedia page
 [Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).

 In MLlib, we implemented multinomial naive Bayes, which is typically used for document
 classification. Within that context, each observation is a document, each feature represents a term,
-whose value is the frequency of the term. For its formulation, please visit the wikipedia page
-[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
+whose value is the frequency of the term. For its formulation, please visit the Wikipedia page
+[Multinomial Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
 or the section
 [Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
 from the book Introduction to Information
@ -36,9 +36,18 @@ can be used for evaluation and prediction.

 {% highlight scala %}
 import org.apache.spark.mllib.classification.NaiveBayes
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint

-val training: RDD[LabeledPoint] = ... // training set
-val test: RDD[LabeledPoint] = ... // test set
+val data = sc.textFile("mllib/data/sample_naive_bayes_data.txt")
+val parsedData = data.map { line =>
+  val parts = line.split(',')
+  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
+}
+// Split data into training (60%) and test (40%).
+val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
+val training = splits(0)
+val test = splits(1)

 val model = NaiveBayes.train(training, lambda = 1.0)
 val prediction = model.predict(test.map(_.features))
@ -58,29 +67,36 @@ optionally smoothing parameter `lambda` as input, and output a
 can be used for evaluation and prediction.

 {% highlight java %}
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.function.Function;
 import org.apache.spark.mllib.classification.NaiveBayes;
+import org.apache.spark.mllib.classification.NaiveBayesModel;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import scala.Tuple2;

 JavaRDD<LabeledPoint> training = ... // training set
 JavaRDD<LabeledPoint> test = ... // test set

-NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
+final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);

-JavaRDD<Double> prediction = model.predict(test.map(new Function<LabeledPoint, Vector>() {
-    public Vector call(LabeledPoint p) {
-      return p.features();
+JavaRDD<Double> prediction =
+  test.map(new Function<LabeledPoint, Double>() {
+    @Override public Double call(LabeledPoint p) {
+      return model.predict(p.features());
    }
-  })
+  });
 JavaPairRDD<Double, Double> predictionAndLabel = 
  prediction.zip(test.map(new Function<LabeledPoint, Double>() {
-    public Double call(LabeledPoint p) {
+    @Override public Double call(LabeledPoint p) {
      return p.label();
    }
-  })
+  }));
 double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
-    public Boolean call(Tuple2<Double, Double> pl) {
+    @Override public Boolean call(Tuple2<Double, Double> pl) {
      return pl._1() == pl._2();
    }
-  }).count() / test.count()
+  }).count() / test.count();
 {% endhighlight %}
 </div>

@ -93,7 +109,7 @@ smoothing parameter `lambda` as input, and output a
 [NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
 used for evaluation and prediction.

-<!--- TODO: Make Python's example consistent with Scala's and Java's. --->
+<!-- TODO: Make Python's example consistent with Scala's and Java's. -->
 {% highlight python %}
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import NaiveBayes
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@ -48,12 +48,12 @@ how to access a cluster. To create a `SparkContext` you first need to build a `S
 that contains information about your application.

 {% highlight scala %}
-val conf = new SparkConf().setAppName(<app name>).setMaster(<master>)
+val conf = new SparkConf().setAppName(appName).setMaster(master)
 new SparkContext(conf)
 {% endhighlight %}

-The `<master>` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
-to connect to, or a special "local" string to run in local mode, as described below. `<app name>` is
+The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
+to connect to, or a special "local" string to run in local mode, as described below. `appName` is
 a name for your application, which will be shown in the cluster web UI. It's also possible to set
 these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
 which avoids hard-coding the master name in your application.
@ -81,9 +81,8 @@ The master URL passed to Spark can be in one of the following formats:
 <table class="table">
 <tr><th>Master URL</th><th>Meaning</th></tr>
 <tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
-<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
+<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
 <tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
-</td></tr>
 <tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
        cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
 </td></tr>
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@ -416,3 +416,4 @@ results = hiveCtx.hql("FROM src SELECT key, value").collect()
 {% endhighlight %}

 </div>
+</div>
--- a/mllib/data/sample_naive_bayes_data.txt
+++ b/mllib/data/sample_naive_bayes_data.txt
@ -1,6 +1,6 @@
-0, 1 0 0
-0, 2 0 0
-1, 0 1 0
-1, 0 2 0
-2, 0 0 1
-2, 0 0 2
+0,1 0 0
+0,2 0 0
+1,0 1 0
+1,0 2 0
+2,0 0 1
+2,0 0 2