[SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents

## What changes were proposed in this pull request?

This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change.

**Fix broken links**
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md

**Fix malformed section header and scala coding style**
  * mllib-linear-methods.md

**Replace indirect forward links with direct one**
  * ml-classification-regression.md

## How was this patch tested?

Manual tests (with `cd docs; jekyll build`.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13608 from dongjoon-hyun/SPARK-15883.
This commit is contained in:
Dongjoon Hyun 2016-06-11 12:55:38 +01:00 committed by Sean Owen
parent 3761330dd0
commit ad102af169
8 changed files with 25 additions and 29 deletions

View file

@ -815,7 +815,7 @@ The main differences between this API and the [original MLlib ensembles API](mll
## Random Forests
[Random forests](http://en.wikipedia.org/wiki/Random_forest)
are ensembles of [decision trees](ml-decision-tree.html).
are ensembles of [decision trees](ml-classification-regression.html#decision-trees).
Random forests combine many decision trees in order to reduce the risk of overfitting.
The `spark.ml` implementation supports random forests for binary and multiclass classification and for regression,
using both continuous and categorical features.
@ -896,7 +896,7 @@ All output columns are optional; to exclude an output column, set its correspond
## Gradient-Boosted Trees (GBTs)
[Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
are ensembles of [decision trees](ml-decision-tree.html).
are ensembles of [decision trees](ml-classification-regression.html#decision-trees).
GBTs iteratively train decision trees in order to minimize a loss function.
The `spark.ml` implementation supports GBTs for binary classification and for regression,
using both continuous and categorical features.

View file

@ -33,7 +33,7 @@ implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin
using the factory methods implemented in
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors.
Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API.
Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
@ -199,7 +199,7 @@ After loading, the feature indices are converted to zero-based.
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
examples stored in LIBSVM format.
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on the API.
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on the API.
{% highlight scala %}
import org.apache.spark.mllib.regression.LabeledPoint
@ -264,7 +264,7 @@ We recommend using the factory methods implemented
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) to create local
matrices. Remember, local matrices in MLlib are stored in column-major order.
Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) for details on the API.
Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) for details on the API.
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
@ -331,7 +331,7 @@ sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
A distributed matrix has long-typed row and column indices and double-typed values, stored
distributively in one or more RDDs. It is very important to choose the right format to store large
and distributed matrices. Converting a distributed matrix to a different format may require a
global shuffle, which is quite expensive. Three types of distributed matrices have been implemented
global shuffle, which is quite expensive. Four types of distributed matrices have been implemented
so far.
The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
@ -344,6 +344,8 @@ An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
which can be used for identifying rows and executing joins.
A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format,
backed by an RDD of its entries.
A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`
which is a tuple of `(Int, Int, Matrix)`.
***Note***
@ -535,12 +537,6 @@ rowsRDD = mat.rows
# Convert to a RowMatrix by dropping the row indices.
rowMat = mat.toRowMatrix()
# Convert to a CoordinateMatrix.
coordinateMat = mat.toCoordinateMatrix()
# Convert to a BlockMatrix.
blockMat = mat.toBlockMatrix()
{% endhighlight %}
</div>

View file

@ -136,7 +136,7 @@ When tuning these parameters, be careful to validate on held-out test data to av
* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) since those are often trained deeper than individual trees.
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) since those are often trained deeper than individual trees.
* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
@ -152,13 +152,13 @@ These parameters may be tuned. Be careful to validate on held-out test data whe
* The default value is conservatively chosen to be 256 MB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
* *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
### Caching and checkpointing
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) when `numTrees` is set to be large.
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) when `numTrees` is set to be large.
* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
* This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).

View file

@ -9,7 +9,7 @@ displayTitle: Ensembles - spark.mllib
An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
is a learning algorithm which creates a model composed of a set of other base models.
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$).
Both use [decision trees](mllib-decision-tree.html) as their base models.
## Gradient-Boosted Trees vs. Random Forests
@ -96,7 +96,7 @@ The test error is calculated to measure the algorithm accuracy.
<div class="codetabs">
<div data-lang="scala" markdown="1">
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
</div>
@ -127,7 +127,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
<div class="codetabs">
<div data-lang="scala" markdown="1">
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
</div>

View file

@ -333,7 +333,7 @@ Details you can read at [dimensionality reduction](mllib-dimensionality-reductio
The following code demonstrates how to compute principal components on a `Vector`
and use them to project the vectors into a low-dimensional space while keeping associated labels
for calculation a [Linear Regression]((mllib-linear-methods.html))
for calculation a [Linear Regression](mllib-linear-methods.html)
<div class="codetabs">
<div data-lang="scala" markdown="1">

View file

@ -185,10 +185,10 @@ algorithm for 200 iterations.
import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
svmAlg.optimizer.
setNumIterations(200).
setRegParam(0.1).
setUpdater(new L1Updater)
svmAlg.optimizer
.setNumIterations(200)
.setRegParam(0.1)
.setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)
{% endhighlight %}
@ -395,7 +395,7 @@ section of the Spark
quick-start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.
###Streaming linear regression
### Streaming linear regression
When data arrive in a streaming fashion, it is useful to fit regression models online,
updating the parameters of the model as new data arrives. `spark.mllib` currently supports

View file

@ -47,7 +47,7 @@ To export a supported `model` (see table above) to PMML, simply call `model.toPM
As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats.
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API.
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
Here a complete example of building a KMeansModel and print it out in PMML format:
{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}

View file

@ -80,7 +80,7 @@ correlation methods are currently Pearson's and Spearman's correlation.
calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
</div>
@ -210,7 +210,7 @@ message.
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
</div>
@ -277,12 +277,12 @@ uniform, standard normal, or Poisson.
<div class="codetabs">
<div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) for details on the API.
Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) for details on the API.
{% highlight scala %}
import org.apache.spark.SparkContext