[SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents
## What changes were proposed in this pull request? This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change. **Fix broken links** * mllib-data-types.md * mllib-decision-tree.md * mllib-ensembles.md * mllib-feature-extraction.md * mllib-pmml-model-export.md * mllib-statistics.md **Fix malformed section header and scala coding style** * mllib-linear-methods.md **Replace indirect forward links with direct one** * ml-classification-regression.md ## How was this patch tested? Manual tests (with `cd docs; jekyll build`.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13608 from dongjoon-hyun/SPARK-15883.
This commit is contained in:
parent
3761330dd0
commit
ad102af169
|
@ -815,7 +815,7 @@ The main differences between this API and the [original MLlib ensembles API](mll
|
|||
## Random Forests
|
||||
|
||||
[Random forests](http://en.wikipedia.org/wiki/Random_forest)
|
||||
are ensembles of [decision trees](ml-decision-tree.html).
|
||||
are ensembles of [decision trees](ml-classification-regression.html#decision-trees).
|
||||
Random forests combine many decision trees in order to reduce the risk of overfitting.
|
||||
The `spark.ml` implementation supports random forests for binary and multiclass classification and for regression,
|
||||
using both continuous and categorical features.
|
||||
|
@ -896,7 +896,7 @@ All output columns are optional; to exclude an output column, set its correspond
|
|||
## Gradient-Boosted Trees (GBTs)
|
||||
|
||||
[Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
|
||||
are ensembles of [decision trees](ml-decision-tree.html).
|
||||
are ensembles of [decision trees](ml-classification-regression.html#decision-trees).
|
||||
GBTs iteratively train decision trees in order to minimize a loss function.
|
||||
The `spark.ml` implementation supports GBTs for binary classification and for regression,
|
||||
using both continuous and categorical features.
|
||||
|
|
|
@ -33,7 +33,7 @@ implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin
|
|||
using the factory methods implemented in
|
||||
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors.
|
||||
|
||||
Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API.
|
||||
Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
|
||||
|
||||
{% highlight scala %}
|
||||
import org.apache.spark.mllib.linalg.{Vector, Vectors}
|
||||
|
@ -199,7 +199,7 @@ After loading, the feature indices are converted to zero-based.
|
|||
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
|
||||
examples stored in LIBSVM format.
|
||||
|
||||
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on the API.
|
||||
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on the API.
|
||||
|
||||
{% highlight scala %}
|
||||
import org.apache.spark.mllib.regression.LabeledPoint
|
||||
|
@ -264,7 +264,7 @@ We recommend using the factory methods implemented
|
|||
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) to create local
|
||||
matrices. Remember, local matrices in MLlib are stored in column-major order.
|
||||
|
||||
Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) for details on the API.
|
||||
Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) for details on the API.
|
||||
|
||||
{% highlight scala %}
|
||||
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
|
||||
|
@ -331,7 +331,7 @@ sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
|
|||
A distributed matrix has long-typed row and column indices and double-typed values, stored
|
||||
distributively in one or more RDDs. It is very important to choose the right format to store large
|
||||
and distributed matrices. Converting a distributed matrix to a different format may require a
|
||||
global shuffle, which is quite expensive. Three types of distributed matrices have been implemented
|
||||
global shuffle, which is quite expensive. Four types of distributed matrices have been implemented
|
||||
so far.
|
||||
|
||||
The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
|
||||
|
@ -344,6 +344,8 @@ An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
|
|||
which can be used for identifying rows and executing joins.
|
||||
A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format,
|
||||
backed by an RDD of its entries.
|
||||
A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`
|
||||
which is a tuple of `(Int, Int, Matrix)`.
|
||||
|
||||
***Note***
|
||||
|
||||
|
@ -535,12 +537,6 @@ rowsRDD = mat.rows
|
|||
|
||||
# Convert to a RowMatrix by dropping the row indices.
|
||||
rowMat = mat.toRowMatrix()
|
||||
|
||||
# Convert to a CoordinateMatrix.
|
||||
coordinateMat = mat.toCoordinateMatrix()
|
||||
|
||||
# Convert to a BlockMatrix.
|
||||
blockMat = mat.toBlockMatrix()
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
|
||||
|
|
|
@ -136,7 +136,7 @@ When tuning these parameters, be careful to validate on held-out test data to av
|
|||
|
||||
* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
|
||||
|
||||
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) since those are often trained deeper than individual trees.
|
||||
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) since those are often trained deeper than individual trees.
|
||||
|
||||
* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
|
||||
|
||||
|
@ -152,13 +152,13 @@ These parameters may be tuned. Be careful to validate on held-out test data whe
|
|||
* The default value is conservatively chosen to be 256 MB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
|
||||
* *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
|
||||
|
||||
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
|
||||
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
|
||||
|
||||
* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
|
||||
|
||||
### Caching and checkpointing
|
||||
|
||||
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) when `numTrees` is set to be large.
|
||||
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) when `numTrees` is set to be large.
|
||||
|
||||
* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
|
||||
* This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).
|
||||
|
|
|
@ -9,7 +9,7 @@ displayTitle: Ensembles - spark.mllib
|
|||
|
||||
An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
|
||||
is a learning algorithm which creates a model composed of a set of other base models.
|
||||
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
|
||||
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$).
|
||||
Both use [decision trees](mllib-decision-tree.html) as their base models.
|
||||
|
||||
## Gradient-Boosted Trees vs. Random Forests
|
||||
|
@ -96,7 +96,7 @@ The test error is calculated to measure the algorithm accuracy.
|
|||
<div class="codetabs">
|
||||
|
||||
<div data-lang="scala" markdown="1">
|
||||
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
|
||||
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
|
||||
|
||||
{% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
|
||||
</div>
|
||||
|
@ -127,7 +127,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
|
|||
<div class="codetabs">
|
||||
|
||||
<div data-lang="scala" markdown="1">
|
||||
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
|
||||
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
|
||||
|
||||
{% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
|
||||
</div>
|
||||
|
|
|
@ -333,7 +333,7 @@ Details you can read at [dimensionality reduction](mllib-dimensionality-reductio
|
|||
|
||||
The following code demonstrates how to compute principal components on a `Vector`
|
||||
and use them to project the vectors into a low-dimensional space while keeping associated labels
|
||||
for calculation a [Linear Regression]((mllib-linear-methods.html))
|
||||
for calculation a [Linear Regression](mllib-linear-methods.html)
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
|
|
@ -185,10 +185,10 @@ algorithm for 200 iterations.
|
|||
import org.apache.spark.mllib.optimization.L1Updater
|
||||
|
||||
val svmAlg = new SVMWithSGD()
|
||||
svmAlg.optimizer.
|
||||
setNumIterations(200).
|
||||
setRegParam(0.1).
|
||||
setUpdater(new L1Updater)
|
||||
svmAlg.optimizer
|
||||
.setNumIterations(200)
|
||||
.setRegParam(0.1)
|
||||
.setUpdater(new L1Updater)
|
||||
val modelL1 = svmAlg.run(training)
|
||||
{% endhighlight %}
|
||||
|
||||
|
@ -395,7 +395,7 @@ section of the Spark
|
|||
quick-start guide. Be sure to also include *spark-mllib* to your build file as
|
||||
a dependency.
|
||||
|
||||
###Streaming linear regression
|
||||
### Streaming linear regression
|
||||
|
||||
When data arrive in a streaming fashion, it is useful to fit regression models online,
|
||||
updating the parameters of the model as new data arrives. `spark.mllib` currently supports
|
||||
|
|
|
@ -47,7 +47,7 @@ To export a supported `model` (see table above) to PMML, simply call `model.toPM
|
|||
|
||||
As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats.
|
||||
|
||||
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API.
|
||||
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
|
||||
|
||||
Here a complete example of building a KMeansModel and print it out in PMML format:
|
||||
{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}
|
||||
|
|
|
@ -80,7 +80,7 @@ correlation methods are currently Pearson's and Spearman's correlation.
|
|||
calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
|
||||
an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
|
||||
|
||||
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
|
||||
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
|
||||
|
||||
{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
|
||||
</div>
|
||||
|
@ -210,7 +210,7 @@ message.
|
|||
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
|
||||
and interpret the hypothesis tests.
|
||||
|
||||
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
|
||||
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
|
||||
|
||||
{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
|
||||
</div>
|
||||
|
@ -277,12 +277,12 @@ uniform, standard normal, or Poisson.
|
|||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
|
||||
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) provides factory
|
||||
methods to generate random double RDDs or vector RDDs.
|
||||
The following example generates a random double RDD, whose values follows the standard normal
|
||||
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
|
||||
|
||||
Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) for details on the API.
|
||||
Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) for details on the API.
|
||||
|
||||
{% highlight scala %}
|
||||
import org.apache.spark.SparkContext
|
||||
|
|
Loading…
Reference in a new issue