[SPARK-18480][DOCS] Fix wrong links for ML guide docs

## What changes were proposed in this pull request?
1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
4, Other link updates.
## How was this patch tested?
 manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15912 from zhengruifeng/md_fix.
This commit is contained in:
Zheng RuiFeng 2016-11-17 13:40:16 +00:00 committed by Sean Owen
parent de77c67750
commit cdaf4ce9fe
No known key found for this signature in database
GPG key ID: BEB3956D6717BDDC
8 changed files with 19 additions and 22 deletions

View file

@ -36,7 +36,6 @@ description: GraphX graph processing library guide for Spark SPARK_VERSION_SHORT
[Graph.fromEdgeTuples]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
[Graph.fromEdges]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy
[Graph.partitionBy]: api/scala/index.html#org.apache.spark.graphx.Graph$@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]
[PageRank]: api/scala/index.html#org.apache.spark.graphx.lib.PageRank$
[ConnectedComponents]: api/scala/index.html#org.apache.spark.graphx.lib.ConnectedComponents$
[TriangleCount]: api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$

View file

@ -984,7 +984,7 @@ Random forests combine many decision trees in order to reduce the risk of overfi
The `spark.ml` implementation supports random forests for binary and multiclass classification and for regression,
using both continuous and categorical features.
For more information on the algorithm itself, please see the [`spark.mllib` documentation on random forests](mllib-ensembles.html).
For more information on the algorithm itself, please see the [`spark.mllib` documentation on random forests](mllib-ensembles.html#random-forests).
### Inputs and Outputs
@ -1065,7 +1065,7 @@ GBTs iteratively train decision trees in order to minimize a loss function.
The `spark.ml` implementation supports GBTs for binary classification and for regression,
using both continuous and categorical features.
For more information on the algorithm itself, please see the [`spark.mllib` documentation on GBTs](mllib-ensembles.html).
For more information on the algorithm itself, please see the [`spark.mllib` documentation on GBTs](mllib-ensembles.html#gradient-boosted-trees-gbts).
### Inputs and Outputs

View file

@ -710,7 +710,7 @@ for more details on the API.
`VectorIndexer` helps index categorical features in datasets of `Vector`s.
It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
1. Take an input column of type [Vector](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and a parameter `maxCategories`.
1. Take an input column of type [Vector](api/scala/index.html#org.apache.spark.ml.linalg.Vector) and a parameter `maxCategories`.
2. Decide which features should be categorical based on the number of distinct values, where features with at most `maxCategories` are declared categorical.
3. Compute 0-based category indices for each categorical feature.
4. Index categorical features and transform original feature values to indices.

View file

@ -38,26 +38,26 @@ algorithms into a single pipeline, or workflow.
This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
* **[`DataFrame`](ml-guide.html#dataframe)**: This ML API uses `DataFrame` from Spark SQL as an ML
* **[`DataFrame`](ml-pipeline.html#dataframe)**: This ML API uses `DataFrame` from Spark SQL as an ML
dataset, which can hold a variety of data types.
E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
* **[`Transformer`](ml-pipeline.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
* **[`Estimator`](ml-pipeline.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
* **[`Pipeline`](ml-pipeline.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
* **[`Parameter`](ml-pipeline.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
## DataFrame
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
This API adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) for a list of supported types.
`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#data-types) for a list of supported types.
In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.

View file

@ -139,7 +139,7 @@ and logistic regression.
Linear SVMs supports only binary classification, while logistic regression supports both binary and
multiclass classification problems.
For both methods, `spark.mllib` supports L1 and L2 regularized variants.
The training data set is represented by an RDD of [LabeledPoint](mllib-data-types.html) in MLlib,
The training data set is represented by an RDD of [LabeledPoint](mllib-data-types.html#labeled-point) in MLlib,
where labels are class indices starting from zero: $0, 1, 2, \ldots$.
### Linear Support Vector Machines (SVMs)
@ -491,5 +491,3 @@ Algorithms are all implemented in Scala:
* [RidgeRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
* [LassoWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
Python calls the Scala implementation via
[PythonMLLibAPI](api/scala/index.html#org.apache.spark.mllib.api.python.PythonMLLibAPI).

View file

@ -40,7 +40,7 @@ private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
* @group param
*/
final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" +
"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" +
" increasing dimensionality lowers the false negative rate, and decreasing dimensionality" +
" improves the running performance", ParamValidators.gt(0))
/** @group getParam */

View file

@ -34,7 +34,7 @@ private[spark] object GradientBoostedTrees extends Logging {
/**
* Method to train a gradient boosting model
* @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
* @param input Training dataset: RDD of [[LabeledPoint]].
* @param seed Random seed.
* @return tuple of ensemble models and weights:
* (array of decision tree models, array of model weights)
@ -59,7 +59,7 @@ private[spark] object GradientBoostedTrees extends Logging {
/**
* Method to validate a gradient boosting model
* @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
* @param input Training dataset: RDD of [[LabeledPoint]].
* @param validationInput Validation dataset.
* This dataset should be different from the training dataset,
* but it should follow the same distribution.
@ -162,7 +162,7 @@ private[spark] object GradientBoostedTrees extends Logging {
* Method to calculate error of the base learner for the gradient boosting calculation.
* Note: This method is not used by the gradient boosting algorithm but is useful for debugging
* purposes.
* @param data Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
* @param data Training dataset: RDD of [[LabeledPoint]].
* @param trees Boosted Decision Tree models
* @param treeWeights Learning rates at each boosting iteration.
* @param loss evaluation metric.
@ -184,7 +184,7 @@ private[spark] object GradientBoostedTrees extends Logging {
/**
* Method to compute error or loss for every iteration of gradient boosting.
*
* @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
* @param data RDD of [[LabeledPoint]]
* @param trees Boosted Decision Tree models
* @param treeWeights Learning rates at each boosting iteration.
* @param loss evaluation metric.

View file

@ -82,7 +82,7 @@ private[spark] object RandomForest extends Logging {
/**
* Train a random forest.
*
* @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
* @param input Training data: RDD of [[LabeledPoint]]
* @return an unweighted set of trees
*/
def run(
@ -343,7 +343,7 @@ private[spark] object RandomForest extends Logging {
/**
* Given a group of nodes, this finds the best split for each node.
*
* @param input Training data: RDD of [[org.apache.spark.ml.tree.impl.TreePoint]]
* @param input Training data: RDD of [[TreePoint]]
* @param metadata Learning and dataset metadata
* @param topNodesForGroup For each tree in group, tree index -> root node.
* Used for matching instances with nodes.
@ -854,10 +854,10 @@ private[spark] object RandomForest extends Logging {
* and for multiclass classification with a high-arity feature,
* there is one bin per category.
*
* @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
* @param input Training data: RDD of [[LabeledPoint]]
* @param metadata Learning and dataset metadata
* @param seed random seed
* @return Splits, an Array of [[org.apache.spark.mllib.tree.model.Split]]
* @return Splits, an Array of [[Split]]
* of size (numFeatures, numSplits)
*/
protected[tree] def findSplits(