[SPARK-15394][ML][DOCS] User guide typos and grammar audit
## What changes were proposed in this pull request? Correct some typos and incorrectly worded sentences. ## How was this patch tested? Doc changes only. Note that many of these changes were identified by whomfire01 Author: sethah <seth.hendrickson16@gmail.com> Closes #13180 from sethah/ml_guide_audit.
This commit is contained in:
parent
47a2940da9
commit
5e203505f1
|
@ -236,9 +236,9 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
|
|||
|
||||
Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network).
|
||||
MLPC consists of multiple layers of nodes.
|
||||
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs
|
||||
by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
|
||||
It can be written in matrix form for MLPC with `$K+1$` layers as follows:
|
||||
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs
|
||||
by a linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
|
||||
This can be written in matrix form for MLPC with `$K+1$` layers as follows:
|
||||
`\[
|
||||
\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
|
||||
\]`
|
||||
|
@ -252,7 +252,7 @@ Nodes in the output layer use softmax function:
|
|||
\]`
|
||||
The number of nodes `$N$` in the output layer corresponds to the number of classes.
|
||||
|
||||
MLPC employs backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
|
||||
MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.
|
||||
|
||||
**Example**
|
||||
|
||||
|
@ -311,9 +311,9 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
|
|||
|
||||
## Naive Bayes
|
||||
|
||||
[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple
|
||||
[Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple
|
||||
probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence
|
||||
assumptions between the features. The spark.ml implementation currently supports both [multinomial
|
||||
assumptions between the features. The `spark.ml` implementation currently supports both [multinomial
|
||||
naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
|
||||
and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
|
||||
More information can be found in the section on [Naive Bayes in MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib).
|
||||
|
@ -482,11 +482,11 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.
|
|||
|
||||
In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model)
|
||||
model which is a parametric survival regression model for censored data.
|
||||
It describes a model for the log of survival time, so it's often called
|
||||
log-linear model for survival analysis. Different from
|
||||
It describes a model for the log of survival time, so it's often called a
|
||||
log-linear model for survival analysis. Different from a
|
||||
[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
|
||||
designed for the same purpose, the AFT model is more easily to parallelize
|
||||
because each instance contribute to the objective function independently.
|
||||
designed for the same purpose, the AFT model is easier to parallelize
|
||||
because each instance contributes to the objective function independently.
|
||||
|
||||
Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of
|
||||
subjects i = 1, ..., n, with possible right-censoring,
|
||||
|
@ -501,10 +501,10 @@ assumes the form:
|
|||
\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
|
||||
\]`
|
||||
Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
|
||||
and $f_{0}(\epsilon_{i})$ is corresponding density function.
|
||||
and $f_{0}(\epsilon_{i})$ is the corresponding density function.
|
||||
|
||||
The most commonly used AFT model is based on the Weibull distribution of the survival time.
|
||||
The Weibull distribution for lifetime corresponding to extreme value distribution for
|
||||
The Weibull distribution for lifetime corresponds to the extreme value distribution for the
|
||||
log of the lifetime, and the $S_{0}(\epsilon)$ function is:
|
||||
`\[
|
||||
S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
|
||||
|
@ -513,7 +513,7 @@ the $f_{0}(\epsilon_{i})$ function is:
|
|||
`\[
|
||||
f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
|
||||
\]`
|
||||
The log-likelihood function for AFT model with Weibull distribution of lifetime is:
|
||||
The log-likelihood function for AFT model with a Weibull distribution of lifetime is:
|
||||
`\[
|
||||
\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
|
||||
\]`
|
||||
|
@ -529,7 +529,7 @@ The gradient functions for $\beta$ and $\log\sigma$ respectively are:
|
|||
|
||||
The AFT model can be formulated as a convex optimization problem,
|
||||
i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$
|
||||
that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
|
||||
that depends on the coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
|
||||
The optimization algorithm underlying the implementation is L-BFGS.
|
||||
The implementation matches the result from R's survival function
|
||||
[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
|
||||
|
|
|
@ -89,7 +89,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
|
|||
## Latent Dirichlet allocation (LDA)
|
||||
|
||||
`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
|
||||
and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by
|
||||
and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
|
||||
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.
|
||||
|
||||
<div class="codetabs">
|
||||
|
|
|
@ -60,7 +60,7 @@ best parameter learned from a sampled subset to the full dataset and expect simi
|
|||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
||||
In the following example, we load rating data from the
|
||||
In the following example, we load ratings data from the
|
||||
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
|
||||
consisting of a user, a movie, a rating and a timestamp.
|
||||
We then train an ALS model which assumes, by default, that the ratings are
|
||||
|
@ -91,7 +91,7 @@ val als = new ALS()
|
|||
|
||||
<div data-lang="java" markdown="1">
|
||||
|
||||
In the following example, we load rating data from the
|
||||
In the following example, we load ratings data from the
|
||||
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
|
||||
consisting of a user, a movie, a rating and a timestamp.
|
||||
We then train an ALS model which assumes, by default, that the ratings are
|
||||
|
@ -122,7 +122,7 @@ ALS als = new ALS()
|
|||
|
||||
<div data-lang="python" markdown="1">
|
||||
|
||||
In the following example, we load rating data from the
|
||||
In the following example, we load ratings data from the
|
||||
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
|
||||
consisting of a user, a movie, a rating and a timestamp.
|
||||
We then train an ALS model which assumes, by default, that the ratings are
|
||||
|
|
|
@ -26,7 +26,7 @@ to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and th
|
|||
Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while
|
||||
document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use
|
||||
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
|
||||
often but carry little information about the document, e.g., "a", "the", and "of". If a term appears
|
||||
often but carry little information about the document, e.g. "a", "the", and "of". If a term appears
|
||||
very often across the corpus, it means it doesn't carry special information about a particular document.
|
||||
Inverse document frequency is a numerical measure of how much information a term provides:
|
||||
`\[
|
||||
|
@ -50,7 +50,7 @@ A raw feature is mapped into an index (term) by applying a hash function. Then t
|
|||
are calculated based on the mapped indices. This approach avoids the need to compute a global
|
||||
term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
|
||||
collisions, where different raw features may become the same term after hashing. To reduce the
|
||||
chance of collision, we can increase the target feature dimension, i.e., the number of buckets
|
||||
chance of collision, we can increase the target feature dimension, i.e. the number of buckets
|
||||
of the hash table. Since a simple modulo is used to transform the hash function to a column index,
|
||||
it is advisable to use a power of two as the feature dimension, otherwise the features will
|
||||
not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
|
||||
|
@ -104,7 +104,7 @@ the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for mor
|
|||
`Word2Vec` is an `Estimator` which takes sequences of words representing documents and trains a
|
||||
`Word2VecModel`. The model maps each word to a unique fixed-size vector. The `Word2VecModel`
|
||||
transforms each document into a vector using the average of all words in the document; this vector
|
||||
can then be used for as features for prediction, document similarity calculations, etc.
|
||||
can then be used as features for prediction, document similarity calculations, etc.
|
||||
Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#word2vec) for more
|
||||
details.
|
||||
|
||||
|
@ -140,12 +140,12 @@ for more details on the API.
|
|||
|
||||
`CountVectorizer` and `CountVectorizerModel` aim to help convert a collection of text documents
|
||||
to vectors of token counts. When an a-priori dictionary is not available, `CountVectorizer` can
|
||||
be used as an `Estimator` to extract the vocabulary and generates a `CountVectorizerModel`. The
|
||||
be used as an `Estimator` to extract the vocabulary, and generates a `CountVectorizerModel`. The
|
||||
model produces sparse representations for the documents over the vocabulary, which can then be
|
||||
passed to other algorithms like LDA.
|
||||
|
||||
During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
|
||||
term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
|
||||
term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
|
||||
by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
|
||||
included in the vocabulary.
|
||||
|
||||
|
@ -161,8 +161,8 @@ Assume that we have the following DataFrame with columns `id` and `texts`:
|
|||
~~~~
|
||||
|
||||
each row in `texts` is a document of type Array[String].
|
||||
Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c),
|
||||
then the output column "vector" after transformation contains:
|
||||
Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c).
|
||||
Then the output column "vector" after transformation contains:
|
||||
|
||||
~~~~
|
||||
id | texts | vector
|
||||
|
@ -171,7 +171,7 @@ then the output column "vector" after transformation contains:
|
|||
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
|
||||
~~~~
|
||||
|
||||
each vector represents the token counts of the document over the vocabulary.
|
||||
Each vector represents the token counts of the document over the vocabulary.
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
@ -477,8 +477,7 @@ for more details on the API.
|
|||
## StringIndexer
|
||||
|
||||
`StringIndexer` encodes a string column of labels to a column of label indices.
|
||||
The indices are in `[0, numLabels)`, ordered by label frequencies.
|
||||
So the most frequent label gets index `0`.
|
||||
The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
|
||||
If the input column is numeric, we cast it to string and index the string
|
||||
values. When downstream pipeline components such as `Estimator` or
|
||||
`Transformer` make use of this string-indexed label, you must set the input
|
||||
|
@ -585,7 +584,7 @@ for more details on the API.
|
|||
## IndexToString
|
||||
|
||||
Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices
|
||||
back to a column containing the original labels as strings. The common use case
|
||||
back to a column containing the original labels as strings. A common use case
|
||||
is to produce indices from labels with `StringIndexer`, train a model with those
|
||||
indices and retrieve the original labels from the column of predicted indices
|
||||
with `IndexToString`. However, you are free to supply your own labels.
|
||||
|
@ -652,7 +651,7 @@ for more details on the API.
|
|||
|
||||
## OneHotEncoder
|
||||
|
||||
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
|
||||
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
@ -888,7 +887,7 @@ for more details on the API.
|
|||
|
||||
* `splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
|
||||
|
||||
Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
|
||||
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
|
||||
|
||||
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
|
||||
|
||||
|
@ -976,7 +975,7 @@ for more details on the API.
|
|||
Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
|
||||
where `"__THIS__"` represents the underlying table of the input dataset.
|
||||
The select clause specifies the fields, constants, and expressions to display in
|
||||
the output, it can be any select clause that Spark SQL supports. Users can also
|
||||
the output, and can be any select clause that Spark SQL supports. Users can also
|
||||
use Spark SQL built-in function and UDFs to operate on these selected columns.
|
||||
For example, `SQLTransformer` supports statements like:
|
||||
|
||||
|
@ -1121,7 +1120,7 @@ Assume that we have a DataFrame with the columns `id`, `hour`:
|
|||
~~~
|
||||
|
||||
`hour` is a continuous feature with `Double` type. We want to turn the continuous feature into
|
||||
categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
|
||||
a categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
|
||||
|
||||
~~~
|
||||
id | hour | result
|
||||
|
@ -1171,19 +1170,19 @@ for more details on the API.
|
|||
`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a
|
||||
sub-array of the original features. It is useful for extracting features from a vector column.
|
||||
|
||||
`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column
|
||||
`VectorSlicer` accepts a vector column with specified indices, then outputs a new vector column
|
||||
whose values are selected via those indices. There are two types of indices,
|
||||
|
||||
1. Integer indices that represents the indices into the vector, `setIndices()`;
|
||||
1. Integer indices that represent the indices into the vector, `setIndices()`.
|
||||
|
||||
2. String indices that represents the names of features into the vector, `setNames()`.
|
||||
2. String indices that represent the names of features into the vector, `setNames()`.
|
||||
*This requires the vector column to have an `AttributeGroup` since the implementation matches on
|
||||
the name field of an `Attribute`.*
|
||||
|
||||
Specification by integer and string are both acceptable. Moreover, you can use integer index and
|
||||
string name simultaneously. At least one feature must be selected. Duplicate features are not
|
||||
allowed, so there can be no overlap between selected indices and names. Note that if names of
|
||||
features are selected, an exception will be threw out when encountering with empty input attributes.
|
||||
features are selected, an exception will be thrown if empty input attributes are encountered.
|
||||
|
||||
The output vector will order features with the selected indices first (in the order given),
|
||||
followed by the selected names (in the order given).
|
||||
|
@ -1198,8 +1197,8 @@ Suppose that we have a DataFrame with the column `userFeatures`:
|
|||
[0.0, 10.0, 0.5]
|
||||
~~~
|
||||
|
||||
`userFeatures` is a vector column that contains three user features. Assuming that the first column
|
||||
of `userFeatures` are all zeros, so we want to remove it and only the last two columns are selected.
|
||||
`userFeatures` is a vector column that contains three user features. Assume that the first column
|
||||
of `userFeatures` are all zeros, so we want to remove it and select only the last two columns.
|
||||
The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then produces a new vector
|
||||
column named `features`:
|
||||
|
||||
|
@ -1209,7 +1208,7 @@ column named `features`:
|
|||
[0.0, 10.0, 0.5] | [10.0, 0.5]
|
||||
~~~
|
||||
|
||||
Suppose also that we have a potential input attributes for the `userFeatures`, i.e.
|
||||
Suppose also that we have potential input attributes for the `userFeatures`, i.e.
|
||||
`["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
|
||||
|
||||
~~~
|
||||
|
@ -1337,8 +1336,8 @@ id | features | clicked
|
|||
9 | [1.0, 0.0, 15.0, 0.1] | 0.0
|
||||
~~~
|
||||
|
||||
If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to our label `clicked` the
|
||||
last column in our `features` chosen as the most useful feature:
|
||||
If we use `ChiSqSelector` with `numTopFeatures = 1`, then according to our label `clicked` the
|
||||
last column in our `features` is chosen as the most useful feature:
|
||||
|
||||
~~~
|
||||
id | features | clicked | selectedFeatures
|
||||
|
|
|
@ -47,7 +47,7 @@ mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
|
|||
E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
|
||||
|
||||
* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
|
||||
E.g., an ML model is a `Transformer` which transforms `DataFrame` with features into a `DataFrame` with predictions.
|
||||
E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
|
||||
|
||||
* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
|
||||
E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
|
||||
|
@ -292,13 +292,13 @@ However, it is also a well-established method for choosing parameters which is m
|
|||
|
||||
## Example: model selection via train validation split
|
||||
In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
|
||||
`TrainValidationSplit` only evaluates each combination of parameters once as opposed to k times in
|
||||
case of `CrossValidator`. It is therefore less expensive,
|
||||
`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
|
||||
the case of `CrossValidator`. It is therefore less expensive,
|
||||
but will not produce as reliable results when the training dataset is not sufficiently large.
|
||||
|
||||
`TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in the `estimatorParamMaps` parameter,
|
||||
and an `Evaluator`.
|
||||
It begins by splitting the dataset into two parts using `trainRatio` parameter
|
||||
It begins by splitting the dataset into two parts using the `trainRatio` parameter
|
||||
which are used as separate training and test datasets. For example with `$trainRatio=0.75$` (default),
|
||||
`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
|
||||
Similar to `CrossValidator`, `TrainValidationSplit` also iterates through the set of `ParamMap`s.
|
||||
|
|
Loading…
Reference in a new issue