d5b1d5fc80
## What changes were proposed in this pull request? It seems in Scala/Java, - `Note:` - `NOTE:` - `Note that` - `'''Note:'''` - `note` This PR proposes to fix those to `note` to be consistent. **Before** - Scala ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png) - Java ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png) **After** - Scala ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png) - Java ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png) ## How was this patch tested? The notes were found via ```bash grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// NOTE: " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...` -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// Note that " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// Note: " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// '''Note:''' " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` And then fixed one by one comparing with API documentation/access modifiers. After that, manually tested via `jekyll build`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15889 from HyukjinKwon/SPARK-18437.
87 lines
4.6 KiB
Markdown
87 lines
4.6 KiB
Markdown
---
|
|
layout: global
|
|
title: Isotonic regression - RDD-based API
|
|
displayTitle: Regression - RDD-based API
|
|
---
|
|
|
|
## Isotonic regression
|
|
[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression)
|
|
belongs to the family of regression algorithms. Formally isotonic regression is a problem where
|
|
given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` representing observed responses
|
|
and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
|
|
finding a function that minimises
|
|
|
|
`\begin{equation}
|
|
f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
|
|
\end{equation}`
|
|
|
|
with respect to complete order subject to
|
|
`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights.
|
|
The resulting function is called isotonic regression and it is unique.
|
|
It can be viewed as least squares problem under order restriction.
|
|
Essentially isotonic regression is a
|
|
[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
|
|
best fitting the original data points.
|
|
|
|
`spark.mllib` supports a
|
|
[pool adjacent violators algorithm](http://doi.org/10.1198/TECH.2010.10111)
|
|
which uses an approach to
|
|
[parallelizing isotonic regression](http://doi.org/10.1007/978-3-642-99789-1_10).
|
|
The training input is an RDD of tuples of three double values that represent
|
|
label, feature and weight in this order. Additionally IsotonicRegression algorithm has one
|
|
optional parameter called $isotonic$ defaulting to true.
|
|
This argument specifies if the isotonic regression is
|
|
isotonic (monotonically increasing) or antitonic (monotonically decreasing).
|
|
|
|
Training returns an IsotonicRegressionModel that can be used to predict
|
|
labels for both known and unknown features. The result of isotonic regression
|
|
is treated as piecewise linear function. The rules for prediction therefore are:
|
|
|
|
* If the prediction input exactly matches a training feature
|
|
then associated prediction is returned. In case there are multiple predictions with the same
|
|
feature then one of them is returned. Which one is undefined
|
|
(same as java.util.Arrays.binarySearch).
|
|
* If the prediction input is lower or higher than all training features
|
|
then prediction with lowest or highest feature is returned respectively.
|
|
In case there are multiple predictions with the same feature
|
|
then the lowest or highest is returned respectively.
|
|
* If the prediction input falls between two training features then prediction is treated
|
|
as piecewise linear function and interpolated value is calculated from the
|
|
predictions of the two closest features. In case there are multiple values
|
|
with the same feature then the same rules as in previous point are used.
|
|
|
|
### Examples
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
Data are read from a file where each line has a format label,feature
|
|
i.e. 4710.28,500.00. The data are split to training and testing set.
|
|
Model is created using the training set and a mean squared error is calculated from the predicted
|
|
labels and real labels in the test set.
|
|
|
|
Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegressionModel) for details on the API.
|
|
|
|
{% include_example scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala %}
|
|
</div>
|
|
<div data-lang="java" markdown="1">
|
|
Data are read from a file where each line has a format label,feature
|
|
i.e. 4710.28,500.00. The data are split to training and testing set.
|
|
Model is created using the training set and a mean squared error is calculated from the predicted
|
|
labels and real labels in the test set.
|
|
|
|
Refer to the [`IsotonicRegression` Java docs](api/java/org/apache/spark/mllib/regression/IsotonicRegression.html) and [`IsotonicRegressionModel` Java docs](api/java/org/apache/spark/mllib/regression/IsotonicRegressionModel.html) for details on the API.
|
|
|
|
{% include_example java/org/apache/spark/examples/mllib/JavaIsotonicRegressionExample.java %}
|
|
</div>
|
|
<div data-lang="python" markdown="1">
|
|
Data are read from a file where each line has a format label,feature
|
|
i.e. 4710.28,500.00. The data are split to training and testing set.
|
|
Model is created using the training set and a mean squared error is calculated from the predicted
|
|
labels and real labels in the test set.
|
|
|
|
Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegressionModel) for more details on the API.
|
|
|
|
{% include_example python/mllib/isotonic_regression_example.py %}
|
|
</div>
|
|
</div>
|