2014-04-22 14:20:47 -04:00
---
layout: global
2016-07-15 16:38:23 -04:00
title: Dimensionality Reduction - RDD-based API
displayTitle: Dimensionality Reduction - RDD-based API
2014-04-22 14:20:47 -04:00
---
* Table of contents
{:toc}
[Dimensionality reduction ](http://en.wikipedia.org/wiki/Dimensionality_reduction ) is the process
of reducing the number of variables under consideration.
2014-08-12 20:15:21 -04:00
It can be used to extract latent features from raw and noisy features
2014-04-22 14:20:47 -04:00
or compress data while maintaining the structure.
2015-12-10 15:50:46 -05:00
`spark.mllib` provides support for dimensionality reduction on the < a href = "mllib-data-types.html#rowmatrix" > RowMatrix</ a > class.
2014-04-22 14:20:47 -04:00
## Singular value decomposition (SVD)
[Singular value decomposition (SVD) ](http://en.wikipedia.org/wiki/Singular_value_decomposition )
factorizes a matrix into three matrices: $U$, $\Sigma$, and $V$ such that
`\[
A = U \Sigma V^T,
\]`
where
* $U$ is an orthonormal matrix, whose columns are called left singular vectors,
* $\Sigma$ is a diagonal matrix with non-negative diagonals in descending order,
whose diagonals are called singular values,
* $V$ is an orthonormal matrix, whose columns are called right singular vectors.
For large matrices, usually we don't need the complete factorization but only the top singular
2014-08-12 20:15:21 -04:00
values and its associated singular vectors. This can save storage, de-noise
2014-04-22 14:20:47 -04:00
and recover the low-rank structure of the matrix.
2014-08-12 20:15:21 -04:00
If we keep the top $k$ singular values, then the dimensions of the resulting low-rank matrix will be:
2014-04-22 14:20:47 -04:00
* `$U$` : `$m \times k$` ,
* `$\Sigma$` : `$k \times k$` ,
* `$V$` : `$n \times k$` .
2014-08-24 20:35:54 -04:00
### Performance
We assume $n$ is smaller than $m$. The singular values and the right singular vectors are derived
from the eigenvalues and the eigenvectors of the Gramian matrix $A^T A$. The matrix
storing the left singular vectors $U$, is computed via matrix multiplication as
$U = A (V S^{-1})$, if requested by the user via the computeU parameter.
The actual method to use is determined automatically based on the computational cost:
* If $n$ is small ($n < 100 $) or $ k $ is large compared with $ n $ ($ k > n / 2$), we compute the Gramian matrix
first and then compute its top eigenvalues and eigenvectors locally on the driver.
This requires a single pass with $O(n^2)$ storage on each executor and on the driver, and
$O(n^2 k)$ time on the driver.
* Otherwise, we compute $(A^T A) v$ in a distributive way and send it to
< a href = "http://www.caam.rice.edu/software/ARPACK/" > ARPACK< / a > to
compute $(A^T A)$'s top eigenvalues and eigenvectors on the driver node. This requires $O(k)$
passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
### SVD Example
2015-12-10 15:50:46 -05:00
`spark.mllib` provides SVD functionality to row-oriented matrices, provided in the
2014-08-27 04:19:48 -04:00
< a href = "mllib-data-types.html#rowmatrix" > RowMatrix< / a > class.
2014-04-22 14:20:47 -04:00
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2015-10-07 10:00:19 -04:00
Refer to the [`SingularValueDecomposition` Scala docs ](api/scala/index.html#org.apache.spark.mllib.linalg.SingularValueDecomposition ) for details on the API.
2016-02-22 20:16:56 -05:00
{% include_example scala/org/apache/spark/examples/mllib/SVDExample.scala %}
2014-07-20 23:48:44 -04:00
2014-08-12 20:15:21 -04:00
The same code applies to `IndexedRowMatrix` if `U` is defined as an
`IndexedRowMatrix` .
2014-04-22 14:20:47 -04:00
< / div >
2014-07-20 23:48:44 -04:00
< div data-lang = "java" markdown = "1" >
2015-10-07 10:00:19 -04:00
Refer to the [`SingularValueDecomposition` Java docs ](api/java/org/apache/spark/mllib/linalg/SingularValueDecomposition.html ) for details on the API.
2016-02-22 20:16:56 -05:00
{% include_example java/org/apache/spark/examples/mllib/JavaSVDExample.java %}
2014-08-12 20:15:21 -04:00
The same code applies to `IndexedRowMatrix` if `U` is defined as an
`IndexedRowMatrix` .
2017-05-03 04:58:05 -04:00
< / div >
< div data-lang = "python" markdown = "1" >
Refer to the [`SingularValueDecomposition` Python docs ](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.SingularValueDecomposition ) for details on the API.
2014-08-12 20:15:21 -04:00
2017-05-03 04:58:05 -04:00
{% include_example python/mllib/svd_example.py %}
2014-08-12 20:15:21 -04:00
2017-05-03 04:58:05 -04:00
The same code applies to `IndexedRowMatrix` if `U` is defined as an
`IndexedRowMatrix` .
2014-04-22 14:20:47 -04:00
< / div >
2014-07-20 23:48:44 -04:00
< / div >
2014-04-22 14:20:47 -04:00
## Principal component analysis (PCA)
[Principal component analysis (PCA) ](http://en.wikipedia.org/wiki/Principal_component_analysis ) is a
statistical method to find a rotation such that the first coordinate has the largest variance
2018-04-06 01:37:08 -04:00
possible, and each succeeding coordinate, in turn, has the largest variance possible. The columns of
2014-04-22 14:20:47 -04:00
the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
2015-12-10 15:50:46 -05:00
`spark.mllib` supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
2014-04-22 14:20:47 -04:00
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2014-08-24 20:35:54 -04:00
The following code demonstrates how to compute principal components on a `RowMatrix`
2014-04-22 14:20:47 -04:00
and use them to project the vectors into a low-dimensional space.
2015-10-07 10:00:19 -04:00
Refer to the [`RowMatrix` Scala docs ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) for details on the API.
2016-02-22 20:16:56 -05:00
{% include_example scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala %}
[SPARK-5521] PCA wrapper for easy transform vectors
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
Example of usage:
```
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.PCA
val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val pca = PCA.create(training.first().features.size/2, data.map(_.features))
val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
val numIterations = 100
val model = LinearRegressionWithSGD.train(training, numIterations)
val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("Mean Squared Error = " + MSE)
println("PCA Mean Squared Error = " + MSE_pca)
```
Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4304 from catap/pca and squashes the following commits:
501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
2015-05-10 16:34:00 -04:00
The following code demonstrates how to compute principal components on source vectors
and use them to project the vectors into a low-dimensional space while keeping associated labels:
2015-10-07 10:00:19 -04:00
Refer to the [`PCA` Scala docs ](api/scala/index.html#org.apache.spark.mllib.feature.PCA ) for details on the API.
2016-02-22 20:16:56 -05:00
{% include_example scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala %}
[SPARK-5521] PCA wrapper for easy transform vectors
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
Example of usage:
```
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.PCA
val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val pca = PCA.create(training.first().features.size/2, data.map(_.features))
val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
val numIterations = 100
val model = LinearRegressionWithSGD.train(training, numIterations)
val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("Mean Squared Error = " + MSE)
println("PCA Mean Squared Error = " + MSE_pca)
```
Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4304 from catap/pca and squashes the following commits:
501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
2015-05-10 16:34:00 -04:00
2014-04-22 14:20:47 -04:00
< / div >
2014-07-20 23:48:44 -04:00
< div data-lang = "java" markdown = "1" >
2014-08-24 20:35:54 -04:00
The following code demonstrates how to compute principal components on a `RowMatrix`
2014-07-20 23:48:44 -04:00
and use them to project the vectors into a low-dimensional space.
2015-10-07 10:00:19 -04:00
Refer to the [`RowMatrix` Java docs ](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html ) for details on the API.
2016-02-22 20:16:56 -05:00
{% include_example java/org/apache/spark/examples/mllib/JavaPCAExample.java %}
2014-07-20 23:48:44 -04:00
< / div >
2014-10-15 00:37:51 -04:00
2017-05-03 04:58:05 -04:00
< div data-lang = "python" markdown = "1" >
The following code demonstrates how to compute principal components on a `RowMatrix`
and use them to project the vectors into a low-dimensional space.
Refer to the [`RowMatrix` Python docs ](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix ) for details on the API.
{% include_example python/mllib/pca_rowmatrix_example.py %}
< / div >
< / div >