spark-instrumented-optimizer/docs/mllib-dimensionality-reduction.md
Xiangrui Meng df0aa8353a [WIP][SPARK-1871][MLLIB] Improve MLlib guide for v1.0
Some improvements to MLlib guide:

1. [SPARK-1872] Update API links for unidoc.
2. [SPARK-1783] Added `page.displayTitle` to the global layout. If it is defined, use it instead of `page.title` for title display.
3. Add more Java/Python examples.

Author: Xiangrui Meng <meng@databricks.com>

Closes #816 from mengxr/mllib-doc and squashes the following commits:

ec2e407 [Xiangrui Meng] format scala example for ALS
cd9f40b [Xiangrui Meng] add a paragraph to summarize distributed matrix types
4617f04 [Xiangrui Meng] add python example to loadLibSVMFile and fix Java example
d6509c2 [Xiangrui Meng] [SPARK-1783] update mllib titles
561fdc0 [Xiangrui Meng] add a displayTitle option to global layout
195d06f [Xiangrui Meng] add Java example for summary stats and minor fix
9f1ff89 [Xiangrui Meng] update java api links in mllib-basics
7dad18e [Xiangrui Meng] update java api links in NB
3a0f4a6 [Xiangrui Meng] api/pyspark -> api/python
35bdeb9 [Xiangrui Meng] api/mllib -> api/scala
e4afaa8 [Xiangrui Meng] explicity state what might change
2014-05-18 17:00:57 -07:00

3.7 KiB

layout title displayTitle
global Dimensionality Reduction - MLlib <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
  • Table of contents {:toc}

Dimensionality reduction is the process of reducing the number of variables under consideration. It is used to extract latent features from raw and noisy features, or compress data while maintaining the structure. In this release, we provide preliminary support for dimensionality reduction on tall-and-skinny matrices.

Singular value decomposition (SVD)

Singular value decomposition (SVD) factorizes a matrix into three matrices: U, \Sigma, and V such that

\[ A = U \Sigma V^T, \]

where

  • U is an orthonormal matrix, whose columns are called left singular vectors,
  • \Sigma is a diagonal matrix with non-negative diagonals in descending order, whose diagonals are called singular values,
  • V is an orthonormal matrix, whose columns are called right singular vectors.

For large matrices, usually we don't need the complete factorization but only the top singular values and its associated singular vectors. This can save storage, and more importantly, de-noise and recover the low-rank structure of the matrix.

If we keep the top k singular values, then the dimensions of the return will be:

  • $U$: $m \times k$,
  • $\Sigma$: $k \times k$,
  • $V$: $n \times k$.

In this release, we provide SVD computation to row-oriented matrices that have only a few columns, say, less than 1000, but many rows, which we call tall-and-skinny.

{% highlight scala %} import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.SingularValueDecomposition

val mat: RowMatrix = ...

// Compute the top 20 singular values and corresponding singular vectors. val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(20, computeU = true) val U: RowMatrix = svd.U // The U factor is a RowMatrix. val s: Vector = svd.s // The singular values are stored in a local dense vector. val V: Matrix = svd.V // The V factor is a local dense matrix. {% endhighlight %}

Same code applies to `IndexedRowMatrix`. The only difference that the `U` matrix becomes an `IndexedRowMatrix`.

Principal component analysis (PCA)

Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.

In this release, we implement PCA for tall-and-skinny matrices stored in row-oriented format.

The following code demonstrates how to compute principal components on a tall-and-skinny RowMatrix and use them to project the vectors into a low-dimensional space. The number of columns should be small, e.g, less than 1000.

{% highlight scala %} import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix

val mat: RowMatrix = ...

// Compute the top 10 principal components. val pc: Matrix = mat.computePrincipalComponents(10) // Principal components are stored in a local dense matrix.

// Project the rows to the linear space spanned by the top 10 principal components. val projected: RowMatrix = mat.multiply(pc) {% endhighlight %}