spark-instrumented-optimizer/docs/mllib-linear-algebra.md

---
layout: global
title: MLlib - Linear Algebra
---

* Table of contents
{:toc}


# Singular Value Decomposition
Singular Value `Decomposition` for Tall and Skinny matrices.
Given an `$m \times n$` matrix `$A$`, we can compute matrices `$U,S,V$` such that

`\[
 A = U \cdot S \cdot V^T
 \]`

There is no restriction on m, but we require n^2 doubles to
fit in memory locally on one machine.
Further, n should be less than m.

The decomposition is computed by first computing `$A^TA = V S^2 V^T$`,
computing SVD locally on that (since `$n \times n$` is small),
from which we recover `$S$` and `$V$`.
Then we compute U via easy matrix multiplication
as `$U =  A \cdot V \cdot S^{-1}$`.

Only singular vectors associated with largest k singular values
are recovered. If there are k
such values, then the dimensions of the return will be:

* `$S$` is `$k \times k$` and diagonal, holding the singular values on diagonal.
* `$U$` is `$m \times k$` and satisfies `$U^T U = \mathop{eye}(k)$`.
* `$V$` is `$n \times k$` and satisfies `$V^T V = \mathop{eye}(k)$`.

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in
SparseMatrix RDDs. Below is example usage.

{% highlight scala %}

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
  val parts = line.split(',')
  MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4
val k = 1

// recover largest singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), k)
val = decomposed.S.data

println("singular values = " + s.toArray.mkString)
{% endhighlight %}


# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X.
Rows of X correspond to observations and columns correspond to variables.
The coefficient matrix is n-by-k. Each column of the return matrix contains coefficients
for one principal component, and the columns are in descending
order of component variance. This function centers the data and uses the
singular value decomposition (SVD) algorithm.

All input and output is expected in DenseMatrix matrix format. See the examples directory
under "SparkPCA.scala" for example usage.
Merge pull request #552 from martinjaggi/master. Closes #552. tex formulas in the documentation using mathjax. and spliting the MLlib documentation by techniques see jira https://spark-project.atlassian.net/browse/MLLIB-19 and https://github.com/shivaram/spark/compare/mathjax Author: Martin Jaggi <m.jaggi@gmail.com> == Merge branch commits == commit 0364bfabbfc347f917216057a20c39b631842481 Author: Martin Jaggi <m.jaggi@gmail.com> Date: Fri Feb 7 03:19:38 2014 +0100 minor polishing, as suggested by @pwendell commit dcd2142c164b2f602bf472bb152ad55bae82d31a Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 18:04:26 2014 +0100 enabling inline latex formulas with $.$ same mathjax configuration as used in math.stackexchange.com sample usage in the linear algebra (SVD) documentation commit bbafafd2b497a5acaa03a140bb9de1fbb7d67ffa Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 17:31:29 2014 +0100 split MLlib documentation by techniques and linked from the main mllib-guide.md site commit d1c5212b93c67436543c2d8ddbbf610fdf0a26eb Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 16:59:43 2014 +0100 enable mathjax formula in the .md documentation files code by @shivaram commit d73948db0d9bc36296054e79fec5b1a657b4eab4 Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 16:57:23 2014 +0100 minor update on how to compile the documentation 2014-02-08 14:39:13 -05:00			`---`
			`layout: global`
			`title: MLlib - Linear Algebra`
			`---`

			`* Table of contents`
			`{:toc}`


			`# Singular Value Decomposition`
			Singular Value `Decomposition` for Tall and Skinny matrices.
			Given an `$m \times n$` matrix `$A$`, we can compute matrices `$U,S,V$` such that

			`\[
			`A = U \cdot S \cdot V^T`
			\]`

			`There is no restriction on m, but we require n^2 doubles to`
			`fit in memory locally on one machine.`
			`Further, n should be less than m.`

			The decomposition is computed by first computing `$A^TA = V S^2 V^T$`,
			computing SVD locally on that (since `$n \times n$` is small),
			from which we recover `$S$` and `$V$`.
			`Then we compute U via easy matrix multiplication`
			as `$U = A \cdot V \cdot S^{-1}$`.

			`Only singular vectors associated with largest k singular values`
			`are recovered. If there are k`
			`such values, then the dimensions of the return will be:`

			* `$S$` is `$k \times k$` and diagonal, holding the singular values on diagonal.
			* `$U$` is `$m \times k$` and satisfies `$U^T U = \mathop{eye}(k)$`.
			* `$V$` is `$n \times k$` and satisfies `$V^T V = \mathop{eye}(k)$`.

			`All input and output is expected in sparse matrix format, 0-indexed`
			`as tuples of the form ((i,j),value) all in`
			`SparseMatrix RDDs. Below is example usage.`

			`{% highlight scala %}`

			`import org.apache.spark.SparkContext`
			`import org.apache.spark.mllib.linalg.SVD`
			`import org.apache.spark.mllib.linalg.SparseMatrix`
			`import org.apache.spark.mllib.linalg.MatrixEntry`

			`// Load and parse the data file`
			`val data = sc.textFile("mllib/data/als/test.data").map { line =>`
			`val parts = line.split(',')`
			`MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)`
			`}`
			`val m = 4`
			`val n = 4`
			`val k = 1`

			`// recover largest singular vector`
			`val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), k)`
			`val = decomposed.S.data`

			`println("singular values = " + s.toArray.mkString)`
			`{% endhighlight %}`
Principal Component Analysis # Principal Component Analysis Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm. ## Testing Tests included: * All principal components * Only top k principal components * Dense SVD tests * Dense/sparse matrix tests The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html ## Documentation Added to mllib-guide.md ## Example Usage Added to examples directory under SparkPCA.scala Author: Reza Zadeh <rizlar@gmail.com> Closes #88 from rezazadeh/sparkpca and squashes the following commits: e298700 [Reza Zadeh] reformat using IDE 3f23271 [Reza Zadeh] documentation and cleanup b025ab2 [Reza Zadeh] documentation e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals 3787bb4 [Reza Zadeh] stylin c6ecc1f [Reza Zadeh] docs aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense 56975b0 [Reza Zadeh] docs 2df9bde [Reza Zadeh] docs update 8fb0015 [Reza Zadeh] rcond documentation dbf7797 [Reza Zadeh] correct argument number a9f1f62 [Reza Zadeh] documentation 4ce6caa [Reza Zadeh] style changes 9a56a02 [Reza Zadeh] use rcond relative to larget svalue 120f796 [Reza Zadeh] housekeeping 156ff78 [Reza Zadeh] string comprehension 2e1cf43 [Reza Zadeh] rename rcond ea223a6 [Reza Zadeh] many style changes f4002d7 [Reza Zadeh] more docs bd53c7a [Reza Zadeh] proper accumulator a8b5ecf [Reza Zadeh] Don't use for loops 0dc7980 [Reza Zadeh] filter zeros in sparse 6115610 [Reza Zadeh] More documentation 36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation bc4599f [Reza Zadeh] configurable rcond 86f7515 [Reza Zadeh] compute per parition, use while 09726b3 [Reza Zadeh] more style changes 4195e69 [Reza Zadeh] private, accumulator 17002be [Reza Zadeh] style changes 4ba7471 [Reza Zadeh] style change f4982e6 [Reza Zadeh] Use dense matrix in example 2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops 72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean f807be9 [Reza Zadeh] fix typo 2d7ccde [Reza Zadeh] Array interface for dense svd and pca cd290fa [Reza Zadeh] provide RDD[Array[Double]] support 398d123 [Reza Zadeh] style change 55abbfa [Reza Zadeh] docs fix ef29644 [Reza Zadeh] bad chnage undo 472566e [Reza Zadeh] all files from old pr 555168f [Reza Zadeh] initial files 2014-03-20 13:39:20 -04:00

			`# Principal Component Analysis`

			`Computes the top k principal component coefficients for the m-by-n data matrix X.`
			`Rows of X correspond to observations and columns correspond to variables.`
			`The coefficient matrix is n-by-k. Each column of the return matrix contains coefficients`
			`for one principal component, and the columns are in descending`
			`order of component variance. This function centers the data and uses the`
			`singular value decomposition (SVD) algorithm.`

			`All input and output is expected in DenseMatrix matrix format. See the examples directory`
			`under "SparkPCA.scala" for example usage.`