spark-instrumented-optimizer/mllib
Shahid ce40efa200 [SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix
## What changes were proposed in this pull request?
Spark PCA supports maximum only ~65,535 columns matrix. This is due to the fact that, it computes the Covariance matrix first, then compute principle components. The main bottle neck was computing **covariance matrix.** The limit 65,500 came due to the integer size limit. Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500.

Currently we don't have such limitation for computing SVD in spark.  So, we can make use of Spark SVD to compute the PCA, if the number of columns exceeds the limit.

Computation of PCA can be done directly using SVD of matrix, instead of finding the covariance matrix.
Following are the papers/links for the reference.

https://arxiv.org/pdf/1404.1100.pdf
https://en.wikipedia.org/wiki/Principal_component_analysis#Singular_value_decomposition
http://www.ifis.uni-luebeck.de/~moeller/Lectures/WS-16-17/Web-Mining-Agents/PCA-SVD.pdf

## How was this patch tested?
added UT, also manually verified with the existing test for pca, by removing the limit condition in the fit method.

Closes #22784 from shahidki31/PCA.

Authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-10-30 08:39:30 -05:00
..
benchmarks [SPARK-25489][ML][TEST] Refactor UDTSerializationBenchmark 2018-09-23 13:34:06 -07:00
src [SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix 2018-10-30 08:39:30 -05:00
pom.xml [SPARK-25592] Setting version to 3.0.0-SNAPSHOT 2018-10-02 08:48:24 -07:00