spark-instrumented-optimizer/mllib
shahid 519be238be [SPARK-35423][ML] PCA results should be consistent, If the Matrix contains both Sparse and Dense vectors
### What changes were proposed in this pull request?
If the dataset contains mix of sparse and dense vectors output of PCA seems different. The issue here is we check only the first row's Vector type. If the first row is dense and rest all the row's are sparse, we compute PCA based on dense path. Similarly, if only first row in Sparse and rest all the rows are dense, we compute based on Sparse computation path.

Following datasets will produce different results with PCA, even though the data is same, except first row type is sparse.
```
val data1 = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
```

```
+-----------------------------------------------------------+
|pcaFeatures                                                |
+-----------------------------------------------------------+
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+-----------------------------------------------------------+

```
```
val data1 = Array(
  Vectors.dense(0.0, 1.0, 0.0, 7.0, 0.0 ),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
```

```
+------------------------------------------------------------+
|pcaFeatures                                                 |
+------------------------------------------------------------+
|[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]|
|[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]|
|[-6.428880535676488,-5.337951427775359,-1.009143519399851]  |
+------------------------------------------------------------+
```

### Why are the changes needed?
To fix inconsistent result if dataset contains both sparse and dense vectors. We need to treat the entire metrics as Sparse ONLY if all the rows are sparse. Otherwise we need to consider the matrix as dense. This PR can be a followup for the PR: https://github.com/apache/spark/pull/23126

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs

Closes #32734 from shahidki31/shahid/pca.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-06-09 10:23:46 -05:00
..
benchmarks [SPARK-34950][TESTS] Update benchmark results to the ones created by GitHub Actions machines 2021-04-03 23:02:56 +03:00
src [SPARK-35423][ML] PCA results should be consistent, If the Matrix contains both Sparse and Dense vectors 2021-06-09 10:23:46 -05:00
pom.xml [SPARK-35532][TESTS] Ensure mllib and kafka-0-10 module can be maven test independently in Scala 2.13 2021-05-30 16:36:17 -07:00