spark-instrumented-optimizer/mllib-local
zhengruifeng a3bbc371cb [SPARK-28421][ML] SparseVector.apply performance optimization
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.

| size|  nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|

## How was this patch tested?
existing tests

using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;

	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```

Closes #25178 from zhengruifeng/sparse_vec_apply.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-23 20:20:22 -05:00
..
src [SPARK-28421][ML] SparseVector.apply performance optimization 2019-07-23 20:20:22 -05:00
pom.xml [SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 2018-11-14 16:22:23 -08:00