spark-instrumented-optimizer

History

zhengruifeng a3bbc371cb [SPARK-28421][ML] SparseVector.apply performance optimization ## What changes were proposed in this pull request? optimize the `SparseVector.apply` by avoiding internal conversion Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting. \| size\| nnz \| apply(old) \| apply2(new impl) \| apply3(new impl with extra range check)\| \|------\|----------\|------------\|----------\|----------\| \|10000000\|100\|75294\|12208\|18682\| \|10000000\|10000\|75616\|23132\|32932\| \|10000000\|1000000\|92949\|42529\|48821\| ## How was this patch tested? existing tests using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`): ``` import scala.util.Random import org.apache.spark.ml.linalg._ val size = 10000000 for (nnz <- Seq(100, 10000, 1000000)) { val rng = new Random(123) val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted val values = Array.fill(nnz)(rng.nextDouble) val vec = Vectors.sparse(size, indices, values).toSparse val tic1 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} }; val toc1 = System.currentTimeMillis; val tic2 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} }; val toc2 = System.currentTimeMillis; val tic3 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} }; val toc3 = System.currentTimeMillis; println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3)) } ``` Closes #25178 from zhengruifeng/sparse_vec_apply. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-23 20:20:22 -05:00
..
src	[SPARK-28421][ML] SparseVector.apply performance optimization	2019-07-23 20:20:22 -05:00
pom.xml	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0	2018-11-14 16:22:23 -08:00

zhengruifeng a3bbc371cb [SPARK-28421][ML] SparseVector.apply performance optimization

## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.

| size|  nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|

## How was this patch tested?
existing tests

using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;

	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```

Closes #25178 from zhengruifeng/sparse_vec_apply.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

2019-07-23 20:20:22 -05:00

src

[SPARK-28421][ML] SparseVector.apply performance optimization

2019-07-23 20:20:22 -05:00

pom.xml

[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-14 16:22:23 -08:00