spark-instrumented-optimizer

History

Xiangrui Meng ca7910d6dd [SPARK-3424][MLLIB] cache point distances during k-means\|\| init This PR ports the following feature implemented in #2634 by derrickburns: * During k-means\|\| initialization, we should cache costs (squared distances) previously computed. It also contains the following optimization: * aggregate sumCosts directly * ran multiple (#runs) k-means++ in parallel I compared the performance locally on mnist-digit. Before this patch: ![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png) with this patch: ![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png) It is clear that each k-means\|\| iteration takes about the same amount of time with this patch. Authors: Derrick Burns <derrickburns@gmail.com> Xiangrui Meng <meng@databricks.com> Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits: `0a875ec` [Xiangrui Meng] address comments 4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means\|\|	2015-01-21 21:21:07 -08:00
..
src	[SPARK-3424][MLLIB] cache point distances during k-means\|\| init	2015-01-21 21:21:07 -08:00
pom.xml	[SPARK-4048] Enhance and extend hadoop-provided profile.	2015-01-08 17:15:13 -08:00

Xiangrui Meng ca7910d6dd [SPARK-3424][MLLIB] cache point distances during k-means|| init

This PR ports the following feature implemented in #2634 by derrickburns:

* During k-means|| initialization, we should cache costs (squared distances) previously computed.

It also contains the following optimization:

* aggregate sumCosts directly
* ran multiple (#runs) k-means++ in parallel

I compared the performance locally on mnist-digit. Before this patch:

![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png)

with this patch:

![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png)

It is clear that each k-means|| iteration takes about the same amount of time with this patch.

Authors:
  Derrick Burns <derrickburns@gmail.com>
  Xiangrui Meng <meng@databricks.com>

Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits:

0a875ec [Xiangrui Meng] address comments
4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means||

2015-01-21 21:21:07 -08:00

src [SPARK-3424][MLLIB] cache point distances during k-means|| init 2015-01-21 21:21:07 -08:00

pom.xml [SPARK-4048] Enhance and extend hadoop-provided profile. 2015-01-08 17:15:13 -08:00