ca7910d6dd
This PR ports the following feature implemented in #2634 by derrickburns:
* During k-means|| initialization, we should cache costs (squared distances) previously computed.
It also contains the following optimization:
* aggregate sumCosts directly
* ran multiple (#runs) k-means++ in parallel
I compared the performance locally on mnist-digit. Before this patch:
![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png)
with this patch:
![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png)
It is clear that each k-means|| iteration takes about the same amount of time with this patch.
Authors:
Derrick Burns <derrickburns@gmail.com>
Xiangrui Meng <meng@databricks.com>
Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits:
|
||
---|---|---|
.. | ||
src | ||
pom.xml |