spark-instrumented-optimizer/python/pyspark/mllib
李亮 e503065fd8 [SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem
## What changes were proposed in this pull request?

Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy.

## How was this patch tested?
From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense)

**For calculation logic test**
There is my test for sparse-sparse, dense-dense, sparse-dense case

There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

**For real world data test**
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

Closes #22893 from KyleLi1985/updatekmeanpatch.

Authored-by: 李亮 <liang.li.work@outlook.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-11-14 07:24:13 -08:00
..
linalg [SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+ 2018-07-07 11:39:29 +08:00
stat Fix typos detected by github.com/client9/misspell 2018-08-11 21:23:36 -05:00
__init__.py [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide 2016-07-15 13:38:23 -07:00
classification.py [SPARK-14712][ML] LogisticRegressionModel.toString should summarize model 2018-06-28 12:40:39 -07:00
clustering.py [SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem 2018-11-14 07:24:13 -08:00
common.py [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch 2016-10-03 14:12:03 -07:00
evaluation.py [SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 2018-11-07 22:48:50 -06:00
feature.py [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 2018-09-13 11:19:43 +08:00
fpm.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
random.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
recommendation.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
regression.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
tests.py [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark 2018-05-07 14:47:58 -07:00
tree.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
util.py [SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 2018-11-07 22:48:50 -06:00