spark-instrumented-optimizer

History

李亮 e503065fd8 [SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem ## What changes were proposed in this pull request? Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy. ## How was this patch tested? From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense) For calculation logic test There is my test for sparse-sparse, dense-dense, sparse-dense case There is test result: First we need define some branch path logic for sparse-sparse and sparse-dense case if meet precisionBound1, we define it as LOGIC1 if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2 if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3 (There is a trick, you can manually change the precision value to meet above situation) sparse- sparse case time cost situation (milliseconds) LOGIC1 Before add patch: 7786, 7970, 8086 After add patch: 7729, 7653, 7903 LOGIC2 Before add patch: 8412, 9029, 8606 After add patch: 8603, 8724, 9024 LOGIC3 Before add patch: 19365, 19146, 19351 After add patch: 18917, 19007, 19074 sparse-dense case time cost situation (milliseconds) LOGIC1 Before add patch: 4195, 4014, 4409 After add patch: 4081,3971, 4151 LOGIC2 Before add patch: 4968, 5579, 5080 After add patch: 4980, 5472, 5148 LOGIC3 Before add patch: 11848, 12077, 12168 After add patch: 11718, 11874, 11743 And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance dense-dense case time cost situation (milliseconds) Before add patch: 7340, 7816, 7672 After add patch: 5752, 5800, 5753 For real world data test There is my test data situation I use the data http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data total instances are 13230 the attributes for line are 6000 Result for sparse-sparse situation time cost (milliseconds) Before Enhance: 7670, 7704, 7652 After Enhance: 7634, 7729, 7645 Closes #22893 from KyleLi1985/updatekmeanpatch. Authored-by: 李亮 <liang.li.work@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>		2018-11-14 07:24:13 -08:00
..
linalg	[SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+	2018-07-07 11:39:29 +08:00
stat	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
__init__.py	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide	2016-07-15 13:38:23 -07:00
classification.py	[SPARK-14712][ML] LogisticRegressionModel.toString should summarize model	2018-06-28 12:40:39 -07:00
clustering.py	[SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem	2018-11-14 07:24:13 -08:00
common.py	[SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch	2016-10-03 14:12:03 -07:00
evaluation.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
feature.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
fpm.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
random.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
recommendation.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
regression.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
tests.py	[SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark	2018-05-07 14:47:58 -07:00
tree.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
util.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00