spark-instrumented-optimizer/python/pyspark
Yanbo Liang 3ef6d98a80 [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
## What changes were proposed in this pull request?

Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
- mllib `GaussianMixture` allows k == 1, but ml does not.
- mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
## How was this patch tested?

Existing tests and added new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15413 from yanboliang/spark-17847.
2017-01-09 21:38:46 -08:00
..
ml [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml 2017-01-09 21:38:46 -08:00
mllib [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) 2016-12-28 00:49:36 -08:00
sql [SPARK-19126][DOCS] Update Join Documentation Across Languages 2017-01-08 20:37:46 -08:00
streaming [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
__init__.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
accumulators.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
broadcast.py [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python 2016-09-14 13:37:35 -07:00
cloudpickle.py [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python 2016-09-14 13:37:35 -07:00
conf.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
context.py [SPARK-18523][PYSPARK] Make SparkContext.stop more reliable 2016-11-28 18:28:24 -08:00
daemon.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
find_spark_home.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
heapq3.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
java_gateway.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
rdd.py [SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator 2016-12-20 13:12:16 -08:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resultiterable.py [SPARK-3074] [PySpark] support groupByKey() with single huge key 2015-04-09 17:07:23 -07:00
serializers.py [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records 2016-12-08 11:08:12 -08:00
shell.py [SPARK-16536][SQL][PYSPARK][MINOR] Expose sql in PySpark Shell 2016-07-13 22:24:26 -07:00
shuffle.py [SPARK-10710] Remove ability to disable spilling in core and SQL 2015-09-19 21:40:21 -07:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api 2016-04-12 23:06:55 -07:00
taskcontext.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
tests.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
version.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
worker.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00