spark-instrumented-optimizer/python/pyspark
Davies Liu ca23c3b014 [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark
The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.

cc JoshRosen rxin angelini

Author: Davies Liu <davies@databricks.com>

Closes #6714 from davies/batch_size and squashes the following commits:

b170dfb [Davies Liu] update test
b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
6ade745 [Davies Liu] update test
5c21777 [Davies Liu] Update shuffle.py
e746aec [Davies Liu] fix batch size during sort
2015-06-18 13:49:32 -07:00
..
ml [SPARK-7432] [MLLIB] fix flaky CrossValidator doctest 2015-06-02 08:51:07 -07:00
mllib [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and regression 2015-06-16 14:30:42 -07:00
sql [SPARK-8146] DataFrame Python API: Alias replace in df.na 2015-06-07 01:21:08 -07:00
streaming [SPARK-7497] [PYSPARK] [STREAMING] fix streaming flaky tests 2015-06-01 14:40:40 -07:00
__init__.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
accumulators.py [SPARK-7899] [PYSPARK] Fix Python 3 pyspark/sql/types module conflict 2015-06-01 16:56:04 -07:00
broadcast.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
cloudpickle.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
conf.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
context.py [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD 2015-06-17 13:59:47 -07:00
daemon.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
heapq3.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
java_gateway.py [SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression 2015-04-21 00:08:18 -07:00
join.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
profiler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
rdd.py [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD 2015-06-17 13:59:47 -07:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resultiterable.py [SPARK-3074] [PySpark] support groupByKey() with single huge key 2015-04-09 17:07:23 -07:00
serializers.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
shell.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
shuffle.py [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark 2015-06-18 13:49:32 -07:00
statcounter.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-3417] Use new-style classes in PySpark 2014-09-08 15:45:36 -07:00
tests.py [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark 2015-06-18 13:49:32 -07:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
worker.py [SPARK-6216] [PYSPARK] check python version of worker with driver 2015-05-18 12:55:37 -07:00