spark-instrumented-optimizer/python/pyspark
Aaron Davidson f46e02fcdb SPARK-2203: PySpark defaults to use same num reduce partitions as map side
For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.

JIRA: https://issues.apache.org/jira/browse/SPARK-2203

Author: Aaron Davidson <aaron@databricks.com>

Closes #1138 from aarondav/pyfix and squashes the following commits:

1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
2014-06-20 00:06:57 -07:00
..
mllib [SPARK-2091][MLLIB] use numpy.dot instead of ndarray.dot 2014-06-11 00:22:40 -07:00
__init__.py SPARK-1004. PySpark on YARN 2014-04-29 23:24:34 -07:00
accumulators.py Add custom serializer support to PySpark. 2013-11-10 16:45:38 -08:00
broadcast.py Fix some Python docs and make sure to unset SPARK_TESTING in Python 2013-12-29 20:15:07 -05:00
cloudpickle.py SPARK-1917: fix PySpark import of scipy.special functions 2014-05-31 14:59:09 -07:00
conf.py [FIX] do not load defaults when testing SparkConf in pyspark 2014-05-14 14:57:17 -07:00
context.py SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats 2014-06-09 22:21:03 -07:00
daemon.py SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions 2014-05-07 09:48:31 -07:00
files.py Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
java_gateway.py [SPARK-1466] Raise exception if pyspark Gateway process doesn't start. 2014-06-18 13:16:26 -07:00
join.py Spark 1271: Co-Group and Group-By should pass Iterable[X] 2014-04-08 18:15:59 -07:00
rdd.py SPARK-2203: PySpark defaults to use same num reduce partitions as map side 2014-06-20 00:06:57 -07:00
rddsampler.py SPARK-1438 RDD.sample() make seed param optional 2014-04-24 17:27:16 -07:00
resultiterable.py Spark 1271: Co-Group and Group-By should pass Iterable[X] 2014-04-08 18:15:59 -07:00
serializers.py SPARK-1421. Make MLlib work on Python 2.6 2014-04-05 20:52:05 -07:00
shell.py [SPARK-1808] Route bin/pyspark through Spark submit 2014-05-16 22:34:38 -07:00
sql.py [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL 2014-06-17 19:14:59 -07:00
statcounter.py Spark 1246 add min max to stat counter 2014-03-18 00:45:47 -07:00
storagelevel.py [SPARK-2130] End-user friendly String repr for StorageLevel in Python 2014-06-16 23:31:31 -07:00
tests.py SPARK-554. Add aggregateByKey. 2014-06-12 08:14:25 -07:00
worker.py Add Python includes to path before depickling broadcast values 2014-05-10 13:02:13 -07:00