spark-instrumented-optimizer

History

Bryan Cutler ecaa495b1f [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance ## What changes were proposed in this pull request? When executing `toPandas` with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in. This change sends un-ordered partitions to Python as soon as they arrive in the JVM, followed by a list of partition indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased. Followup to #21546 ## How was this patch tested? Added new test with a large number of batches per partition, and test that forces a small delay in the first partition. These test that partitions are collected out-of-order and then are are put in the correct order in Python. ## Performance Tests - toPandas Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `toPandas()` and took the average best time of 5 runs/5 loops each. Test code ```python df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand()) for i in range(5): start = time.time() _ = df.toPandas() elapsed = time.time() - start ``` Spark config ``` spark.driver.memory 5g spark.executor.memory 5g spark.driver.maxResultSize 2g spark.sql.execution.arrow.enabled true ``` Current Master w/ Arrow stream \| This PR ---------------------\|------------ 5.16207 \| 4.342533 5.133671 \| 4.399408 5.147513 \| 4.468471 5.105243 \| 4.36524 5.018685 \| 4.373791 Avg Master \| Avg This PR ------------------\|-------------- 5.1134364 \| 4.3898886 Speedup of 1.164821449 Closes #22275 from BryanCutler/arrow-toPandas-oo-batches-SPARK-25274. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>		2018-12-06 10:07:28 -08:00
..
ml	[SPARK-26133][ML][FOLLOWUP] Fix doc for OneHotEncoder	2018-12-05 19:30:25 +08:00
mllib	[SPARK-26275][PYTHON][ML] Increases timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test	2018-12-06 09:14:46 +08:00
sql	[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance	2018-12-06 10:07:28 -08:00
streaming	[SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files	2018-11-16 07:58:09 +08:00
testing	[SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small cleanup and deduplication in ml/mllib tests	2018-12-03 14:03:10 -08:00
tests	[SPARK-26201] Fix python broadcast with encryption	2018-11-30 12:48:56 -06:00
__init__.py	[SPARK-25248][.1][PYSPARK] update barrier Python API	2018-08-29 07:22:03 -07:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator	2018-10-08 15:18:08 +08:00
broadcast.py	[SPARK-26201] Fix python broadcast with encryption	2018-11-30 12:48:56 -06:00
cloudpickle.py	[SPARK-24303][PYTHON] Update cloudpickle to v0.4.4	2018-05-18 09:53:24 -07:00
conf.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
context.py	[SPARK-25737][CORE] Remove JavaSparkContextVarargsWorkaround	2018-10-24 14:43:51 -05:00
daemon.py	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
heapq3.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
java_gateway.py	[SPARK-25253][PYSPARK][FOLLOWUP] Undefined name: from pyspark.util import _exception_message	2018-08-30 08:13:11 +08:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
rdd.py	[MINOR] Update all DOI links to preferred resolver	2018-11-25 17:43:55 -06:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance	2018-12-06 10:07:28 -08:00
shell.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
shuffle.py	[SPARK-23754][PYTHON] Re-raising StopIteration in client code	2018-05-30 18:11:33 +08:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
taskcontext.py	[SPARK-25921][PYSPARK] Fix barrier task run without BarrierTaskContext while python worker reuse	2018-11-13 17:05:39 +08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
version.py	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT	2018-10-02 08:48:24 -07:00
worker.py	[SPARK-26080][PYTHON] Skips Python resource limit on Windows in Python worker	2018-12-02 17:41:08 +08:00