spark-instrumented-optimizer

History

Holden Karau 42050c3f4f [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator ### What changes were proposed in this pull request? This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not. ### Why are the changes needed? In https://issues.apache.org/jira/browse/SPARK-23961 / `5e79ae3b40` we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking. ### Does this PR introduce any user-facing change? A new param is added to toLocalIterator ### How was this patch tested? New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`. I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case. > > 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). > Running timers: > > [Stage 32:> (0 + 1) / 1] > Results: > > Prefetch time: > > 100.228110831 > > > Regular time: > > 188.341721614 > > > Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>		2019-09-20 09:59:31 -07:00
..
benchmarks	[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks	2019-09-18 17:52:23 -07:00
src	[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator	2019-09-20 09:59:31 -07:00
v1.2.1/src	[SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession	2019-08-19 19:01:56 +08:00
v2.3.5/src	[SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession	2019-08-19 19:01:56 +08:00
pom.xml	[SPARK-27521][SQL] Move data source v2 to catalyst module	2019-06-05 09:55:55 -07:00