spark-instrumented-optimizer/python/pyspark/sql/tests
Holden Karau 42050c3f4f [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator
### What changes were proposed in this pull request?

This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not.

### Why are the changes needed?

In https://issues.apache.org/jira/browse/SPARK-23961 / 5e79ae3b40 we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking.

### Does this PR introduce any user-facing change?

A new param is added to toLocalIterator

### How was this patch tested?

New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`.

I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case.

>
> 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> Running timers:
>
> [Stage 32:>                                                         (0 + 1) / 1]
> Results:
>
> Prefetch time:
>
> 100.228110831
>
>
> Regular time:
>
> 188.341721614
>
>
>

Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2019-09-20 09:59:31 -07:00
..
__init__.py [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files 2018-11-14 14:51:11 +08:00
test_arrow.py [SPARK-28881][PYTHON][TESTS][FOLLOW-UP] Use SparkSession(SparkContext(...)) to prevent for Spark conf to affect other tests 2019-08-28 10:39:21 +09:00
test_catalog.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_column.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_conf.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_context.py [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
test_dataframe.py [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator 2019-09-20 09:59:31 -07:00
test_datasources.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_functions.py [SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) 2019-07-31 22:40:01 +08:00
test_group.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_pandas_udf.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_pandas_udf_cogrouped_map.py [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs 2019-09-17 17:13:50 -07:00
test_pandas_udf_grouped_agg.py [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause 2019-08-14 00:32:33 +09:00
test_pandas_udf_grouped_map.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_pandas_udf_iter.py [SPARK-28198][PYTHON][FOLLOW-UP] Rename mapPartitionsInPandas to mapInPandas with a separate evaluation type 2019-07-05 09:22:41 +09:00
test_pandas_udf_scalar.py [SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes 2019-09-12 19:59:34 +08:00
test_pandas_udf_window.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_readwriter.py [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored 2019-07-18 13:37:59 +09:00
test_serde.py [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type 2019-09-12 08:52:25 +09:00
test_session.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_streaming.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_types.py [SPARK-28454][PYTHON] Validate LongType in createDataFrame(verifySchema=True) 2019-08-08 11:47:25 +09:00
test_udf.py [SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes 2019-09-12 19:59:34 +08:00
test_utils.py [SPARK-19926][PYSPARK] make captured exception from JVM side user friendly 2019-09-18 23:32:10 +09:00