spark-instrumented-optimizer/python/pyspark/sql
Josh Rosen 6d06ff6f7e [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python
## What changes were proposed in this pull request?

In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing.

The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.).

In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations.

## How was this patch tested?

Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15068 from JoshRosen/pyspark-collect-limit.
2016-09-14 10:10:01 -07:00
..
__init__.py [SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes" 2016-08-06 05:02:59 +01:00
catalog.py [SPARK-16772] Correct API doc references to PySpark classes + formatting fixes 2016-07-28 14:57:15 -07:00
column.py [SPARK-17215][SQL] Method SQLContext.parseDataType(dataTypeString: String) could be removed. 2016-08-24 23:36:04 -07:00
conf.py [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code 2016-05-23 18:14:48 -07:00
context.py [SPARK-16700][PYSPARK][SQL] create DataFrame from dict/Row with schema 2016-08-15 12:41:27 -07:00
dataframe.py [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python 2016-09-14 10:10:01 -07:00
functions.py [SPARK-17215][SQL] Method SQLContext.parseDataType(dataTypeString: String) could be removed. 2016-08-24 23:36:04 -07:00
group.py [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation 2016-07-06 10:45:51 -07:00
readwriter.py [SPARK-17215][SQL] Method SQLContext.parseDataType(dataTypeString: String) could be removed. 2016-08-24 23:36:04 -07:00
session.py [SPARK-17261] [PYSPARK] Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext" 2016-09-02 10:08:14 -07:00
streaming.py [SPARK-17264][SQL] DataStreamWriter should document that it only supports Parquet for now 2016-08-30 11:19:45 +01:00
tests.py [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python 2016-09-14 10:10:01 -07:00
types.py [SPARK-17215][SQL] Method SQLContext.parseDataType(dataTypeString: String) could be removed. 2016-08-24 23:36:04 -07:00
utils.py [SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery 2016-06-15 10:46:07 -07:00
window.py [SPARK-14058][PYTHON] Incorrect docstring in Window.order 2016-03-21 23:52:33 -07:00