spark-instrumented-optimizer/bin
Bryan Cutler d03aebbe65 [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
2017-07-10 15:21:03 -07:00
..
beeline [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
beeline.cmd [SPARK-13673][WINDOWS] Fixed not to pollute environment variables. 2016-03-04 13:53:53 +00:00
find-spark-home [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
load-spark-env.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
load-spark-env.sh [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
pyspark [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas 2017-07-10 15:21:03 -07:00
pyspark.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
pyspark2.cmd [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 2017-07-05 16:33:23 -07:00
run-example [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
run-example.cmd [SPARK-13576][BUILD] Don't create assembly for examples. 2016-03-15 09:44:51 -07:00
spark-class [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode 2017-05-05 11:36:51 +01:00
spark-class.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
spark-class2.cmd [SPARK-20613] Remove excess quotes in Windows executable 2017-05-05 08:30:42 -07:00
spark-shell [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
spark-shell.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
spark-shell2.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
spark-sql [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
spark-submit [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
spark-submit.cmd [SPARK-13592][WINDOWS] fix path of spark-submit2.cmd in spark-submit.cmd 2016-03-01 14:37:36 +00:00
spark-submit2.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
sparkR [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
sparkR.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00
sparkR2.cmd [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts 2016-02-10 09:54:22 +00:00