spark-instrumented-optimizer/dev
Bryan Cutler e44697606f [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
2017-06-23 09:01:13 +08:00
..
create-release [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version 2017-05-09 11:25:29 -07:00
deps [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas 2017-06-23 09:01:13 +08:00
sparktestsupport [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes 2017-06-02 21:59:52 -07:00
tests [SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests 2015-12-30 12:47:42 -08:00
.gitignore [SPARK-6219] Reuse pep8.py 2015-04-18 16:46:28 -07:00
.rat-excludes [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core 2017-06-15 11:46:00 -07:00
appveyor-guide.md [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only) 2016-09-08 08:26:59 -07:00
appveyor-install-dependencies.ps1 [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support 2017-02-16 12:32:45 +00:00
change-scala-version.sh [SPARK-9250] Make change-scala-version more helpful w.r.t. valid Scala versions 2015-07-24 17:09:33 +01:00
change-version-to-2.10.sh [MINOR] Fix some typo of the document 2017-06-19 20:35:58 +01:00
change-version-to-2.11.sh [MINOR] Fix some typo of the document 2017-06-19 20:35:58 +01:00
check-license [SPARK-13596][BUILD] Move misc top-level build files into appropriate subdirs 2016-03-07 14:48:02 -08:00
checkstyle-suppressions.xml [MINOR][BUILD] Fix lint-java breaks. 2017-05-10 13:56:34 +01:00
checkstyle.xml [SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site 2016-11-23 11:25:47 +00:00
github_jira_sync.py [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts 2017-01-02 15:23:19 +00:00
lint-java [SPARK-16967] move mesos to module 2016-08-26 12:25:22 -07:00
lint-python [MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory 2017-06-02 14:25:38 +01:00
lint-r [SPARK-10328] [SPARKR] Fix generic for na.omit 2015-08-28 00:37:50 -07:00
lint-r.R [SPARK-14074][SPARKR] Specify commit sha1 ID when using install_github to install intr package. 2016-03-23 07:57:03 -07:00
lint-scala [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically 2014-08-06 12:58:24 -07:00
make-distribution.sh [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… 2017-04-02 15:31:13 +01:00
merge_spark_pr.py [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts 2017-01-02 15:23:19 +00:00
mima [SPARK-19550][HOTFIX][BUILD] Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima 2017-02-16 18:43:38 +00:00
pip-sanity-check.py [SPARK-19064][PYSPARK] Fix pip installing of sub components 2017-01-25 14:43:39 -08:00
README.md Merge pull request #565 from pwendell/dev-scripts. Closes #565. 2014-02-08 23:13:34 -08:00
requirements.txt [SPARK-19064][PYSPARK] Fix pip installing of sub components 2017-01-25 14:43:39 -08:00
run-pip-tests [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas 2017-06-23 09:01:13 +08:00
run-tests [SPARK-5161] Parallelize Python test execution 2015-06-29 21:32:40 -07:00
run-tests-jenkins [SPARK-19955][PYSPARK] Jenkins Python Conda based test. 2017-03-29 11:41:17 -07:00
run-tests-jenkins.py [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6 2017-02-08 21:28:04 +00:00
run-tests.py [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes 2017-06-02 21:59:52 -07:00
scalastyle [SPARK-16967] move mesos to module 2016-08-26 12:25:22 -07:00
test-dependencies.sh [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support 2017-02-16 12:32:45 +00:00
tox.ini [MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory 2017-06-02 14:25:38 +01:00

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.