History

Bryan Cutler e44697606f [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas ## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Li Jin <li.jin@twosigma.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.		2017-06-23 09:01:13 +08:00
..
create-release	[SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version	2017-05-09 11:25:29 -07:00
deps	[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas	2017-06-23 09:01:13 +08:00
sparktestsupport	[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes	2017-06-02 21:59:52 -07:00
tests	[SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests	2015-12-30 12:47:42 -08:00
.gitignore	[SPARK-6219] Reuse pep8.py	2015-04-18 16:46:28 -07:00
.rat-excludes	[SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core	2017-06-15 11:46:00 -07:00
appveyor-guide.md	[SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only)	2016-09-08 08:26:59 -07:00
appveyor-install-dependencies.ps1	[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support	2017-02-16 12:32:45 +00:00
change-scala-version.sh	[SPARK-9250] Make change-scala-version more helpful w.r.t. valid Scala versions	2015-07-24 17:09:33 +01:00
change-version-to-2.10.sh	[MINOR] Fix some typo of the document	2017-06-19 20:35:58 +01:00
change-version-to-2.11.sh	[MINOR] Fix some typo of the document	2017-06-19 20:35:58 +01:00
check-license	[SPARK-13596][BUILD] Move misc top-level build files into appropriate subdirs	2016-03-07 14:48:02 -08:00
checkstyle-suppressions.xml	[MINOR][BUILD] Fix lint-java breaks.	2017-05-10 13:56:34 +01:00
checkstyle.xml	[SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site	2016-11-23 11:25:47 +00:00
github_jira_sync.py	[SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts	2017-01-02 15:23:19 +00:00
lint-java	[SPARK-16967] move mesos to module	2016-08-26 12:25:22 -07:00
lint-python	[MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory	2017-06-02 14:25:38 +01:00
lint-r	[SPARK-10328] [SPARKR] Fix generic for na.omit	2015-08-28 00:37:50 -07:00
lint-r.R	[SPARK-14074][SPARKR] Specify commit sha1 ID when using install_github to install intr package.	2016-03-23 07:57:03 -07:00
lint-scala	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
make-distribution.sh	[SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK…	2017-04-02 15:31:13 +01:00
merge_spark_pr.py	[SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts	2017-01-02 15:23:19 +00:00
mima	[SPARK-19550][HOTFIX][BUILD] Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima	2017-02-16 18:43:38 +00:00
pip-sanity-check.py	[SPARK-19064][PYSPARK] Fix pip installing of sub components	2017-01-25 14:43:39 -08:00
README.md	Merge pull request #565 from pwendell/dev-scripts. Closes #565 .	2014-02-08 23:13:34 -08:00
requirements.txt	[SPARK-19064][PYSPARK] Fix pip installing of sub components	2017-01-25 14:43:39 -08:00
run-pip-tests	[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas	2017-06-23 09:01:13 +08:00
run-tests	[SPARK-5161] Parallelize Python test execution	2015-06-29 21:32:40 -07:00
run-tests-jenkins	[SPARK-19955][PYSPARK] Jenkins Python Conda based test.	2017-03-29 11:41:17 -07:00
run-tests-jenkins.py	[SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6	2017-02-08 21:28:04 +00:00
run-tests.py	[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes	2017-06-02 21:59:52 -07:00
scalastyle	[SPARK-16967] move mesos to module	2016-08-26 12:25:22 -07:00
test-dependencies.sh	[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support	2017-02-16 12:32:45 +00:00
tox.ini	[MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory	2017-06-02 14:25:38 +01:00

README.md

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.