spark-instrumented-optimizer/python/pyspark/sql
Bryan Cutler 43d9c7e7e5 [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion
### What changes were proposed in this pull request?

Prevent unnecessary copies of data during conversion from Arrow to Pandas.

### Why are the changes needed?

During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/devarrow.apache.org/msg17008.html and ARROW-7596 for discussion.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #27358 from BryanCutler/pyspark-pandas-timestamp-copy-fix-SPARK-30640.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-01-26 15:21:06 -08:00
..
avro [SPARK-27506][SQL][FOLLOWUP] Use option avroSchema to specify an evolved schema in from_avro 2019-12-30 18:14:21 +09:00
pandas [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion 2020-01-26 15:21:06 -08:00
tests [SPARK-30607][SQL][PYSPARK][SPARKR] Add overlay wrappers for SparkR and PySpark 2020-01-23 16:16:47 +09:00
__init__.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
catalog.py [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
column.py [SPARK-30188][SQL] Resolve the failed unit tests when enable AQE 2020-01-13 22:55:19 +08:00
conf.py [SPARK-23698][PYTHON] Resolve undefined names in Python 3 2018-08-22 10:06:59 -07:00
context.py [MINOR][PYSPARK][DOCS] Fix typo in example documentation 2019-11-01 11:55:29 -07:00
dataframe.py [SPARK-30539][PYTHON][SQL] Add DataFrame.tail in PySpark 2020-01-18 00:18:12 -08:00
functions.py Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" 2020-01-25 21:34:12 -08:00
group.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
readwriter.py [SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC 2019-12-23 09:57:42 +09:00
session.py [SPARK-30434][FOLLOW-UP][PYTHON][SQL] Make the parameter list consistent in createDataFrame 2020-01-16 12:39:44 +09:00
streaming.py [SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC 2019-12-23 09:57:42 +09:00
types.py [SPARK-30252][SQL] Disallow negative scale of Decimal 2020-01-21 21:09:48 +08:00
udf.py [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types 2020-01-22 15:32:58 +09:00
utils.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
window.py [SPARK-30188][SQL] Resolve the failed unit tests when enable AQE 2020-01-13 22:55:19 +08:00