spark-instrumented-optimizer/python/pyspark
Bryan Cutler 43d9c7e7e5 [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion
### What changes were proposed in this pull request?

Prevent unnecessary copies of data during conversion from Arrow to Pandas.

### Why are the changes needed?

During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/devarrow.apache.org/msg17008.html and ARROW-7596 for discussion.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #27358 from BryanCutler/pyspark-pandas-timestamp-copy-fix-SPARK-30640.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-01-26 15:21:06 -08:00
..
ml [SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method 2020-01-23 16:44:13 +08:00
mllib [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's 2019-11-08 06:44:58 +09:00
sql [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion 2020-01-26 15:21:06 -08:00
streaming [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
testing [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
tests [SPARK-30480][PYTHON][TESTS] Increases the memory limit being tested in 'WorkerMemoryTest.test_memory_limit' 2020-01-13 18:47:15 +09:00
__init__.py [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
_globals.py [SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary 2018-02-09 14:21:10 +08:00
accumulators.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
broadcast.py [SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0 2019-10-03 19:20:51 +09:00
cloudpickle.py [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 2019-10-22 16:18:34 +09:00
conf.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
context.py [SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode 2019-11-21 10:54:01 +09:00
daemon.py [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon 2019-07-31 09:10:24 +09:00
files.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
find_spark_home.py Fix typos detected by github.com/client9/misspell 2018-08-11 21:23:36 -05:00
heapq3.py Fix typos detected by github.com/client9/misspell 2018-08-11 21:23:36 -05:00
java_gateway.py [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's 2019-11-08 06:44:58 +09:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis 2019-01-17 19:40:39 -06:00
rdd.py [SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier 2019-10-23 13:46:09 +02:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resourceinformation.py [SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources 2019-07-11 09:32:58 +09:00
resultiterable.py [SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove deprecation warnings 2019-12-10 11:08:13 -08:00
serializers.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
shell.py [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 2018-09-13 11:19:43 +08:00
shuffle.py [SPARK-25696] The storage memory displayed on spark Application UI is… 2018-12-10 18:27:01 -06:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 2018-11-07 22:48:50 -06:00
taskcontext.py [SPARK-29582][PYSPARK] Support TaskContext.get() in a barrier task from Python side 2019-10-31 13:10:44 +09:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
util.py [SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode 2019-11-21 10:54:01 +09:00
version.py [INFRA] Reverts commit 56dcd79 and c216ef1 2019-12-16 19:57:44 -07:00
worker.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00