spark-instrumented-optimizer

History

HyukjinKwon 1217996f15 [SPARK-27995][PYTHON] Note the difference between str of Python 2 and 3 at Arrow optimized ## What changes were proposed in this pull request? When Arrow optimization is enabled in Python 2.7, ```python import pandas pdf = pandas.DataFrame(["test1", "test2"]) df = spark.createDataFrame(pdf) df.show() ``` I got the following output: ``` +----------------+ \| 0\| +----------------+ \|[74 65 73 74 31]\| \|[74 65 73 74 32]\| +----------------+ ``` This looks because Python's `str` and `byte` are same. it does look right: ```python >>> str == bytes True >>> isinstance("a", bytes) True ``` To cut it short: 1. Python 2 treats `str` as `bytes`. 2. PySpark added some special codes and hacks to recognizes `str` as string types. 3. PyArrow / Pandas followed Python 2 difference To fix, we have two options: 1. Fix it to match the behaviour to PySpark's 2. Note the differences but Python 2 is deprecated anyway. I think it's better to just note it and for go option 2. ## How was this patch tested? Manually tested. Doc was checked too: ![Screen Shot 2019-06-11 at 6 40 07 PM](https://user-images.githubusercontent.com/6477701/59261402-59ad3b00-8c78-11e9-94a6-3236a2c338d4.png) Closes #24838 from HyukjinKwon/SPARK-27995. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2019-06-11 18:43:59 +09:00
..
ml	[SPARK-18570][ML][R] RFormula support * and ^ operators	2019-06-04 08:59:30 -05:00
mllib	[SPARK-27540][MLLIB] Add 'meanAveragePrecision_at_k' metric to RankingMetrics	2019-05-09 08:47:05 -05:00
sql	[SPARK-27995][PYTHON] Note the difference between str of Python 2 and 3 at Arrow optimized	2019-06-11 18:43:59 +09:00
streaming	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs	2019-03-11 10:15:07 +09:00
testing	Revert "[SPARK-27870][SQL][PYSPARK] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline)"	2019-06-09 08:28:31 -07:00
tests	Revert "[SPARK-27870][SQL][PYSPARK] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline)"	2019-06-09 08:28:31 -07:00
__init__.py	[SPARK-25248][.1][PYSPARK] update barrier Python API	2018-08-29 07:22:03 -07:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator	2018-10-08 15:18:08 +08:00
broadcast.py	[SPARK-18161][PYTHON] Update cloudpickle to v0.6.1	2019-02-02 10:49:45 +08:00
cloudpickle.py	[SPARK-27000][PYTHON] Upgrades cloudpickle to v0.8.0	2019-02-28 02:33:10 +09:00
conf.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
context.py	[SPARK-27887][PYTHON] Add deprecation warning for Python 2	2019-06-04 15:36:52 +09:00
daemon.py	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
heapq3.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
java_gateway.py	[SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway	2019-02-15 18:08:06 -08:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
rdd.py	[SPARK-23961][SPARK-27548][PYTHON] Fix error when toLocalIterator goes out of scope and properly raise errors from worker	2019-05-07 14:47:39 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	Revert "[SPARK-27870][SQL][PYSPARK] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline)"	2019-06-09 08:28:31 -07:00
shell.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
shuffle.py	[SPARK-25696] The storage memory displayed on spark Application UI is…	2018-12-10 18:27:01 -06:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
taskcontext.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs	2019-03-11 10:15:07 +09:00
version.py	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT	2018-10-02 08:48:24 -07:00
worker.py	[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF.	2019-03-25 11:26:09 -07:00