spark-instrumented-optimizer

History

Peter Toth ab8a9a0ceb [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType)))) ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType)))) ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ \| c3\| +---+ \|1.0\| +---+ >>> df.select("c4").show() +---+ \| c4\| +---+ \| 1\| +---+ >>> df.select("c3", "c4").show() +---+----+ \| c3\| c4\| +---+----+ \|1.0\|null\| +---+----+ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113 After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ \| c3\| c4\| +---+---+ \|1.0\| 1\| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>		2021-03-07 19:12:42 -06:00
..
avro	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python`	2021-02-02 09:30:50 +09:00
pandas	[SPARK-32953][PYTHON][SQL] Add Arrow self_destruct support to toPandas	2021-02-10 09:58:46 -08:00
tests	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite	2021-03-07 19:12:42 -06:00
__init__.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
__init__.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
_typing.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
catalog.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
catalog.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
column.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
column.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
conf.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
conf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
context.py	[SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly	2021-02-08 08:39:58 +00:00
context.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
dataframe.py	[PYTHON][MINOR] Fix docstring of DataFrame.join	2021-02-06 09:08:49 -06:00
dataframe.pyi	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
functions.py	[SPARK-33678][SQL] Product aggregation function	2021-03-02 16:51:07 +09:00
functions.pyi	[SPARK-33678][SQL] Product aggregation function	2021-03-02 16:51:07 +09:00
group.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
group.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
readwriter.py	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs	2021-02-17 14:04:47 +00:00
readwriter.pyi	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV	2020-11-27 15:47:39 +09:00
session.py	[SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs	2021-02-13 09:32:55 -06:00
session.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
streaming.py	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs	2021-02-17 14:04:47 +00:00
streaming.pyi	[SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable	2020-12-21 19:42:59 +09:00
types.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
types.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
udf.py	[SPARK-34408][PYTHON] Refactor spark.udf.register to share the same path to generate UDF instance	2021-02-11 10:57:02 +09:00
udf.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
utils.py	Spelling r common dev mlib external project streaming resource managers python	2020-11-27 10:22:45 -06:00
window.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
window.pyi	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00