spark-instrumented-optimizer

History

Peter Toth ab8a9a0ceb [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType)))) ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType)))) ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ \| c3\| +---+ \|1.0\| +---+ >>> df.select("c4").show() +---+ \| c4\| +---+ \| 1\| +---+ >>> df.select("c3", "c4").show() +---+----+ \| c3\| c4\| +---+----+ \|1.0\|null\| +---+----+ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113 After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ \| c3\| c4\| +---+---+ \|1.0\| 1\| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>		2021-03-07 19:12:42 -06:00
..
__init__.py	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files	2018-11-14 14:51:11 +08:00
test_arrow.py	[SPARK-32953][PYTHON][SQL] Add Arrow self_destruct support to toPandas	2021-02-10 09:58:46 -08:00
test_catalog.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_column.py	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
test_conf.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_context.py	[SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py	2020-09-28 21:54:00 -07:00
test_dataframe.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_datasources.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_functions.py	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
test_group.py	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
test_pandas_cogrouped_map.py	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas	2021-02-02 16:25:32 +09:00
test_pandas_grouped_map.py	[SPARK-33489][PYSPARK] Add NullType support for Arrow executions	2021-01-25 11:34:47 +09:00
test_pandas_map.py	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas	2021-02-02 16:25:32 +09:00
test_pandas_udf.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_pandas_udf_grouped_agg.py	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests	2021-03-04 10:03:54 +09:00
test_pandas_udf_scalar.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_pandas_udf_typehints.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_pandas_udf_window.py	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests	2021-03-04 10:03:54 +09:00
test_readwriter.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_serde.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_session.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_streaming.py	[SPARK-33836][SS][PYTHON][FOLLOW-UP] Use test utils and clean up doctests in table and toTable	2020-12-22 06:27:27 +09:00
test_types.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_udf.py	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite	2021-03-07 19:12:42 -06:00
test_utils.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00