spark-instrumented-optimizer

History

yi.wu e9362c2571 [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2021-02-02 16:25:32 +09:00
..
avro	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python`	2021-02-02 09:30:50 +09:00
pandas	[SPARK-33489][PYSPARK] Add NullType support for Arrow executions	2021-01-25 11:34:47 +09:00
tests	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas	2021-02-02 16:25:32 +09:00
__init__.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
__init__.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
_typing.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
catalog.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
catalog.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
column.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
column.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
conf.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
conf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
context.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
context.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
dataframe.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
dataframe.pyi	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
functions.py	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python`	2021-02-02 09:30:50 +09:00
functions.pyi	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
group.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
group.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
readwriter.py	[SPARK-33978][SQL] Support ZSTD compression in ORC data source	2021-01-04 00:54:47 -08:00
readwriter.pyi	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV	2020-11-27 15:47:39 +09:00
session.py	[SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql	2021-01-14 15:27:14 +00:00
session.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
streaming.py	[SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API	2020-12-24 12:44:37 +09:00
streaming.pyi	[SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable	2020-12-21 19:42:59 +09:00
types.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
types.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
udf.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
udf.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
utils.py	Spelling r common dev mlib external project streaming resource managers python	2020-11-27 10:22:45 -06:00
window.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
window.pyi	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00