spark-instrumented-optimizer

History

Kousuke Saruta 8ffe00e745 [SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `fa1805db48`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2021-10-05 11:17:12 +08:00
..
benchmarks	[SPARK-34981][SQL][FOLLOWUP] Use SpecificInternalRow in ApplyFunctionExpression	2021-05-24 17:25:24 +09:00
src	[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join	2021-10-05 11:17:12 +08:00
pom.xml	Preparing development version 3.2.1-SNAPSHOT	2021-09-28 10:53:42 +00:00