spark-instrumented-optimizer

History

Terry Kim f09c1a36c4 [SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root \|-- col1: string (nullable = true) \|-- col2: string (nullable = true) \|-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ \|col1\| col2\| col2\| +----+-----+-----+ \| 1\|hello\| 2\| \| 3\| 4\|hello\| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #26593 from imback82/na_fill. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2019-11-26 00:06:19 +08:00
..
benchmarks	[SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values	2019-11-11 21:53:33 +08:00
src	[SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns	2019-11-26 00:06:19 +08:00
v1.2/src	[SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short	2019-11-23 12:50:50 -08:00
v2.3/src	[SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short	2019-11-23 12:50:50 -08:00
pom.xml	[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+	2019-11-15 23:58:15 -08:00