spark-instrumented-optimizer/sql/core
Terry Kim f09c1a36c4 [SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns
### What changes were proposed in this pull request?

`DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified.

```Scala
val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.fill("hello").show
```
produces
```
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1268)
```
The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

### Why are the changes needed?

If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns.

### Does this PR introduce any user-facing change?

Yes, now the above example displays the following:
```
+----+-----+-----+
|col1| col2| col2|
+----+-----+-----+
|   1|hello|    2|
|   3|    4|hello|
+----+-----+-----+

```

### How was this patch tested?

Added new unit tests.

Closes #26593 from imback82/na_fill.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-26 00:06:19 +08:00
..
benchmarks [SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values 2019-11-11 21:53:33 +08:00
src [SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns 2019-11-26 00:06:19 +08:00
v1.2/src [SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short 2019-11-23 12:50:50 -08:00
v2.3/src [SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short 2019-11-23 12:50:50 -08:00
pom.xml [SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ 2019-11-15 23:58:15 -08:00