abc88deeed
### What changes were proposed in this pull request? Modify the approach in `DataFrameNaFunctions.fillValue`, the new one uses `df.withColumns` which only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe. ### Why are the changes needed? Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields. ``` scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", "f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3") scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", "f2", "f4-1")).toDF("f1", "f2", "f4") scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), joinType="left_outer") scala> df_join.na.fill("", cols=Seq("f4")) org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could be: df1.f2, df2.f2.; ``` ### Does this PR introduce any user-facing change? Yes, fillna operation will pass and give the right answer for a joined table. ### How was this patch tested? Local test and newly added UT. Closes #25768 from xuanyuanking/SPARK-29063. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
v1.2.1/src | ||
v2.3.5/src | ||
pom.xml |