spark-instrumented-optimizer

History

Terry Kim d7499aed9c [SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ \| c1\| +--------+ \| [, a2]\| \|[b1, b2]\| \| null\| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ \| c1\| +--------+ \|[b1, b2]\| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ \| c1\| +--------+ \| [, a2]\| \|[b1, b2]\| \| null\| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ \| c1\| +--------+ \|[b1, b2]\| +--------+ ``` Also, if `` is specified as a column, it will throw an `AnalysisException` that `` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2020-04-20 02:59:09 +00:00
..
benchmarks	[SPARK-31469][SQL][TESTS][FOLLOWUP] Remove unsupported fields from ExtractBenchmark	2020-04-18 00:32:42 -07:00
src	[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns	2020-04-20 02:59:09 +00:00
v1.2/src	[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader	2020-04-13 05:29:54 +00:00
v2.3/src	[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader	2020-04-13 05:29:54 +00:00
pom.xml	[SPARK-31021][SQL] Support MariaDB Kerberos login in JDBC connector	2020-04-09 09:20:02 -07:00