spark-instrumented-optimizer

History

Gengliang Wang f5b9370da2 [SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>		2019-01-24 18:24:49 -08:00
..
benchmarks	[SPARK-26584][SQL] Remove `spark.sql.orc.copyBatchToSpark` internal conf	2019-01-10 08:42:23 -08:00
compatibility/src/test/scala/org/apache/spark/sql/hive/execution	Revert [SPARK-19355][SPARK-25352]	2018-09-20 20:18:31 +08:00
src	[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly	2019-01-24 18:24:49 -08:00
pom.xml	[SPARK-26306][TEST][BUILD] More memory to de-flake SorterSuite	2019-01-04 15:35:23 -06:00