spark-instrumented-optimizer

History

Dongjoon Hyun 257391497b [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition ## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. PREPARATION ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` BEFORE ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- (1) Project [col2#22._1 AS _1#28L] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == (2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- (1) Project [col2#22] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` AFTER* ```scala scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- (1) Project [col2#5._1 AS _1#11L] +- (1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>		2019-03-19 20:24:22 -07:00
..
AggregateBenchmark-results.txt	[SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API	2018-10-01 07:32:40 -07:00
BloomFilterBenchmark-results.txt	[SPARK-25589][SQL][TEST] Add BloomFilterBenchmark	2018-10-03 04:14:07 -07:00
BuiltInDataSourceWriteBenchmark-results.txt	[SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDataSourceWriteBenchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method	2018-10-31 03:03:42 -07:00
ColumnarBatchBenchmark-results.txt	[SPARK-25481][SQL][TEST] Refactor ColumnarBatchBenchmark to use main method	2018-09-26 20:40:10 -07:00
CompressionSchemeBenchmark-results.txt	[SPARK-25478][SQL][TEST] Refactor CompressionSchemeBenchmark to use main method	2018-09-23 20:46:40 -07:00
CSVBenchmark-results.txt	[SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables	2019-01-30 15:15:29 +08:00
DatasetBenchmark-results.txt	[SPARK-25479][TEST] Refactor DatasetBenchmark to use main method	2018-10-04 11:58:16 -07:00
DataSourceReadBenchmark-results.txt	[SPARK-26584][SQL] Remove `spark.sql.orc.copyBatchToSpark` internal conf	2019-01-10 08:42:23 -08:00
DateTimeBenchmark-results.txt	[SPARK-26903][SQL] Remove the TimeZone cache	2019-02-23 09:44:22 -06:00
ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt	[SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark	2019-01-09 09:54:21 -08:00
FilterPushdownBenchmark-results.txt	[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption	2018-09-15 17:48:39 -07:00
HashedRelationMetricsBenchmark-results.txt	[SPARK-26337][SQL][TEST] Add benchmark for LongToUnsafeRowMap	2018-12-14 10:50:48 +08:00
InExpressionBenchmark-results.txt	[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates	2019-03-04 15:40:04 -08:00
JoinBenchmark-results.txt	[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method	2018-10-12 16:08:12 -07:00
JSONBenchmark-results.txt	[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959	2019-01-31 14:32:31 +08:00
MiscBenchmark-results.txt	[SPARK-25488][SQL][TEST] Refactor MiscBenchmark to use main method	2018-10-06 08:47:43 -07:00
OrcNestedSchemaPruningBenchmark-results.txt	[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition	2019-03-19 20:24:22 -07:00
OrcV2NestedSchemaPruningBenchmark-results.txt	[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition	2019-03-19 20:24:22 -07:00
ParquetNestedSchemaPruningBenchmark-results.txt	[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition	2019-03-19 20:24:22 -07:00
PrimitiveArrayBenchmark-results.txt	[SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark	2018-09-21 15:04:47 +09:00
RangeBenchmark-results.txt	[SPARK-25710][SQL] range should report metrics correctly	2018-10-13 13:55:28 +08:00
SortBenchmark-results.txt	[SPARK-25486][TEST] Refactor SortBenchmark to use main method	2018-09-25 11:13:05 -07:00
UnsafeArrayDataBenchmark-results.txt	[SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to use main method	2018-10-03 04:20:02 -07:00
WideSchemaBenchmark-results.txt	[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method	2018-10-20 17:31:13 -07:00
WideTableBenchmark-results.txt	[SPARK-25676][SQL][FOLLOWUP] Use 'foreach(_ => ())'	2018-11-08 23:37:14 +08:00