257391497b
## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. **PREPARATION** ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` **BEFORE** ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- *(1) Project [col2#22._1 AS _1#28L] +- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain == Physical Plan == *(2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [col2#22] +- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` **AFTER** ```scala scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- *(1) Project [col2#5._1 AS _1#11L] +- *(1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> |
||
---|---|---|
.. | ||
AggregateBenchmark-results.txt | ||
BloomFilterBenchmark-results.txt | ||
BuiltInDataSourceWriteBenchmark-results.txt | ||
ColumnarBatchBenchmark-results.txt | ||
CompressionSchemeBenchmark-results.txt | ||
CSVBenchmark-results.txt | ||
DatasetBenchmark-results.txt | ||
DataSourceReadBenchmark-results.txt | ||
DateTimeBenchmark-results.txt | ||
ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt | ||
FilterPushdownBenchmark-results.txt | ||
HashedRelationMetricsBenchmark-results.txt | ||
InExpressionBenchmark-results.txt | ||
JoinBenchmark-results.txt | ||
JSONBenchmark-results.txt | ||
MiscBenchmark-results.txt | ||
OrcNestedSchemaPruningBenchmark-results.txt | ||
OrcV2NestedSchemaPruningBenchmark-results.txt | ||
ParquetNestedSchemaPruningBenchmark-results.txt | ||
PrimitiveArrayBenchmark-results.txt | ||
RangeBenchmark-results.txt | ||
SortBenchmark-results.txt | ||
UnsafeArrayDataBenchmark-results.txt | ||
WideSchemaBenchmark-results.txt | ||
WideTableBenchmark-results.txt |