2019-03-05 14:12:57 -05:00
|
|
|
================================================================================================
|
|
|
|
Nested Schema Pruning Benchmark For ORC v2
|
|
|
|
================================================================================================
|
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
2019-03-13 16:27:10 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-03-05 14:12:57 -05:00
|
|
|
Selection: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Top-level column 141 161 22 7.1 140.6 1.0X
|
|
|
|
Nested column 1425 1455 26 0.7 1424.7 0.1X
|
|
|
|
Nested column in array 5248 5300 46 0.2 5247.5 0.0X
|
2019-03-05 14:12:57 -05:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
2019-03-13 16:27:10 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-03-05 14:12:57 -05:00
|
|
|
Limiting: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Top-level column 133 163 22 7.5 132.8 1.0X
|
|
|
|
Nested column 1254 1308 40 0.8 1254.0 0.1X
|
|
|
|
Nested column in array 5303 5418 81 0.2 5303.3 0.0X
|
2019-03-05 14:12:57 -05:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
2019-03-13 16:27:10 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-03-05 14:12:57 -05:00
|
|
|
Repartitioning: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Top-level column 377 401 19 2.7 376.7 1.0X
|
|
|
|
Nested column 1676 1722 21 0.6 1676.1 0.2X
|
|
|
|
Nested column in array 6019 6127 109 0.2 6018.7 0.1X
|
2019-03-05 14:12:57 -05:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
2019-03-13 16:27:10 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-03-05 14:12:57 -05:00
|
|
|
Repartitioning by exprs: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Top-level column 390 447 151 2.6 390.1 1.0X
|
|
|
|
Nested column 4300 4364 60 0.2 4299.7 0.1X
|
|
|
|
Nested column in array 8832 9030 114 0.1 8832.4 0.0X
|
[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
## What changes were proposed in this pull request?
As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`.
**PREPARATION**
```scala
scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p")
scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true")
scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t")
```
**BEFORE**
```scala
scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain
== Physical Plan ==
CollectLimit 1000000
+- *(1) Project [col2#22._1 AS _1#28L]
+- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>>
scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain
== Physical Plan ==
*(2) Project [col2#22._1 AS _1#33L]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [col2#22]
+- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>>
```
**AFTER**
```scala
scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(1) Project [col2#5._1 AS _1#11L]
+- *(1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>>
```
This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 .
## How was this patch tested?
Pass the Jenkins with a newly added test suite.
Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS.
Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-19 23:24:22 -04:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
## What changes were proposed in this pull request?
As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`.
**PREPARATION**
```scala
scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p")
scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true")
scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t")
```
**BEFORE**
```scala
scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain
== Physical Plan ==
CollectLimit 1000000
+- *(1) Project [col2#22._1 AS _1#28L]
+- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>>
scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain
== Physical Plan ==
*(2) Project [col2#22._1 AS _1#33L]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [col2#22]
+- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>>
```
**AFTER**
```scala
scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(1) Project [col2#5._1 AS _1#11L]
+- *(1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>>
```
This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 .
## How was this patch tested?
Pass the Jenkins with a newly added test suite.
Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS.
Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-19 23:24:22 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Sample: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Top-level column 132 143 7 7.6 131.6 1.0X
|
|
|
|
Nested column 1260 1303 20 0.8 1260.2 0.1X
|
|
|
|
Nested column in array 5359 5453 74 0.2 5359.1 0.0X
|
2019-03-05 14:12:57 -05:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
2019-03-13 16:27:10 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-03-05 14:12:57 -05:00
|
|
|
Sorting: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Top-level column 288 302 20 3.5 287.6 1.0X
|
|
|
|
Nested column 3169 3242 53 0.3 3168.7 0.1X
|
|
|
|
Nested column in array 8151 8301 123 0.1 8151.3 0.0X
|
2019-03-05 14:12:57 -05:00
|
|
|
|
|
|
|
|