spark-instrumented-optimizer/sql/core/benchmarks/ParquetNestedSchemaPruningBenchmark-results.txt

54 lines
4.5 KiB
Plaintext
Raw Normal View History

================================================================================================
Nested Schema Pruning Benchmark For Parquet
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Selection: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column 151 174 16 6.6 151.3 1.0X
Nested column 316 375 88 3.2 315.7 0.5X
Nested column in array 1277 1292 11 0.8 1277.0 0.1X
OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Limiting: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column 452 501 67 2.2 451.9 1.0X
Nested column 664 722 77 1.5 664.4 0.7X
Nested column in array 1906 1997 91 0.5 1905.6 0.2X
OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Repartitioning: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column 385 410 39 2.6 385.5 1.0X
Nested column 612 620 10 1.6 611.9 0.6X
Nested column in array 1790 1845 80 0.6 1789.5 0.2X
OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Repartitioning by exprs: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column 386 402 26 2.6 386.0 1.0X
Nested column 2982 3057 64 0.3 2982.0 0.1X
Nested column in array 3504 3690 248 0.3 3503.7 0.1X
[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition ## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. **PREPARATION** ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` **BEFORE** ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- *(1) Project [col2#22._1 AS _1#28L] +- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain == Physical Plan == *(2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [col2#22] +- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` **AFTER** ```scala scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- *(1) Project [col2#5._1 AS _1#11L] +- *(1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-19 23:24:22 -04:00
OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition ## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. **PREPARATION** ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` **BEFORE** ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- *(1) Project [col2#22._1 AS _1#28L] +- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain == Physical Plan == *(2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [col2#22] +- *(1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` **AFTER** ```scala scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- *(1) Project [col2#5._1 AS _1#11L] +- *(1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-19 23:24:22 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Sample: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column 138 152 10 7.2 138.3 1.0X
Nested column 345 369 16 2.9 344.8 0.4X
Nested column in array 1358 1405 50 0.7 1358.5 0.1X
OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Sorting: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column 606 632 45 1.6 606.3 1.0X
Nested column 3586 3679 107 0.3 3585.8 0.2X
Nested column in array 4452 4831 244 0.2 4451.8 0.1X