spark-instrumented-optimizer/sql/core/benchmarks/ExtractBenchmark-results.txt

120 lines
13 KiB
Plaintext
Raw Normal View History

Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.3
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 287 308 19 34.8 28.7 1.0X
MILLENNIUM of timestamp 896 918 20 11.2 89.6 0.3X
CENTURY of timestamp 849 856 7 11.8 84.9 0.3X
DECADE of timestamp 765 777 17 13.1 76.5 0.4X
YEAR of timestamp 754 756 2 13.3 75.4 0.4X
ISOYEAR of timestamp 843 849 5 11.9 84.3 0.3X
QUARTER of timestamp 867 873 9 11.5 86.7 0.3X
MONTH of timestamp 758 762 4 13.2 75.8 0.4X
WEEK of timestamp 1049 1054 6 9.5 104.9 0.3X
DAY of timestamp 750 763 11 13.3 75.0 0.4X
DAYOFWEEK of timestamp 890 918 25 11.2 89.0 0.3X
DOW of timestamp 879 887 8 11.4 87.9 0.3X
ISODOW of timestamp 862 869 11 11.6 86.2 0.3X
DOY of timestamp 811 868 55 12.3 81.1 0.4X
HOUR of timestamp 627 638 11 16.0 62.7 0.5X
MINUTE of timestamp 600 606 6 16.7 60.0 0.5X
SECOND of timestamp 743 799 51 13.5 74.3 0.4X
MILLISECONDS of timestamp 723 737 22 13.8 72.3 0.4X
MICROSECONDS of timestamp 648 653 5 15.4 64.8 0.4X
EPOCH of timestamp 780 800 17 12.8 78.0 0.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.3
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
Invoke date_part for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 238 248 12 42.0 23.8 1.0X
MILLENNIUM of timestamp 862 875 12 11.6 86.2 0.3X
CENTURY of timestamp 833 847 22 12.0 83.3 0.3X
DECADE of timestamp 759 765 7 13.2 75.9 0.3X
YEAR of timestamp 744 755 15 13.4 74.4 0.3X
ISOYEAR of timestamp 937 1019 73 10.7 93.7 0.3X
QUARTER of timestamp 1011 1091 69 9.9 101.1 0.2X
MONTH of timestamp 846 888 40 11.8 84.6 0.3X
WEEK of timestamp 1210 1239 41 8.3 121.0 0.2X
DAY of timestamp 932 979 41 10.7 93.2 0.3X
DAYOFWEEK of timestamp 1174 1200 42 8.5 117.4 0.2X
DOW of timestamp 1131 1172 37 8.8 113.1 0.2X
ISODOW of timestamp 896 903 7 11.2 89.6 0.3X
DOY of timestamp 805 818 12 12.4 80.5 0.3X
HOUR of timestamp 596 597 2 16.8 59.6 0.4X
MINUTE of timestamp 582 597 17 17.2 58.2 0.4X
SECOND of timestamp 697 709 16 14.4 69.7 0.3X
MILLISECONDS of timestamp 700 710 10 14.3 70.0 0.3X
MICROSECONDS of timestamp 612 631 21 16.3 61.2 0.4X
EPOCH of timestamp 755 760 7 13.2 75.5 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.3
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
Invoke extract for date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to date 633 641 7 15.8 63.3 1.0X
MILLENNIUM of date 824 845 26 12.1 82.4 0.8X
CENTURY of date 864 878 13 11.6 86.4 0.7X
DECADE of date 746 763 17 13.4 74.6 0.8X
YEAR of date 752 785 39 13.3 75.2 0.8X
ISOYEAR of date 900 905 5 11.1 90.0 0.7X
QUARTER of date 906 930 23 11.0 90.6 0.7X
MONTH of date 747 752 6 13.4 74.7 0.8X
WEEK of date 1089 1100 17 9.2 108.9 0.6X
DAY of date 795 803 13 12.6 79.5 0.8X
DAYOFWEEK of date 937 944 10 10.7 93.7 0.7X
DOW of date 927 1003 66 10.8 92.7 0.7X
ISODOW of date 977 980 7 10.2 97.7 0.6X
DOY of date 827 877 45 12.1 82.7 0.8X
HOUR of date 1525 1547 34 6.6 152.5 0.4X
MINUTE of date 1473 1499 23 6.8 147.3 0.4X
SECOND of date 1600 1615 15 6.2 160.0 0.4X
MILLISECONDS of date 1666 1765 156 6.0 166.6 0.4X
MICROSECONDS of date 1554 1627 127 6.4 155.4 0.4X
EPOCH of date 1615 1646 27 6.2 161.5 0.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.3
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Invoke date_part for date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to date 676 690 20 14.8 67.6 1.0X
MILLENNIUM of date 868 882 16 11.5 86.8 0.8X
CENTURY of date 875 880 8 11.4 87.5 0.8X
DECADE of date 762 776 16 13.1 76.2 0.9X
YEAR of date 788 800 11 12.7 78.8 0.9X
ISOYEAR of date 903 917 12 11.1 90.3 0.7X
QUARTER of date 983 1018 40 10.2 98.3 0.7X
MONTH of date 836 857 19 12.0 83.6 0.8X
WEEK of date 1137 1168 28 8.8 113.7 0.6X
DAY of date 768 817 82 13.0 76.8 0.9X
DAYOFWEEK of date 890 926 36 11.2 89.0 0.8X
DOW of date 1007 1033 39 9.9 100.7 0.7X
ISODOW of date 962 969 7 10.4 96.2 0.7X
DOY of date 797 882 80 12.6 79.7 0.8X
HOUR of date 1449 1482 29 6.9 144.9 0.5X
MINUTE of date 1536 1610 69 6.5 153.6 0.4X
SECOND of date 1675 1823 128 6.0 167.5 0.4X
MILLISECONDS of date 1605 1622 18 6.2 160.5 0.4X
MICROSECONDS of date 1481 1504 32 6.8 148.1 0.5X
EPOCH of date 1656 1843 296 6.0 165.6 0.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.3
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Invoke date_part for interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to interval 919 959 45 10.9 91.9 1.0X
MILLENNIUM of interval 959 1001 36 10.4 95.9 1.0X
CENTURY of interval 974 998 21 10.3 97.4 0.9X
DECADE of interval 975 989 13 10.3 97.5 0.9X
YEAR of interval 987 1012 24 10.1 98.7 0.9X
QUARTER of interval 1014 1031 25 9.9 101.4 0.9X
MONTH of interval 978 1018 45 10.2 97.8 0.9X
DAY of interval 1012 1095 92 9.9 101.2 0.9X
HOUR of interval 1129 1143 14 8.9 112.9 0.8X
MINUTE of interval 1038 1060 25 9.6 103.8 0.9X
SECOND of interval 1062 1289 356 9.4 106.2 0.9X
MILLISECONDS of interval 1142 1226 83 8.8 114.2 0.8X
MICROSECONDS of interval 1038 1129 79 9.6 103.8 0.9X
EPOCH of interval 1090 1127 34 9.2 109.0 0.8X