spark-instrumented-optimizer/sql/core/benchmarks/ExtractBenchmark-jdk11-results.txt
Maxim Gekk f5118f81e3 [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.

### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |

- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests

# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1  // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
    ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
    ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
    ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #27078 from MaxGekk/noop-in-benchmarks.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 13:18:19 -08:00

120 lines
13 KiB
Plaintext

OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 380 456 67 26.3 38.0 1.0X
MILLENNIUM of timestamp 1274 1361 138 7.9 127.4 0.3X
CENTURY of timestamp 1119 1132 19 8.9 111.9 0.3X
DECADE of timestamp 1076 1083 6 9.3 107.6 0.4X
YEAR of timestamp 1066 1098 43 9.4 106.6 0.4X
ISOYEAR of timestamp 1190 1194 4 8.4 119.0 0.3X
QUARTER of timestamp 1269 1273 4 7.9 126.9 0.3X
MONTH of timestamp 1060 1075 22 9.4 106.0 0.4X
WEEK of timestamp 1560 1565 8 6.4 156.0 0.2X
DAY of timestamp 1039 1046 8 9.6 103.9 0.4X
DAYOFWEEK of timestamp 1248 1274 24 8.0 124.8 0.3X
DOW of timestamp 1252 1273 25 8.0 125.2 0.3X
ISODOW of timestamp 1195 1204 9 8.4 119.5 0.3X
DOY of timestamp 1081 1086 6 9.3 108.1 0.4X
HOUR of timestamp 778 781 5 12.9 77.8 0.5X
MINUTE of timestamp 779 780 1 12.8 77.9 0.5X
SECOND of timestamp 597 611 20 16.7 59.7 0.6X
MILLISECONDS of timestamp 636 642 6 15.7 63.6 0.6X
MICROSECONDS of timestamp 498 504 5 20.1 49.8 0.8X
EPOCH of timestamp 946 956 9 10.6 94.6 0.4X
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke date_part for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 356 362 9 28.1 35.6 1.0X
MILLENNIUM of timestamp 1260 1297 41 7.9 126.0 0.3X
CENTURY of timestamp 1082 1085 3 9.2 108.2 0.3X
DECADE of timestamp 1056 1068 19 9.5 105.6 0.3X
YEAR of timestamp 1045 1053 13 9.6 104.5 0.3X
ISOYEAR of timestamp 1300 1316 25 7.7 130.0 0.3X
QUARTER of timestamp 1279 1280 2 7.8 127.9 0.3X
MONTH of timestamp 1037 1046 11 9.6 103.7 0.3X
WEEK of timestamp 1539 1557 28 6.5 153.9 0.2X
DAY of timestamp 1032 1038 6 9.7 103.2 0.3X
DAYOFWEEK of timestamp 1241 1244 4 8.1 124.1 0.3X
DOW of timestamp 1237 1241 7 8.1 123.7 0.3X
ISODOW of timestamp 1155 1158 3 8.7 115.5 0.3X
DOY of timestamp 1075 1080 4 9.3 107.5 0.3X
HOUR of timestamp 766 770 5 13.1 76.6 0.5X
MINUTE of timestamp 764 769 4 13.1 76.4 0.5X
SECOND of timestamp 590 592 2 16.9 59.0 0.6X
MILLISECONDS of timestamp 627 636 10 16.0 62.7 0.6X
MICROSECONDS of timestamp 493 505 15 20.3 49.3 0.7X
EPOCH of timestamp 962 966 4 10.4 96.2 0.4X
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke extract for date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to date 886 905 17 11.3 88.6 1.0X
MILLENNIUM of date 1253 1261 7 8.0 125.3 0.7X
CENTURY of date 1068 1079 10 9.4 106.8 0.8X
DECADE of date 1040 1068 25 9.6 104.0 0.9X
YEAR of date 1032 1043 11 9.7 103.2 0.9X
ISOYEAR of date 1304 1313 12 7.7 130.4 0.7X
QUARTER of date 1284 1301 16 7.8 128.4 0.7X
MONTH of date 1033 1036 4 9.7 103.3 0.9X
WEEK of date 1535 1545 8 6.5 153.5 0.6X
DAY of date 1023 1033 11 9.8 102.3 0.9X
DAYOFWEEK of date 1230 1236 6 8.1 123.0 0.7X
DOW of date 1238 1247 9 8.1 123.8 0.7X
ISODOW of date 1159 1169 17 8.6 115.9 0.8X
DOY of date 1082 1084 3 9.2 108.2 0.8X
HOUR of date 1879 1891 11 5.3 187.9 0.5X
MINUTE of date 1881 1905 21 5.3 188.1 0.5X
SECOND of date 1718 1724 5 5.8 171.8 0.5X
MILLISECONDS of date 1733 1737 6 5.8 173.3 0.5X
MICROSECONDS of date 1629 1644 23 6.1 162.9 0.5X
EPOCH of date 2085 2090 5 4.8 208.5 0.4X
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke date_part for date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to date 888 891 6 11.3 88.8 1.0X
MILLENNIUM of date 1250 1260 17 8.0 125.0 0.7X
CENTURY of date 1070 1076 6 9.3 107.0 0.8X
DECADE of date 1036 1041 5 9.6 103.6 0.9X
YEAR of date 1037 1038 1 9.6 103.7 0.9X
ISOYEAR of date 1300 1307 9 7.7 130.0 0.7X
QUARTER of date 1267 1277 9 7.9 126.7 0.7X
MONTH of date 1034 1037 4 9.7 103.4 0.9X
WEEK of date 1543 1554 10 6.5 154.3 0.6X
DAY of date 1022 1030 12 9.8 102.2 0.9X
DAYOFWEEK of date 1230 1232 4 8.1 123.0 0.7X
DOW of date 1227 1242 15 8.1 122.7 0.7X
ISODOW of date 1157 1173 20 8.6 115.7 0.8X
DOY of date 1073 1083 18 9.3 107.3 0.8X
HOUR of date 1873 1878 7 5.3 187.3 0.5X
MINUTE of date 1861 1876 14 5.4 186.1 0.5X
SECOND of date 1717 1724 6 5.8 171.7 0.5X
MILLISECONDS of date 1729 1736 7 5.8 172.9 0.5X
MICROSECONDS of date 1622 1627 5 6.2 162.2 0.5X
EPOCH of date 2066 2079 19 4.8 206.6 0.4X
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke date_part for interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to interval 1273 1279 6 7.9 127.3 1.0X
MILLENNIUM of interval 1323 1352 36 7.6 132.3 1.0X
CENTURY of interval 1353 1356 4 7.4 135.3 0.9X
DECADE of interval 1326 1339 11 7.5 132.6 1.0X
YEAR of interval 1341 1345 3 7.5 134.1 0.9X
QUARTER of interval 1368 1372 4 7.3 136.8 0.9X
MONTH of interval 1320 1326 6 7.6 132.0 1.0X
DAY of interval 1306 1310 4 7.7 130.6 1.0X
HOUR of interval 1341 1347 8 7.5 134.1 0.9X
MINUTE of interval 1337 1349 11 7.5 133.7 1.0X
SECOND of interval 1450 1451 1 6.9 145.0 0.9X
MILLISECONDS of interval 1476 1490 23 6.8 147.6 0.9X
MICROSECONDS of interval 1316 1331 25 7.6 131.6 1.0X
EPOCH of interval 1461 1462 1 6.8 146.1 0.9X