spark-instrumented-optimizer/sql/core/benchmarks/ExtractBenchmark-results.txt
Maxim Gekk f5118f81e3 [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.

### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |

- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests

# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1  // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
    ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
    ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
    ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #27078 from MaxGekk/noop-in-benchmarks.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 13:18:19 -08:00

120 lines
13 KiB
Plaintext

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 409 457 45 24.5 40.9 1.0X
MILLENNIUM of timestamp 1240 1295 57 8.1 124.0 0.3X
CENTURY of timestamp 1186 1240 49 8.4 118.6 0.3X
DECADE of timestamp 1083 1104 20 9.2 108.3 0.4X
YEAR of timestamp 1061 1073 15 9.4 106.1 0.4X
ISOYEAR of timestamp 1198 1213 25 8.3 119.8 0.3X
QUARTER of timestamp 1304 1322 23 7.7 130.4 0.3X
MONTH of timestamp 1052 1067 19 9.5 105.2 0.4X
WEEK of timestamp 1534 1558 25 6.5 153.4 0.3X
DAY of timestamp 1038 1057 26 9.6 103.8 0.4X
DAYOFWEEK of timestamp 1226 1239 22 8.2 122.6 0.3X
DOW of timestamp 1212 1224 13 8.3 121.2 0.3X
ISODOW of timestamp 1148 1165 24 8.7 114.8 0.4X
DOY of timestamp 1066 1075 14 9.4 106.6 0.4X
HOUR of timestamp 358 362 6 27.9 35.8 1.1X
MINUTE of timestamp 364 369 4 27.4 36.4 1.1X
SECOND of timestamp 453 471 26 22.1 45.3 0.9X
MILLISECONDS of timestamp 497 500 2 20.1 49.7 0.8X
MICROSECONDS of timestamp 360 363 4 27.8 36.0 1.1X
EPOCH of timestamp 1100 1104 5 9.1 110.0 0.4X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke date_part for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 315 317 2 31.8 31.5 1.0X
MILLENNIUM of timestamp 1155 1162 8 8.7 115.5 0.3X
CENTURY of timestamp 1146 1152 5 8.7 114.6 0.3X
DECADE of timestamp 1031 1043 11 9.7 103.1 0.3X
YEAR of timestamp 1033 1041 10 9.7 103.3 0.3X
ISOYEAR of timestamp 1274 1278 5 7.8 127.4 0.2X
QUARTER of timestamp 1326 1346 30 7.5 132.6 0.2X
MONTH of timestamp 1027 1031 7 9.7 102.7 0.3X
WEEK of timestamp 1529 1535 6 6.5 152.9 0.2X
DAY of timestamp 1024 1031 9 9.8 102.4 0.3X
DAYOFWEEK of timestamp 1197 1201 5 8.4 119.7 0.3X
DOW of timestamp 1201 1218 15 8.3 120.1 0.3X
ISODOW of timestamp 1143 1149 6 8.8 114.3 0.3X
DOY of timestamp 1074 1085 9 9.3 107.4 0.3X
HOUR of timestamp 354 354 0 28.3 35.4 0.9X
MINUTE of timestamp 358 361 5 27.9 35.8 0.9X
SECOND of timestamp 445 456 17 22.5 44.5 0.7X
MILLISECONDS of timestamp 504 514 12 19.8 50.4 0.6X
MICROSECONDS of timestamp 361 370 8 27.7 36.1 0.9X
EPOCH of timestamp 1111 1117 9 9.0 111.1 0.3X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke extract for date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to date 849 851 3 11.8 84.9 1.0X
MILLENNIUM of date 1129 1139 11 8.9 112.9 0.8X
CENTURY of date 1136 1143 7 8.8 113.6 0.7X
DECADE of date 1039 1043 5 9.6 103.9 0.8X
YEAR of date 1030 1037 10 9.7 103.0 0.8X
ISOYEAR of date 1269 1278 9 7.9 126.9 0.7X
QUARTER of date 1323 1330 6 7.6 132.3 0.6X
MONTH of date 1021 1023 2 9.8 102.1 0.8X
WEEK of date 1541 1549 8 6.5 154.1 0.6X
DAY of date 1021 1033 13 9.8 102.1 0.8X
DAYOFWEEK of date 1196 1209 11 8.4 119.6 0.7X
DOW of date 1214 1229 13 8.2 121.4 0.7X
ISODOW of date 1148 1153 7 8.7 114.8 0.7X
DOY of date 1073 1079 5 9.3 107.3 0.8X
HOUR of date 1311 1314 4 7.6 131.1 0.6X
MINUTE of date 1311 1311 1 7.6 131.1 0.6X
SECOND of date 1420 1434 13 7.0 142.0 0.6X
MILLISECONDS of date 1426 1442 14 7.0 142.6 0.6X
MICROSECONDS of date 1312 1318 6 7.6 131.2 0.6X
EPOCH of date 2034 2050 16 4.9 203.4 0.4X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke date_part for date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to date 852 879 42 11.7 85.2 1.0X
MILLENNIUM of date 1131 1136 7 8.8 113.1 0.8X
CENTURY of date 1138 1145 6 8.8 113.8 0.7X
DECADE of date 1030 1043 13 9.7 103.0 0.8X
YEAR of date 1022 1028 8 9.8 102.2 0.8X
ISOYEAR of date 1260 1265 6 7.9 126.0 0.7X
QUARTER of date 1326 1330 7 7.5 132.6 0.6X
MONTH of date 1014 1034 26 9.9 101.4 0.8X
WEEK of date 1523 1526 5 6.6 152.3 0.6X
DAY of date 1022 1023 2 9.8 102.2 0.8X
DAYOFWEEK of date 1197 1203 9 8.4 119.7 0.7X
DOW of date 1188 1198 16 8.4 118.8 0.7X
ISODOW of date 1143 1153 9 8.8 114.3 0.7X
DOY of date 1052 1058 7 9.5 105.2 0.8X
HOUR of date 1309 1311 4 7.6 130.9 0.7X
MINUTE of date 1302 1305 6 7.7 130.2 0.7X
SECOND of date 1414 1432 16 7.1 141.4 0.6X
MILLISECONDS of date 1441 1450 11 6.9 144.1 0.6X
MICROSECONDS of date 1292 1301 8 7.7 129.2 0.7X
EPOCH of date 2030 2036 8 4.9 203.0 0.4X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Invoke date_part for interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to interval 1249 1254 6 8.0 124.9 1.0X
MILLENNIUM of interval 1310 1316 9 7.6 131.0 1.0X
CENTURY of interval 1304 1315 10 7.7 130.4 1.0X
DECADE of interval 1306 1313 7 7.7 130.6 1.0X
YEAR of interval 1304 1313 11 7.7 130.4 1.0X
QUARTER of interval 1310 1317 7 7.6 131.0 1.0X
MONTH of interval 1311 1319 12 7.6 131.1 1.0X
DAY of interval 1295 1304 13 7.7 129.5 1.0X
HOUR of interval 1301 1306 8 7.7 130.1 1.0X
MINUTE of interval 1316 1319 3 7.6 131.6 0.9X
SECOND of interval 1437 1440 3 7.0 143.7 0.9X
MILLISECONDS of interval 1435 1449 16 7.0 143.5 0.9X
MICROSECONDS of interval 1304 1314 9 7.7 130.4 1.0X
EPOCH of interval 1440 1453 19 6.9 144.0 0.9X