2018-10-20 20:31:13 -04:00
|
|
|
================================================================================================
|
|
|
|
parsing large select expressions
|
|
|
|
================================================================================================
|
2016-09-03 13:02:20 -04:00
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
|
|
|
parsing large select: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2021-04-03 16:02:56 -04:00
|
|
|
1 select expressions 1 2 0 0.0 1296117.0 1.0X
|
|
|
|
100 select expressions 9 11 1 0.0 8808690.0 0.1X
|
|
|
|
2500 select expressions 422 426 5 0.0 421632363.0 0.0X
|
2016-09-03 13:02:20 -04:00
|
|
|
|
2016-06-11 18:26:08 -04:00
|
|
|
|
2018-10-20 20:31:13 -04:00
|
|
|
================================================================================================
|
|
|
|
many column field read and write
|
|
|
|
================================================================================================
|
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
|
|
|
many column field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2021-04-03 16:02:56 -04:00
|
|
|
1 cols x 100000 rows (read in-mem) 15 18 3 6.5 154.0 1.0X
|
|
|
|
1 cols x 100000 rows (exec in-mem) 20 23 4 4.9 202.1 0.8X
|
|
|
|
1 cols x 100000 rows (read parquet) 34 41 12 3.0 338.8 0.5X
|
|
|
|
1 cols x 100000 rows (write parquet) 123 132 6 0.8 1226.7 0.1X
|
|
|
|
100 cols x 1000 rows (read in-mem) 22 25 4 4.5 221.5 0.7X
|
|
|
|
100 cols x 1000 rows (exec in-mem) 36 40 5 2.8 357.2 0.4X
|
|
|
|
100 cols x 1000 rows (read parquet) 33 41 15 3.0 331.2 0.5X
|
|
|
|
100 cols x 1000 rows (write parquet) 120 133 10 0.8 1199.6 0.1X
|
|
|
|
2500 cols x 40 rows (read in-mem) 165 179 9 0.6 1645.1 0.1X
|
|
|
|
2500 cols x 40 rows (exec in-mem) 734 784 51 0.1 7338.1 0.0X
|
|
|
|
2500 cols x 40 rows (read parquet) 621 634 18 0.2 6210.9 0.0X
|
|
|
|
2500 cols x 40 rows (write parquet) 268 283 18 0.4 2675.3 0.1X
|
2018-10-20 20:31:13 -04:00
|
|
|
|
|
|
|
|
|
|
|
================================================================================================
|
|
|
|
wide shallowly nested struct field read and write
|
|
|
|
================================================================================================
|
2016-06-11 18:26:08 -04:00
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
|
|
|
wide shallowly nested struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2021-04-03 16:02:56 -04:00
|
|
|
1 wide x 100000 rows (read in-mem) 27 30 5 3.8 265.0 1.0X
|
|
|
|
1 wide x 100000 rows (exec in-mem) 33 37 5 3.0 331.8 0.8X
|
|
|
|
1 wide x 100000 rows (read parquet) 54 60 9 1.9 536.1 0.5X
|
|
|
|
1 wide x 100000 rows (write parquet) 129 140 6 0.8 1293.8 0.2X
|
|
|
|
100 wide x 1000 rows (read in-mem) 21 26 6 4.8 210.5 1.3X
|
|
|
|
100 wide x 1000 rows (exec in-mem) 43 51 6 2.3 434.4 0.6X
|
|
|
|
100 wide x 1000 rows (read parquet) 34 39 8 2.9 340.2 0.8X
|
|
|
|
100 wide x 1000 rows (write parquet) 119 128 4 0.8 1188.9 0.2X
|
|
|
|
2500 wide x 40 rows (read in-mem) 31 37 7 3.2 310.6 0.9X
|
|
|
|
2500 wide x 40 rows (exec in-mem) 524 559 37 0.2 5242.4 0.1X
|
|
|
|
2500 wide x 40 rows (read parquet) 139 152 6 0.7 1393.4 0.2X
|
|
|
|
2500 wide x 40 rows (write parquet) 130 151 42 0.8 1300.7 0.2X
|
2016-06-11 18:26:08 -04:00
|
|
|
|
2018-10-20 20:31:13 -04:00
|
|
|
|
|
|
|
================================================================================================
|
|
|
|
deeply nested struct field read and write
|
|
|
|
================================================================================================
|
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
|
|
|
deeply nested struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2021-04-03 16:02:56 -04:00
|
|
|
1 deep x 100000 rows (read in-mem) 21 25 4 4.7 214.6 1.0X
|
|
|
|
1 deep x 100000 rows (exec in-mem) 27 31 6 3.7 266.7 0.8X
|
|
|
|
1 deep x 100000 rows (read parquet) 43 51 10 2.3 431.2 0.5X
|
|
|
|
1 deep x 100000 rows (write parquet) 121 125 4 0.8 1205.3 0.2X
|
|
|
|
100 deep x 1000 rows (read in-mem) 71 77 4 1.4 707.6 0.3X
|
|
|
|
100 deep x 1000 rows (exec in-mem) 675 685 13 0.1 6746.2 0.0X
|
|
|
|
100 deep x 1000 rows (read parquet) 583 588 9 0.2 5832.6 0.0X
|
|
|
|
100 deep x 1000 rows (write parquet) 169 183 15 0.6 1687.8 0.1X
|
|
|
|
250 deep x 400 rows (read in-mem) 264 276 11 0.4 2642.9 0.1X
|
|
|
|
250 deep x 400 rows (exec in-mem) 4254 4289 50 0.0 42538.9 0.0X
|
|
|
|
250 deep x 400 rows (read parquet) 3280 3348 96 0.0 32796.7 0.0X
|
|
|
|
250 deep x 400 rows (write parquet) 392 406 14 0.3 3915.9 0.1X
|
2018-10-20 20:31:13 -04:00
|
|
|
|
2016-06-11 18:26:08 -04:00
|
|
|
|
2018-10-20 20:31:13 -04:00
|
|
|
================================================================================================
|
|
|
|
bushy struct field read and write
|
|
|
|
================================================================================================
|
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
|
|
|
bushy struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
-------------------------------------------------------------------------------------------------------------------------
|
|
|
|
1 x 1 deep x 100000 rows (read in-mem) 19 22 5 5.3 189.0 1.0X
|
|
|
|
1 x 1 deep x 100000 rows (exec in-mem) 23 29 5 4.3 231.4 0.8X
|
|
|
|
1 x 1 deep x 100000 rows (read parquet) 26 31 7 3.8 264.7 0.7X
|
|
|
|
1 x 1 deep x 100000 rows (write parquet) 119 136 44 0.8 1189.3 0.2X
|
|
|
|
128 x 8 deep x 1000 rows (read in-mem) 21 24 4 4.8 208.8 0.9X
|
|
|
|
128 x 8 deep x 1000 rows (exec in-mem) 56 62 4 1.8 561.8 0.3X
|
|
|
|
128 x 8 deep x 1000 rows (read parquet) 35 40 7 2.8 353.0 0.5X
|
|
|
|
128 x 8 deep x 1000 rows (write parquet) 119 128 6 0.8 1193.9 0.2X
|
|
|
|
1024 x 11 deep x 100 rows (read in-mem) 27 31 4 3.7 269.0 0.7X
|
|
|
|
1024 x 11 deep x 100 rows (exec in-mem) 366 377 9 0.3 3661.4 0.1X
|
|
|
|
1024 x 11 deep x 100 rows (read parquet) 70 74 6 1.4 698.0 0.3X
|
|
|
|
1024 x 11 deep x 100 rows (write parquet) 131 141 6 0.8 1314.8 0.1X
|
2018-10-20 20:31:13 -04:00
|
|
|
|
|
|
|
|
|
|
|
================================================================================================
|
|
|
|
wide array field read and write
|
|
|
|
================================================================================================
|
2016-06-11 18:26:08 -04:00
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
|
|
|
wide array field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2021-04-03 16:02:56 -04:00
|
|
|
1 wide x 100000 rows (read in-mem) 20 24 4 4.9 204.1 1.0X
|
|
|
|
1 wide x 100000 rows (exec in-mem) 26 31 4 3.8 263.6 0.8X
|
|
|
|
1 wide x 100000 rows (read parquet) 57 64 8 1.8 565.8 0.4X
|
|
|
|
1 wide x 100000 rows (write parquet) 115 129 6 0.9 1148.2 0.2X
|
|
|
|
100 wide x 1000 rows (read in-mem) 17 20 6 5.9 170.0 1.2X
|
|
|
|
100 wide x 1000 rows (exec in-mem) 21 24 4 4.7 214.3 1.0X
|
|
|
|
100 wide x 1000 rows (read parquet) 30 36 7 3.3 303.9 0.7X
|
|
|
|
100 wide x 1000 rows (write parquet) 113 120 4 0.9 1127.2 0.2X
|
|
|
|
2500 wide x 40 rows (read in-mem) 17 21 4 5.9 169.9 1.2X
|
|
|
|
2500 wide x 40 rows (exec in-mem) 21 26 5 4.7 214.3 1.0X
|
|
|
|
2500 wide x 40 rows (read parquet) 30 35 7 3.3 300.8 0.7X
|
|
|
|
2500 wide x 40 rows (write parquet) 116 122 4 0.9 1161.4 0.2X
|
2016-09-03 13:02:20 -04:00
|
|
|
|
2018-10-20 20:31:13 -04:00
|
|
|
|
|
|
|
================================================================================================
|
|
|
|
wide map field read and write
|
|
|
|
================================================================================================
|
|
|
|
|
2021-04-03 16:02:56 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Linux 5.4.0-1043-azure
|
|
|
|
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
|
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed?
To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Re-run all modified benchmarks using Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
|
|
|
wide map field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2021-04-03 16:02:56 -04:00
|
|
|
1 wide x 100000 rows (read in-mem) 19 23 3 5.1 194.5 1.0X
|
|
|
|
1 wide x 100000 rows (exec in-mem) 26 29 2 3.8 260.1 0.7X
|
|
|
|
1 wide x 100000 rows (read parquet) 79 88 7 1.3 785.1 0.2X
|
|
|
|
1 wide x 100000 rows (write parquet) 120 125 5 0.8 1195.5 0.2X
|
|
|
|
100 wide x 1000 rows (read in-mem) 13 15 3 7.7 130.1 1.5X
|
|
|
|
100 wide x 1000 rows (exec in-mem) 18 20 2 5.6 177.7 1.1X
|
|
|
|
100 wide x 1000 rows (read parquet) 38 43 7 2.6 383.8 0.5X
|
|
|
|
100 wide x 1000 rows (write parquet) 113 121 6 0.9 1130.0 0.2X
|
|
|
|
2500 wide x 40 rows (read in-mem) 15 17 4 6.9 145.0 1.3X
|
|
|
|
2500 wide x 40 rows (exec in-mem) 20 22 3 5.1 195.5 1.0X
|
|
|
|
2500 wide x 40 rows (read parquet) 38 44 8 2.6 383.8 0.5X
|
|
|
|
2500 wide x 40 rows (write parquet) 114 120 4 0.9 1137.3 0.2X
|
2018-10-20 20:31:13 -04:00
|
|
|
|
2016-06-11 18:26:08 -04:00
|
|
|
|