spark-instrumented-optimizer/sql/core/benchmarks/MiscBenchmark-results.txt

128 lines
9.4 KiB
Plaintext
Raw Normal View History

================================================================================================
filter & aggregate without group
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
range/filter/sum: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
range/filter/sum wholestage off 46812 48110 1836 44.8 22.3 1.0X
range/filter/sum wholestage on 3116 3656 309 673.1 1.5 15.0X
================================================================================================
range/limit/sum
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
range/limit/sum: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
range/limit/sum wholestage off 200 219 27 2621.6 0.4 1.0X
range/limit/sum wholestage on 117 125 7 4477.8 0.2 1.7X
================================================================================================
sample
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
sample with replacement: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
sample with replacement wholestage off 12963 13506 768 10.1 98.9 1.0X
sample with replacement wholestage on 7397 7742 300 17.7 56.4 1.8X
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
sample without replacement: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
sample without replacement wholestage off 2977 2977 0 44.0 22.7 1.0X
sample without replacement wholestage on 1087 1099 13 120.6 8.3 2.7X
================================================================================================
collect
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
collect: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
collect 1 million 281 315 34 3.7 268.2 1.0X
collect 2 millions 531 564 28 2.0 506.6 0.5X
collect 4 millions 1179 1970 1118 0.9 1124.5 0.2X
================================================================================================
collect limit
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
collect limit: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
collect limit 1 million 344 352 5 3.0 328.0 1.0X
collect limit 2 millions 656 660 6 1.6 625.2 0.5X
================================================================================================
generate explode
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
generate explode array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate explode array wholestage off 14664 14780 164 1.1 874.0 1.0X
generate explode array wholestage on 14789 14886 87 1.1 881.5 1.0X
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
generate explode map: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate explode map wholestage off 50441 51014 811 0.3 3006.5 1.0X
generate explode map wholestage on 49164 49915 892 0.3 2930.4 1.0X
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
generate posexplode array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate posexplode array wholestage off 17101 17130 40 1.0 1019.3 1.0X
generate posexplode array wholestage on 15625 15675 46 1.1 931.3 1.1X
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
generate inline array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate inline array wholestage off 13923 14720 1127 1.2 829.9 1.0X
generate inline array wholestage on 12246 12591 695 1.4 729.9 1.1X
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
generate big struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate big struct array wholestage off 442 456 20 0.1 7368.2 1.0X
generate big struct array wholestage on 409 429 22 0.1 6823.9 1.1X
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-27707][SQL] Prune unnecessary nested fields from Generate ## What changes were proposed in this pull request? Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example: ```scala val df = spark.sparkContext.parallelize(Seq(("1", Array.fill(M)({ val i = math.random (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) })))).toDF("col", "arr") .selectExpr("col", "struct(col, arr) as st") .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") ``` The explode causes `st` to be duplicated as many as the exploded elements. Benchmarks it: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 [info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz [info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] generate big nested struct array wholestage off 52668 53162 699 0.0 877803.4 1.0X [info] generate big nested struct array wholestage on 47261 49093 1125 0.0 787690.2 1.1X [info] ``` The query plan: ``` == Physical Plan == Project [col#508, st#512.col AS col1#515, arr_col#519] +- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519] +- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504] +- Scan[obj#534] ``` This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of `Generate`, and substitutes them by alias attributes on the projection on top of `Generate`. Benchmarks it after the change: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 [info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz [info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] generate big nested struct array wholestage off 311 331 28 0.2 5188.6 1.0X [info] generate big nested struct array wholestage on 297 312 15 0.2 4947.3 1.0X [info] ``` The query plan: ``` == Physical Plan == Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603] +- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603] +- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#588] +- Scan[obj#586] ``` This behavior is controlled by a SQL config `spark.sql.optimizer.expression.nestedPruning.enabled`. ## How was this patch tested? Added benchmark. Closes #24637 from viirya/SPARK-27707. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-19 02:32:07 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate big nested struct array wholestage off 409 415 8 0.1 6822.4 1.0X
generate big nested struct array wholestage on 401 414 23 0.1 6687.0 1.0X
[SPARK-27707][SQL] Prune unnecessary nested fields from Generate ## What changes were proposed in this pull request? Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example: ```scala val df = spark.sparkContext.parallelize(Seq(("1", Array.fill(M)({ val i = math.random (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) })))).toDF("col", "arr") .selectExpr("col", "struct(col, arr) as st") .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") ``` The explode causes `st` to be duplicated as many as the exploded elements. Benchmarks it: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 [info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz [info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] generate big nested struct array wholestage off 52668 53162 699 0.0 877803.4 1.0X [info] generate big nested struct array wholestage on 47261 49093 1125 0.0 787690.2 1.1X [info] ``` The query plan: ``` == Physical Plan == Project [col#508, st#512.col AS col1#515, arr_col#519] +- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519] +- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504] +- Scan[obj#534] ``` This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of `Generate`, and substitutes them by alias attributes on the projection on top of `Generate`. Benchmarks it after the change: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 [info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz [info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] generate big nested struct array wholestage off 311 331 28 0.2 5188.6 1.0X [info] generate big nested struct array wholestage on 297 312 15 0.2 4947.3 1.0X [info] ``` The query plan: ``` == Physical Plan == Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603] +- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603] +- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#588] +- Scan[obj#586] ``` This behavior is controlled by a SQL config `spark.sql.optimizer.expression.nestedPruning.enabled`. ## How was this patch tested? Added benchmark. Closes #24637 from viirya/SPARK-27707. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-19 02:32:07 -04:00
================================================================================================
generate regular generator
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
generate stack: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
generate stack wholestage off 27043 27118 106 0.6 1611.9 1.0X
generate stack wholestage on 24002 24036 34 0.7 1430.6 1.1X