spark-instrumented-optimizer/sql/core/benchmarks/UDFBenchmark-results.txt

60 lines
4.7 KiB
Plaintext
Raw Normal View History

[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
================================================================================================
UDF with mixed input types
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
long/nullable int/string to string wholestage off 250 327 108 0.4 2500.6 1.0X
long/nullable int/string to string wholestage on 142 157 16 0.7 1421.2 1.8X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
long/nullable int/string to option wholestage off 124 125 2 0.8 1237.8 1.0X
long/nullable int/string to option wholestage on 73 93 27 1.4 730.1 1.7X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
long/nullable int/string to primitive wholestage off 66 69 4 1.5 658.8 1.0X
long/nullable int/string to primitive wholestage on 61 67 11 1.6 611.7 1.1X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
================================================================================================
UDF with primitive types
================================================================================================
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
long/nullable int to string wholestage off 66 67 0 1.5 663.9 1.0X
long/nullable int to string wholestage on 66 68 2 1.5 664.6 1.0X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
long/nullable int to option wholestage off 40 42 3 2.5 402.6 1.0X
long/nullable int to option wholestage on 40 42 2 2.5 401.3 1.0X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
long/nullable int to primitive wholestage off 38 39 0 2.6 384.8 1.0X
long/nullable int to primitive wholestage on 39 45 5 2.5 392.6 1.0X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 11:58:25 -04:00
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00
UDF identity overhead: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-12 16:18:19 -05:00
Baseline 32 33 1 3.1 320.8 1.0X
With identity UDF 37 40 6 2.7 369.1 0.9X
[SPARK-27684][SQL] Avoid conversion overhead for primitive types ## What changes were proposed in this pull request? As outlined in the JIRA by JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to `identity` function or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads. ## How was this patch tested? Here is a simple test which shows the benefit that this PR can bring: ``` test("SPARK-27684: perf evaluation") { val intLongUdf = ScalaUDF( (a: Int, b: Long) => a + b, LongType, Literal(1) :: Literal(1L) :: Nil, true :: true :: Nil, nullable = false) val plan = generateProject( MutableProjection.create(Alias(intLongUdf, s"udf")() :: Nil), intLongUdf) plan.initialize(0) var i = 0 val N = 100000000 val t0 = System.nanoTime() while(i < N) { plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) plan(EmptyRow).get(0, intLongUdf.dataType) i += 1 } val t1 = System.nanoTime() println(s"Avg time: ${(t1 - t0).toDouble / N} ns") } ``` The output before the patch is: ``` Avg time: 51.27083294 ns ``` after, we get: ``` Avg time: 11.85874227 ns ``` which is ~5X faster. Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was: ``` ================================================================================================ UDF with mixed input types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to string wholestage off 257 287 42 0,4 2569,5 1,0X long/nullable int/string to string wholestage on 158 172 18 0,6 1579,0 1,6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to option wholestage off 104 107 5 1,0 1037,9 1,0X long/nullable int/string to option wholestage on 80 92 12 1,2 804,0 1,3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to primitive wholestage off 71 76 7 1,4 712,1 1,0X long/nullable int to primitive wholestage on 64 71 6 1,6 636,2 1,1X ================================================================================================ UDF with primitive types ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to string wholestage off 60 60 0 1,7 600,3 1,0X long/nullable int to string wholestage on 55 64 8 1,8 551,2 1,1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int to option: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int to option wholestage off 66 73 9 1,5 663,0 1,0X long/nullable int to option wholestage on 30 32 2 3,3 300,7 2,2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-4558U CPU 2.80GHz long/nullable int/string to primitive: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ long/nullable int/string to primitive wholestage off 32 35 5 3,2 316,7 1,0X long/nullable int/string to primitive wholestage on 41 68 17 2,4 414,0 0,8X ``` The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs. Closes #24636 from mgaido91/SPARK-27684. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-05-30 20:09:19 -04:00