ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
HyukjinKwon	ebf01ec3c1	[SPARK-34950][TESTS] Update benchmark results to the ones created by GitHub Actions machines ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/32015 added a way to run benchmarks much more easily in the same GitHub Actions build. This PR updates the benchmark results by using the way. NOTE that looks like GitHub Actions use four types of CPU given my observations: - Intel(R) Xeon(R) Platinum 8171M CPU 2.60GHz - Intel(R) Xeon(R) CPU E5-2673 v4 2.30GHz - Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz - Intel(R) Xeon(R) Platinum 8272CL CPU 2.60GHz Given my quick research, seems like they perform roughly similarly: ![Screen Shot 2021-04-03 at 9 31 23 PM](https://user-images.githubusercontent.com/6477701/113478478-f4b57b80-94c3-11eb-9047-f81ca8c59672.png) I couldn't find enough information about Intel(R) Xeon(R) Platinum 8272CL CPU 2.60GHz but the performance seems roughly similar given the numbers. So shouldn't be a big deal especially given that this way is much easier, encourages contributors to run more and guarantee the same number of cores and same memory with the same softwares. ### Why are the changes needed? To have a base line of the benchmarks accordingly. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It was generated from: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) Closes #32044 from HyukjinKwon/SPARK-34950. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-03 23:02:56 +03:00
Max Gekk	8c44d74463	[SPARK-32071][SQL][TESTS] Add `make_interval` benchmark ### What changes were proposed in this pull request? Add benchmarks for interval constructor `make_interval` and measure perf of 4 cases: 1. Constant (year, month) 2. Constant (week, day) 3. Constant (hour, minute, second, second fraction) 4. All fields are NOT constant. The benchmark results are generated in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| ### Why are the changes needed? To have a base line for future perf improvements of `make_interval`, and to prevent perf regressions in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `IntervalBenchmark` via: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark" ``` Closes #28905 from MaxGekk/benchmark-make_interval. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-27 17:54:06 -07:00
Kent Yao	fbc9dc7e9d	[SPARK-31129][SQL][TESTS] Fix IntervalBenchmark and DateTimeBenchmark ### What changes were proposed in this pull request? This PR aims to recover `IntervalBenchmark` and `DataTimeBenchmark` due to banning intervals as output. ### Why are the changes needed? This PR recovers the benchmark suite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually, re-run the benchmark. Closes #27885 from yaooqinn/SPARK-31111-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-12 12:59:29 -07:00
Maxim Gekk	f5118f81e3	[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/10 \| - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-12 13:18:19 -08:00
Kent Yao	ed0c33fdd4	[SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string ### What changes were proposed in this pull request? We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do. This PR makes the whitespaces handling be consistent for nterval values. Typed interval literal, multi-unit representation, and casting from strings are all supported. ```sql postgres=# select interval E'1 \t day'; interval ---------- 1 day (1 row) postgres=# select interval E'1\t' day; interval ---------- 1 day (1 row) ``` ### Why are the changes needed? Whitespace handling should be consistent for interval value, and across different types in Spark. PostgreSQL feature parity. ### Does this PR introduce any user-facing change? Yes, the interval string of multi-units values which separated by whitespaces can be valid now. ### How was this patch tested? add ut. Closes #26662 from yaooqinn/SPARK-30026. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 01:20:38 +08:00
Kent Yao	d06a9cc4bd	[SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values ### What changes were proposed in this pull request? With the latest string to literal optimization https://github.com/apache/spark/pull/26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces. How to re-produce, which aim the revisions since https://github.com/apache/spark/pull/26256 is merged ```sql select cast(v as interval) from values ('+ 1 second') t(v); select cast(v as interval) from values ('- 1 second') t(v); ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. ut 2. new benchmark test Closes #26449 from yaooqinn/SPARK-29605. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-11 21:53:33 +08:00
Maxim Gekk	29dc59ac29	[SPARK-29605][SQL] Optimize string to interval casting ### What changes were proposed in this pull request? In the PR, I propose new function `stringToInterval()` in `IntervalUtils` for converting `UTF8String` to `CalendarInterval`. The function is used in casting a `STRING` column to an `INTERVAL` column. ### Why are the changes needed? The proposed implementation is ~10 times faster. For example, parsing 9 interval units on JDK 8: Before: ``` 9 units w/ interval 14004 14125 116 0.1 14003.6 0.0X 9 units w/o interval 13785 14056 290 0.1 13784.9 0.0X ``` After: ``` 9 units w/ interval 1343 1344 1 0.7 1343.0 0.3X 9 units w/o interval 1345 1349 8 0.7 1344.6 0.3X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By new tests for `stringToInterval` in `IntervalUtilsSuite` - By existing tests Closes #26256 from MaxGekk/string-to-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-07 12:39:52 +08:00
Wenchen Fan	cdea520ff8	[SPARK-29532][SQL] Simplify interval string parsing ### What changes were proposed in this pull request? Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`. ### Why are the changes needed? Simplify the code and fix inconsistent behaviors. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the updated test cases. Closes #26190 from cloud-fan/parser. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 09:15:59 -07:00
Dongjoon Hyun	b91356e4c2	[SPARK-29533][SQL][TESTS][FOLLOWUP] Regenerate the result on EC2 ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26189 to regenerate the result on EC2. ### Why are the changes needed? This will be used for the other PR reviews. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26233 from dongjoon-hyun/SPARK-29533. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-10-23 21:41:05 +00:00
Maxim Gekk	6ffec5e6a6	[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals ### What changes were proposed in this pull request? Added new benchmark `IntervalBenchmark` to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with `interval` prefix and without it because there is special code for this `da576a737c/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java (L100-L103)` . And also I added benchmarks for different number of units in interval strings, for example 1 unit is `interval 10 years`, 2 units w/o interval is `10 years 5 months`, and etc. ### Why are the changes needed? - To find out current performance issues in casting to intervals - The benchmark can be used while refactoring/re-implementing `CalendarInterval.fromString()` or `CalendarInterval.fromCaseInsensitiveString()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark via the command: ```shell SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark" ``` Closes #26189 from MaxGekk/interval-from-string-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:04 +09:00

10 commits