spark-instrumented-optimizer/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt
Maxim Gekk c1f160e097 [SPARK-30648][SQL] Support filters pushdown in JSON datasource
### What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion to `TIMESTAMP` values).

The main idea behind of `JsonFilters` is to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches `0`, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row).

The common code shared between `CSVFilters` and `JsonFilters` is moved to the `StructFilters` class and its companion object.

### Why are the changes needed?
The changes improve performance on synthetic benchmarks up to **27 times** on JDK 8 and **25** times on JDK 11:
```
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                       25230          25255          22          0.0      252299.6       1.0X
pushdown disabled                                 25248          25282          33          0.0      252475.6       1.0X
w/ filters                                          905            911           8          0.1        9047.9      27.9X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`.
- By new end-to-end and case sensitivity tests in `JsonSuite`.
- By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`.
- Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #27366 from MaxGekk/json-filters-pushdown.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-17 00:01:13 +09:00

121 lines
10 KiB
Plaintext

================================================================================================
Benchmark for performance of JSON parsing
================================================================================================
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
No encoding 70753 71127 471 1.4 707.5 1.0X
UTF-8 is set 128105 129183 1165 0.8 1281.1 0.6X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
No encoding 59588 59643 73 1.7 595.9 1.0X
UTF-8 is set 97081 97122 62 1.0 970.8 0.6X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
No encoding 58835 59259 659 0.2 5883.5 1.0X
UTF-8 is set 103117 103218 88 0.1 10311.7 0.6X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
No encoding 142993 143485 436 0.0 285985.3 1.0X
UTF-8 is set 165446 165496 60 0.0 330892.4 0.9X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns 21557 21593 61 0.5 2155.7 1.0X
Select 1 column 24197 24236 35 0.4 2419.7 0.9X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Short column without encoding 9795 9820 29 1.0 979.5 1.0X
Short column with UTF-8 16442 16536 146 0.6 1644.2 0.6X
Wide column without encoding 99134 99475 300 0.1 9913.4 0.1X
Wide column with UTF-8 155913 156369 692 0.1 15591.3 0.1X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Text read 671 679 7 14.9 67.1 1.0X
from_json 25356 25432 79 0.4 2535.6 0.0X
json_tuple 29464 29927 672 0.3 2946.4 0.0X
get_json_object 21841 21877 32 0.5 2184.1 0.0X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Text read 3109 3116 12 16.1 62.2 1.0X
schema inferring 28751 28765 15 1.7 575.0 0.1X
parsing 34923 35030 151 1.4 698.5 0.1X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Text read 10787 10818 32 4.6 215.7 1.0X
Schema inferring 49577 49775 184 1.0 991.5 0.2X
Parsing without charset 35343 35433 87 1.4 706.9 0.3X
Parsing with UTF-8 60253 60290 35 0.8 1205.1 0.2X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 2200 2209 8 4.5 220.0 1.0X
to_json(timestamp) 18410 18602 264 0.5 1841.0 0.1X
write timestamps to files 11841 12032 305 0.8 1184.1 0.2X
Create a dataset of dates 2353 2363 9 4.3 235.3 0.9X
to_json(date) 12135 12182 72 0.8 1213.5 0.2X
write dates to files 6776 6801 33 1.5 677.6 0.3X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2563 2580 20 3.9 256.3 1.0X
read timestamps from files 41261 41360 97 0.2 4126.1 0.1X
infer timestamps from files 92292 92517 243 0.1 9229.2 0.0X
read date text from files 2332 2340 11 4.3 233.2 1.1X
read date from files 18753 18768 13 0.5 1875.3 0.1X
timestamp strings 3108 3123 13 3.2 310.8 0.8X
parse timestamps from Dataset[String] 51078 51448 323 0.2 5107.8 0.1X
infer timestamps from Dataset[String] 101373 101429 65 0.1 10137.3 0.0X
date strings 4126 4138 15 2.4 412.6 0.6X
parse dates from Dataset[String] 29365 29398 30 0.3 2936.5 0.1X
from_json(timestamp) 67033 67098 63 0.1 6703.3 0.0X
from_json(date) 44495 44581 125 0.2 4449.5 0.1X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 30167 30223 48 0.0 301674.9 1.0X
pushdown disabled 30291 30311 30 0.0 302914.8 1.0X
w/ filters 901 915 14 0.1 9012.4 33.5X