2018-10-30 12:18:55 -04:00
|
|
|
================================================================================================
|
|
|
|
Benchmark to measure CSV read/write performance
|
|
|
|
================================================================================================
|
|
|
|
|
2020-05-25 11:00:11 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-25 11:00:11 -04:00
|
|
|
One quoted string 45457 45731 344 0.0 909136.8 1.0X
|
2018-10-30 12:18:55 -04:00
|
|
|
|
2020-05-25 11:00:11 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-25 11:00:11 -04:00
|
|
|
Select 1000 columns 129646 130527 1412 0.0 129646.3 1.0X
|
|
|
|
Select 100 columns 42444 42551 119 0.0 42444.0 3.1X
|
|
|
|
Select one column 35415 35428 20 0.0 35414.6 3.7X
|
|
|
|
count() 11114 11128 16 0.1 11113.6 11.7X
|
|
|
|
Select 100 columns, one bad input field 93353 93670 275 0.0 93352.6 1.4X
|
|
|
|
Select 100 columns, corrupt record field 113569 113952 373 0.0 113568.8 1.1X
|
2018-10-30 12:18:55 -04:00
|
|
|
|
2020-05-25 11:00:11 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-25 11:00:11 -04:00
|
|
|
Select 10 columns + count() 18498 18589 87 0.5 1849.8 1.0X
|
|
|
|
Select 1 column + count() 11078 11095 27 0.9 1107.8 1.7X
|
|
|
|
count() 3928 3950 22 2.5 392.8 4.7X
|
2019-04-22 22:08:02 -04:00
|
|
|
|
2020-05-25 11:00:11 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-25 11:00:11 -04:00
|
|
|
Create a dataset of timestamps 1933 1940 11 5.2 193.3 1.0X
|
|
|
|
to_csv(timestamp) 18078 18243 255 0.6 1807.8 0.1X
|
|
|
|
write timestamps to files 12668 12786 134 0.8 1266.8 0.2X
|
|
|
|
Create a dataset of dates 2196 2201 5 4.6 219.6 0.9X
|
|
|
|
to_csv(date) 9583 9597 21 1.0 958.3 0.2X
|
|
|
|
write dates to files 7091 7110 20 1.4 709.1 0.3X
|
2019-04-22 22:08:02 -04:00
|
|
|
|
2020-05-25 11:00:11 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-25 11:00:11 -04:00
|
|
|
read timestamp text from files 2166 2177 10 4.6 216.6 1.0X
|
|
|
|
read timestamps from files 53212 53402 281 0.2 5321.2 0.0X
|
|
|
|
infer timestamps from files 109788 110372 570 0.1 10978.8 0.0X
|
|
|
|
read date text from files 1921 1929 8 5.2 192.1 1.1X
|
|
|
|
read date from files 25470 25499 25 0.4 2547.0 0.1X
|
|
|
|
infer date from files 27201 27342 134 0.4 2720.1 0.1X
|
|
|
|
timestamp strings 3638 3653 19 2.7 363.8 0.6X
|
|
|
|
parse timestamps from Dataset[String] 61894 62532 555 0.2 6189.4 0.0X
|
|
|
|
infer timestamps from Dataset[String] 125171 125430 236 0.1 12517.1 0.0X
|
|
|
|
date strings 3736 3749 14 2.7 373.6 0.6X
|
|
|
|
parse dates from Dataset[String] 30787 30829 43 0.3 3078.7 0.1X
|
|
|
|
from_csv(timestamp) 60842 61035 209 0.2 6084.2 0.0X
|
|
|
|
from_csv(date) 30123 30196 95 0.3 3012.3 0.1X
|
[SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables
## What changes were proposed in this pull request?
After [recent changes](https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2) to CSV parsing to return partial results for bad CSV records, queries of wide CSV tables slowed considerably. That recent change resulted in every row being recreated, even when the associated input record had no parsing issues and the user specified no corrupt record field in his/her schema.
The change to FailureSafeParser.scala also impacted queries against wide JSON tables as well.
In this PR, I propose that a row should be recreated only if columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the code should use the row as-is (For CSV input, it will have values for the columns that could be converted, and also null values for columns that could not be converted).
See benchmarks below. The CSV benchmark for 1000 columns went from 120144 ms to 89069 ms, a savings of 25% (this only brings the cost down to baseline levels. Again, see benchmarks below).
Similarly, the JSON benchmark for 1000 columns (added in this PR) went from 109621 ms to 80871 ms, also a savings of 25%.
Still, partial results functionality is preserved:
<pre>
bash-3.2$ cat test2.csv
"hello",1999-08-01,"last"
"there","bad date","field"
"again","2017-11-22","in file"
bash-3.2$ bin/spark-shell
...etc...
scala> val df = spark.read.schema("a string, b date, c string").csv("test2.csv")
df: org.apache.spark.sql.DataFrame = [a: string, b: date ... 1 more field]
scala> df.show
+-----+----------+-------+
| a| b| c|
+-----+----------+-------+
|hello|1999-08-01| last|
|there| null| field|
|again|2017-11-22|in file|
+-----+----------+-------+
scala> val df = spark.read.schema("badRecord string, a string, b date, c string").
| option("columnNameOfCorruptRecord", "badRecord").
| csv("test2.csv")
df: org.apache.spark.sql.DataFrame = [badRecord: string, a: string ... 2 more fields]
scala> df.show
+--------------------+-----+----------+-------+
| badRecord| a| b| c|
+--------------------+-----+----------+-------+
| null|hello|1999-08-01| last|
|"there","bad date...|there| null| field|
| null|again|2017-11-22|in file|
+--------------------+-----+----------+-------+
scala>
</pre>
### CSVBenchmark Benchmarks:
baseline = commit before partial results change
PR = this PR
master = master branch
[baseline_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697109/baseline_CSVBenchmark-results.txt)
[pr_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697110/pr_CSVBenchmark-results.txt)
[master_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697111/master_CSVBenchmark-results.txt)
### JSONBenchmark Benchmarks:
baseline = commit before partial results change
PR = this PR
master = master branch
[baseline_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711040/baseline_JSONBenchmark-results.txt)
[pr_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711041/pr_JSONBenchmark-results.txt)
[master_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711042/master_JSONBenchmark-results.txt)
## How was this patch tested?
- All SQL unit tests.
- Added 2 CSV benchmarks
- Python core and SQL tests
Closes #23336 from bersprockets/csv-wide-row-opt2.
Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-30 02:15:29 -05:00
|
|
|
|
2020-05-25 11:00:11 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-01-15 23:10:08 -05:00
|
|
|
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-25 11:00:11 -04:00
|
|
|
w/o filters 28985 29042 80 0.0 289852.9 1.0X
|
|
|
|
pushdown disabled 29080 29146 58 0.0 290799.4 1.0X
|
|
|
|
w/ filters 2072 2084 17 0.0 20722.3 14.0X
|
2020-01-15 23:10:08 -05:00
|
|
|
|
2018-10-30 12:18:55 -04:00
|
|
|
|