2018-10-30 12:18:55 -04:00
|
|
|
================================================================================================
|
|
|
|
Benchmark to measure CSV read/write performance
|
|
|
|
================================================================================================
|
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
One quoted string 62603 62755 133 0.0 1252055.6 1.0X
|
2018-10-30 12:18:55 -04:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Select 1000 columns 225032 225919 782 0.0 225031.7 1.0X
|
|
|
|
Select 100 columns 51982 52290 286 0.0 51982.1 4.3X
|
|
|
|
Select one column 40167 40283 133 0.0 40167.4 5.6X
|
|
|
|
count() 11435 11593 176 0.1 11435.1 19.7X
|
|
|
|
Select 100 columns, one bad input field 66864 66968 174 0.0 66864.1 3.4X
|
|
|
|
Select 100 columns, corrupt record field 79570 80418 1080 0.0 79569.5 2.8X
|
2018-10-30 12:18:55 -04:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Select 10 columns + count() 23271 23389 103 0.4 2327.1 1.0X
|
|
|
|
Select 1 column + count() 18206 19772 NaN 0.5 1820.6 1.3X
|
|
|
|
count() 8500 8521 18 1.2 850.0 2.7X
|
2019-04-22 22:08:02 -04:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
Create a dataset of timestamps 2025 2068 66 4.9 202.5 1.0X
|
|
|
|
to_csv(timestamp) 22192 22983 879 0.5 2219.2 0.1X
|
|
|
|
write timestamps to files 15949 16030 72 0.6 1594.9 0.1X
|
|
|
|
Create a dataset of dates 2200 2234 32 4.5 220.0 0.9X
|
|
|
|
to_csv(date) 18268 18341 73 0.5 1826.8 0.1X
|
|
|
|
write dates to files 10495 10722 214 1.0 1049.5 0.2X
|
2019-04-22 22:08:02 -04:00
|
|
|
|
2019-10-03 11:58:25 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2019-04-22 22:08:02 -04:00
|
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2019-10-03 11:58:25 -04:00
|
|
|
read timestamp text from files 6491 6503 18 1.5 649.1 1.0X
|
|
|
|
read timestamps from files 56069 56795 874 0.2 5606.9 0.1X
|
|
|
|
infer timestamps from files 113383 114203 825 0.1 11338.3 0.1X
|
|
|
|
read date text from files 6411 6419 10 1.6 641.1 1.0X
|
|
|
|
read date from files 46245 46371 138 0.2 4624.5 0.1X
|
|
|
|
infer date from files 43623 43906 291 0.2 4362.3 0.1X
|
|
|
|
timestamp strings 4951 4959 7 2.0 495.1 1.3X
|
|
|
|
parse timestamps from Dataset[String] 65786 66309 663 0.2 6578.6 0.1X
|
|
|
|
infer timestamps from Dataset[String] 130891 133861 1928 0.1 13089.1 0.0X
|
|
|
|
date strings 3814 3895 84 2.6 381.4 1.7X
|
|
|
|
parse dates from Dataset[String] 52259 52960 614 0.2 5225.9 0.1X
|
|
|
|
from_csv(timestamp) 63013 63306 291 0.2 6301.3 0.1X
|
|
|
|
from_csv(date) 49840 52352 NaN 0.2 4984.0 0.1X
|
[SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables
## What changes were proposed in this pull request?
After [recent changes](https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2) to CSV parsing to return partial results for bad CSV records, queries of wide CSV tables slowed considerably. That recent change resulted in every row being recreated, even when the associated input record had no parsing issues and the user specified no corrupt record field in his/her schema.
The change to FailureSafeParser.scala also impacted queries against wide JSON tables as well.
In this PR, I propose that a row should be recreated only if columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the code should use the row as-is (For CSV input, it will have values for the columns that could be converted, and also null values for columns that could not be converted).
See benchmarks below. The CSV benchmark for 1000 columns went from 120144 ms to 89069 ms, a savings of 25% (this only brings the cost down to baseline levels. Again, see benchmarks below).
Similarly, the JSON benchmark for 1000 columns (added in this PR) went from 109621 ms to 80871 ms, also a savings of 25%.
Still, partial results functionality is preserved:
<pre>
bash-3.2$ cat test2.csv
"hello",1999-08-01,"last"
"there","bad date","field"
"again","2017-11-22","in file"
bash-3.2$ bin/spark-shell
...etc...
scala> val df = spark.read.schema("a string, b date, c string").csv("test2.csv")
df: org.apache.spark.sql.DataFrame = [a: string, b: date ... 1 more field]
scala> df.show
+-----+----------+-------+
| a| b| c|
+-----+----------+-------+
|hello|1999-08-01| last|
|there| null| field|
|again|2017-11-22|in file|
+-----+----------+-------+
scala> val df = spark.read.schema("badRecord string, a string, b date, c string").
| option("columnNameOfCorruptRecord", "badRecord").
| csv("test2.csv")
df: org.apache.spark.sql.DataFrame = [badRecord: string, a: string ... 2 more fields]
scala> df.show
+--------------------+-----+----------+-------+
| badRecord| a| b| c|
+--------------------+-----+----------+-------+
| null|hello|1999-08-01| last|
|"there","bad date...|there| null| field|
| null|again|2017-11-22|in file|
+--------------------+-----+----------+-------+
scala>
</pre>
### CSVBenchmark Benchmarks:
baseline = commit before partial results change
PR = this PR
master = master branch
[baseline_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697109/baseline_CSVBenchmark-results.txt)
[pr_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697110/pr_CSVBenchmark-results.txt)
[master_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697111/master_CSVBenchmark-results.txt)
### JSONBenchmark Benchmarks:
baseline = commit before partial results change
PR = this PR
master = master branch
[baseline_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711040/baseline_JSONBenchmark-results.txt)
[pr_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711041/pr_JSONBenchmark-results.txt)
[master_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711042/master_JSONBenchmark-results.txt)
## How was this patch tested?
- All SQL unit tests.
- Added 2 CSV benchmarks
- Python core and SQL tests
Closes #23336 from bersprockets/csv-wide-row-opt2.
Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-30 02:15:29 -05:00
|
|
|
|
2018-10-30 12:18:55 -04:00
|
|
|
|