spark-instrumented-optimizer/sql/core/benchmarks/CSVBenchmark-results.txt

60 lines
5.4 KiB
Plaintext
Raw Normal View History

================================================================================================
Benchmark to measure CSV read/write performance
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 36998 37134 120 0.0 739953.1 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 140620 141162 737 0.0 140620.5 1.0X
Select 100 columns 35170 35287 183 0.0 35170.0 4.0X
Select one column 27711 27927 187 0.0 27710.9 5.1X
count() 7707 7804 84 0.1 7707.4 18.2X
Select 100 columns, one bad input field 41762 41851 117 0.0 41761.8 3.4X
Select 100 columns, corrupt record field 48717 48761 44 0.0 48717.4 2.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 16001 16053 53 0.6 1600.1 1.0X
Select 1 column + count() 11571 11614 58 0.9 1157.1 1.4X
count() 4752 4766 18 2.1 475.2 3.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 1070 1072 2 9.3 107.0 1.0X
to_csv(timestamp) 10446 10746 344 1.0 1044.6 0.1X
write timestamps to files 9573 9659 101 1.0 957.3 0.1X
Create a dataset of dates 1245 1260 17 8.0 124.5 0.9X
to_csv(date) 7157 7167 11 1.4 715.7 0.1X
write dates to files 5415 5450 57 1.8 541.5 0.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 1880 1887 8 5.3 188.0 1.0X
read timestamps from files 27135 27180 43 0.4 2713.5 0.1X
infer timestamps from files 51426 51534 97 0.2 5142.6 0.0X
read date text from files 1618 1622 4 6.2 161.8 1.2X
read date from files 20207 20218 13 0.5 2020.7 0.1X
infer date from files 19418 19479 94 0.5 1941.8 0.1X
timestamp strings 2289 2300 13 4.4 228.9 0.8X
parse timestamps from Dataset[String] 29367 29391 24 0.3 2936.7 0.1X
infer timestamps from Dataset[String] 54782 54902 126 0.2 5478.2 0.0X
date strings 2508 2524 16 4.0 250.8 0.7X
parse dates from Dataset[String] 21884 21902 19 0.5 2188.4 0.1X
from_csv(timestamp) 27188 27723 477 0.4 2718.8 0.1X
from_csv(date) 21137 21191 84 0.5 2113.7 0.1X
[SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables ## What changes were proposed in this pull request? After [recent changes](https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2) to CSV parsing to return partial results for bad CSV records, queries of wide CSV tables slowed considerably. That recent change resulted in every row being recreated, even when the associated input record had no parsing issues and the user specified no corrupt record field in his/her schema. The change to FailureSafeParser.scala also impacted queries against wide JSON tables as well. In this PR, I propose that a row should be recreated only if columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the code should use the row as-is (For CSV input, it will have values for the columns that could be converted, and also null values for columns that could not be converted). See benchmarks below. The CSV benchmark for 1000 columns went from 120144 ms to 89069 ms, a savings of 25% (this only brings the cost down to baseline levels. Again, see benchmarks below). Similarly, the JSON benchmark for 1000 columns (added in this PR) went from 109621 ms to 80871 ms, also a savings of 25%. Still, partial results functionality is preserved: <pre> bash-3.2$ cat test2.csv "hello",1999-08-01,"last" "there","bad date","field" "again","2017-11-22","in file" bash-3.2$ bin/spark-shell ...etc... scala> val df = spark.read.schema("a string, b date, c string").csv("test2.csv") df: org.apache.spark.sql.DataFrame = [a: string, b: date ... 1 more field] scala> df.show +-----+----------+-------+ | a| b| c| +-----+----------+-------+ |hello|1999-08-01| last| |there| null| field| |again|2017-11-22|in file| +-----+----------+-------+ scala> val df = spark.read.schema("badRecord string, a string, b date, c string"). | option("columnNameOfCorruptRecord", "badRecord"). | csv("test2.csv") df: org.apache.spark.sql.DataFrame = [badRecord: string, a: string ... 2 more fields] scala> df.show +--------------------+-----+----------+-------+ | badRecord| a| b| c| +--------------------+-----+----------+-------+ | null|hello|1999-08-01| last| |"there","bad date...|there| null| field| | null|again|2017-11-22|in file| +--------------------+-----+----------+-------+ scala> </pre> ### CSVBenchmark Benchmarks: baseline = commit before partial results change PR = this PR master = master branch [baseline_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697109/baseline_CSVBenchmark-results.txt) [pr_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697110/pr_CSVBenchmark-results.txt) [master_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697111/master_CSVBenchmark-results.txt) ### JSONBenchmark Benchmarks: baseline = commit before partial results change PR = this PR master = master branch [baseline_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711040/baseline_JSONBenchmark-results.txt) [pr_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711041/pr_JSONBenchmark-results.txt) [master_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711042/master_JSONBenchmark-results.txt) ## How was this patch tested? - All SQL unit tests. - Added 2 CSV benchmarks - Python core and SQL tests Closes #23336 from bersprockets/csv-wide-row-opt2. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-30 02:15:29 -05:00