spark-instrumented-optimizer/sql/core/benchmarks/CSVBenchmark-results.txt

================================================================================================
Benchmark to measure CSV read/write performance
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Parsing quoted values:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string                                 36998          37134         120          0.0      739953.1       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Wide rows with 1000 columns:              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns                              140620         141162         737          0.0      140620.5       1.0X
Select 100 columns                                35170          35287         183          0.0       35170.0       4.0X
Select one column                                 27711          27927         187          0.0       27710.9       5.1X
count()                                            7707           7804          84          0.1        7707.4      18.2X
Select 100 columns, one bad input field           41762          41851         117          0.0       41761.8       3.4X
Select 100 columns, corrupt record field          48717          48761          44          0.0       48717.4       2.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Count a dataset with 10 columns:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count()                       16001          16053          53          0.6        1600.1       1.0X
Select 1 column + count()                         11571          11614          58          0.9        1157.1       1.4X
count()                                            4752           4766          18          2.1         475.2       3.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Write dates and timestamps:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps                     1070           1072           2          9.3         107.0       1.0X
to_csv(timestamp)                                 10446          10746         344          1.0        1044.6       0.1X
write timestamps to files                          9573           9659         101          1.0         957.3       0.1X
Create a dataset of dates                          1245           1260          17          8.0         124.5       0.9X
to_csv(date)                                       7157           7167          11          1.4         715.7       0.1X
write dates to files                               5415           5450          57          1.8         541.5       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Read dates and timestamps:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files                     1880           1887           8          5.3         188.0       1.0X
read timestamps from files                        27135          27180          43          0.4        2713.5       0.1X
infer timestamps from files                       51426          51534          97          0.2        5142.6       0.0X
read date text from files                          1618           1622           4          6.2         161.8       1.2X
read date from files                              20207          20218          13          0.5        2020.7       0.1X
infer date from files                             19418          19479          94          0.5        1941.8       0.1X
timestamp strings                                  2289           2300          13          4.4         228.9       0.8X
parse timestamps from Dataset[String]             29367          29391          24          0.3        2936.7       0.1X
infer timestamps from Dataset[String]             54782          54902         126          0.2        5478.2       0.0X
date strings                                       2508           2524          16          4.0         250.8       0.7X
parse dates from Dataset[String]                  21884          21902          19          0.5        2188.4       0.1X
from_csv(timestamp)                               27188          27723         477          0.4        2718.8       0.1X
from_csv(date)                                    21137          21191          84          0.5        2113.7       0.1X
[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method ## What changes were proposed in this pull request? use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmark"` ## How was this patch tested? manual tests Closes #22845 from heary-cao/CSVBenchmarks. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 2018-10-30 12:18:55 -04:00			`================================================================================================`
			`Benchmark to measure CSV read/write performance`
			`================================================================================================`

[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2019-04-22 22:08:02 -04:00			`Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4`
			`Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz`
			`Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative`
			`------------------------------------------------------------------------------------------------------------------------`
			`One quoted string 36998 37134 120 0.0 739953.1 1.0X`
[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method ## What changes were proposed in this pull request? use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmark"` ## How was this patch tested? manual tests Closes #22845 from heary-cao/CSVBenchmarks. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 2018-10-30 12:18:55 -04:00
[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2019-04-22 22:08:02 -04:00			`Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4`
			`Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz`
			`Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative`
			`------------------------------------------------------------------------------------------------------------------------`
			`Select 1000 columns 140620 141162 737 0.0 140620.5 1.0X`
			`Select 100 columns 35170 35287 183 0.0 35170.0 4.0X`
			`Select one column 27711 27927 187 0.0 27710.9 5.1X`
			`count() 7707 7804 84 0.1 7707.4 18.2X`
			`Select 100 columns, one bad input field 41762 41851 117 0.0 41761.8 3.4X`
			`Select 100 columns, corrupt record field 48717 48761 44 0.0 48717.4 2.9X`
[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method ## What changes were proposed in this pull request? use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmark"` ## How was this patch tested? manual tests Closes #22845 from heary-cao/CSVBenchmarks. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 2018-10-30 12:18:55 -04:00
[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2019-04-22 22:08:02 -04:00			`Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4`
			`Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz`
			`Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative`
			`------------------------------------------------------------------------------------------------------------------------`
			`Select 10 columns + count() 16001 16053 53 0.6 1600.1 1.0X`
			`Select 1 column + count() 11571 11614 58 0.9 1157.1 1.4X`
			`count() 4752 4766 18 2.1 475.2 3.4X`

			`Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4`
			`Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz`
			`Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative`
			`------------------------------------------------------------------------------------------------------------------------`
			`Create a dataset of timestamps 1070 1072 2 9.3 107.0 1.0X`
			`to_csv(timestamp) 10446 10746 344 1.0 1044.6 0.1X`
			`write timestamps to files 9573 9659 101 1.0 957.3 0.1X`
			`Create a dataset of dates 1245 1260 17 8.0 124.5 0.9X`
			`to_csv(date) 7157 7167 11 1.4 715.7 0.1X`
			`write dates to files 5415 5450 57 1.8 541.5 0.2X`

			`Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4`
			`Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz`
			`Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative`
			`------------------------------------------------------------------------------------------------------------------------`
			`read timestamp text from files 1880 1887 8 5.3 188.0 1.0X`
			`read timestamps from files 27135 27180 43 0.4 2713.5 0.1X`
			`infer timestamps from files 51426 51534 97 0.2 5142.6 0.0X`
			`read date text from files 1618 1622 4 6.2 161.8 1.2X`
			`read date from files 20207 20218 13 0.5 2020.7 0.1X`
			`infer date from files 19418 19479 94 0.5 1941.8 0.1X`
			`timestamp strings 2289 2300 13 4.4 228.9 0.8X`
			`parse timestamps from Dataset[String] 29367 29391 24 0.3 2936.7 0.1X`
			`infer timestamps from Dataset[String] 54782 54902 126 0.2 5478.2 0.0X`
			`date strings 2508 2524 16 4.0 250.8 0.7X`
			`parse dates from Dataset[String] 21884 21902 19 0.5 2188.4 0.1X`
			`from_csv(timestamp) 27188 27723 477 0.4 2718.8 0.1X`
			`from_csv(date) 21137 21191 84 0.5 2113.7 0.1X`
[SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables ## What changes were proposed in this pull request? After [recent changes](https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2) to CSV parsing to return partial results for bad CSV records, queries of wide CSV tables slowed considerably. That recent change resulted in every row being recreated, even when the associated input record had no parsing issues and the user specified no corrupt record field in his/her schema. The change to FailureSafeParser.scala also impacted queries against wide JSON tables as well. In this PR, I propose that a row should be recreated only if columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the code should use the row as-is (For CSV input, it will have values for the columns that could be converted, and also null values for columns that could not be converted). See benchmarks below. The CSV benchmark for 1000 columns went from 120144 ms to 89069 ms, a savings of 25% (this only brings the cost down to baseline levels. Again, see benchmarks below). Similarly, the JSON benchmark for 1000 columns (added in this PR) went from 109621 ms to 80871 ms, also a savings of 25%. Still, partial results functionality is preserved: <pre> bash-3.2$ cat test2.csv "hello",1999-08-01,"last" "there","bad date","field" "again","2017-11-22","in file" bash-3.2$ bin/spark-shell ...etc... scala> val df = spark.read.schema("a string, b date, c string").csv("test2.csv") df: org.apache.spark.sql.DataFrame = [a: string, b: date ... 1 more field] scala> df.show +-----+----------+-------+ \| a\| b\| c\| +-----+----------+-------+ \|hello\|1999-08-01\| last\| \|there\| null\| field\| \|again\|2017-11-22\|in file\| +-----+----------+-------+ scala> val df = spark.read.schema("badRecord string, a string, b date, c string"). \| option("columnNameOfCorruptRecord", "badRecord"). \| csv("test2.csv") df: org.apache.spark.sql.DataFrame = [badRecord: string, a: string ... 2 more fields] scala> df.show +--------------------+-----+----------+-------+ \| badRecord\| a\| b\| c\| +--------------------+-----+----------+-------+ \| null\|hello\|1999-08-01\| last\| \|"there","bad date...\|there\| null\| field\| \| null\|again\|2017-11-22\|in file\| +--------------------+-----+----------+-------+ scala> </pre> ### CSVBenchmark Benchmarks: baseline = commit before partial results change PR = this PR master = master branch [baseline_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697109/baseline_CSVBenchmark-results.txt) [pr_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697110/pr_CSVBenchmark-results.txt) [master_CSVBenchmark-results.txt](https://github.com/apache/spark/files/2697111/master_CSVBenchmark-results.txt) ### JSONBenchmark Benchmarks: baseline = commit before partial results change PR = this PR master = master branch [baseline_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711040/baseline_JSONBenchmark-results.txt) [pr_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711041/pr_JSONBenchmark-results.txt) [master_JSONBenchmark-results.txt](https://github.com/apache/spark/files/2711042/master_JSONBenchmark-results.txt) ## How was this patch tested? - All SQL unit tests. - Added 2 CSV benchmarks - Python core and SQL tests Closes #23336 from bersprockets/csv-wide-row-opt2. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2019-01-30 02:15:29 -05:00
[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method ## What changes were proposed in this pull request? use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmark"` ## How was this patch tested? manual tests Closes #22845 from heary-cao/CSVBenchmarks. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 2018-10-30 12:18:55 -04:00