[SPARK-34815][SQL] Update CSVBenchmark

### What changes were proposed in this pull request?

This PR updates CSVBenchmark especially we have a fix like https://github.com/apache/spark/pull/31858 that could potentially improve the performance.

### Why are the changes needed?

To have the updated benchmark results.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually ran the benchmark

Closes #31917 from HyukjinKwon/SPARK-34815.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
This commit is contained in:
HyukjinKwon 2021-03-22 10:49:53 +03:00 committed by Max Gekk
parent 121883b1a5
commit ec70467d4d
2 changed files with 88 additions and 88 deletions

View file

@ -2,66 +2,66 @@
Benchmark to measure CSV read/write performance
================================================================================================
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 53332 53484 194 0.0 1066633.5 1.0X
One quoted string 21212 21537 327 0.0 424244.5 1.0X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 127472 128337 1185 0.0 127472.4 1.0X
Select 100 columns 43731 43856 130 0.0 43730.7 2.9X
Select one column 37347 37401 47 0.0 37347.4 3.4X
count() 8014 8028 25 0.1 8013.8 15.9X
Select 100 columns, one bad input field 95603 95726 106 0.0 95603.0 1.3X
Select 100 columns, corrupt record field 111851 111969 171 0.0 111851.4 1.1X
Select 1000 columns 73744 74898 1930 0.0 73743.8 1.0X
Select 100 columns 22704 22860 236 0.0 22704.4 3.2X
Select one column 17837 17977 121 0.1 17837.2 4.1X
count() 4304 4320 27 0.2 4304.0 17.1X
Select 100 columns, one bad input field 42060 42280 378 0.0 42059.8 1.8X
Select 100 columns, corrupt record field 46633 47520 773 0.0 46632.5 1.6X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 20364 20481 110 0.5 2036.4 1.0X
Select 1 column + count() 14706 14803 153 0.7 1470.6 1.4X
count() 3855 3880 32 2.6 385.5 5.3X
Select 10 columns + count() 9906 10132 246 1.0 990.6 1.0X
Select 1 column + count() 6497 6616 104 1.5 649.7 1.5X
count() 2285 2322 32 4.4 228.5 4.3X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 2191 2205 14 4.6 219.1 1.0X
to_csv(timestamp) 13209 13253 43 0.8 1320.9 0.2X
write timestamps to files 12300 12380 71 0.8 1230.0 0.2X
Create a dataset of dates 2254 2269 14 4.4 225.4 1.0X
to_csv(date) 7980 8006 22 1.3 798.0 0.3X
write dates to files 7076 7100 26 1.4 707.6 0.3X
Create a dataset of timestamps 902 932 30 11.1 90.2 1.0X
to_csv(timestamp) 8537 8851 274 1.2 853.7 0.1X
write timestamps to files 7810 8000 238 1.3 781.0 0.1X
Create a dataset of dates 929 931 2 10.8 92.9 1.0X
to_csv(date) 5170 5237 62 1.9 517.0 0.2X
write dates to files 4163 4220 49 2.4 416.3 0.2X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2405 2408 5 4.2 240.5 1.0X
read timestamps from files 54502 54624 207 0.2 5450.2 0.0X
infer timestamps from files 112896 113040 135 0.1 11289.6 0.0X
read date text from files 2127 2141 23 4.7 212.7 1.1X
read date from files 30229 30257 29 0.3 3022.9 0.1X
infer date from files 28156 28621 409 0.4 2815.6 0.1X
timestamp strings 3096 3097 1 3.2 309.6 0.8X
parse timestamps from Dataset[String] 63096 63751 571 0.2 6309.6 0.0X
infer timestamps from Dataset[String] 120916 121262 556 0.1 12091.6 0.0X
date strings 3445 3457 13 2.9 344.5 0.7X
parse dates from Dataset[String] 37481 37585 91 0.3 3748.1 0.1X
from_csv(timestamp) 57933 57996 69 0.2 5793.3 0.0X
from_csv(date) 35312 35469 164 0.3 3531.2 0.1X
read timestamp text from files 1475 1497 33 6.8 147.5 1.0X
read timestamps from files 18596 18811 343 0.5 1859.6 0.1X
infer timestamps from files 37182 37511 342 0.3 3718.2 0.0X
read date text from files 1183 1210 31 8.5 118.3 1.2X
read date from files 8797 9099 283 1.1 879.7 0.2X
infer date from files 11296 11427 218 0.9 1129.6 0.1X
timestamp strings 1379 1382 4 7.3 137.9 1.1X
parse timestamps from Dataset[String] 18243 19000 721 0.5 1824.3 0.1X
infer timestamps from Dataset[String] 38253 39096 731 0.3 3825.3 0.0X
date strings 1686 1721 35 5.9 168.6 0.9X
parse dates from Dataset[String] 10474 10680 184 1.0 1047.4 0.1X
from_csv(timestamp) 18643 18965 350 0.5 1864.3 0.1X
from_csv(date) 9814 10018 188 1.0 981.4 0.2X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 24751 24829 67 0.0 247510.6 1.0X
pushdown disabled 24856 24889 29 0.0 248558.7 1.0X
w/ filters 1881 1892 11 0.1 18814.4 13.2X
w/o filters 11243 11535 287 0.0 112433.9 1.0X
pushdown disabled 11093 11117 34 0.0 110931.9 1.0X
w/ filters 794 800 5 0.1 7942.1 14.2X

View file

@ -2,66 +2,66 @@
Benchmark to measure CSV read/write performance
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 47588 47831 244 0.0 951755.4 1.0X
One quoted string 24185 24195 10 0.0 483694.2 1.0X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 129509 130323 1388 0.0 129509.4 1.0X
Select 100 columns 42474 42572 108 0.0 42473.6 3.0X
Select one column 35479 35586 93 0.0 35479.1 3.7X
count() 11021 11071 47 0.1 11021.3 11.8X
Select 100 columns, one bad input field 94652 94795 134 0.0 94652.0 1.4X
Select 100 columns, corrupt record field 115336 115542 350 0.0 115336.0 1.1X
Select 1000 columns 61793 62388 532 0.0 61793.4 1.0X
Select 100 columns 21958 21993 34 0.0 21957.9 2.8X
Select one column 18215 18515 505 0.1 18215.0 3.4X
count() 5865 6168 296 0.2 5865.1 10.5X
Select 100 columns, one bad input field 39638 39739 124 0.0 39637.5 1.6X
Select 100 columns, corrupt record field 47290 48133 741 0.0 47290.0 1.3X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 19959 20022 76 0.5 1995.9 1.0X
Select 1 column + count() 13920 13968 54 0.7 1392.0 1.4X
count() 3928 3938 11 2.5 392.8 5.1X
Select 10 columns + count() 9935 10460 461 1.0 993.5 1.0X
Select 1 column + count() 6786 7179 342 1.5 678.6 1.5X
count() 2281 2458 165 4.4 228.1 4.4X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 1940 1977 56 5.2 194.0 1.0X
to_csv(timestamp) 15398 15669 458 0.6 1539.8 0.1X
write timestamps to files 12438 12454 19 0.8 1243.8 0.2X
Create a dataset of dates 2157 2171 18 4.6 215.7 0.9X
to_csv(date) 11764 11839 95 0.9 1176.4 0.2X
write dates to files 8893 8907 12 1.1 889.3 0.2X
Create a dataset of timestamps 812 826 14 12.3 81.2 1.0X
to_csv(timestamp) 7548 7764 192 1.3 754.8 0.1X
write timestamps to files 7052 7193 141 1.4 705.2 0.1X
Create a dataset of dates 897 909 13 11.1 89.7 0.9X
to_csv(date) 4778 4787 10 2.1 477.8 0.2X
write dates to files 3853 3891 33 2.6 385.3 0.2X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2219 2230 11 4.5 221.9 1.0X
read timestamps from files 51519 51725 192 0.2 5151.9 0.0X
infer timestamps from files 104744 104885 124 0.1 10474.4 0.0X
read date text from files 1940 1943 4 5.2 194.0 1.1X
read date from files 27099 27118 33 0.4 2709.9 0.1X
infer date from files 27662 27703 61 0.4 2766.2 0.1X
timestamp strings 4225 4242 15 2.4 422.5 0.5X
parse timestamps from Dataset[String] 56090 56479 376 0.2 5609.0 0.0X
infer timestamps from Dataset[String] 115629 116245 1049 0.1 11562.9 0.0X
date strings 4337 4344 10 2.3 433.7 0.5X
parse dates from Dataset[String] 32373 32476 120 0.3 3237.3 0.1X
from_csv(timestamp) 54952 55157 300 0.2 5495.2 0.0X
from_csv(date) 30924 30985 66 0.3 3092.4 0.1X
read timestamp text from files 1259 1262 4 7.9 125.9 1.0X
read timestamps from files 20030 20105 80 0.5 2003.0 0.1X
infer timestamps from files 39621 39674 61 0.3 3962.1 0.0X
read date text from files 1039 1068 40 9.6 103.9 1.2X
read date from files 9352 9363 10 1.1 935.2 0.1X
infer date from files 11465 11485 23 0.9 1146.5 0.1X
timestamp strings 1759 1812 59 5.7 175.9 0.7X
parse timestamps from Dataset[String] 20806 20858 75 0.5 2080.6 0.1X
infer timestamps from Dataset[String] 40537 40821 258 0.2 4053.7 0.0X
date strings 1808 1816 12 5.5 180.8 0.7X
parse dates from Dataset[String] 12080 12311 245 0.8 1208.0 0.1X
from_csv(timestamp) 20120 21503 1224 0.5 2012.0 0.1X
from_csv(date) 10607 10768 246 0.9 1060.7 0.1X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 25630 25636 8 0.0 256301.4 1.0X
pushdown disabled 25673 25681 9 0.0 256734.0 1.0X
w/ filters 1873 1886 15 0.1 18733.1 13.7X
w/o filters 13109 13249 151 0.0 131086.4 1.0X
pushdown disabled 12951 12994 63 0.0 129509.7 1.0X
w/ filters 1095 1113 15 0.1 10953.7 12.0X