From ec70467d4d69b024958b04a7f2e3735506ee8a07 Mon Sep 17 00:00:00 2001 From: HyukjinKwon Date: Mon, 22 Mar 2021 10:49:53 +0300 Subject: [PATCH] [SPARK-34815][SQL] Update CSVBenchmark ### What changes were proposed in this pull request? This PR updates CSVBenchmark especially we have a fix like https://github.com/apache/spark/pull/31858 that could potentially improve the performance. ### Why are the changes needed? To have the updated benchmark results. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually ran the benchmark Closes #31917 from HyukjinKwon/SPARK-34815. Authored-by: HyukjinKwon Signed-off-by: Max Gekk --- .../benchmarks/CSVBenchmark-jdk11-results.txt | 88 +++++++++---------- sql/core/benchmarks/CSVBenchmark-results.txt | 88 +++++++++---------- 2 files changed, 88 insertions(+), 88 deletions(-) diff --git a/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt b/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt index 03c51ddad1..c8db7859d2 100644 --- a/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance ================================================================================================ -OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -One quoted string 53332 53484 194 0.0 1066633.5 1.0X +One quoted string 21212 21537 327 0.0 424244.5 1.0X -OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 1000 columns 127472 128337 1185 0.0 127472.4 1.0X -Select 100 columns 43731 43856 130 0.0 43730.7 2.9X -Select one column 37347 37401 47 0.0 37347.4 3.4X -count() 8014 8028 25 0.1 8013.8 15.9X -Select 100 columns, one bad input field 95603 95726 106 0.0 95603.0 1.3X -Select 100 columns, corrupt record field 111851 111969 171 0.0 111851.4 1.1X +Select 1000 columns 73744 74898 1930 0.0 73743.8 1.0X +Select 100 columns 22704 22860 236 0.0 22704.4 3.2X +Select one column 17837 17977 121 0.1 17837.2 4.1X +count() 4304 4320 27 0.2 4304.0 17.1X +Select 100 columns, one bad input field 42060 42280 378 0.0 42059.8 1.8X +Select 100 columns, corrupt record field 46633 47520 773 0.0 46632.5 1.6X -OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 10 columns + count() 20364 20481 110 0.5 2036.4 1.0X -Select 1 column + count() 14706 14803 153 0.7 1470.6 1.4X -count() 3855 3880 32 2.6 385.5 5.3X +Select 10 columns + count() 9906 10132 246 1.0 990.6 1.0X +Select 1 column + count() 6497 6616 104 1.5 649.7 1.5X +count() 2285 2322 32 4.4 228.5 4.3X -OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Create a dataset of timestamps 2191 2205 14 4.6 219.1 1.0X -to_csv(timestamp) 13209 13253 43 0.8 1320.9 0.2X -write timestamps to files 12300 12380 71 0.8 1230.0 0.2X -Create a dataset of dates 2254 2269 14 4.4 225.4 1.0X -to_csv(date) 7980 8006 22 1.3 798.0 0.3X -write dates to files 7076 7100 26 1.4 707.6 0.3X +Create a dataset of timestamps 902 932 30 11.1 90.2 1.0X +to_csv(timestamp) 8537 8851 274 1.2 853.7 0.1X +write timestamps to files 7810 8000 238 1.3 781.0 0.1X +Create a dataset of dates 929 931 2 10.8 92.9 1.0X +to_csv(date) 5170 5237 62 1.9 517.0 0.2X +write dates to files 4163 4220 49 2.4 416.3 0.2X -OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -read timestamp text from files 2405 2408 5 4.2 240.5 1.0X -read timestamps from files 54502 54624 207 0.2 5450.2 0.0X -infer timestamps from files 112896 113040 135 0.1 11289.6 0.0X -read date text from files 2127 2141 23 4.7 212.7 1.1X -read date from files 30229 30257 29 0.3 3022.9 0.1X -infer date from files 28156 28621 409 0.4 2815.6 0.1X -timestamp strings 3096 3097 1 3.2 309.6 0.8X -parse timestamps from Dataset[String] 63096 63751 571 0.2 6309.6 0.0X -infer timestamps from Dataset[String] 120916 121262 556 0.1 12091.6 0.0X -date strings 3445 3457 13 2.9 344.5 0.7X -parse dates from Dataset[String] 37481 37585 91 0.3 3748.1 0.1X -from_csv(timestamp) 57933 57996 69 0.2 5793.3 0.0X -from_csv(date) 35312 35469 164 0.3 3531.2 0.1X +read timestamp text from files 1475 1497 33 6.8 147.5 1.0X +read timestamps from files 18596 18811 343 0.5 1859.6 0.1X +infer timestamps from files 37182 37511 342 0.3 3718.2 0.0X +read date text from files 1183 1210 31 8.5 118.3 1.2X +read date from files 8797 9099 283 1.1 879.7 0.2X +infer date from files 11296 11427 218 0.9 1129.6 0.1X +timestamp strings 1379 1382 4 7.3 137.9 1.1X +parse timestamps from Dataset[String] 18243 19000 721 0.5 1824.3 0.1X +infer timestamps from Dataset[String] 38253 39096 731 0.3 3825.3 0.0X +date strings 1686 1721 35 5.9 168.6 0.9X +parse dates from Dataset[String] 10474 10680 184 1.0 1047.4 0.1X +from_csv(timestamp) 18643 18965 350 0.5 1864.3 0.1X +from_csv(date) 9814 10018 188 1.0 981.4 0.2X -OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -w/o filters 24751 24829 67 0.0 247510.6 1.0X -pushdown disabled 24856 24889 29 0.0 248558.7 1.0X -w/ filters 1881 1892 11 0.1 18814.4 13.2X +w/o filters 11243 11535 287 0.0 112433.9 1.0X +pushdown disabled 11093 11117 34 0.0 110931.9 1.0X +w/ filters 794 800 5 0.1 7942.1 14.2X diff --git a/sql/core/benchmarks/CSVBenchmark-results.txt b/sql/core/benchmarks/CSVBenchmark-results.txt index a0d8c0c6fd..15f901e8a7 100644 --- a/sql/core/benchmarks/CSVBenchmark-results.txt +++ b/sql/core/benchmarks/CSVBenchmark-results.txt @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance ================================================================================================ -OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -One quoted string 47588 47831 244 0.0 951755.4 1.0X +One quoted string 24185 24195 10 0.0 483694.2 1.0X -OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 1000 columns 129509 130323 1388 0.0 129509.4 1.0X -Select 100 columns 42474 42572 108 0.0 42473.6 3.0X -Select one column 35479 35586 93 0.0 35479.1 3.7X -count() 11021 11071 47 0.1 11021.3 11.8X -Select 100 columns, one bad input field 94652 94795 134 0.0 94652.0 1.4X -Select 100 columns, corrupt record field 115336 115542 350 0.0 115336.0 1.1X +Select 1000 columns 61793 62388 532 0.0 61793.4 1.0X +Select 100 columns 21958 21993 34 0.0 21957.9 2.8X +Select one column 18215 18515 505 0.1 18215.0 3.4X +count() 5865 6168 296 0.2 5865.1 10.5X +Select 100 columns, one bad input field 39638 39739 124 0.0 39637.5 1.6X +Select 100 columns, corrupt record field 47290 48133 741 0.0 47290.0 1.3X -OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 10 columns + count() 19959 20022 76 0.5 1995.9 1.0X -Select 1 column + count() 13920 13968 54 0.7 1392.0 1.4X -count() 3928 3938 11 2.5 392.8 5.1X +Select 10 columns + count() 9935 10460 461 1.0 993.5 1.0X +Select 1 column + count() 6786 7179 342 1.5 678.6 1.5X +count() 2281 2458 165 4.4 228.1 4.4X -OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Create a dataset of timestamps 1940 1977 56 5.2 194.0 1.0X -to_csv(timestamp) 15398 15669 458 0.6 1539.8 0.1X -write timestamps to files 12438 12454 19 0.8 1243.8 0.2X -Create a dataset of dates 2157 2171 18 4.6 215.7 0.9X -to_csv(date) 11764 11839 95 0.9 1176.4 0.2X -write dates to files 8893 8907 12 1.1 889.3 0.2X +Create a dataset of timestamps 812 826 14 12.3 81.2 1.0X +to_csv(timestamp) 7548 7764 192 1.3 754.8 0.1X +write timestamps to files 7052 7193 141 1.4 705.2 0.1X +Create a dataset of dates 897 909 13 11.1 89.7 0.9X +to_csv(date) 4778 4787 10 2.1 477.8 0.2X +write dates to files 3853 3891 33 2.6 385.3 0.2X -OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -read timestamp text from files 2219 2230 11 4.5 221.9 1.0X -read timestamps from files 51519 51725 192 0.2 5151.9 0.0X -infer timestamps from files 104744 104885 124 0.1 10474.4 0.0X -read date text from files 1940 1943 4 5.2 194.0 1.1X -read date from files 27099 27118 33 0.4 2709.9 0.1X -infer date from files 27662 27703 61 0.4 2766.2 0.1X -timestamp strings 4225 4242 15 2.4 422.5 0.5X -parse timestamps from Dataset[String] 56090 56479 376 0.2 5609.0 0.0X -infer timestamps from Dataset[String] 115629 116245 1049 0.1 11562.9 0.0X -date strings 4337 4344 10 2.3 433.7 0.5X -parse dates from Dataset[String] 32373 32476 120 0.3 3237.3 0.1X -from_csv(timestamp) 54952 55157 300 0.2 5495.2 0.0X -from_csv(date) 30924 30985 66 0.3 3092.4 0.1X +read timestamp text from files 1259 1262 4 7.9 125.9 1.0X +read timestamps from files 20030 20105 80 0.5 2003.0 0.1X +infer timestamps from files 39621 39674 61 0.3 3962.1 0.0X +read date text from files 1039 1068 40 9.6 103.9 1.2X +read date from files 9352 9363 10 1.1 935.2 0.1X +infer date from files 11465 11485 23 0.9 1146.5 0.1X +timestamp strings 1759 1812 59 5.7 175.9 0.7X +parse timestamps from Dataset[String] 20806 20858 75 0.5 2080.6 0.1X +infer timestamps from Dataset[String] 40537 40821 258 0.2 4053.7 0.0X +date strings 1808 1816 12 5.5 180.8 0.7X +parse dates from Dataset[String] 12080 12311 245 0.8 1208.0 0.1X +from_csv(timestamp) 20120 21503 1224 0.5 2012.0 0.1X +from_csv(date) 10607 10768 246 0.9 1060.7 0.1X -OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -w/o filters 25630 25636 8 0.0 256301.4 1.0X -pushdown disabled 25673 25681 9 0.0 256734.0 1.0X -w/ filters 1873 1886 15 0.1 18733.1 13.7X +w/o filters 13109 13249 151 0.0 131086.4 1.0X +pushdown disabled 12951 12994 63 0.0 129509.7 1.0X +w/ filters 1095 1113 15 0.1 10953.7 12.0X