ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
HyukjinKwon	ec70467d4d	[SPARK-34815][SQL] Update CSVBenchmark ### What changes were proposed in this pull request? This PR updates CSVBenchmark especially we have a fix like https://github.com/apache/spark/pull/31858 that could potentially improve the performance. ### Why are the changes needed? To have the updated benchmark results. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually ran the benchmark Closes #31917 from HyukjinKwon/SPARK-34815. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-22 10:49:53 +03:00
Karuppayya Rajendran	0a58029d52	[SPARK-31897][SQL] Enable codegen for GenerateExec ### What changes were proposed in this pull request? Enabling codegen for GenerateExec ### Why are the changes needed? To leverage code generation for Generators ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - UT tests added ### Benchmark ``` case class Data(value1: Float, value2: Map[String, String], value3: String) val path = "<path>" val numRecords = Seq(10000000, 100000000) numRecords.map { recordCount => import java.util.concurrent.TimeUnit.NANOSECONDS val srcDF = spark.range(recordCount).map { x => Data(x.toFloat, Map(x.toString -> x.toString ), s"value3$x") }.select($"value1", explode($"value2"), $"value3") val start = System.nanoTime() srcDF .write .mode("overwrite") .parquet(s"$path/$recordCount") val end = System.nanoTime() val diff = end - start (recordCount, NANOSECONDS.toMillis(diff)) } ``` With codegen: ``` res0: Seq[(Int, Long)] = List((10000000,13989), (100000000,129625)) ``` Without codegen: ``` res0: Seq[(Int, Long)] = List((10000000,15736), (100000000,150399)) ``` Closes #28715 from karuppayya/SPARK-31897. Lead-authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Co-authored-by: Karuppayya Rajendran <karuppayya.rajendran@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-03-18 20:50:28 -07:00
Cheng Su	b5b198516c	[SPARK-34620][SQL] Code-gen broadcast nested loop join (inner/cross) ### What changes were proposed in this pull request? `BroadcastNestedLoopJoinExec` does not have code-gen, and we can potentially boost the CPU performance for this operator if we add code-gen for it. https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html also showed the evidence in one fork. The codegen for `BroadcastNestedLoopJoinExec` shared some code with `HashJoin`, and the interface `JoinCodegenSupport` is created to hold those common logic. This PR is only supporting inner and cross join. Other join types will be added later in followup PRs. Example query and generated code: ``` val df1 = spark.range(4).select($"id".as("k1")) val df2 = spark.range(3).select($"id".as("k2")) df1.join(df2, $"k1" + 1 =!= $"k2").explain("codegen") ``` ``` == Subtree 2 / 2 (maxMethodCodeSize:282; maxConstantPoolSize:203(0.31% used); numInnerClasses:0) == (2) BroadcastNestedLoopJoin BuildRight, Inner, NOT ((k1#2L + 1) = k2#6L) :- (2) Project [id#0L AS k1#2L] : +- (2) Range (0, 4, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode, [id=#22] +- (1) Project [id#4L AS k2#6L] +- (1) Range (0, 3, step=1, splits=2) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean range_initRange_0; / 010 / private long range_nextIndex_0; / 011 / private TaskContext range_taskContext_0; / 012 / private InputMetrics range_inputMetrics_0; / 013 / private long range_batchEnd_0; / 014 / private long range_numElementsTodo_0; / 015 / private InternalRow[] bnlj_buildRowArray_0; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4]; / 017 / / 018 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / / 026 / range_taskContext_0 = TaskContext.get(); / 027 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 028 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 029 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 030 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 031 / bnlj_buildRowArray_0 = (InternalRow[]) ((org.apache.spark.broadcast.TorrentBroadcast) references[1] / broadcastTerm /).value(); / 032 / range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); / 033 / / 034 / } / 035 / / 036 / private void bnlj_doConsume_0(long bnlj_expr_0_0) throws java.io.IOException { / 037 / for (int bnlj_arrayIndex_0 = 0; bnlj_arrayIndex_0 < bnlj_buildRowArray_0.length; bnlj_arrayIndex_0++) { / 038 / UnsafeRow bnlj_buildRow_0 = (UnsafeRow) bnlj_buildRowArray_0[bnlj_arrayIndex_0]; / 039 / / 040 / long bnlj_value_1 = bnlj_buildRow_0.getLong(0); / 041 / / 042 / long bnlj_value_4 = -1L; / 043 / / 044 / bnlj_value_4 = bnlj_expr_0_0 + 1L; / 045 / / 046 / boolean bnlj_value_3 = false; / 047 / bnlj_value_3 = bnlj_value_4 == bnlj_value_1; / 048 / boolean bnlj_value_2 = false; / 049 / bnlj_value_2 = !(bnlj_value_3); / 050 / if (!(false \|\| !bnlj_value_2)) / 051 / { / 052 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / numOutputRows /).add(1); / 053 / / 054 / range_mutableStateArray_0[3].reset(); / 055 / / 056 / range_mutableStateArray_0[3].write(0, bnlj_expr_0_0); / 057 / / 058 / range_mutableStateArray_0[3].write(1, bnlj_value_1); / 059 / append((range_mutableStateArray_0[3].getRow()).copy()); / 060 / / 061 / } / 062 / } / 063 / / 064 / } / 065 / / 066 / private void initRange(int idx) { / 067 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 068 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(2L); / 069 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(4L); / 070 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 071 / java.math.BigInteger start = java.math.BigInteger.valueOf(0L); / 072 / long partitionEnd; / 073 / / 074 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 075 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 076 / range_nextIndex_0 = Long.MAX_VALUE; / 077 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 078 / range_nextIndex_0 = Long.MIN_VALUE; / 079 / } else { / 080 / range_nextIndex_0 = st.longValue(); / 081 / } / 082 / range_batchEnd_0 = range_nextIndex_0; / 083 / / 084 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 085 / .multiply(step).add(start); / 086 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 087 / partitionEnd = Long.MAX_VALUE; / 088 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 089 / partitionEnd = Long.MIN_VALUE; / 090 / } else { / 091 / partitionEnd = end.longValue(); / 092 / } / 093 / / 094 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 095 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 096 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 097 / if (range_numElementsTodo_0 < 0) { / 098 / range_numElementsTodo_0 = 0; / 099 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 100 / range_numElementsTodo_0++; / 101 / } / 102 / } / 103 / / 104 / protected void processNext() throws java.io.IOException { / 105 / // initialize Range / 106 / if (!range_initRange_0) { / 107 / range_initRange_0 = true; / 108 / initRange(partitionIndex); / 109 / } / 110 / / 111 / while (true) { / 112 / if (range_nextIndex_0 == range_batchEnd_0) { / 113 / long range_nextBatchTodo_0; / 114 / if (range_numElementsTodo_0 > 1000L) { / 115 / range_nextBatchTodo_0 = 1000L; / 116 / range_numElementsTodo_0 -= 1000L; / 117 / } else { / 118 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 119 / range_numElementsTodo_0 = 0; / 120 / if (range_nextBatchTodo_0 == 0) break; / 121 / } / 122 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 123 / } / 124 / / 125 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 126 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 127 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 128 / / 129 / // common sub-expressions / 130 / / 131 / bnlj_doConsume_0(range_value_0); / 132 / / 133 / if (shouldStop()) { / 134 / range_nextIndex_0 = range_value_0 + 1L; / 135 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localIdx_0 + 1); / 136 / range_inputMetrics_0.incRecordsRead(range_localIdx_0 + 1); / 137 / return; / 138 / } / 139 / / 140 / } / 141 / range_nextIndex_0 = range_batchEnd_0; / 142 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 143 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 144 / range_taskContext_0.killTaskIfInterrupted(); / 145 / } / 146 / } / 147 / / 148 / } ``` ### Why are the changes needed? Improve query CPU performance. Added a micro benchmark query in `JoinBenchmark.scala`. Saw 1x of run time improvement: ``` OpenJDK 64-Bit Server VM 11.0.9+11-LTS on Linux 4.14.219-161.340.amzn2.x86_64 Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz broadcast nested loop join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- broadcast nested loop join wholestage off 62922 63052 184 0.3 3000.3 1.0X broadcast nested loop join wholestage on 30946 30972 26 0.7 1475.6 2.0X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite.scala`, and existing unit tests for `BroadcastNestedLoopJoinExec`. * Updated golden files for several TCPDS query plans, as whole stage code-gen for `BroadcastNestedLoopJoinExec` is triggered. * Updated `JoinBenchmark-jdk11-results.txt ` and `JoinBenchmark-results.txt` with new benchmark result. Followed previous benchmark PRs - https://github.com/apache/spark/pull/27078 and https://github.com/apache/spark/pull/26003 to use same type of machine: ``` Amazon AWS EC2 type: r3.xlarge region: us-west-2 (Oregon) OS: Linux ``` Closes #31736 from c21/nested-join-exec. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 11:45:43 +00:00
Kent Yao	d1177b5230	[SPARK-34192][SQL] Move char padding to write side and remove length check on read side too ### What changes were proposed in this pull request? On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst. This PR reverts `6da5cdf1db` that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty. This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark. ### Why are the changes needed? fix perf regression when tables have char/varchar type columns closes #31278 ### Does this PR introduce _any_ user-facing change? yes, spark will not raise error for oversized char/varchar values in read side ### How was this patch tested? modified ut the dropped read side benchmark ``` ================================================================================================ Char Varchar Read Side Perf w/o Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1564 1573 9 63.9 15.6 1.0X read char with length 20 1532 1551 18 65.3 15.3 1.0X read varchar with length 20 1520 1531 13 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1573 1613 41 63.6 15.7 1.0X read char with length 40 1575 1577 2 63.5 15.7 1.0X read varchar with length 40 1568 1576 11 63.8 15.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1526 1540 23 65.5 15.3 1.0X read char with length 60 1514 1539 23 66.0 15.1 1.0X read varchar with length 60 1486 1497 10 67.3 14.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1531 1542 19 65.3 15.3 1.0X read char with length 80 1514 1529 15 66.0 15.1 1.0X read varchar with length 80 1524 1565 42 65.6 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1597 1623 25 62.6 16.0 1.0X read char with length 100 1499 1512 16 66.7 15.0 1.1X read varchar with length 100 1517 1524 8 65.9 15.2 1.1X ================================================================================================ Char Varchar Read Side Perf w/ Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1524 1526 1 65.6 15.2 1.0X read char with length 20 1532 1537 9 65.3 15.3 1.0X read varchar with length 20 1520 1532 15 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1556 1580 32 64.3 15.6 1.0X read char with length 40 1600 1611 17 62.5 16.0 1.0X read varchar with length 40 1648 1716 88 60.7 16.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1504 1524 20 66.5 15.0 1.0X read char with length 60 1509 1512 3 66.2 15.1 1.0X read varchar with length 60 1519 1535 21 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1640 1652 17 61.0 16.4 1.0X read char with length 80 1625 1666 35 61.5 16.3 1.0X read varchar with length 80 1590 1605 13 62.9 15.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1622 1628 5 61.6 16.2 1.0X read char with length 100 1614 1646 30 62.0 16.1 1.0X read varchar with length 100 1594 1606 11 62.7 15.9 1.0X ``` Closes #31281 from yaooqinn/SPARK-34192. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 02:08:35 +08:00
Kent Yao	d640631e36	[SPARK-34164][SQL] Improve write side varchar check to visit only last few tailing spaces ### What changes were proposed in this pull request? For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N. ### Why are the changes needed? improve varchar performance for write side ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? benchmark and existing ut Closes #31253 from yaooqinn/SPARK-34164. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 05:30:57 +00:00
Kent Yao	6fa2fb9eb5	[SPARK-34130][SQL] Impove preformace for char varchar padding and length check with StaticInvoke ### What changes were proposed in this pull request? This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression. here is a case from tpcds that could be fixed by this improvement https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ The original case generate 20K bytes, we are trying to reduce it to less than 8k ### Why are the changes needed? performance improvement as in the PR benchmark test, the performance w/ codegen is 2~3x better than w/o codegen. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, it's a code reflect so the existing ut should be enough cross-check with https://github.com/apache/spark/pull/31012 where the tpcds shall all pass benchmark compared with master ```logtalk ================================================================================================ Char Varchar Read Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1571 1667 83 63.6 15.7 1.0X read char with length 20 1710 1764 58 58.5 17.1 0.9X read varchar with length 20 1774 1792 16 56.4 17.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1824 1927 91 54.8 18.2 1.0X read char with length 40 1788 1928 137 55.9 17.9 1.0X read varchar with length 40 1676 1700 40 59.7 16.8 1.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1727 1762 30 57.9 17.3 1.0X read char with length 60 1628 1674 43 61.4 16.3 1.1X read varchar with length 60 1651 1665 13 60.6 16.5 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1748 1778 28 57.2 17.5 1.0X read char with length 80 1673 1678 9 59.8 16.7 1.0X read varchar with length 80 1667 1684 27 60.0 16.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1709 1743 48 58.5 17.1 1.0X read char with length 100 1610 1664 67 62.1 16.1 1.1X read varchar with length 100 1614 1673 53 61.9 16.1 1.1X ================================================================================================ Char Varchar Write Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 20 2277 2327 67 4.4 227.7 1.0X write char with length 20 2421 2443 19 4.1 242.1 0.9X write varchar with length 20 2393 2419 27 4.2 239.3 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 40 2249 2290 38 4.4 224.9 1.0X write char with length 40 2386 2444 57 4.2 238.6 0.9X write varchar with length 40 2397 2405 12 4.2 239.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 60 2326 2367 41 4.3 232.6 1.0X write char with length 60 2478 2501 37 4.0 247.8 0.9X write varchar with length 60 2475 2503 24 4.0 247.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 80 9367 9773 354 1.1 936.7 1.0X write char with length 80 10454 10621 238 1.0 1045.4 0.9X write varchar with length 80 18943 19503 571 0.5 1894.3 0.5X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 100 11055 11104 59 0.9 1105.5 1.0X write char with length 100 12204 12275 63 0.8 1220.4 0.9X write varchar with length 100 21737 22275 574 0.5 2173.7 0.5X ``` Closes #31199 from yaooqinn/SPARK-34130. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:03:06 +00:00
Liang-Chi Hsieh	fb7b870214	[SPARK-33523][SQL][TEST][FOLLOWUP] Fix benchmark case name in SubExprEliminationBenchmark ### What changes were proposed in this pull request? Fix the wrong benchmark case name. ### Why are the changes needed? The last commit to refactor the benchmark code missed a change of case name. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30505 from viirya/SPARK-33523-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 15:22:47 -08:00
Liang-Chi Hsieh	9643eab53e	[SPARK-33540][SQL] Subexpression elimination for interpreted predicate ### What changes were proposed in this pull request? This patch proposes to support subexpression elimination for interpreted predicate. ### Why are the changes needed? Similar to interpreted projection, there are use cases when codegen predicate is not able to work, e.g. too complex schema, non-codegen expression, etc. When there are frequently occurring expressions (subexpressions) among predicate expression, the performance is quite bad as we need to re-compute same expressions. We should be able to support subexpression elimination for interpreted predicate like interpreted projection. ### Does this PR introduce _any_ user-facing change? No, this doesn't change user behavior. ### How was this patch tested? Unit test and benchmark. Closes #30497 from viirya/SPARK-33540. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 08:55:39 -08:00
Liang-Chi Hsieh	f35e28fea5	[SPARK-33523][SQL][TEST] Add predicate related benchmark to SubExprEliminationBenchmark ### What changes were proposed in this pull request? This patch adds predicate related benchmark to `SubExprEliminationBenchmark`. ### Why are the changes needed? We should have a benchmark for subexpression elimination of predicate. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Run benchmark locally. Closes #30476 from viirya/SPARK-33523. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-24 13:30:06 +09:00
Liang-Chi Hsieh	928348408e	[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema. ### Why are the changes needed? Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945. We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability. #### Benchmark Update `SubExprEliminationBenchmark`: Before: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 24707 25688 903 0.0 247068775.9 1.0X ``` After: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 2360 2435 87 0.0 23604320.7 11.2X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Benchmark manually. Closes #30341 from viirya/SPARK-33427. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 14:29:37 +00:00
Liang-Chi Hsieh	eea846b895	[SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination ### What changes were proposed in this pull request? This patch adds a benchmark `SubExprEliminationBenchmark` for benchmarking subexpression elimination feature. ### Why are the changes needed? We need a benchmark for subexpression elimination feature for change such as #30341. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30379 from viirya/SPARK-33455. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 19:02:36 -08:00
Max Gekk	7e867298fe	[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for `date_trunc` ### What changes were proposed in this pull request? Updated results of `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? The fix https://github.com/apache/spark/pull/30303 slowed down `date_trunc`. This PR updates benchmark results to have actual info about performance of `date_trunc`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By regenerating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeBenchmark" ``` Closes #30338 from MaxGekk/fix-trunc_date-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-11 08:50:43 -08:00
yangjie01	b38f3a5557	[SPARK-32978][SQL] Make sure the number of dynamic part metric is correct ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32978. The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side. The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct. ### Why are the changes needed? The the number of dynamic part metric we display on the UI should be correct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test case refer to described in SPARK-32978 Closes #30026 from LuciferYang/SPARK-32978. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 14:01:07 +00:00
Max Gekk	bbf2d6f6df	[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing ### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-22 10:03:41 +09:00
fqaiser94@gmail.com	2793347972	[SPARK-32511][SQL] Add dropFields method to Column class ### What changes were proposed in this pull request? 1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`). 2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:53:30 +00:00
Chao Sun	a6d6ea3efe	[SPARK-32802][SQL] Avoid using SpecificInternalRow in RunLengthEncoding#Encoder ### What changes were proposed in this pull request? Currently `RunLengthEncoding#Encoder` uses `SpecificInternalRow` as a holder for the current value when calculating compression stats and doing the actual compression. It calls `ColumnType.copyField` and `ColumnType.getField` on the internal row which incurs extra cost comparing to directly operating on the internal type. This proposes to replace the `SpecificInternalRow` with `T#InternalType` to avoid the extra cost. ### Why are the changes needed? Operating on `SpecificInternalRow` carries certain cost and negatively impact performance when using `RunLengthEncoding` for compression. With the change I see some improvements through `CompressionSchemeBenchmark`: ```diff Intel(R) Core(TM) i9-9880H CPU 2.30GHz BOOLEAN Encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 1 1 0 51957.0 0.0 1.0X -RunLengthEncoding(2.502) 549 555 9 122.2 8.2 0.0X -BooleanBitSet(0.125) 296 301 3 226.6 4.4 0.0X +PassThrough(1.000) 2 2 0 42985.4 0.0 1.0X +RunLengthEncoding(2.517) 487 500 10 137.7 7.3 0.0X +BooleanBitSet(0.125) 348 353 4 192.8 5.2 0.0X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz SHORT Encode (Lower Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 3 3 0 22779.9 0.0 1.0X -RunLengthEncoding(1.520) 1186 1192 9 56.6 17.7 0.0X +PassThrough(1.000) 3 4 0 21216.6 0.0 1.0X +RunLengthEncoding(1.493) 882 931 50 76.1 13.1 0.0X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz SHORT Encode (Higher Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 3 4 0 21352.2 0.0 1.0X -RunLengthEncoding(2.009) 1173 1175 3 57.2 17.5 0.0X +PassThrough(1.000) 3 3 0 22388.6 0.0 1.0X +RunLengthEncoding(2.015) 924 941 23 72.6 13.8 0.0X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz INT Encode (Lower Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 9 10 1 7410.1 0.1 1.0X -RunLengthEncoding(1.000) 1499 1502 4 44.8 22.3 0.0X -DictionaryEncoding(0.500) 621 630 11 108.0 9.3 0.0X -IntDelta(0.250) 134 149 10 502.0 2.0 0.1X +PassThrough(1.000) 9 10 1 7575.9 0.1 1.0X +RunLengthEncoding(1.002) 952 966 12 70.5 14.2 0.0X +DictionaryEncoding(0.500) 561 567 6 119.7 8.4 0.0X +IntDelta(0.250) 129 134 3 521.9 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz INT Encode (Higher Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 9 10 1 7668.3 0.1 1.0X -RunLengthEncoding(1.332) 1561 1685 175 43.0 23.3 0.0X -DictionaryEncoding(0.501) 616 642 21 108.9 9.2 0.0X -IntDelta(0.250) 126 131 2 533.4 1.9 0.1X +PassThrough(1.000) 9 10 1 7494.1 0.1 1.0X +RunLengthEncoding(1.336) 974 987 13 68.9 14.5 0.0X +DictionaryEncoding(0.501) 709 719 10 94.6 10.6 0.0X +IntDelta(0.250) 127 132 4 528.4 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz LONG Encode (Lower Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 18 19 1 3803.0 0.3 1.0X -RunLengthEncoding(0.754) 1526 1540 20 44.0 22.7 0.0X -DictionaryEncoding(0.250) 735 759 33 91.3 11.0 0.0X -LongDelta(0.125) 126 129 2 530.8 1.9 0.1X +PassThrough(1.000) 19 21 1 3543.5 0.3 1.0X +RunLengthEncoding(0.747) 1049 1058 12 63.9 15.6 0.0X +DictionaryEncoding(0.250) 620 634 17 108.2 9.2 0.0X +LongDelta(0.125) 129 132 2 520.1 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz LONG Encode (Higher Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 18 20 1 3705.4 0.3 1.0X -RunLengthEncoding(1.002) 1665 1669 6 40.3 24.8 0.0X -DictionaryEncoding(0.251) 890 901 11 75.4 13.3 0.0X -LongDelta(0.125) 125 130 3 537.2 1.9 0.1X +PassThrough(1.000) 18 20 2 3726.8 0.3 1.0X +RunLengthEncoding(0.999) 1076 1077 2 62.4 16.0 0.0X +DictionaryEncoding(0.251) 904 919 19 74.3 13.5 0.0X +LongDelta(0.125) 125 131 4 536.5 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz STRING Encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 27 30 2 2497.1 0.4 1.0X -RunLengthEncoding(0.892) 3443 3587 204 19.5 51.3 0.0X -DictionaryEncoding(0.167) 2286 2290 6 29.4 34.1 0.0X +PassThrough(1.000) 28 31 2 2430.2 0.4 1.0X +RunLengthEncoding(0.889) 1798 1800 3 37.3 26.8 0.0X +DictionaryEncoding(0.167) 1956 1959 4 34.3 29.1 0.0X ``` In the above diff, new results are with changes in this PR. It can be seen that encoding performance has improved quite a lot especially for string type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Relies on existing unit tests. Closes #29654 from sunchao/SPARK-32802. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-12 22:19:30 -07:00
Maxim Gekk	c1f160e097	[SPARK-30648][SQL] Support filters pushdown in JSON datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion to `TIMESTAMP` values). The main idea behind of `JsonFilters` is to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches `0`, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row). The common code shared between `CSVFilters` and `JsonFilters` is moved to the `StructFilters` class and its companion object. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to 27 times on JDK 8 and 25 times on JDK 11: ``` OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1044-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 25230 25255 22 0.0 252299.6 1.0X pushdown disabled 25248 25282 33 0.0 252475.6 1.0X w/ filters 905 911 8 0.1 9047.9 27.9X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`. - By new end-to-end and case sensitivity tests in `JsonSuite`. - By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`. - Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27366 from MaxGekk/json-filters-pushdown. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-17 00:01:13 +09:00
Max Gekk	42f01e314b	[SPARK-32130][SQL][FOLLOWUP] Enable timestamps inference in JsonBenchmark ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference. ### Why are the changes needed? The PR https://github.com/apache/spark/pull/28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By re-generating results of `JsonBenchmark`. Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-02 13:26:57 -07:00
Max Gekk	bcf23307f4	[SPARK-32130][SQL] Disable the JSON option `inferTimestamp` by default ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option. ### Why are the changes needed? To prevent perf regression while inferring schemas from JSON with potential timestamps fields. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`. - Regenerated results of `JsonBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-01 15:45:39 -07:00
Max Gekk	8c44d74463	[SPARK-32071][SQL][TESTS] Add `make_interval` benchmark ### What changes were proposed in this pull request? Add benchmarks for interval constructor `make_interval` and measure perf of 4 cases: 1. Constant (year, month) 2. Constant (week, day) 3. Constant (hour, minute, second, second fraction) 4. All fields are NOT constant. The benchmark results are generated in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| ### Why are the changes needed? To have a base line for future perf improvements of `make_interval`, and to prevent perf regressions in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `IntervalBenchmark` via: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark" ``` Closes #28905 from MaxGekk/benchmark-make_interval. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-27 17:54:06 -07:00
Max Gekk	045106e29d	[SPARK-32072][CORE][TESTS] Fix table formatting with benchmark results ### What changes were proposed in this pull request? Set column width w/ benchmark names to maximum of either 1. 40 (before this PR) or 2. The length of benchmark name or 3. Maximum length of cases names ### Why are the changes needed? To improve readability of benchmark results. For example, `MakeDateTimeBenchmark`. Before: ``` make_timestamp(): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ prepare make_timestamp() 3636 3673 38 0.3 3635.7 1.0X make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 4 10.7 93.8 38.8X make_timestamp(2019, 1, 2, 3, 4, 60.000000) 68 80 13 14.6 68.3 53.2X make_timestamp(2019, 12, 31, 23, 59, 60.00) 65 79 19 15.3 65.3 55.7X make_timestamp(, , , 3, 4, 50.123456) 271 280 14 3.7 270.7 13.4X ``` After: ``` make_timestamp(): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------- prepare make_timestamp() 3694 3745 82 0.3 3694.0 1.0X make_timestamp(2019, 1, 2, 3, 4, 50.123456) 82 90 9 12.2 82.3 44.9X make_timestamp(2019, 1, 2, 3, 4, 60.000000) 72 77 5 13.9 71.9 51.4X make_timestamp(2019, 12, 31, 23, 59, 60.00) 67 71 5 15.0 66.8 55.3X make_timestamp(, , , 3, 4, 50.123456) 273 289 14 3.7 273.2 13.5X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By re-generating benchmark results for `MakeDateTimeBenchmark`: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28906 from MaxGekk/benchmark-table-formatting. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-24 04:43:53 +00:00
Max Gekk	e00f43cb86	[SPARK-32043][SQL] Replace Decimal by Int op in `make_interval` and `make_timestamp` ### What changes were proposed in this pull request? Replace Decimal by Int op in the `MakeInterval` & `MakeTimestamp` expression. For instance, `(secs * Decimal(MICROS_PER_SECOND)).toLong` can be replaced by the unscaled long because the former one already contains microseconds. ### Why are the changes needed? To improve performance. Before: ``` make_timestamp(): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ ... make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 4 10.7 93.8 38.8X ``` After: ``` make_timestamp(2019, 1, 2, 3, 4, 50.123456) 76 92 15 13.1 76.5 48.1X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - By existing test suites `IntervalExpressionsSuite`, `DateExpressionsSuite` and etc. - Re-generate results of `MakeDateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28886 from MaxGekk/make_interval-opt-decimal. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-23 11:45:12 +00:00
Max Gekk	350aa859fe	[SPARK-32006][SQL] Create date/timestamp formatters once before collect in `hiveResultString()` ### What changes were proposed in this pull request? 1. Add method `getTimeFormatters` to `HiveResult` which creates timestamp and date formatters. 2. Move creation of `dateFormatter` and `timestampFormatter` from the constructor of the `HiveResult` object to `HiveResult. hiveResultString()` via `getTimeFormatters`. This allows to resolve time zone ID from Spark's session time zone `spark.sql.session.timeZone` and create date/timestamp formatters only once before collecting `java.sql.Timestamp`/`java.sql.Date` values. 3. Create date/timestamp formatters once in SparkExecuteStatementOperation. ### Why are the changes needed? To fix perf regression comparing to Spark 2.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - By existing test suite `HiveResultSuite` and etc. - Re-generate benchmarks results of `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28842 from MaxGekk/opt-toHiveString-oss-master. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-17 06:28:47 +00:00
Max Gekk	9d95f1b010	[SPARK-31992][SQL] Benchmark the EXCEPTION rebase mode ### What changes were proposed in this pull request? - Modify `DateTimeRebaseBenchmark` to benchmark the default date-time rebasing mode - `EXCEPTION` for saving/loading dates/timestamps from/to parquet files. The mode is benchmarked for modern timestamps after 1900-01-01 00:00:00Z and dates after 1582-10-15. - Regenerate benchmark results in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| ### Why are the changes needed? The `EXCEPTION` rebasing mode is the default mode of the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.datetimeRebaseModeInWrite`. The changes are needed to improve benchmark coverage for default settings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmark and check results manually. Closes #28829 from MaxGekk/benchmark-exception-mode. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-15 07:25:56 +00:00
Max Gekk	ddd8d5f5a0	[SPARK-31932][SQL][TESTS] Add date/timestamp benchmarks for `HiveResult.hiveResultString()` ### What changes were proposed in this pull request? Add benchmarks for `HiveResult.hiveResultString()/toHiveString()` to measure throughput of `toHiveString` for the date/timestamp types: - java.sql.Date/Timestamp - java.time.Instant - java.time.LocalDate Benchmark results were generated in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? To detect perf regressions of `toHiveString` in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `DateTimeBenchmark` and check dataset content. Closes #28757 from MaxGekk/benchmark-toHiveString. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-09 04:59:41 +00:00
Max Gekk	92685c0148	[SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results ### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? 1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-25 15:00:11 +00:00
Max Gekk	bef5828e12	[SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold ### What changes were proposed in this pull request? Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`. ### Why are the changes needed? The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`. Before (see the PR's results https://github.com/apache/spark/pull/28440): ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 376 388 10 13.3 75.2 1.1X Collect java.sql.Timestamp 1878 1937 64 2.7 375.6 0.2X ``` After: ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 249 264 24 20.1 49.8 1.7X Collect java.sql.Timestamp 1503 1523 24 3.3 300.5 0.3X ``` Perf improvements in average of: 1. From java.sql.Timestamp is ~ 34% 2. To java.sql.Timestamps is ~16% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`. Closes #28441 from MaxGekk/opt-rebase-common-threshold. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 14:11:53 +00:00
Max Gekk	735771e7b4	[SPARK-31623][SQL][TESTS] Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write ### What changes were proposed in this pull request? Add new benchmarks to `DateTimeRebaseBenchmark` for reading/writing timestamps of INT96 and TIMESTAMP_MICROS column types. Here are benchmark results for reading timestamps after 1582 year with default settings (rebasing is off for TIMESTAMP_MICROS/TIMESTAMP_MILLIS, and rebasing on for INT96): timestamp type \| vectorized off (ns/row) \| vectorized on (ns/row) --\|--\|-- TIMESTAMP_MICROS\| 160.1 \| 50.2 INT96 \| 215.6 \| 117.8 TIMESTAMP_MILLIS \| 159.9 \| 60.6 ### Why are the changes needed? To compare default timestamp type `TIMESTAMP_MICROS` with other types in the case if an user decides to switch on them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmarks via: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252-8u252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28431 from MaxGekk/parquet-timestamps-DateTimeRebaseBenchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 05:40:15 +00:00
Wenchen Fan	f72220b8ab	[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes #28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 15:30:10 +09:00
Max Gekk	2fb85f6b68	[SPARK-31527][SQL][TESTS][FOLLOWUP] Fix the number of rows in `DateTimeBenchmark` ### What changes were proposed in this pull request? - Changed to the number of rows in benchmark cases from 3 to the actual number `N`. - Regenerated benchmark results in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? The changes are needed to have: - Correct benchmark results - Base line for other perf improvements that can be checked in the same environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmark and checking its output. Closes #28440 from MaxGekk/SPARK-31527-DateTimeBenchmark-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-04 09:39:50 +09:00
Kent Yao	54996be4d2	[SPARK-31527][SQL][TESTS][FOLLOWUP] Add a benchmark test for datetime add/subtract interval operations ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/28310, the operation of date +/- interval(m, d, 0) has been improved a lot. According to the benchmark results, about 75% time cost is reduced because of no casting date to timestamp back and forth. In this PR, we add a benchmark for these operations, and timestamp +/- interval operations as accessories. ### Why are the changes needed? Performance test coverage, since these operations are missing in the DateTimeBenchmark. ### Does this PR introduce any user-facing change? No, just test ### How was this patch tested? regenerated benchmark results Closes #28369 from yaooqinn/SPARK-31527-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 15:39:28 +00:00
Jian Tang	6a576161ae	[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown ### What changes were proposed in this pull request? This PR aims to add a benchmark suite for nested predicate pushdown with parquet file: Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios: 1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them. 2. When predicate pushed down, parquet reader only loads one of the row groups. 3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down. ### Why are the changes needed? No benchmark exists today for nested fields predicate pushdown performance evaluation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Benchmark runs and reporting result. Closes #28319 from JiJiTang/SPARK-31364. Authored-by: Jian Tang <jian_tang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-04-24 22:10:58 +00:00
Kent Yao	37d2e037ed	[SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function ### What changes were proposed in this pull request? Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm https://prestodb.io/docs/current/functions/datetime.html https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT This PR removes these extract fields support from extract function for date and timestamp values `isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms `isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it. For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated. 1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`] 2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`] 3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found. 4. with a discussion the community，we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too. Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system ### Why are the changes needed? Remove some nonstandard and uncommon features as we can add them back if necessary ### Does this PR introduce any user-facing change? NO, we should target this to 3.0.0 and these are added during 3.0.0 ### How was this patch tested? Remove unused tests Closes #28284 from yaooqinn/SPARK-31507. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 10:24:49 +00:00
Max Gekk	f1fde0cc22	[SPARK-31490][SQL][TESTS] Benchmark conversions to/from Java 8 datetime types ### What changes were proposed in this pull request? - Add benchmark cases for parallelizing `java.time.LocalDate` and `java.time.Instant` column values. - Add benchmark cases for collecting `java.time.LocalDate` and `java.time.Instant` column values. ### Why are the changes needed? - To detect perf regression in the future - To compare parallelization/collection of Java 8 date-time types with Java 7 date-time types `java.sql.Date` & `java.sql.Timestamp`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the modified benchmarks in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28263 from MaxGekk/java8-datetime-collect-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-20 07:26:38 +00:00
Kent Yao	77cb7cde0d	[SPARK-31469][SQL][TESTS][FOLLOWUP] Remove unsupported fields from ExtractBenchmark ### What changes were proposed in this pull request? In `697083c051`, we remove "MILLENNIUM", "CENTURY", "DECADE", "QUARTER", "MILLISECONDS", "MICROSECONDS", "EPOCH" field for date_part and extract expression, this PR fix the related Benchmark. ### Why are the changes needed? test fix. ### Does this PR introduce any user-facing change? no ### How was this patch tested? passing Jenkins Closes #28249 from yaooqinn/SPARK-31469-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-18 00:32:42 -07:00
Max Gekk	744c2480b5	[SPARK-31443][SQL] Fix perf regression of toJavaDate ### What changes were proposed in this pull request? Optimise the `toJavaDate()` method of `DateTimeUtils` by: 1. Re-using `rebaseGregorianToJulianDays` optimised by #28067 2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`. Also new benchmark for collecting dates is added to `DateTimeBenchmark`. ### Why are the changes needed? The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27): Spark 2.4.6-SNAPSHOT: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 559 603 38 8.9 111.8 1.0X Collect dates 2306 3221 1558 2.2 461.1 0.2X ``` Before the changes: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 1052 1130 73 4.8 210.3 1.0X Collect dates 3251 4943 1624 1.5 650.2 0.3X ``` After: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 416 419 3 12.0 83.2 1.0X Collect dates 1928 2759 1180 2.6 385.6 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28212 from MaxGekk/optimize-toJavaDate. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-15 06:19:12 +00:00
Max Gekk	2c5d489679	[SPARK-31439][SQL] Fix perf regression of fromJavaDate ### What changes were proposed in this pull request? In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`. Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`. ### Why are the changes needed? The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values: - x4 on the master branch - 30% against Spark 2.4.6-SNAPSHOT Spark 2.4.6-SNAPSHOT: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 614 655 43 8.1 122.8 1.0X ``` Before the changes: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 1154 1206 46 4.3 230.9 1.0X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 427 434 7 11.7 85.3 1.0X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28205 from MaxGekk/optimize-fromJavaDate. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-14 14:44:00 +00:00
Max Gekk	a0f8cc08a3	[SPARK-31426][SQL] Fix perf regressions of toJavaTimestamp/fromJavaTimestamp ### What changes were proposed in this pull request? Reuse the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` functions introduced by the PR #28119 in `DateTimeUtils`.`toJavaTimestamp()` and `fromJavaTimestamp()`. Actually, new implementation is derived from Spark 2.4 + rebasing via pre-calculated rebasing maps. ### Why are the changes needed? The changes speed up conversions to/from java.sql.Timestamp, and as a consequence the PR improve performance of ORC datasource in loading/saving timestamps: - Saving ~ x2.8 faster in master, and -11% against Spark 2.4.6 - Loading - x3.2-4.5 faster in master, -5% against Spark 2.4.6 Before: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 59877 59877 0 1.7 598.8 0.0X before 1582 61361 61361 0 1.6 613.6 0.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48197 48288 118 2.1 482.0 1.0X after 1582, vec on 38247 38351 128 2.6 382.5 1.3X before 1582, vec off 53179 53359 249 1.9 531.8 0.9X before 1582, vec on 44076 44268 269 2.3 440.8 1.1X ``` After: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 21250 21250 0 4.7 212.5 0.1X before 1582 22105 22105 0 4.5 221.0 0.1X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 14903 14933 40 6.7 149.0 1.0X after 1582, vec on 8342 8426 73 12.0 83.4 1.8X before 1582, vec off 15528 15575 76 6.4 155.3 1.0X before 1582, vec on 9025 9075 61 11.1 90.2 1.7X ``` Spark 2.4.6-SNAPSHOT: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 18858 18858 0 5.3 188.6 1.0X before 1582 18508 18508 0 5.4 185.1 1.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 14063 14177 143 7.1 140.6 1.0X after 1582, vec on 5955 6029 100 16.8 59.5 2.4X before 1582, vec off 14119 14126 7 7.1 141.2 1.0X before 1582, vec on 5991 6007 25 16.7 59.9 2.3X ``` ### Does this PR introduce any user-facing change? Yes, the `to_utc_timestamp` function returns the later local timestamp in the case of overlapping local timestamps at daylight saving time. it's changed back to the 2.4 behavior. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuites`, `ParquetIOSuite`, `OrcHadoopFsRelationSuite`. - Re-generating results of the benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28189 from MaxGekk/optimize-to-from-java-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-14 04:50:20 +00:00
Max Gekk	cac8d1b352	[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader ### What changes were proposed in this pull request? In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated. ### Why are the changes needed? - The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off. - The changes improve performance of regular ORC reader for DATE columns. - x3.6 faster comparing to the current master - x1.9-x4.3 faster against Spark 2.4.6 Before (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 39651 39686 31 2.5 396.5 1.0X after 1582, vec on 3647 3660 13 27.4 36.5 10.9X before 1582, vec off 38155 38219 61 2.6 381.6 1.0X before 1582, vec on 4041 4046 6 24.7 40.4 9.8X ``` After (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 10947 10971 28 9.1 109.5 1.0X after 1582, vec on 3677 3702 36 27.2 36.8 3.0X before 1582, vec off 11456 11472 21 8.7 114.6 1.0X before 1582, vec on 4079 4103 21 24.5 40.8 2.7X ``` Spark 2.4.6: ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48169 48276 96 2.1 481.7 1.0X after 1582, vec on 5375 5410 41 18.6 53.7 9.0X before 1582, vec off 22353 22482 198 4.5 223.5 2.2X before 1582, vec on 5474 5475 1 18.3 54.7 8.8X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites like `DateTimeUtilsSuite` - Checked for `hive-1.2` by: ``` ./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite" ``` - Re-run `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28169 from MaxGekk/orc-optimize-dates. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 05:29:54 +00:00
Kent Yao	d65f534c5a	[SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing ### What changes were proposed in this pull request? With benchmark original, where the timestamp values are valid to the new parser the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms ``` When we modify the benchmark to ```scala def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } ``` where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4). the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` About 10x perf-regression BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` ### Why are the changes needed? Fix performance regression. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? new tests added. Closes #28181 from yaooqinn/SPARK-31414. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 03:11:28 +00:00
Max Gekk	e2d9399602	[SPARK-31359][SQL] Speed up timestamps rebasing ### What changes were proposed in this pull request? In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: \| i \| local timestamp \| Proleptic Greg. seconds \| Hybrid (Julian+Greg) seconds \| difference in minutes\| \| -- \| ------- \|----\|----\| ---- \| \|0\|0001-01-01 00:00\|-62135568422\|-62135740800\|-2872\| \|1\|0100-03-01 00:00\|-59006333222\|-59006419200\|-1432\| \|...\|...\|...\|...\|...\| \|13\|1582-10-15 00:00\|-12219264422\|-12219264000\|7\| \|14\|1883-11-18 12:00\|-2717640000\|-2717640000\|0\| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The _Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_ time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. ### Why are the changes needed? To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ x3.8 faster - Loading timestamps from parquet files is ~x2.8 faster. - Loading timestamps by Vectorized reader ~x4.6 faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127): \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28119 from MaxGekk/optimize-rebase-micros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-09 05:23:52 +00:00
Max Gekk	35e6a9deee	[SPARK-31353][SQL] Set a time zone in DateTimeBenchmark and DateTimeRebaseBenchmark ### What changes were proposed in this pull request? In the PR, I propose to set the `America/Los_Angeles` time zone in the date-time benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` via `withDefaultTimeZone(LA)` and `withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> LA.getId)`. The results of affected benchmarks was given on an Amazon EC2 instance w/ the configuration: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK8/11 \| ### Why are the changes needed? Performance of date-time functions can depend on the system JVM time zone or SQL config `spark.sql.session.timeZone`. The changes allow to avoid any fluctuations of benchmarks results related to time zones, and set a reliable baseline for future optimization. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By regenerating results of DateTimeBenchmark and DateTimeRebaseBenchmark. Closes #28127 from MaxGekk/set-timezone-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-06 05:21:04 +00:00
Maxim Gekk	820bb9985a	[SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time ### What changes were proposed in this pull request? 1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4. 2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case. ### Why are the changes needed? This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see <img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png"> Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets. - Run the following code to check that there is no difference between rebased and original micros for modern timestamps: ```scala test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } ``` ``` local date-time = 0001-01-01T00:00 diff = -2872 minutes local date-time = 0100-03-01T00:00 diff = -1432 minutes local date-time = 0200-03-01T00:00 diff = 7 minutes local date-time = 0300-03-01T00:00 diff = 1447 minutes local date-time = 0500-03-01T00:00 diff = 2887 minutes local date-time = 0600-03-01T00:00 diff = 4327 minutes local date-time = 0700-03-01T00:00 diff = 5767 minutes local date-time = 0900-03-01T00:00 diff = 7207 minutes local date-time = 1000-03-01T00:00 diff = 8647 minutes local date-time = 1100-03-01T00:00 diff = 10087 minutes local date-time = 1300-03-01T00:00 diff = 11527 minutes local date-time = 1400-03-01T00:00 diff = 12967 minutes local date-time = 1500-03-01T00:00 diff = 14407 minutes local date-time = 1582-10-15T00:00 diff = 7 minutes local date-time = 1883-11-18T12:22:58 diff = 0 minutes counter = 15 ``` The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds. - By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 1.8.0_242-8u242/11.0.6+10 \| Closes #28101 from MaxGekk/fix-local-date-overlap. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-03 04:35:31 +00:00
Max Gekk	91af87d34e	[SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example. ### Why are the changes needed? To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 1.8.0_242-8u242/11.0.6+10 \| Closes #28076 from MaxGekk/rebase-benchmark-orc. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-01 07:02:26 +00:00
Maxim Gekk	bb0b416f0b	[SPARK-31297][SQL] Speed up dates rebasing ### What changes were proposed in this pull request? In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, 9999-12-31]`: \| date \| Proleptic Greg. days \| Hybrid (Julian+Greg) days \| diff\| \| ---- \| ----\|----\|----\| \|0001-01-01\|-719162\|-719164\|-2\| \|0100-03-01\|-682944\|-682945\|-1\| \|0200-03-01\|-646420\|-646420\|0\| \|0300-03-01\|-609896\|-609895\|1\| \|0500-03-01\|-536847\|-536845\|2\| \|0600-03-01\|-500323\|-500320\|3\| \|0700-03-01\|-463799\|-463795\|4\| \|0900-03-01\|-390750\|-390745\|5\| \|1000-03-01\|-354226\|-354220\|6\| \|1100-03-01\|-317702\|-317695\|7\| \|1300-03-01\|-244653\|-244645\|8\| \|1400-03-01\|-208129\|-208120\|9\| \|1500-03-01\|-171605\|-171595\|10\| \|1582-10-15\|-141427\|-141427\|0\| For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar. For example, if need to rebase -650000 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -650000, and the result is -650001. ### Why are the changes needed? To make dates rebasing faster. ### Does this PR introduce any user-facing change? No, the results should be the same for valid range of the `DATE` type `[0001-01-01, 9999-12-31]`. ### How was this patch tested? - Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates. - Re-run `DateTimeRebaseBenchmark` on: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK8/11 \| Closes #28067 from MaxGekk/optimize-rebasing. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-31 17:38:47 +08:00
Maxim Gekk	a1dbcd13a3	[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is ~6 times slower - Loading timestamps w/ vectorized off is ~4 times slower - Loading timestamps w/ vectorized on is ~10 times slower ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK8/11 \| Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-30 16:46:31 +08:00
Kent Yao	f1d27cdd91	[SPARK-31119][SQL] Add interval value support for extract expression as extract source ### What changes were proposed in this pull request? ``` <extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren> <extract source> ::= <datetime value expression> \| <interval value expression> ``` We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval. This pr adds interval value support for `extract` expression as extract source ### Why are the changes needed? For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions. ### Does this PR introduce any user-facing change? yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals ### How was this patch tested? add unit tests Closes #27876 from yaooqinn/SPARK-31119. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 12:29:39 +08:00
Kent Yao	0946a9514f	[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp ### What changes were proposed in this pull request? This PR is to support parsing timestamp values with variable length second fraction parts. e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7 ```sql select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values ('2019-10-06 10:11:12.'), ('2019-10-06 10:11:12.0'), ('2019-10-06 10:11:12.1'), ('2019-10-06 10:11:12.12'), ('2019-10-06 10:11:12.123UTC'), ('2019-10-06 10:11:12.1234'), ('2019-10-06 10:11:12.12345CST'), ('2019-10-06 10:11:12.123456PST') t(v) 2019-10-06 03:11:12.123 2019-10-06 08:11:12.12345 2019-10-06 10:11:12 2019-10-06 10:11:12 2019-10-06 10:11:12.1 2019-10-06 10:11:12.12 2019-10-06 10:11:12.1234 2019-10-06 10:11:12.123456 select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') NULL ``` Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated. ### Why are the changes needed? improve timestamp parsing and more compatible with 2.4.x ### Does this PR introduce any user-facing change? no, the related changes are newly added ### How was this patch tested? add uts Closes #27906 from yaooqinn/SPARK-31150. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 21:53:46 +08:00
Kent Yao	fbc9dc7e9d	[SPARK-31129][SQL][TESTS] Fix IntervalBenchmark and DateTimeBenchmark ### What changes were proposed in this pull request? This PR aims to recover `IntervalBenchmark` and `DataTimeBenchmark` due to banning intervals as output. ### Why are the changes needed? This PR recovers the benchmark suite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually, re-run the benchmark. Closes #27885 from yaooqinn/SPARK-31111-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-12 12:59:29 -07:00
Kent Yao	2b46662bd0	[SPARK-31111][SQL][TESTS] Fix interval output issue in ExtractBenchmark ### What changes were proposed in this pull request? fix the error caused by interval output in ExtractBenchmark ### Why are the changes needed? fix a bug in the test ```scala [info] Running case: cast to interval [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;; [error] OverwriteByExpression RelationV2[] noop-table, true, true [error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2] [error] +- Range (1262304000, 1272304000, step=1, splits=Some(1)) [error] [error] at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106) [error] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389) [error] at org.a ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? re-run benchmark Closes #27867 from yaooqinn/SPARK-31111. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 20:13:59 +08:00

1 2 3

127 commits