ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
fhygh	3a3f8ca6f4	[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation ### What changes were proposed in this pull request? This PR is used to fix this bug: ``` set spark.sql.legacy.charVarcharAsString=true; create table chartb01(a char(3)); insert into chartb01 select 'aaaaa'; ``` here we expect the data of table chartb01 is 'aaa', but it runs failed. ### Why are the changes needed? Improve backward compatibility ``` spark-sql> > create table tchar01(col char(2)) using parquet; Time taken: 0.767 seconds spark-sql> > insert into tchar01 select 'aaa'; ERROR \| Executor task launch worker for task 0.0 in stage 0.0 (TID 0) \| Aborting task \| org.apache.spark.util.Utils.logError(Logging.scala:94) java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default). ### How was this patch tested? Added unit tests. Closes #32501 from fhygh/master. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-18 00:13:40 +08:00
Wenchen Fan	3b63f32601	[SPARK-35400][SQL] Simplify getOuterReferences and improve error message for correlated subquery ### What changes were proposed in this pull request? Spark doesn't support aggregate functions with mixed outer and local references. This PR applies this check earlier to fail with a clear error message instead of some weird ones, and simplifies the related code in `SubExprUtils.getOuterReferences`. This PR also refines the error message a bit. ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #32503 from cloud-fan/try. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-17 14:13:44 +00:00
Jungtaek Lim	7c13636be3	[SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces UpdatingSessionsIterator, which analyzes neighbor elements and adjust session information on elements. UpdatingSessionsIterator calculates and updates the session window for each element in the given iterator, which makes elements in the same session window having same session spec. Downstream can apply aggregation to finally merge these elements bound to the same session window. UpdatingSessionsIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort. UpdatingSessionsIterator copies the elements to safely update on each element, as well as buffers elements which are bound to the same session window. Due to such overheads, MergingSessionsIterator which will be introduced via SPARK-34889 should be used whenever possible. This PR also introduces UpdatingSessionsExec which is the physical node on leveraging UpdatingSessionsIterator to sort the input rows and updates session information on input rows. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test suite added. Closes #31986 from HeartSaVioR/SPARK-34888-SPARK-10816-PR-31570-part-1. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-17 21:05:49 +09:00
Yuming Wang	fb9316388a	[SPARK-32792][SQL][FOLLOWUP] Fix conflict with SPARK-34661 ### What changes were proposed in this pull request? This fixes the compilation error due to the logical conflicts between https://github.com/apache/spark/pull/31776 and https://github.com/apache/spark/pull/29642 . ### Why are the changes needed? To recover compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #32568 from wangyum/HOT-FIX. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-16 22:12:52 -07:00
Yuming Wang	d2d1f0b580	[SPARK-32792][SQL] Improve Parquet In filter pushdown ### What changes were proposed in this pull request? Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` maximum value for Parquet when [sources.In](`a744fea3be/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala (L162-L181)`)'s values exceeds `spark.sql.optimizer.inSetRewriteMinMaxThreshold`. For example: ```sql SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15) ``` We will push down `id >= 1 and id <= 15`. Impala also has this improvement: https://issues.apache.org/jira/browse/IMPALA-3654 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test, [manual test](https://github.com/apache/spark/pull/29642#issuecomment-743109098) and benchmark test. Before this PR: ``` ================================================================================================ Pushdown benchmark for InSet -> InFilters ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5995 6026 53 2.6 381.2 1.0X Parquet Vectorized (Pushdown) 423 440 11 37.2 26.9 14.2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5767 5887 154 2.7 366.7 1.0X Parquet Vectorized (Pushdown) 419 428 6 37.6 26.6 13.8X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5764 5857 96 2.7 366.4 1.0X Parquet Vectorized (Pushdown) 408 419 9 38.6 25.9 14.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5895 5949 41 2.7 374.8 1.0X Parquet Vectorized (Pushdown) 5908 5986 114 2.7 375.6 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5893 5988 106 2.7 374.7 1.0X Parquet Vectorized (Pushdown) 5875 5939 57 2.7 373.5 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5891 5954 42 2.7 374.5 1.0X Parquet Vectorized (Pushdown) 5901 5976 99 2.7 375.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6128 6158 40 2.6 389.6 1.0X Parquet Vectorized (Pushdown) 6145 6190 37 2.6 390.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6142 6217 64 2.6 390.5 1.0X Parquet Vectorized (Pushdown) 6149 6235 90 2.6 391.0 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6148 6218 64 2.6 390.9 1.0X Parquet Vectorized (Pushdown) 6145 6177 30 2.6 390.7 1.0X ``` After this PR: ``` ================================================================================================ Pushdown benchmark for InSet -> InFilters ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5745 5768 28 2.7 365.2 1.0X Parquet Vectorized (Pushdown) 401 412 12 39.2 25.5 14.3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5796 5861 61 2.7 368.5 1.0X Parquet Vectorized (Pushdown) 417 482 37 37.7 26.5 13.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5754 5777 20 2.7 365.8 1.0X Parquet Vectorized (Pushdown) 408 418 9 38.6 25.9 14.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5878 5915 40 2.7 373.7 1.0X Parquet Vectorized (Pushdown) 929 940 10 16.9 59.1 6.3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5886 5917 29 2.7 374.2 1.0X Parquet Vectorized (Pushdown) 3091 3114 20 5.1 196.5 1.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5913 5948 48 2.7 375.9 1.0X Parquet Vectorized (Pushdown) 5330 5427 98 3.0 338.9 1.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6147 6228 72 2.6 390.8 1.0X Parquet Vectorized (Pushdown) 1023 1029 4 15.4 65.1 6.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6164 6224 47 2.6 391.9 1.0X Parquet Vectorized (Pushdown) 3332 3360 45 4.7 211.9 1.8X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6154 6192 38 2.6 391.3 1.0X Parquet Vectorized (Pushdown) 5588 5679 92 2.8 355.3 1.1X ``` Closes #29642 from wangyum/SPARK-32792. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <yumwang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-16 21:20:52 -07:00
Takeshi Yamamuro	2eef2f9035	[SPARK-35412][SQL] Fix a bug in groupBy of year-month/day-time intervals ### What changes were proposed in this pull request? To fix a bug below in groupBy of year-month/day-time intervals, this PR proposes to make `HashMapGenerator` handle the two types for hash-aggregates; ``` scala> Seq(java.time.Duration.ofDays(1)).toDF("a").groupBy("a").count().show() scala.MatchError: DayTimeIntervalType (of class org.apache.spark.sql.types.DayTimeIntervalType$) at org.apache.spark.sql.execution.aggregate.HashMapGenerator.genComputeHash(HashMapGenerator.scala:159) at org.apache.spark.sql.execution.aggregate.HashMapGenerator.$anonfun$generateHashFunction$1(HashMapGenerator.scala:102) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.execution.aggregate.HashMapGenerator.genHashForKeys$1(HashMapGenerator.scala:99) at org.apache.spark.sql.execution.aggregate.HashMapGenerator.generateHashFunction(HashMapGenerator.scala:111) ``` ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test. Closes #32560 from maropu/FixIntervalIssue. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-16 10:51:32 -07:00
Cheng Su	5c1567ba97	[SPARK-35363][SQL][FOLLOWUP] Use fresh name for findNextJoinRows instead of hardcoding it ### What changes were proposed in this pull request? This is a followup from discussion in https://github.com/apache/spark/pull/32495#discussion_r632283178 . The hardcoded function name `findNextJoinRows` is not a real problem now as we always do code generation for SMJ's children separately. But this change is to make it future proof in case this assumption changed in the future. ### Why are the changes needed? Fix the potential reliability issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32548 from c21/smj-followup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-16 10:49:31 -07:00
yangjie01	7ca0a0910f	[SPARK-34661][SQL] Clean up `OriginalType` and `DecimalMetadata` usage in Parquet related code ### What changes were proposed in this pull request? `OriginalType` and `DecimalMetadata` has been marked as `Deprecated` in new Parquet code. `Apache Parquet` suggest us replace `OriginalType` with `LogicalTypeAnnotation` and replace `DecimalMetadata` with `DecimalLogicalTypeAnnotation`, so the main change of this pr is clean up these deprecated usages in Parquet related code. ### Why are the changes needed? Cleanup deprecated api usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31776 from LuciferYang/cleanup-parquet-dep-api. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-16 09:03:26 -05:00
Yuming Wang	520a355516	[SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState ### What changes were proposed in this pull request? This PR replaces `SessionState.start` with `shim.setCurrentSessionState/SessionState.setCurrentSessionState`. ### Why are the changes needed? To avoid [SessionState.createSessionDirs](https://github.com/apache/hive/blob/rel/release-2.3.8/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L652-L696) creating too many directories and Spark SQL do not need it: ![image](https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #32410 from wangyum/setCurrentSessionState. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-05-16 18:39:15 +08:00
QuangHuyViettel	9789ee84e4	[SPARK-32484][SQL] Fix log info BroadcastExchangeExec.scala ### What changes were proposed in this pull request? Fix log info in BroadcastExchangeExec.scala ### Why are the changes needed? Log info s"Cannot broadcast the table that is larger than 8GB: ${dataSize >> 30} GB") is not accurate info , because 8GB is not accurate. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? no Closes #32544 from LittleCuteBug/SPARK-32484. Authored-by: QuangHuyViettel <quanghuynguyen236@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-15 13:08:42 -05:00
Chao Sun	a8032e7efa	[SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke` ### What changes were proposed in this pull request? Move hash map lookup operation out of `InvokeLike.invoke` since it doesn't depend on the input. ### Why are the changes needed? We shouldn't need to look up the hash map for every input row evaluated by `InvokeLike.invoke` since it doesn't depend on input. This could speed up the performance a bit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32532 from sunchao/SPARK-35384-follow-up. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-14 14:00:39 -07:00
yi.wu	94bd480761	[SPARK-35206][TESTS][SQL] Extract common used get project path into a function in SparkFunctionSuite ### What changes were proposed in this pull request? Add a common functions `getWorkspaceFilePath` (which prefixed with spark home) to `SparkFunctionSuite`, and applies these the function to where they're extracted from. ### Why are the changes needed? Spark sql has test suites to read resources when running tests. The way of getting the path of resources is commonly used in different suites. We can extract them into a function to ease the code maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #32315 from Ngone51/extract-common-file-path. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-14 22:17:50 +08:00
ulysses-you	6218bc5036	[SPARK-35332][SQL][FOLLOWUP] Refine wrong comment ### What changes were proposed in this pull request? Refine comment in `CacheManager`. ### Why are the changes needed? Avoid misleading developer. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not needed. Closes #32543 from ulysses-you/SPARK-35332-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-05-14 17:10:21 +08:00
Pablo Langa	9ea55fe771	[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) ### What changes were proposed in this pull request? Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types. ``` scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show +-------------------------+--------------------------+ \|hash(CAST(0.0 AS DOUBLE))\|hash(CAST(-0.0 AS DOUBLE))\| +-------------------------+--------------------------+ \| -1670924195\| -853646085\| +-------------------------+--------------------------+ scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show +--------------------------------------------+ \|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))\| +--------------------------------------------+ \| true\| +--------------------------------------------+ ``` Here is an extract from IEEE 754: > The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases From this, I deduce that the hash function must produce the same result for 0 and -0. ### Why are the changes needed? It is a correctness issue ### Does this PR introduce _any_ user-facing change? This changes only affect to the hash function applied to -0 value in float and double types ### How was this patch tested? Unit testing and manual testing Closes #32496 from planga82/feature/spark35207_hashnegativezero. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-14 12:40:36 +08:00
Takeshi Yamamuro	8fa739fb9d	[SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec ### What changes were proposed in this pull request? This PR intends to split generated switch code into smaller ones in `ExpandExec`. In the current master, even a simple query like the one below generates a large method whose size (`maxMethodCodeSize:7448`) is close to `8000` (`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT`); ``` scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 2, "a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id") scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 second"), $"value").orderBy($"window.start".asc, $"value".desc).select("value") scala> sql("SET spark.sql.adaptive.enabled=false") scala> import org.apache.spark.sql.execution.debug._ scala> rdf.debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% used); numInnerClasses:0) == ^^^^ (1) Project [window#34.start AS _gen_alias_39#39, value#11] +- (1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= window#34.start)) AND (cast(time#10 as timestamp) < window#34.end)) +- (1) Expand [List(named_struct(start, precisetimestampcon... / 028 / private void expand_doConsume_0(InternalRow localtablescan_row_0, UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, int expand_expr_1_0) throws java.io.IOException { / 029 / boolean expand_isNull_0 = true; / 030 / InternalRow expand_value_0 = / 031 / null; / 032 / for (int expand_i_0 = 0; expand_i_0 < 4; expand_i_0 ++) { / 033 / switch (expand_i_0) { / 034 / case 0: (too many code lines) / 517 / break; / 518 / / 519 / case 1: (too many code lines) / 1002 / break; / 1003 / / 1004 / case 2: (too many code lines) / 1487 / break; / 1488 / / 1489 / case 3: (too many code lines) / 1972 / break; / 1973 / } / 1974 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[33] / numOutputRows /).add(1); / 1975 / / 1976 / do { / 1977 / boolean filter_value_2 = !expand_isNull_0; / 1978 / if (!filter_value_2) continue; ``` The fix in this PR can make the method smaller as follows; ``` Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 (maxMethodCodeSize:1713; maxConstantPoolSize:210(0.32% used); numInnerClasses:0) == ^^^^ (1) Project [window#17.start AS _gen_alias_32#32, value#11] +- (1) Filter ((isnotnull(window#17) AND (cast(time#10 as timestamp) >= window#17.start)) AND (cast(time#10 as timestamp) < window#17.end)) +- (1) Expand [List(named_struct(start, precisetimestampcon... /* 032 / private void expand_doConsume_0(InternalRow localtablescan_row_0, UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, int expand_expr_1_0) throws java.io.IOException { / 033 / for (int expand_i_0 = 0; expand_i_0 < 4; expand_i_0 ++) { / 034 / switch (expand_i_0) { / 035 / case 0: / 036 / expand_switchCaseCode_0(expand_exprIsNull_0_0, expand_expr_0_0); / 037 / break; / 038 / / 039 / case 1: / 040 / expand_switchCaseCode_1(expand_exprIsNull_0_0, expand_expr_0_0); / 041 / break; / 042 / / 043 / case 2: / 044 / expand_switchCaseCode_2(expand_exprIsNull_0_0, expand_expr_0_0); / 045 / break; / 046 / / 047 / case 3: / 048 / expand_switchCaseCode_3(expand_exprIsNull_0_0, expand_expr_0_0); / 049 / break; / 050 / } / 051 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[33] / numOutputRows /).add(1); / 052 / / 053 / do { / 054 / boolean filter_value_2 = !expand_resultIsNull_0; / 055 / if (!filter_value_2) continue; / 056 */ ... ``` ### Why are the changes needed? For better generated code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA passed. Closes #32457 from maropu/splitSwitchCode. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-13 17:53:46 -07:00
Liang-Chi Hsieh	6a949d1659	[SPARK-35397][SQL] Replace sys.err usage with explicit exception type ### What changes were proposed in this pull request? This patch replaces `sys.err` usages with explicit exception types. ### Why are the changes needed? Motivated by the previous comment https://github.com/apache/spark/pull/32519#discussion_r630787080, it sounds better to replace `sys.err` usages with explicit exception type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32535 from viirya/replace-sys-err. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-13 10:37:24 -07:00
Linhong Liu	6aa2594c6b	[SPARK-35366][SQL] Avoid using deprecated `buildForBatch` and `buildForStreaming` ### What changes were proposed in this pull request? Currently, in DSv2, we are still using the deprecated `buildForBatch` and `buildForStreaming`. This PR implements the `build`, `toBatch`, `toStreaming` interfaces to replace the deprecated ones. ### Why are the changes needed? Code refactor ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exsting UT Closes #32497 from linhongliu-db/dsv2-writer. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 17:23:08 +00:00
gengjiaan	c2e15cccab	[SPARK-35062][SQL] Group exception messages in sql/streaming ### What changes were proposed in this pull request? This PR group exception messages in `sql/core/src/main/scala/org/apache/spark/sql/streaming`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32464 from beliefer/SPARK-35062. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 15:04:03 +00:00
ulysses-you	6f63057ede	[SPARK-35332][SQL] Make cache plan disable configs configurable ### What changes were proposed in this pull request? Add a new config to make cache plan disable configs configurable. ### Why are the changes needed? The disable configs of cache plan if to avoid the perfermance regression, but not all the query will slow than before due to AQE or bucket scan enabled. It's useful to make a new config so that user can decide if some configs should be disabled during cache plan. ### Does this PR introduce _any_ user-facing change? Yes, a new config. ### How was this patch tested? Add test. Closes #32482 from ulysses-you/SPARK-35332. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 14:49:05 +00:00
Gengliang Wang	02c99f15ee	[SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE ### What changes were proposed in this pull request? Add New SQL functions: * TRY_ADD * TRY_DIVIDE These expressions are identical to the following expression under ANSI mode except that it returns null if error occurs: * ADD * DIVIDE Note: it is easy to add other expressions like `TRY_SUBTRACT`/`TRY_MULTIPLY` but let's control the number of these new expressions and just add `TRY_ADD` and `TRY_DIVIDE` for now. ### Why are the changes needed? 1. Users can manage to finish queries without interruptions in ANSI mode. 2. Users can get NULLs instead of unreasonable results if overflow occurs when ANSI mode is off. For example, the behavior of the following SQL operations is unreasonable: ``` 2147483647 + 2 => -2147483647 ``` With the new safe version SQL functions: ``` TRY_ADD(2147483647, 2) => null ``` Note: We should only add new expressions to important operators, instead of adding new safe expressions for all the expressions that can throw errors. ### Does this PR introduce _any_ user-facing change? Yes, new SQL functions: TRY_ADD/TRY_DIVIDE ### How was this patch tested? Unit test Closes #32292 from gengliangwang/try_add. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-13 22:26:08 +08:00
jiake	b6d57b6b99	[SPARK-34637][SQL] Support DPP + AQE when the broadcast exchange can be reused ### What changes were proposed in this pull request? We have supported DPP in AQE when the join is Broadcast hash join before applying the AQE rules in [SPARK-34168](https://issues.apache.org/jira/browse/SPARK-34168), which has some limitations. It only apply DPP when the small table side executed firstly and then the big table side can reuse the broadcast exchange in small table side. This PR is to address the above limitations and can apply the DPP when the broadcast exchange can be reused. ### Why are the changes needed? Resolve the limitations when both enabling DPP and AQE ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Adding new ut Closes #31756 from JkSelf/supportDPP2. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 13:07:02 +00:00
Wenchen Fan	d1b8bd7d11	[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution ### What changes were proposed in this pull request? In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE . This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own. The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal. This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values. ### Why are the changes needed? Fix the MEREG INSERT/UPDATE to be more user-friendly and easy to do schema evolution. ### Does this PR introduce _any_ user-facing change? Yes, but MERGE is only supported by very few data sources. ### How was this patch tested? new tests Closes #32192 from cloud-fan/merge. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 12:58:24 +00:00
Cheng Su	c1e995ac95	[SPARK-35350][SQL] Add code-gen for left semi sort merge join ### What changes were proposed in this pull request? As title. This PR is to add code-gen support for LEFT SEMI sort merge join. The main change is to add `semiJoin` code path in `SortMergeJoinExec.doProduce()` and introduce `onlyBufferFirstMatchedRow` in `SortMergeJoinExec.genScanner()`. The latter is for left semi sort merge join without condition. For this kind of query, we don't need to buffer all matched rows, but only the first one (this is same as non-code-gen code path). Example query: ``` val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(4).select($"id".as("k2")) val oneJoinDF = df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2", "left_semi") ``` Example of generated code for the query: ``` == Subtree 5 / 5 (maxMethodCodeSize:302; maxConstantPoolSize:156(0.24% used); numInnerClasses:0) == (5) Project [id#0L AS k1#2L] +- (5) SortMergeJoin [id#0L], [k2#6L], LeftSemi :- (2) Sort [id#0L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#0L, 5), ENSURE_REQUIREMENTS, [id=#27] : +- (1) Range (0, 10, step=1, splits=2) +- (4) Sort [k2#6L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#33] +- (3) Project [id#4L AS k2#6L] +- (3) Range (0, 4, step=1, splits=2) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage5(references); / 003 / } / 004 / / 005 / // codegenStageId=5 / 006 / final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator smj_streamedInput_0; / 010 / private scala.collection.Iterator smj_bufferedInput_0; / 011 / private InternalRow smj_streamedRow_0; / 012 / private InternalRow smj_bufferedRow_0; / 013 / private long smj_value_2; / 014 / private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0; / 015 / private long smj_value_3; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 017 / / 018 / public GeneratedIteratorForCodegenStage5(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / smj_streamedInput_0 = inputs[0]; / 026 / smj_bufferedInput_0 = inputs[1]; / 027 / / 028 / smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(1, 2147483647); / 029 / smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 030 / smj_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 031 / / 032 / } / 033 / / 034 / private boolean findNextJoinRows( / 035 / scala.collection.Iterator streamedIter, / 036 / scala.collection.Iterator bufferedIter) { / 037 / smj_streamedRow_0 = null; / 038 / int comp = 0; / 039 / while (smj_streamedRow_0 == null) { / 040 / if (!streamedIter.hasNext()) return false; / 041 / smj_streamedRow_0 = (InternalRow) streamedIter.next(); / 042 / long smj_value_0 = smj_streamedRow_0.getLong(0); / 043 / if (false) { / 044 / smj_streamedRow_0 = null; / 045 / continue; / 046 / / 047 / } / 048 / if (!smj_matches_0.isEmpty()) { / 049 / comp = 0; / 050 / if (comp == 0) { / 051 / comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0); / 052 / } / 053 / / 054 / if (comp == 0) { / 055 / return true; / 056 / } / 057 / smj_matches_0.clear(); / 058 / } / 059 / / 060 / do { / 061 / if (smj_bufferedRow_0 == null) { / 062 / if (!bufferedIter.hasNext()) { / 063 / smj_value_3 = smj_value_0; / 064 / return !smj_matches_0.isEmpty(); / 065 / } / 066 / smj_bufferedRow_0 = (InternalRow) bufferedIter.next(); / 067 / long smj_value_1 = smj_bufferedRow_0.getLong(0); / 068 / if (false) { / 069 / smj_bufferedRow_0 = null; / 070 / continue; / 071 / } / 072 / smj_value_2 = smj_value_1; / 073 / } / 074 / / 075 / comp = 0; / 076 / if (comp == 0) { / 077 / comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0); / 078 / } / 079 / / 080 / if (comp > 0) { / 081 / smj_bufferedRow_0 = null; / 082 / } else if (comp < 0) { / 083 / if (!smj_matches_0.isEmpty()) { / 084 / smj_value_3 = smj_value_0; / 085 / return true; / 086 / } else { / 087 / smj_streamedRow_0 = null; / 088 / } / 089 / } else { / 090 / if (smj_matches_0.isEmpty()) { / 091 / smj_matches_0.add((UnsafeRow) smj_bufferedRow_0); / 092 / } / 093 / / 094 / smj_bufferedRow_0 = null; / 095 / } / 096 / } while (smj_streamedRow_0 != null); / 097 / } / 098 / return false; // unreachable / 099 / } / 100 / / 101 / protected void processNext() throws java.io.IOException { / 102 / while (findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0)) { / 103 / long smj_value_4 = -1L; / 104 / smj_value_4 = smj_streamedRow_0.getLong(0); / 105 / scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator(); / 106 / boolean smj_hasOutputRow_0 = false; / 107 / / 108 / while (!smj_hasOutputRow_0 && smj_iterator_0.hasNext()) { / 109 / InternalRow smj_bufferedRow_1 = (InternalRow) smj_iterator_0.next(); / 110 / / 111 / smj_hasOutputRow_0 = true; / 112 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 113 / / 114 / // common sub-expressions / 115 / / 116 / smj_mutableStateArray_0[1].reset(); / 117 / / 118 / smj_mutableStateArray_0[1].write(0, smj_value_4); / 119 / append((smj_mutableStateArray_0[1].getRow()).copy()); / 120 / / 121 / } / 122 / if (shouldStop()) return; / 123 / } / 124 / ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] / plan /).cleanupResources(); / 125 / } / 126 / / 127 / } ``` ### Why are the changes needed? Improve query CPU performance. Test with one query: ``` def sortMergeJoin(): Unit = { val N = 2 << 20 codegenBenchmark("left semi sort merge join", N) { val df1 = spark.range(N).selectExpr(s"id 2 as k1") val df2 = spark.range(N).selectExpr(s"id * 3 as k2") val df = df1.join(df2, col("k1") === col("k2"), "left_semi") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.noop() } } ``` Seeing 30% of run-time improvement: ``` Running benchmark: left semi sort merge join Running case: left semi sort merge join code-gen off Stopped after 2 iterations, 1369 ms Running case: left semi sort merge join code-gen on Stopped after 5 iterations, 2743 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz left semi sort merge join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ left semi sort merge join code-gen off 676 685 13 3.1 322.2 1.0X left semi sort merge join code-gen on 524 549 32 4.0 249.7 1.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite.scala` and `ExistenceJoinSuite.scala`. Closes #32528 from c21/smj-left-semi. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 12:52:26 +00:00
Kent Yao	51815430b2	[SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader ### What changes were proposed in this pull request? In https://github.com/yaooqinn/itachi/issues/8, we had a discussion about the current extension injection for the spark session. We've agreed that the current way is not that convenient for both third-party developers and end-users. It's much simple if third-party developers can provide a resource file that contains default extensions for Spark to load ahead ### Why are the changes needed? better use experience ### Does this PR introduce _any_ user-facing change? no, dev change ### How was this patch tested? new tests Closes #32515 from yaooqinn/SPARK-35380. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-05-13 16:34:13 +08:00
Chao Sun	0ab9bd79b3	[SPARK-35384][SQL] Improve performance for InvokeLike.invoke ### What changes were proposed in this pull request? Change `map` in `InvokeLike.invoke` to a while loop to improve performance, following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex). ### Why are the changes needed? `InvokeLike.invoke`, which is used in non-codegen path for `Invoke` and `StaticInvoke`, currently uses `map` to evaluate arguments: ```scala val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) if (needNullCheck && args.exists(_ == null)) { // return null if one of arguments is null null } else { ... ``` which is pretty expensive if the method itself is trivial. We can change it to a plain while loop. <img width="871" alt="Screen Shot 2021-05-12 at 12 19 59 AM" src="https://user-images.githubusercontent.com/506679/118055719-7f985a00-b33d-11eb-943b-cf85eab35f44.png"> Benchmark results show this can improve as much as 3x from `V2FunctionBenchmark`: Before ``` OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative -------------------------------------------------------------------------------------------------------------------------------------------------------------- native_long_add 36506 36656 251 13.7 73.0 1.0X java_long_add_default 47151 47540 370 10.6 94.3 0.8X java_long_add_magic 178691 182457 1327 2.8 357.4 0.2X java_long_add_static_magic 177151 178258 1151 2.8 354.3 0.2X ``` After ``` OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative -------------------------------------------------------------------------------------------------------------------------------------------------------------- native_long_add 29897 30342 568 16.7 59.8 1.0X java_long_add_default 40628 41075 664 12.3 81.3 0.7X java_long_add_magic 54553 54755 182 9.2 109.1 0.5X java_long_add_static_magic 55410 55532 127 9.0 110.8 0.5X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32527 from sunchao/SPARK-35384. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-12 20:57:21 -07:00
Takeshi Yamamuro	3241aeb7f4	[SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests ### What changes were proposed in this pull request? This PR proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost the same ones; the only differences in these queries are ORDER BY columns. ### Why are the changes needed? To improve test performance. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Existing tests. Closes #32520 from maropu/SkipDupQueries. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-13 09:46:25 +09:00
Chao Sun	bc95c3a69b	[SPARK-35361][SQL][FOLLOWUP] Switch to use while loop ### What changes were proposed in this pull request? Switch to plain `while` loop following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex). ### Why are the changes needed? `while` loop may yield better performance comparing to `foreach`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #32522 from sunchao/SPARK-35361-follow-up. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-12 12:41:12 -07:00
Liang-Chi Hsieh	f156a95641	[SPARK-35347][SQL][FOLLOWUP] Throw exception with an explicit exception type when cannot find the method instead of sys.error ### What changes were proposed in this pull request? A simple follow-up of #32474 to throw exception instead of sys.error. ### Why are the changes needed? An exception only fails the query, instead of sys.error. ### Does this PR introduce _any_ user-facing change? Yes, if `Invoke` or `StaticInvoke` cannot find the method, instead of original `sys.error` now we only throw an exception. ### How was this patch tested? Existing tests. Closes #32519 from viirya/SPARK-35347-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-12 09:56:08 -07:00
Cheng Su	7bcadedbd2	[SPARK-35349][SQL] Add code-gen for left/right outer sort merge join ### What changes were proposed in this pull request? This PR is to add code-gen support for LEFT OUTER / RIGHT OUTER sort merge join. Currently sort merge join only supports inner join type (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374 ). There's no fundamental reason why we cannot support code-gen for other join types. Here we add code-gen for LEFT OUTER / RIGHT OUTER join. Will submit followup PRs to add LEFT SEMI, LEFT ANTI and FULL OUTER code-gen separately. The change is to extend current sort merge join logic to work with LEFT OUTER and RIGHT OUTER (should work with LEFT SEMI/ANTI as well, but FULL OUTER join needs some other more code change). Replace left/right with streamed/buffered to make code extendable to other join types besides inner join. Example query: ``` val df1 = spark.range(10).select($"id".as("k1"), $"id".as("k3")) val df2 = spark.range(4).select($"id".as("k2"), $"id".as("k4")) df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2" && $"k3" + 1 < $"k4", "left_outer").explain("codegen") ``` Example generated code: ``` == Subtree 5 / 5 (maxMethodCodeSize:396; maxConstantPoolSize:159(0.24% used); numInnerClasses:0) == (5) SortMergeJoin [k1#2L], [k2#8L], LeftOuter, ((k3#3L + 1) < k4#9L) :- (2) Sort [k1#2L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(k1#2L, 5), ENSURE_REQUIREMENTS, [id=#26] : +- (1) Project [id#0L AS k1#2L, id#0L AS k3#3L] : +- (1) Range (0, 10, step=1, splits=2) +- (4) Sort [k2#8L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k2#8L, 5), ENSURE_REQUIREMENTS, [id=#32] +- (3) Project [id#6L AS k2#8L, id#6L AS k4#9L] +- (3) Range (0, 4, step=1, splits=2) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage5(references); / 003 / } / 004 / / 005 / // codegenStageId=5 / 006 / final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator smj_streamedInput_0; / 010 / private scala.collection.Iterator smj_bufferedInput_0; / 011 / private InternalRow smj_streamedRow_0; / 012 / private InternalRow smj_bufferedRow_0; / 013 / private long smj_value_2; / 014 / private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0; / 015 / private long smj_value_3; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 017 / / 018 / public GeneratedIteratorForCodegenStage5(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / smj_streamedInput_0 = inputs[0]; / 026 / smj_bufferedInput_0 = inputs[1]; / 027 / / 028 / smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 2147483647); / 029 / smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(4, 0); / 030 / / 031 / } / 032 / / 033 / private boolean findNextJoinRows( / 034 / scala.collection.Iterator streamedIter, / 035 / scala.collection.Iterator bufferedIter) { / 036 / smj_streamedRow_0 = null; / 037 / int comp = 0; / 038 / while (smj_streamedRow_0 == null) { / 039 / if (!streamedIter.hasNext()) return false; / 040 / smj_streamedRow_0 = (InternalRow) streamedIter.next(); / 041 / long smj_value_0 = smj_streamedRow_0.getLong(0); / 042 / if (false) { / 043 / if (!smj_matches_0.isEmpty()) { / 044 / smj_matches_0.clear(); / 045 / } / 046 / return false; / 047 / / 048 / } / 049 / if (!smj_matches_0.isEmpty()) { / 050 / comp = 0; / 051 / if (comp == 0) { / 052 / comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0); / 053 / } / 054 / / 055 / if (comp == 0) { / 056 / return true; / 057 / } / 058 / smj_matches_0.clear(); / 059 / } / 060 / / 061 / do { / 062 / if (smj_bufferedRow_0 == null) { / 063 / if (!bufferedIter.hasNext()) { / 064 / smj_value_3 = smj_value_0; / 065 / return !smj_matches_0.isEmpty(); / 066 / } / 067 / smj_bufferedRow_0 = (InternalRow) bufferedIter.next(); / 068 / long smj_value_1 = smj_bufferedRow_0.getLong(0); / 069 / if (false) { / 070 / smj_bufferedRow_0 = null; / 071 / continue; / 072 / } / 073 / smj_value_2 = smj_value_1; / 074 / } / 075 / / 076 / comp = 0; / 077 / if (comp == 0) { / 078 / comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0); / 079 / } / 080 / / 081 / if (comp > 0) { / 082 / smj_bufferedRow_0 = null; / 083 / } else if (comp < 0) { / 084 / if (!smj_matches_0.isEmpty()) { / 085 / smj_value_3 = smj_value_0; / 086 / return true; / 087 / } else { / 088 / return false; / 089 / } / 090 / } else { / 091 / smj_matches_0.add((UnsafeRow) smj_bufferedRow_0); / 092 / smj_bufferedRow_0 = null; / 093 / } / 094 / } while (smj_streamedRow_0 != null); / 095 / } / 096 / return false; // unreachable / 097 / } / 098 / / 099 / protected void processNext() throws java.io.IOException { / 100 / while (smj_streamedInput_0.hasNext()) { / 101 / findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0); / 102 / long smj_value_4 = -1L; / 103 / long smj_value_5 = -1L; / 104 / boolean smj_loaded_0 = false; / 105 / smj_value_5 = smj_streamedRow_0.getLong(1); / 106 / scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator(); / 107 / boolean smj_foundMatch_0 = false; / 108 / / 109 / // the last iteration of this loop is to emit an empty row if there is no matched rows. / 110 / while (smj_iterator_0.hasNext() \|\| !smj_foundMatch_0) { / 111 / InternalRow smj_bufferedRow_1 = smj_iterator_0.hasNext() ? / 112 / (InternalRow) smj_iterator_0.next() : null; / 113 / boolean smj_isNull_5 = true; / 114 / long smj_value_9 = -1L; / 115 / if (smj_bufferedRow_1 != null) { / 116 / long smj_value_8 = smj_bufferedRow_1.getLong(1); / 117 / smj_isNull_5 = false; / 118 / smj_value_9 = smj_value_8; / 119 / } / 120 / if (smj_bufferedRow_1 != null) { / 121 / boolean smj_isNull_6 = true; / 122 / boolean smj_value_10 = false; / 123 / long smj_value_11 = -1L; / 124 / / 125 / smj_value_11 = smj_value_5 + 1L; / 126 / / 127 / if (!smj_isNull_5) { / 128 / smj_isNull_6 = false; // resultCode could change nullability. / 129 / smj_value_10 = smj_value_11 < smj_value_9; / 130 / / 131 / } / 132 / if (smj_isNull_6 \|\| !smj_value_10) { / 133 / continue; / 134 / } / 135 / } / 136 / if (!smj_loaded_0) { / 137 / smj_loaded_0 = true; / 138 / smj_value_4 = smj_streamedRow_0.getLong(0); / 139 / } / 140 / boolean smj_isNull_3 = true; / 141 / long smj_value_7 = -1L; / 142 / if (smj_bufferedRow_1 != null) { / 143 / long smj_value_6 = smj_bufferedRow_1.getLong(0); / 144 / smj_isNull_3 = false; / 145 / smj_value_7 = smj_value_6; / 146 / } / 147 / smj_foundMatch_0 = true; / 148 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 149 / / 150 / smj_mutableStateArray_0[0].reset(); / 151 / / 152 / smj_mutableStateArray_0[0].zeroOutNullBytes(); / 153 / / 154 / smj_mutableStateArray_0[0].write(0, smj_value_4); / 155 / / 156 / smj_mutableStateArray_0[0].write(1, smj_value_5); / 157 / / 158 / if (smj_isNull_3) { / 159 / smj_mutableStateArray_0[0].setNullAt(2); / 160 / } else { / 161 / smj_mutableStateArray_0[0].write(2, smj_value_7); / 162 / } / 163 / / 164 / if (smj_isNull_5) { / 165 / smj_mutableStateArray_0[0].setNullAt(3); / 166 / } else { / 167 / smj_mutableStateArray_0[0].write(3, smj_value_9); / 168 / } / 169 / append((smj_mutableStateArray_0[0].getRow()).copy()); / 170 / / 171 / } / 172 / if (shouldStop()) return; / 173 / } / 174 / ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] / plan /).cleanupResources(); / 175 / } / 176 / / 177 / } ``` ### Why are the changes needed? Improve query CPU performance. Example micro benchmark below showed 10% run-time improvement. ``` def sortMergeJoinWithDuplicates(): Unit = { val N = 2 << 20 codegenBenchmark("sort merge join with duplicates", N) { val df1 = spark.range(N) .selectExpr(s"(id 15485863) % ${N10} as k1", "id as k3") val df2 = spark.range(N) .selectExpr(s"(id 15485867) % ${N10} as k2", "id as k4") val df = df1.join(df2, col("k1") === col("k2") && col("k3") 3 < col("k4"), "left_outer") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.noop() } } ``` ``` Running benchmark: sort merge join with duplicates Running case: sort merge join with duplicates outer-smj-codegen off Stopped after 2 iterations, 2696 ms Running case: sort merge join with duplicates outer-smj-codegen on Stopped after 5 iterations, 6058 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz sort merge join with duplicates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------------- sort merge join with duplicates outer-smj-codegen off 1333 1348 21 1.6 635.7 1.0X sort merge join with duplicates outer-smj-codegen on 1169 1212 47 1.8 557.4 1.1X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite.scala` and `WholeStageCodegenSuite.scala`. Closes #32476 from c21/smj-outer-codegen. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-12 14:10:15 +00:00
Takeshi Yamamuro	101b0cc313	[SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4 ### What changes were proposed in this pull request? This PR proposes to bump up the janino version from 3.0.16 to v3.1.4. The major changes of this upgrade are as follows: - Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow. - Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated. - Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package - Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions). For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs: - janino-compiler/janino#145 - janino-compiler/janino#146 ### Why are the changes needed? janino v3.0.X is no longer maintained. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA passed. Closes #32455 from maropu/janino_v3.1.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-12 08:57:57 -05:00
Angerszhuuuu	ed059541eb	[SPARK-29145][SQL][FOLLOWUP] Clean up code about support sub-queries in join conditions ### What changes were proposed in this pull request? According to discuss https://github.com/apache/spark/pull/25854#discussion_r629451135 ### Why are the changes needed? Clean code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #32499 from AngersZhuuuu/SPARK-29145-fix. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-12 13:45:53 +00:00
Yingyi Bu	d92018ee35	[SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala ### What changes were proposed in this pull request? Added the following TreePattern enums: - ALIAS - AND_OR - AVERAGE - GENERATE - INTERSECT - SORT - SUM - DISTINCT_LIKE - PROJECT - REPARTITION_OPERATION - UNION Added tree traversal pruning to the following rules in Optimizer.scala: - EliminateAggregateFilter - RemoveRedundantAggregates - RemoveNoopOperators - RemoveNoopUnion - LimitPushDown - ColumnPruning - CollapseRepartition - OptimizeRepartition - OptimizeWindowFunctions - CollapseWindow - TransposeWindow - InferFiltersFromGenerate - InferFiltersFromConstraints - CombineUnions - CombineFilters - EliminateSorts - PruneFilters - EliminateLimits - DecimalAggregates - ConvertToLocalRelation - ReplaceDistinctWithAggregate - ReplaceIntersectWithSemiJoin - ReplaceExceptWithAntiJoin - RewriteExceptAll - RewriteIntersectAll - RemoveLiteralFromGroupExpressions - RemoveRepetitionFromGroupExpressions - OptimizeLimitZero ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. perf diff: Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline RemoveRedundantAggregates \| 51290766 \| 67070477 \| 1.31 RemoveNoopOperators \| 192371141 \| 196631275 \| 1.02 RemoveNoopUnion \| 49222561 \| 43266681 \| 0.88 LimitPushDown \| 40885185 \| 21672646 \| 0.53 ColumnPruning \| 2003406120 \| 1285562149 \| 0.64 CollapseRepartition \| 40648048 \| 72646515 \| 1.79 OptimizeRepartition \| 37813850 \| 20600803 \| 0.54 OptimizeWindowFunctions \| 174426904 \| 46741409 \| 0.27 CollapseWindow \| 38959957 \| 24542426 \| 0.63 TransposeWindow \| 33533191 \| 20414930 \| 0.61 InferFiltersFromGenerate \| 21758688 \| 15597344 \| 0.72 InferFiltersFromConstraints \| 518009794 \| 493282321 \| 0.95 CombineUnions \| 67694022 \| 70550382 \| 1.04 CombineFilters \| 35265060 \| 29005424 \| 0.82 EliminateSorts \| 57025509 \| 19795776 \| 0.35 PruneFilters \| 433964815 \| 465579200 \| 1.07 EliminateLimits \| 44275393 \| 24476859 \| 0.55 DecimalAggregates \| 83143172 \| 28816090 \| 0.35 ReplaceDistinctWithAggregate \| 21783760 \| 18287489 \| 0.84 ReplaceIntersectWithSemiJoin \| 22311271 \| 16566393 \| 0.74 ReplaceExceptWithAntiJoin \| 23838520 \| 16588808 \| 0.70 RewriteExceptAll \| 32750296 \| 29421957 \| 0.90 RewriteIntersectAll \| 29760454 \| 21243599 \| 0.71 RemoveLiteralFromGroupExpressions \| 28151861 \| 25270947 \| 0.90 RemoveRepetitionFromGroupExpressions \| 29587030 \| 23447041 \| 0.79 OptimizeLimitZero \| 18081943 \| 15597344 \| 0.86 Accumulated \| 4129959311 \| 3112676285 \| 0.75 ### How was this patch tested? Existing tests. Closes #32439 from sigmod/optimizer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-12 20:42:47 +08:00
PengLei	82c520a3e2	[SPARK-35243][SQL] Support columnar execution on ANSI interval types ### What changes were proposed in this pull request? Columnar execution support for ANSI interval types include YearMonthIntervalType and DayTimeIntervalType ### Why are the changes needed? support cache tables with ANSI interval types. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? run ./dev/lint-java run ./dev/scalastyle run test: CachedTableSuite run test: ColumnTypeSuite Closes #32452 from Peng-Lei/SPARK-35243. Lead-authored-by: PengLei <18066542445@189.cn> Co-authored-by: Lei Peng <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 20:11:34 +09:00
Chao Sun	78221bda95	[SPARK-35361][SQL] Improve performance for ApplyFunctionExpression ### What changes were proposed in this pull request? In `ApplyFunctionExpression`, move `zipWithIndex` out of the loop for each input row. ### Why are the changes needed? When the `ScalarFunction` is trivial, `zipWithIndex` could incur significant costs, as shown below: <img width="899" alt="Screen Shot 2021-05-11 at 10 03 42 AM" src="https://user-images.githubusercontent.com/506679/117866421-fb19de80-b24b-11eb-8c94-d5e8c8b1eda9.png"> By removing it out of the loop, I'm seeing sometimes 2x speedup from `V2FunctionBenchmark`. For instance: Before: ``` scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative native_long_add 32437 32896 434 15.4 64.9 1.0X java_long_add_default 85675 97045 NaN 5.8 171.3 0.4X ``` After: ``` scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative native_long_add 30182 30387 279 16.6 60.4 1.0X java_long_add_default 42862 43009 209 11.7 85.7 0.7X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #32507 from sunchao/SPARK-35361. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 10:16:35 +09:00
Yingyi Bu	7c9a9ec04f	[SPARK-35146][SQL] Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis.scala ### What changes were proposed in this pull request? Added the following TreePattern enums: - BOOL_AGG - COUNT_IF - CURRENT_LIKE - RUNTIME_REPLACEABLE Added tree traversal pruning to the following rules: - ReplaceExpressions - RewriteNonCorrelatedExists - ComputeCurrentTime - GetCurrentDatabaseAndCatalog ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. Performance improvement (org.apache.spark.sql.TPCDSQuerySuite): Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline ReplaceExpressions \| 27546369 \| 19753804 \| 0.72 RewriteNonCorrelatedExists \| 17304883 \| 2086194 \| 0.12 ComputeCurrentTime \| 35751301 \| 19984477 \| 0.56 GetCurrentDatabaseAndCatalog \| 37230787 \| 18874013 \| 0.51 ### How was this patch tested? Existing tests. Closes #32461 from sigmod/finish_analysis. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-11 17:11:38 +08:00
Cheng Su	c4ca23207b	[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type ### What changes were proposed in this pull request? This is a pre-requisite of https://github.com/apache/spark/pull/32476, in discussion of https://github.com/apache/spark/pull/32476#issuecomment-836469779 . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join. ### Why are the changes needed? Pre-requisite of https://github.com/apache/spark/pull/32476. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `InnerJoinSuite.scala` for inner join code-gen. Closes #32495 from c21/smj-refactor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-11 11:21:59 +09:00
gengjiaan	44bd0a8bd3	[SPARK-35088][SQL][FOLLOWUP] Improve the error message for Sequence expression ### What changes were proposed in this pull request? Sequence expression output a message looks confused. This PR will fix the issue. ### Why are the changes needed? Improve the error message for Sequence expression ### Does this PR introduce _any_ user-facing change? Yes. this PR updates the error message of Sequence expression. ### How was this patch tested? Tests updated. Closes #32492 from beliefer/SPARK-35088-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-11 09:45:09 +09:00
Gengliang Wang	d2a535f85b	[SPARK-34246][FOLLOWUP] Change the definition of `findTightestCommonType` for backward compatibility ### What changes were proposed in this pull request? Change the definition of `findTightestCommonType` from ``` def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType] ``` to ``` val findTightestCommonType: (DataType, DataType) => Option[DataType] ``` ### Why are the changes needed? For backward compatibility. When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error ``` java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2 ``` from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150 In the previous release, the function was ``` static public scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType () ``` After https://github.com/apache/spark/pull/31349, the function becomes: ``` static public scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2) ``` This PR is to reduce the unnecessary API change. ### Does this PR introduce _any_ user-facing change? Yes, the definition of `TypeCoercion.findTightestCommonType` is consistent with previous release again. ### How was this patch tested? Existing unit tests Closes #32493 from gengliangwang/typecoercion. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-10 23:26:39 +08:00
Angerszhuuuu	7182f8cece	[SPARK-35360][SQL] RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too ### What changes were proposed in this pull request? RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too ### Why are the changes needed? Make RepairTableCommand add partition batch size configurable. ### Does this PR introduce _any_ user-facing change? User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table. ### How was this patch tested? Not need Closes #32489 from AngersZhuuuu/SPARK-35360. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-05-10 14:53:31 +05:00
Chao Sun	245dce1ea1	[SPARK-35261][SQL][TESTS][FOLLOW-UP] Change failOnError to false for NativeAdd in V2FunctionBenchmark ### What changes were proposed in this pull request? Change `failOnError` to false for `NativeAdd` in `V2FunctionBenchmark`. ### Why are the changes needed? Since `NativeAdd` is simply doing addition on long it's better to set `failOnError` to false so it will use native long addition instead of `Math.addExact`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #32481 from sunchao/SPARK-35261-follow-up. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-10 07:20:05 +00:00
Angerszhuuuu	2c8ced9590	[SPARK-35111][SPARK-35112][SQL][FOLLOWUP] Rename ANSI interval patterns and regexps ### What changes were proposed in this pull request? Rename pattern strings and regexps of year-month and day-time intervals. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #32444 from AngersZhuuuu/SPARK-35111-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-05-10 11:33:27 +05:00
Cheng Su	38eb5a6936	[SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in CoalesceBucketsInJoin ### What changes were proposed in this pull request? As title. We should use a more restrictive interface `ShuffledJoin` other than `BaseJoinExec` in `CoalesceBucketsInJoin`, as the rule only applies to sort merge join and shuffled hash join (i.e. `ShuffledJoin`). ### Why are the changes needed? Code cleanup. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `CoalesceBucketsInJoinSuite`. Closes #32480 from c21/minor-cleanup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-10 10:04:49 +09:00
Ruifeng Zheng	620f0727e3	[SPARK-35231][SQL] logical.Range override maxRowsPerPartition ### What changes were proposed in this pull request? when `numSlices` is avaiable, `logical.Range` should compute a exact `maxRowsPerPartition` ### Why are the changes needed? `maxRowsPerPartition` is used in optimizer, we should provide an exact value if possible ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #32350 from zhengruifeng/range_maxRowsPerPartition. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-09 21:44:49 +09:00
Liang-Chi Hsieh	5b65d8a129	[SPARK-35347][SQL] Use MethodUtils for looking up methods in Invoke and StaticInvoke ### What changes were proposed in this pull request? This patch proposes to use `MethodUtils` for looking up methods `Invoke` and `StaticInvoke` expressions. ### Why are the changes needed? Currently we wrote our logic in `Invoke` and `StaticInvoke` expressions for looking up methods. It is tricky to consider all the cases and there is already existing utility package for this purpose. We should reuse the utility package. ### Does this PR introduce _any_ user-facing change? No, internal change only. ### How was this patch tested? Existing tests. Closes #32474 from viirya/invoke-util. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-08 15:17:30 -07:00
Dongjoon Hyun	e31bef1ed4	Revert "[SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client" This reverts commit `b4ec9e2304`.	2021-05-08 13:01:17 -07:00
Takeshi Yamamuro	06c40091a6	[SPARK-35327][SQL][TESTS] Filters out the TPC-DS queries that can cause flaky test results ### What changes were proposed in this pull request? This PR proposes to filter out TPCDS v1.4 q6 and q75 in `TPCDSQueryTestSuite`. I saw`TPCDSQueryTestSuite` failed nondeterministically because output row orders were different with those in the golden files. For example, the failure in the GA job, https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`: `a0c76a8755/sql/core/src/test/resources/tpcds/q6.sql (L20)` Actually, `tpcds/q6.sql` and `tpcds-v2.7.0/q6.sql` are almost the same and the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`: `a0c76a8755/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql (L22)` So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 has the same issue). ### Why are the changes needed? For stable testing. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GA passed. Closes #32454 from maropu/CleanUpTpcdsQueries. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-08 21:43:39 +09:00
Kent Yao	b0257801d5	[SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint ### What changes were proposed in this pull request? This PR makes the below case work well. ```sql select a b from values(1) t(a) distribute by a; ``` ```logtalk == Parsed Logical Plan == 'RepartitionByExpression ['a] +- 'Project ['a AS b#42] +- 'SubqueryAlias t +- 'UnresolvedInlineTable [a], [List(1)] == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [b]; line 1 pos 62; 'RepartitionByExpression ['a] +- Project [a#48 AS b#42] +- SubqueryAlias t +- LocalRelation [a#48] ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, the original attributes can be used in `distribute by` / `cluster by` and hints like `/+ REPARTITION(3, c) /` ### How was this patch tested? new tests Closes #32465 from yaooqinn/SPARK-35331. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-08 05:00:51 -07:00
Chao Sun	323a6e848e	[SPARK-35232][SQL] Nested column pruning should retain column metadata ### What changes were proposed in this pull request? Retain column metadata during the process of nested column pruning, when constructing `StructField`. To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail. ### Why are the changes needed? The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32354 from sunchao/SPARK-35232. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-07 22:37:54 -07:00
Chao Sun	f47e0f8379	[SPARK-35261][SQL] Support static magic method for stateless Java ScalarFunction ### What changes were proposed in this pull request? This allows `ScalarFunction` implemented in Java to optionally specify the magic method `invoke` to be static, which can be used if the UDF is stateless. Comparing to the non-static method, it can potentially give better performance due to elimination of dynamic dispatch, etc. Also added a benchmark to measure performance of: the default `produceResult`, non-static magic method and static magic method. ### Why are the changes needed? For UDFs that are stateless (e.g., no need to maintain intermediate state between each function call), it's better to allow users to implement the UDF function as static method which could potentially give better performance. ### Does this PR introduce _any_ user-facing change? Yes. Spark users can now have the choice to define static magic method for `ScalarFunction` when it is written in Java and when the UDF is stateless. ### How was this patch tested? Added new UT. Closes #32407 from sunchao/SPARK-35261. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-07 20:34:51 -07:00
Chao Sun	b4ec9e2304	[SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client ### What changes were proposed in this pull request? Instantiate a new Hive client through `Hive.getWithFastCheck(conf, false)` instead of `Hive.get(conf)`. ### Why are the changes needed? [HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur: ``` Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) ``` The `get_all_functions` is called only when `doRegisterAllFns` is set to true: ```java private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } ``` what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`. ### Does this PR introduce _any_ user-facing change? Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower (with HIVE-24608 too) ### How was this patch tested? Manually started a HMS server of Hive version 1.2.2, with patched Hive 2.3.8 using HIVE-24608. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc. Closes #32446 from sunchao/SPARK-35321. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-07 15:06:04 -07:00

1 2 3 4 5 ...

11192 commits