ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	5d45a415f3	Preparing Spark release v3.2.0-rc7	2021-10-06 11:45:26 +00:00
tianhanhu	9760c8ab60	[SPARK-36919][SQL] Make BadRecordException fields transient ### What changes were proposed in this pull request? Migrating a Spark application from 2.4.x to 3.1.x and finding a difference in the exception chaining behavior. In a case of parsing a malformed CSV, where the root cause exception should be Caused by: java.lang.RuntimeException: Malformed CSV record, only the top level exception is kept, and all lower level exceptions and root cause are lost. Thus, when we call ExceptionUtils.getRootCause on the exception, we still get itself. The reason for the difference is that RuntimeException is wrapped in BadRecordException, which has unserializable fields. When we try to serialize the exception from tasks and deserialize from scheduler, the exception is lost. This PR makes unserializable fields of BadRecordException transient, so the rest of the exception could be serialized and deserialized properly. ### Why are the changes needed? Make BadRecordException serializable ### Does this PR introduce _any_ user-facing change? User could get root cause of BadRecordException ### How was this patch tested? Unit testing Closes #34167 from tianhanhu/master. Authored-by: tianhanhu <adrianhu96@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `aed977c468`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-06 19:06:22 +09:00
Wenchen Fan	688808900d	[SPARK-36926][3.2][SQL] Decimal average mistakenly overflow backport https://github.com/apache/spark/pull/34180 ### What changes were proposed in this pull request? This bug was introduced by https://github.com/apache/spark/pull/33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? Yes, the result was wrong before this PR. ### How was this patch tested? a new test Closes #34193 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-10-06 13:31:03 +08:00
Kousuke Saruta	8ffe00e745	[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `fa1805db48`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-10-05 11:17:12 +08:00
Gengliang Wang	4bd358474b	Preparing development version 3.2.1-SNAPSHOT	2021-09-28 10:53:42 +00:00
Gengliang Wang	dde73e2e1c	Preparing Spark release v3.2.0-rc6	2021-09-28 10:53:35 +00:00
Richard Chen	493aad03ab	[SPARK-36836][SQL] Fix incorrect result in `sha2` expression ### What changes were proposed in this pull request? `sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs. This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell. Spark currently returns ``` #\t}"4�"�B�w��U�*��你��l�� ``` while the expected result is ``` 23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7 ``` This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string. ### Why are the changes needed? `sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new test to `HashExpressionsSuite.scala` which previously failed and now pass Closes #34086 from richardc-db/sha224. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `6c6291b3f6`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 18:38:33 +08:00
Yuming Wang	8f0c846b1d	Revert "[SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type" ### What changes were proposed in this pull request? This reverts commit `aaa0d2a66b`. ### Why are the changes needed? This approach has 2 disadvantages: 1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`. 2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596 Instead, we can use bloom filter join pruning in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34116 from wangyum/revert-SPARK-32855. Authored-by: Yuming Wang <yumwangebay.com> Signed-off-by: Yuming Wang <yumwangebay.com> (cherry picked from commit `e024bdc306`) Closes #34124 from wangyum/revert-SPARK-32855-branch-3.2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-28 15:38:35 +08:00
Gengliang Wang	0c57bb8f7f	Preparing development version 3.2.1-SNAPSHOT	2021-09-27 08:24:50 +00:00
Gengliang Wang	49aea14c5a	Preparing Spark release v3.2.0-rc5	2021-09-27 08:24:44 +00:00
Gengliang Wang	2348cce37e	Preparing development version 3.2.1-SNAPSHOT	2021-09-26 12:28:46 +00:00
Gengliang Wang	2ed8c08c5b	Preparing Spark release v3.2.0-rc5	2021-09-26 12:28:40 +00:00
PengLei	eb794a4f58	[SPARK-36851][SQL] Incorrect parsing of negative ANSI typed interval literals ### What changes were proposed in this pull request? Handle incorrect parsing of negative ANSI typed interval literals [SPARK-36851](https://issues.apache.org/jira/browse/SPARK-36851) ### Why are the changes needed? Incorrect result: ``` spark-sql> select interval -'1' year; 1-0 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut testcase Closes #34107 from Peng-Lei/SPARK-36851. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `0fdca1f0df`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-26 18:43:38 +08:00
Gengliang Wang	da722d43cb	Preparing development version 3.2.1-SNAPSHOT	2021-09-24 10:03:23 +00:00
Gengliang Wang	9e35703211	Preparing Spark release v3.2.0-rc5	2021-09-24 10:03:16 +00:00
Angerszhuuuu	b7174188e5	[SPARK-36792][SQL] InSet should handle NaN ### What changes were proposed in this pull request? InSet should handle NaN ``` InSet(Literal(Double.NaN), Set(Double.NaN, 1d)) should return true, but return false. ``` ### Why are the changes needed? InSet should handle NaN ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34033 from AngersZhuuuu/SPARK-36792. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `64f4bf47af`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-24 16:19:47 +08:00
allisonwang-db	d0c97d6ed9	[SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list ### What changes were proposed in this pull request? This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions. For example ```sql select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t) ``` ``` == Optimized Logical Plan == Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions. +- Project [sum(c2)#10L] +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) <--- Aggregate expression in join condition :- LocalRelation [c2#3] +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2] +- LocalRelation [c1#2, c2#3] java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false]) ``` Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions. `079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)` ### Why are the changes needed? To fix an existing optimizer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Authored-by: allisonwang-db <allison.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit `4a8dc5f7a3`) Signed-off-by: allisonwang-db <allison.wangdatabricks.com> Closes #34081 from allisonwang-db/cp-spark-36747. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-24 16:14:49 +08:00
Gengliang Wang	0fb7127f85	Preparing development version 3.2.1-SNAPSHOT	2021-09-23 08:46:28 +00:00
Gengliang Wang	b609f2fe0c	Preparing Spark release v3.2.0-rc4	2021-09-23 08:46:22 +00:00
Michael Chen	89894a4b1d	[SPARK-36795][SQL] Explain Formatted has Duplicate Node IDs Fixed explain formatted mode so it doesn't have duplicate node IDs when InMemoryRelation is present in query plan. Having duplicated node IDs in the plan makes it confusing. Yes, explain formatted string will change. Notice how `ColumnarToRow` and `InMemoryRelation` have node id of 2. Before changes => ``` == Physical Plan == AdaptiveSparkPlan (14) +- == Final Plan == * BroadcastHashJoin Inner BuildLeft (9) :- BroadcastQueryStage (5) : +- BroadcastExchange (4) : +- * Filter (3) : +- * ColumnarToRow (2) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (8) +- * ColumnarToRow (7) +- Scan parquet default.t2 (6) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (13) :- BroadcastExchange (11) : +- Filter (10) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- Filter (12) +- Scan parquet default.t2 (6) (1) InMemoryTableScan Output [1]: [k#x] Arguments: [k#x], [isnotnull(k#x)] (2) InMemoryRelation Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer401788d5,StorageLevel(disk, memory, deserialized, 1 replicas),(1) ColumnarToRow +- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int> ,None) (3) Scan parquet default.t1 Output [1]: [k#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1] ReadSchema: struct<k:int> (4) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (5) BroadcastQueryStage Output [1]: [k#x] Arguments: 0 (6) Scan parquet default.t2 Output [1]: [key#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int> (7) ColumnarToRow Input [1]: [key#x] (8) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (9) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (10) Filter Input [1]: [k#x] Condition : isnotnull(k#x) (11) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (12) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (13) BroadcastHashJoin Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (14) AdaptiveSparkPlan Output [2]: [k#x, key#x] Arguments: isFinalPlan=true ``` After Changes => ``` == Physical Plan == AdaptiveSparkPlan (17) +- == Final Plan == BroadcastHashJoin Inner BuildLeft (12) :- BroadcastQueryStage (8) : +- BroadcastExchange (7) : +- * Filter (6) : +- * ColumnarToRow (5) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (11) +- * ColumnarToRow (10) +- Scan parquet default.t2 (9) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (16) :- BroadcastExchange (14) : +- Filter (13) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- Filter (15) +- Scan parquet default.t2 (9) (1) InMemoryTableScan Output [1]: [k#x] Arguments: [k#x], [isnotnull(k#x)] (2) InMemoryRelation Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer3ccb12d,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) ColumnarToRow +- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int> ,None) (3) Scan parquet default.t1 Output [1]: [k#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1] ReadSchema: struct<k:int> (4) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (5) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (6) Filter [codegen id : 1] Input [1]: [k#x] Condition : isnotnull(k#x) (7) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (8) BroadcastQueryStage Output [1]: [k#x] Arguments: 0 (9) Scan parquet default.t2 Output [1]: [key#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int> (10) ColumnarToRow Input [1]: [key#x] (11) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (12) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (13) Filter Input [1]: [k#x] Condition : isnotnull(k#x) (14) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (15) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (16) BroadcastHashJoin Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (17) AdaptiveSparkPlan Output [2]: [k#x, key#x] Arguments: isFinalPlan=true ``` add test Closes #34036 from ChenMichael/SPARK-36795-Duplicate-node-id-with-inMemoryRelation. Authored-by: Michael Chen <mike.chen@workday.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `6d7ab7b52b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-23 15:55:15 +09:00
Hyukjin Kwon	af569d1b0a	[MINOR][SQL][DOCS] Correct the 'options' description on UnresolvedRelation ### What changes were proposed in this pull request? This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in https://github.com/apache/spark/pull/29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per https://github.com/apache/spark/pull/29712. This PR can go through from `master` to `branch-3.1`. ### Why are the changes needed? To make `UnresolvedRelation.options`'s description clearer. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Scala linter by `dev/linter-scala`. Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> (cherry picked from commit `0076eba8d0`) Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>	2021-09-22 23:00:35 -07:00
Angerszhuuuu	2ff038a7b3	[SPARK-36753][SQL] ArrayExcept handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN, 1d], but it should return [1d]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayExcept won't show handle equal `NaN` value ### How was this patch tested? Added UT Closes #33994 from AngersZhuuuu/SPARK-36753. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `a7cbe69986`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 23:51:58 +08:00
Ivan Sadikov	fc0b85fb26	[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ec26d94eac`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 17:40:55 +08:00
Chao Sun	a28d8d9b0e	[SPARK-36820][3.2][SQL] Disable tests related to LZ4 for Hadoop 2.7 profile ### What changes were proposed in this pull request? Disable tests related to LZ4 in `FileSourceCodecSuite` and `FileSuite` when using `hadoop-2.7` profile. ### Why are the changes needed? At the moment, parquet-mr uses LZ4 compression codec provided by Hadoop, and only since HADOOP-17292 (in 3.3.1/3.4.0) the latter added `lz4-java` to remove the restriction that the codec can only be run with native library. As consequence, the test will fail when using `hadoop-2.7` profile. ### Does this PR introduce _any_ user-facing change? No, it's just test. ### How was this patch tested? Existing test Closes #34066 from sunchao/SpARK-36820-3.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-22 00:14:45 -07:00
Gengliang Wang	affd7a4d47	[SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ### What changes were proposed in this pull request? Remove `com.github.rdblue:brotli-codec:0.1.1` dependency. ### Why are the changes needed? As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`. Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #34059 from gengliangwang/removeDeps. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `ba5708d944`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-21 10:57:34 -07:00
Max Gekk	7fa88b28a5	[SPARK-36807][SQL] Merge ANSI interval types to a tightest common type ### What changes were proposed in this pull request? In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields. ### Why are the changes needed? This will allow merging of schemas from different datasource files. ### Does this PR introduce _any_ user-facing change? No, the ANSI interval types haven't released yet. ### How was this patch tested? Added new test to `StructTypeSuite`. Closes #34049 from MaxGekk/merge-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `d2340f8e1c`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-21 10:20:27 +03:00
Angerszhuuuu	337a1979d2	[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZhuuuu/SPARK-36754. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2fc7f2f702`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-20 16:51:31 +08:00
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Liang-Chi Hsieh	275ad6bd0b	[SPARK-36673][SQL][FOLLOWUP] Remove duplicate test in DataFrameSetOperationsSuite ### What changes were proposed in this pull request? As a followup of #34025 to remove duplicate test. ### Why are the changes needed? To remove duplicate test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #34032 from viirya/remove. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `f9644cc253`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-17 11:52:26 -07:00
Angerszhuuuu	61d7f1da1b	[SPARK-36767][SQL] ArrayMin/ArrayMax/SortArray/ArraySort add comment and Unit test ### What changes were proposed in this pull request? Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this ### Why are the changes needed? Add Unit test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34008 from AngersZhuuuu/SPARK-36740. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `69e006dd53`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:42:21 +08:00
Liang-Chi Hsieh	895218996a	[SPARK-36673][SQL] Fix incorrect schema of nested types of union ### What changes were proposed in this pull request? This patch proposes to fix incorrect schema of `union`. ### Why are the changes needed? The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition. So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of incorrect schema. ### How was this patch tested? Added unit test. Closes #34025 from viirya/SPARK-36673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `cdd7ae937d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:37:40 +08:00
Angerszhuuuu	a78c4c44ed	[SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_distinct(array(cast('nan' as double), cast('nan' as double))) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayDistinct won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33993 from AngersZhuuuu/SPARK-36741. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e356f6aa11`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 20:48:39 +08:00
Wenchen Fan	16215755b7	[SPARK-36789][SQL] Use the correct constant type as the null value holder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after https://github.com/apache/spark/pull/33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `4145498826`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 16:50:01 +09:00
Hyukjin Kwon	7d7c9915bb	[SPARK-36788][SQL] Change log level of AQE for non-supported plans from warning to debug ### What changes were proposed in this pull request? This PR suppresses the warnings for plans where AQE is not supported. Currently we show the warnings such as: ``` org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] ``` for every plan that AQE is not supported. ### Why are the changes needed? It's too noisy now. Below is the example of `SortSuite` run: ``` 14:51:40.675 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=true, sortOrder=List('a DESC NULLS FIRST) (785 milliseconds) 14:51:41.416 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324884 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324884] . 14:51:41.467 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324884 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324884] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS FIRST) (796 milliseconds) 14:51:42.210 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324887 ASC NULLS LAST], true +- Scan ExistingRDD[a#324887] . 14:51:42.259 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324887 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324887] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS LAST) (797 milliseconds) 14:51:43.009 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324890 DESC NULLS LAST], true +- Scan ExistingRDD[a#324890] . 14:51:43.061 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324890 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324890] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS LAST) (848 milliseconds) 14:51:43.857 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324893 DESC NULLS FIRST], true +- Scan ExistingRDD[a#324893] . 14:51:43.903 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324893 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324893] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS FIRST) (827 milliseconds) 14:51:44.682 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324896 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324896] . 14:51:44.748 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324896 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324896] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS FIRST) (565 milliseconds) 14:51:45.248 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324899 ASC NULLS LAST], true +- Scan ExistingRDD[a#324899] . 14:51:45.312 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324899 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324899] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS LAST) (591 milliseconds) 14:51:45.841 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324902 DESC NULLS LAST], true +- Scan ExistingRDD[a#324902] . 14:51:45.905 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324902 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324902] . ``` ### Does this PR introduce _any_ user-facing change? Yes, it will show less warnings to users. Note that AQE is enabled by default from Spark 3.2, see SPARK-33679 ### How was this patch tested? Manually tested via unittests. Closes #34026 from HyukjinKwon/minor-log-level. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `917d7dad4d`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 12:01:53 +09:00
Wenchen Fan	c1bfe1a5c4	[SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project ### What changes were proposed in this pull request? `ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project. Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule. ### Why are the changes needed? Fix a bug that violates the semantic of nondeterministic expressions. ### Does this PR introduce _any_ user-facing change? Most likely no change, but in some cases, this is a correctness bug fix which changes the query result. ### How was this patch tested? existing tests Closes #34023 from cloud-fan/scan. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `dfd5237c0c`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 10:51:29 +08:00
Josh Rosen	3502fda783	[SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite ### What changes were proposed in this pull request? This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`. The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`. In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary. ### Why are the changes needed? Previously, `SparkSubmitSuite` tests would fail with messages like: ``` [info] - launch simple application with spark-submit * FAILED * (1 second, 832 milliseconds) [info] Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command. After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces: ``` [info] - launch simple application with spark-submit * FAILED * (2 seconds, 800 milliseconds) [info] spark-submit returned with exit code 101. [info] Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar' [info] [info] 2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [info] 2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually ran the affected test suites. Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit `3ae6e6775b`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2021-09-16 14:31:22 -07:00
Dongjoon Hyun	63b8417794	[SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes. ### Why are the changes needed? Apache ORC 1.6.11 has the following fixes. - https://issues.apache.org/jira/projects/ORC/versions/12350499 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33971 from dongjoon-hyun/SPARK-36732. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c217797297`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-15 23:36:36 -07:00
Chao Sun	a7dc8242ea	[SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit `a927b0836b`) Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-09-15 19:17:49 +00:00
Angerszhuuuu	75bffd972d	[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZhuuuu/SPARK-36755. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `b665782f0d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:32:18 +08:00
Angerszhuuuu	e64155691f	[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `638085953f`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:04:24 +08:00
Angerszhuuuu	a472612eb8	[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_union(array(cast('nan' as double), cast('nan' as double)), array()) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayUnion won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f71f37755d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-14 18:26:02 +08:00
Fu Chen	303590b3e9	[SPARK-36715][SQL] InferFiltersFromGenerate should not infer filter for udf ### What changes were proposed in this pull request? Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`. Before this pr, the following case will throw an exception. ```scala spark.udf.register("vec", (i: Int) => (0 until i).toArray) sql("select explode(vec(8)) as c1").show ``` ``` Once strategy's idempotence is broken for batch Infer Filters GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12] +- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3] +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation java.lang.RuntimeException: Once strategy's idempotence is broken for batch Infer Filters GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12] +- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3] +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200) at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179) at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130) at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166) at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214) at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259) at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731) at org.apache.spark.sql.Dataset.head(Dataset.scala:2755) at org.apache.spark.sql.Dataset.take(Dataset.scala:2962) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at org.apache.spark.sql.Dataset.show(Dataset.scala:807) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #33956 from cfmcgrady/SPARK-36715. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `52c5ff20ca`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-14 09:26:21 +09:00
Lukas Rytz	2e7583799e	[SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom. ### What changes were proposed in this pull request? This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script. I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor. - removed OSGi metadata - renamed some internal inner classes - added `Automatic-Module-Name` ### Why are the changes needed? According to the posts, this solves issues for developers that write unit tests for their applications. Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally Closes #33948 from lrytz/parCollDep. Authored-by: Lukas Rytz <lukas.rytz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit `1a62e6a2c1`) Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-13 11:06:58 -05:00
Yuto Akutsu	b043ee4de7	[SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API ### What changes were proposed in this pull request? Fixed wrong documentation on Cot API ### Why are the changes needed? [Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual check. Closes #33978 from yutoacts/SPARK-36738. Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3747cfdb40`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-13 21:51:43 +09:00
Kousuke Saruta	b8a23e9ccc	[SPARK-36725][SQL][TESTS] Ensure HiveThriftServer2Suites to stop Thrift JDBC server on exit ### What changes were proposed in this pull request? This PR aims to ensure that HiveThriftServer2Suites (e.g. `thriftserver.UISeleniumSuite`) stop Thrift JDBC server on exit using shutdown hook. ### Why are the changes needed? Normally, HiveThriftServer2Suites stops Thrift JDBC server via `afterAll` method. But, if they are killed by signal (e.g. Ctrl-C), Thrift JDBC server will be remain. ``` $ jps 2792969 SparkSubmit ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Killed `thriftserver.UISeleniumSuite` by Ctrl-C and confirmed no Thrift JDBC server is remain by jps. Closes #33967 from sarutak/stop-thrift-on-exit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c36d70836d`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-11 15:54:48 -07:00
Liang-Chi Hsieh	b52fbeee2d	[SPARK-36669][SQL] Add Lz4 wrappers for Hadoop Lz4 codec ### What changes were proposed in this pull request? This patch proposes to add a few LZ4 wrapper classes for Parquet Lz4 compression output that uses Hadoop Lz4 codec. ### Why are the changes needed? Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency. ``` [info] Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory [info] at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66) [info] at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119) [info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152) [info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168) ``` Before the issue is fixed at Hadoop new release, we can add a few wrapper classes for Lz4 codec. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test. Closes #33940 from viirya/lz4-wrappers. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `6bcf330191`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-09 09:31:11 -07:00
Andrew Liu	6cb23c163c	[SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe ### What changes were proposed in this pull request? fix SimplifyConditionalsInPredicate to be null-safe Reproducible: ``` import org.apache.spark.sql.types.{StructField, BooleanType, StructType} import org.apache.spark.sql.Row val schema = List( StructField("b", BooleanType, true) ) val data = Seq( Row(true), Row(false), Row(null) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) // cartesian product of true / false / null val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal")) df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // actual: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // +-----+--------+ spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // expected: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // \| null\| true\| // +-----+--------+ ``` ### Why are the changes needed? is a regression that leads to incorrect results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate. Authored-by: Andrew Liu <andrewlliu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `9b633f2075`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:32:59 +08:00
Huaxin Gao	7e8860751c	[SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference ### What changes were proposed in this pull request? Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead ### Why are the changes needed? `FieldReference` is a private class, should use `NamedReference` instead ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #33927 from huaxingao/agg_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `23794fb303`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-08 14:06:26 +08:00
yangjie01	c4332c7bf0	[SPARK-36684][SQL][TESTS] Add Jackson test dependencies to `sql/core` module at `hadoop-2.7` profile ### What changes were proposed in this pull request? SPARK-26346 upgrade Parquet related modules from 1.10.1 to 1.11.1 and `parquet-jackson 1.11.1` use `com.fasterxml.jackson` instead of `org.codehaus.jackson`. So, there are warning logs related to ``` 17:12:17.605 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated ... Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ... ``` when test `sql/core` modules with `hadoop-2.7` profile. This pr adds test dependencies related to `org.codehaus.jackson` in `sql/core` module when `hadoop-2.7` profile is activated. ### Why are the changes needed? Clean up test warning logs that shouldn't exist. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA or Jenkins Tests. - Manual test `mvn clean test -pl sql/core -am -DwildcardSuites=none -Phadoop-2.7` Before No test failed, but warning logs as follows: ``` [INFO] Running test.org.apache.spark.sql.JavaBeanDeserializationSuite 22:42:45.211 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22:42:46.827 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650) at org.apache.hadoop.fs.FsUrlStreamHandlerFactory.<init>(FsUrlStreamHandlerFactory.java:62) at org.apache.spark.sql.internal.SharedState$.liftedTree1$1(SharedState.scala:181) at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$setFsUrlStreamHandlerFactory(SharedState.scala:180) at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:54) at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:135) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:135) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:134) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:335) at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:109) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:109) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:194) at org.apache.spark.sql.types.DataType.sameType(DataType.scala:97) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1(TypeCoercion.scala:291) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:291) at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85) at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82) at scala.collection.immutable.List.forall(List.scala:89) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType(TypeCoercion.scala:291) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1074) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1069) at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1080) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1079) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1084) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1084) at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.$anonfun$dataType$4(objects.scala:815) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.dataType(objects.scala:815) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.$anonfun$dataType$9(complexTypeCreator.scala:416) at scala.collection.immutable.List.map(List.scala:290) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType$lzycompute(complexTypeCreator.scala:410) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:409) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:398) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:309) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:319) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:248) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:154) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at test.org.apache.spark.sql.JavaBeanDeserializationSuite.testBeanWithArrayFieldDeserialization(JavaBeanDeserializationSuite.java:75) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:364) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:237) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:158) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162) at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548) Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.<clinit>(WebHdfsFileSystem.java:129) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ... 81 more Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 88 more ``` After There are no more warning logs like above Closes #33926 from LuciferYang/SPARK-36684. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `acd9c92fa8`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-09-07 21:40:53 -07:00

1 2 3 4 5 ...

11738 commits