ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
PengLei	eb794a4f58	[SPARK-36851][SQL] Incorrect parsing of negative ANSI typed interval literals ### What changes were proposed in this pull request? Handle incorrect parsing of negative ANSI typed interval literals [SPARK-36851](https://issues.apache.org/jira/browse/SPARK-36851) ### Why are the changes needed? Incorrect result: ``` spark-sql> select interval -'1' year; 1-0 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut testcase Closes #34107 from Peng-Lei/SPARK-36851. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `0fdca1f0df`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-26 18:43:38 +08:00
Chao Sun	540e45c3cc	[SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Fix maven issue for Hadoop 2.7 profile after enabling dependency reduced pom ### What changes were proposed in this pull request? Fix an issue where Maven may stuck in an infinite loop when building Spark, for Hadoop 2.7 profile. ### Why are the changes needed? After re-enabling `createDependencyReducedPom` for `maven-shade-plugin`, Spark build stopped working for Hadoop 2.7 profile and will stuck in an infinitely loop, likely due to a Maven shade plugin bug similar to https://issues.apache.org/jira/browse/MSHADE-148. This seems to be caused by the fact that, under `hadoop-2.7` profile, variable `hadoop-client-runtime.artifact` and `hadoop-client-api.artifact`are both `hadoop-client` which triggers the issue. As a workaround, this changes `hadoop-client-runtime.artifact` to be `hadoop-yarn-api` when using `hadoop-2.7`. Since `hadoop-yarn-api` is a dependency of `hadoop-client`, this essentially moves the former to the same level as the latter. It should have no effect as both are dependencies of Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #34100 from sunchao/SPARK-36835-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `937a74e6e7`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-26 13:39:50 +08:00
Gengliang Wang	da722d43cb	Preparing development version 3.2.1-SNAPSHOT	2021-09-24 10:03:23 +00:00
Gengliang Wang	9e35703211	Preparing Spark release v3.2.0-rc5	2021-09-24 10:03:16 +00:00
Gengliang Wang	3fff405c95	[SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data ### What changes were proposed in this pull request? Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it ### Why are the changes needed? Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new unit test for the InMemoryStore. Also, run a simple benchmark with ``` val testConf = conf.clone() .set(MAX_RETAINED_STAGES, 1000) val listener = new AppStatusListener(store, testConf, true) val stages = (1 to 5000).map { i => val s = new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1", resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID) s.submissionTime = Some(i.toLong) s } listener.onJobStart(SparkListenerJobStart(4, time, Nil, null)) val start = System.nanoTime() stages.foreach { s => time +=1 s.submissionTime = Some(time) listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties())) s.completionTime = Some(time) listener.onStageCompleted(SparkListenerStageCompleted(s)) } println(System.nanoTime() - start) ``` Before changes: InMemoryStore: 1.2s After changes: InMemoryStore: 0.23s Closes #34092 from gengliangwang/cleanStage. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `7ac0a2c37b`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-24 17:24:32 +08:00
Angerszhuuuu	b7174188e5	[SPARK-36792][SQL] InSet should handle NaN ### What changes were proposed in this pull request? InSet should handle NaN ``` InSet(Literal(Double.NaN), Set(Double.NaN, 1d)) should return true, but return false. ``` ### Why are the changes needed? InSet should handle NaN ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34033 from AngersZhuuuu/SPARK-36792. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `64f4bf47af`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-24 16:19:47 +08:00
allisonwang-db	d0c97d6ed9	[SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list ### What changes were proposed in this pull request? This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions. For example ```sql select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t) ``` ``` == Optimized Logical Plan == Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions. +- Project [sum(c2)#10L] +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) <--- Aggregate expression in join condition :- LocalRelation [c2#3] +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2] +- LocalRelation [c1#2, c2#3] java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false]) ``` Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions. `079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)` ### Why are the changes needed? To fix an existing optimizer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Authored-by: allisonwang-db <allison.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit `4a8dc5f7a3`) Signed-off-by: allisonwang-db <allison.wangdatabricks.com> Closes #34081 from allisonwang-db/cp-spark-36747. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-24 16:14:49 +08:00
Gengliang Wang	09a8535cc4	Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec… …utors using config instead of command line" ### What changes were proposed in this pull request? This reverts commit `866df69c62`. ### Why are the changes needed? After the change environment variables were not substituted in user classpath entries. Please find an example on SPARK-35672. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34088 from gengliangwang/revertSPARK-35672. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-24 12:46:22 +09:00
Chao Sun	09283d3210	[SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin ### What changes were proposed in this pull request? Enable `createDependencyReducedPom` for Spark's Maven shaded plugin so that the effective pom won't contain those shaded artifacts such as `org.eclipse.jetty` ### Why are the changes needed? At the moment, the effective pom leaks transitive dependencies to downstream apps for those shaded artifacts, which potentially will cause issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I manually tested and the `core/dependency-reduced-pom.xml` no longer contains dependencies such as `jetty-XX`. Closes #34085 from sunchao/SPARK-36835. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `ed88e610f0`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-24 10:16:46 +08:00
Gengliang Wang	0fb7127f85	Preparing development version 3.2.1-SNAPSHOT	2021-09-23 08:46:28 +00:00
Gengliang Wang	b609f2fe0c	Preparing Spark release v3.2.0-rc4	2021-09-23 08:46:22 +00:00
yi.wu	0ad382747d	[SPARK-36782][CORE][FOLLOW-UP] Only handle shuffle block in separate thread pool ### What changes were proposed in this pull request? This's a follow-up of https://github.com/apache/spark/pull/34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is. ### Why are the changes needed? To avoid any potential overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #34076 from Ngone51/spark-36782-follow-up. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `9d8ac7c8e9`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-23 16:30:25 +08:00
Michael Chen	89894a4b1d	[SPARK-36795][SQL] Explain Formatted has Duplicate Node IDs Fixed explain formatted mode so it doesn't have duplicate node IDs when InMemoryRelation is present in query plan. Having duplicated node IDs in the plan makes it confusing. Yes, explain formatted string will change. Notice how `ColumnarToRow` and `InMemoryRelation` have node id of 2. Before changes => ``` == Physical Plan == AdaptiveSparkPlan (14) +- == Final Plan == * BroadcastHashJoin Inner BuildLeft (9) :- BroadcastQueryStage (5) : +- BroadcastExchange (4) : +- * Filter (3) : +- * ColumnarToRow (2) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (8) +- * ColumnarToRow (7) +- Scan parquet default.t2 (6) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (13) :- BroadcastExchange (11) : +- Filter (10) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- Filter (12) +- Scan parquet default.t2 (6) (1) InMemoryTableScan Output [1]: [k#x] Arguments: [k#x], [isnotnull(k#x)] (2) InMemoryRelation Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer401788d5,StorageLevel(disk, memory, deserialized, 1 replicas),(1) ColumnarToRow +- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int> ,None) (3) Scan parquet default.t1 Output [1]: [k#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1] ReadSchema: struct<k:int> (4) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (5) BroadcastQueryStage Output [1]: [k#x] Arguments: 0 (6) Scan parquet default.t2 Output [1]: [key#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int> (7) ColumnarToRow Input [1]: [key#x] (8) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (9) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (10) Filter Input [1]: [k#x] Condition : isnotnull(k#x) (11) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (12) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (13) BroadcastHashJoin Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (14) AdaptiveSparkPlan Output [2]: [k#x, key#x] Arguments: isFinalPlan=true ``` After Changes => ``` == Physical Plan == AdaptiveSparkPlan (17) +- == Final Plan == BroadcastHashJoin Inner BuildLeft (12) :- BroadcastQueryStage (8) : +- BroadcastExchange (7) : +- * Filter (6) : +- * ColumnarToRow (5) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (11) +- * ColumnarToRow (10) +- Scan parquet default.t2 (9) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (16) :- BroadcastExchange (14) : +- Filter (13) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- Filter (15) +- Scan parquet default.t2 (9) (1) InMemoryTableScan Output [1]: [k#x] Arguments: [k#x], [isnotnull(k#x)] (2) InMemoryRelation Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer3ccb12d,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) ColumnarToRow +- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int> ,None) (3) Scan parquet default.t1 Output [1]: [k#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1] ReadSchema: struct<k:int> (4) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (5) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (6) Filter [codegen id : 1] Input [1]: [k#x] Condition : isnotnull(k#x) (7) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (8) BroadcastQueryStage Output [1]: [k#x] Arguments: 0 (9) Scan parquet default.t2 Output [1]: [key#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int> (10) ColumnarToRow Input [1]: [key#x] (11) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (12) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (13) Filter Input [1]: [k#x] Condition : isnotnull(k#x) (14) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (15) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (16) BroadcastHashJoin Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (17) AdaptiveSparkPlan Output [2]: [k#x, key#x] Arguments: isFinalPlan=true ``` add test Closes #34036 from ChenMichael/SPARK-36795-Duplicate-node-id-with-inMemoryRelation. Authored-by: Michael Chen <mike.chen@workday.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `6d7ab7b52b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-23 15:55:15 +09:00
Hyukjin Kwon	af569d1b0a	[MINOR][SQL][DOCS] Correct the 'options' description on UnresolvedRelation ### What changes were proposed in this pull request? This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in https://github.com/apache/spark/pull/29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per https://github.com/apache/spark/pull/29712. This PR can go through from `master` to `branch-3.1`. ### Why are the changes needed? To make `UnresolvedRelation.options`'s description clearer. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Scala linter by `dev/linter-scala`. Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> (cherry picked from commit `0076eba8d0`) Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>	2021-09-22 23:00:35 -07:00
Fabian A.J. Thiele	d4050d7ee9	[SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo ### What changes were proposed in this pull request? Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself. Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future. Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests. ### Why are the changes needed? [SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test as introduced in this PR. --- Ping eejbyfeldt for notice. Closes #34043 from f-thiele/SPARK-36782. Lead-authored-by: Fabian A.J. Thiele <fabian.thiele@posteo.de> Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Co-authored-by: Fabian A.J. Thiele <fthiele@liveintent.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `4ea54e8672`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-23 12:57:54 +08:00
jiaoqb	d203ed51ca	[SPARK-36791][DOCS] Fix spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST ### What changes were proposed in this pull request? The PR fixes SPARK-36791 by replacing JHS_POST with JHS_HOST ### Why are the changes needed? There are spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST ### Does this PR introduce any user-facing change? No ### How was this patch tested? Not needed for docs Closes #34031 from jiaoqingbo/jiaoqingbo. Authored-by: jiaoqb <jiaoqb@asiainfo.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8a1a91bd71`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-23 12:48:06 +09:00
Xinrong Meng	423cff4567	[SPARK-36818][PYTHON] Fix filtering a Series by a boolean Series ### What changes were proposed in this pull request? Fix filtering a Series (without a name) by a boolean Series. ### Why are the changes needed? A bugfix. The issue is raised as https://github.com/databricks/koalas/issues/2199. ### Does this PR introduce _any_ user-facing change? Yes. #### From ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] Traceback (most recent call last): ... KeyError: 'none key' ``` #### To ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] 0 0 1 1 2 2 dtype: int64 ``` ### How was this patch tested? Unit test. Closes #34061 from xinrong-databricks/filter_series. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `6a5ee0283c`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-22 12:53:06 -07:00
Angerszhuuuu	2ff038a7b3	[SPARK-36753][SQL] ArrayExcept handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN, 1d], but it should return [1d]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayExcept won't show handle equal `NaN` value ### How was this patch tested? Added UT Closes #33994 from AngersZhuuuu/SPARK-36753. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `a7cbe69986`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 23:51:58 +08:00
Ivan Sadikov	fc0b85fb26	[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ec26d94eac`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 17:40:55 +08:00
Chao Sun	a28d8d9b0e	[SPARK-36820][3.2][SQL] Disable tests related to LZ4 for Hadoop 2.7 profile ### What changes were proposed in this pull request? Disable tests related to LZ4 in `FileSourceCodecSuite` and `FileSuite` when using `hadoop-2.7` profile. ### Why are the changes needed? At the moment, parquet-mr uses LZ4 compression codec provided by Hadoop, and only since HADOOP-17292 (in 3.3.1/3.4.0) the latter added `lz4-java` to remove the restriction that the codec can only be run with native library. As consequence, the test will fail when using `hadoop-2.7` profile. ### Does this PR introduce _any_ user-facing change? No, it's just test. ### How was this patch tested? Existing test Closes #34066 from sunchao/SpARK-36820-3.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-22 00:14:45 -07:00
Xinrong Meng	4543ac62bc	[SPARK-36771][PYTHON][3.2] Fix `pop` of Categorical Series ### What changes were proposed in this pull request? Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior. This is a backport of https://github.com/apache/spark/pull/34052. ### Why are the changes needed? As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series. ### Does this PR introduce _any_ user-facing change? Yes, results of `pop` of Categorical Series change. #### From ```py >>> psser = ps.Series(["a", "b", "c", "a"], dtype="category") >>> psser 0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(0) 0 >>> psser 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(3) 0 >>> psser 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] ``` #### To ```py >>> psser = ps.Series(["a", "b", "c", "a"], dtype="category") >>> psser 0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(0) 'a' >>> psser 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(3) 'a' >>> psser 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] ``` ### How was this patch tested? Unit tests. Closes #34063 from xinrong-databricks/backport_cat_pop. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-21 19:16:27 -07:00
Gengliang Wang	affd7a4d47	[SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ### What changes were proposed in this pull request? Remove `com.github.rdblue:brotli-codec:0.1.1` dependency. ### Why are the changes needed? As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`. Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #34059 from gengliangwang/removeDeps. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `ba5708d944`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-21 10:57:34 -07:00
Max Gekk	7fa88b28a5	[SPARK-36807][SQL] Merge ANSI interval types to a tightest common type ### What changes were proposed in this pull request? In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields. ### Why are the changes needed? This will allow merging of schemas from different datasource files. ### Does this PR introduce _any_ user-facing change? No, the ANSI interval types haven't released yet. ### How was this patch tested? Added new test to `StructTypeSuite`. Closes #34049 from MaxGekk/merge-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `d2340f8e1c`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-21 10:20:27 +03:00
dgd-contributor	3d47c692d2	[SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value ### What changes were proposed in this pull request? Fix DataFrame.isin when DataFrame has NaN value ### Why are the changes needed? Fix DataFrame.isin when DataFrame has NaN value ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf a b c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 None None True 1 True None None 2 None None True 3 None None None 4 None True True 5 None True True 6 None None True 7 None None None 8 None None None >>> psdf.to_pandas().isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf a b c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### How was this patch tested? Unit tests Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `cc182fe6f6`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-20 17:53:02 -07:00
Dongjoon Hyun	5d0e51e943	[SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image ### What changes were proposed in this pull request? This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image. ### Why are the changes needed? `openjdk:11-jre-slim` image is upgraded to `Debian 11`. ``` $ docker run -it openjdk:11-jre-slim cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" ``` It causes `R 3.5` installation failures in our K8s integration test environment. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/ ``` The following packages have unmet dependencies: r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable Depends: libreadline7 (>= 6.0) but it is not installable E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update && apt install -y -t buster-cran35 r-base r-base-dev && rm -rf ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the installation. ### How was this patch tested? Succeed to build SparkR docker image in the K8s integration test in Jenkins CI. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/ ``` Successfully built 32e1a0cd5ff8 Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51 ``` Closes #34048 from dongjoon-hyun/SPARK-36806. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `a178752540`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-20 14:38:32 -07:00
Angerszhuuuu	337a1979d2	[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZhuuuu/SPARK-36754. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2fc7f2f702`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-20 16:51:31 +08:00
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Ye Zhou	d4d8a6320f	[SPARK-36772] FinalizeShuffleMerge fails with an exception due to attempt id not matching ### What changes were proposed in this pull request? Remove the appAttemptId from TransportConf, and parsing through SparkEnv. ### Why are the changes needed? Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine. Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver. Closes #34018 from zhouyejoe/SPARK-36772. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `cabc36b54d`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-18 15:52:14 +08:00
dgd-contributor	36ce9cce55	[SPARK-36762][PYTHON] Fix Series.isin when Series has NaN values ### What changes were proposed in this pull request? Fix Series.isin when Series has NaN values ### Why are the changes needed? Fix Series.isin when Series has NaN values ``` python >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) >>> psser = ps.from_pandas(pser) >>> pser.isin([1, 3, 5, None]) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 False 8 False dtype: bool >>> psser.isin([1, 3, 5, None]) 0 None 1 True 2 None 3 True 4 None 5 True 6 None 7 None 8 None dtype: object ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) >>> psser = ps.from_pandas(pser) >>> psser.isin([1, 3, 5, None]) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 False 8 False dtype: bool ``` ### How was this patch tested? unit tests Closes #34005 from dgd-contributor/SPARK-36762_fix_series.isin_when_values_have_NaN. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `32b8512912`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-17 17:48:27 -07:00
Liang-Chi Hsieh	275ad6bd0b	[SPARK-36673][SQL][FOLLOWUP] Remove duplicate test in DataFrameSetOperationsSuite ### What changes were proposed in this pull request? As a followup of #34025 to remove duplicate test. ### Why are the changes needed? To remove duplicate test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #34032 from viirya/remove. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `f9644cc253`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-17 11:52:26 -07:00
Yang He	f093f3b939	[SPARK-36780][BUILD] Make `dev/mima` runs on Java 17 ### What changes were proposed in this pull request? Java 17 has been officially released. This PR makes `dev/mima` runs on Java 17. ### Why are the changes needed? To make tests pass on Java 17. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #34022 from RabbidHY/SPARK-36780. Lead-authored-by: Yang He <stitch106hy@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `5d0889bf36`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-17 08:54:57 -07:00
Angerszhuuuu	61d7f1da1b	[SPARK-36767][SQL] ArrayMin/ArrayMax/SortArray/ArraySort add comment and Unit test ### What changes were proposed in this pull request? Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this ### Why are the changes needed? Add Unit test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34008 from AngersZhuuuu/SPARK-36740. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `69e006dd53`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:42:21 +08:00
Liang-Chi Hsieh	895218996a	[SPARK-36673][SQL] Fix incorrect schema of nested types of union ### What changes were proposed in this pull request? This patch proposes to fix incorrect schema of `union`. ### Why are the changes needed? The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition. So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of incorrect schema. ### How was this patch tested? Added unit test. Closes #34025 from viirya/SPARK-36673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `cdd7ae937d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:37:40 +08:00
Jungtaek Lim	af7dd18a5e	[SPARK-36764][SS][TEST] Fix race-condition on "ensure continuous stream is being used" in KafkaContinuousTest ### What changes were proposed in this pull request? The test “ensure continuous stream is being used“ in KafkaContinuousTest quickly checks the actual type of the execution, and stops the query. Stopping the streaming query in continuous mode is done by interrupting query execution thread and join with indefinite timeout. In parallel, started streaming query is going to generate execution plan, including running optimizer. Some parts of SessionState can be built at that time, as they are defined as lazy. The problem is, some of them seem to “swallow” the InterruptedException and let the thread run continuously. That said, the query can’t indicate whether there is a request on stopping query, so the query won’t stop. This PR fixes such scenario via ensuring that streaming query has started before the test stops the query. ### Why are the changes needed? Race-condition could end up with test hang till test framework marks it as timed-out. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34004 from HeartSaVioR/SPARK-36764. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `6099edc66e`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:28:19 +08:00
Angerszhuuuu	a78c4c44ed	[SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_distinct(array(cast('nan' as double), cast('nan' as double))) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayDistinct won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33993 from AngersZhuuuu/SPARK-36741. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e356f6aa11`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 20:48:39 +08:00
Wenchen Fan	16215755b7	[SPARK-36789][SQL] Use the correct constant type as the null value holder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after https://github.com/apache/spark/pull/33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `4145498826`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 16:50:01 +09:00
Hyukjin Kwon	7d7c9915bb	[SPARK-36788][SQL] Change log level of AQE for non-supported plans from warning to debug ### What changes were proposed in this pull request? This PR suppresses the warnings for plans where AQE is not supported. Currently we show the warnings such as: ``` org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] ``` for every plan that AQE is not supported. ### Why are the changes needed? It's too noisy now. Below is the example of `SortSuite` run: ``` 14:51:40.675 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=true, sortOrder=List('a DESC NULLS FIRST) (785 milliseconds) 14:51:41.416 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324884 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324884] . 14:51:41.467 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324884 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324884] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS FIRST) (796 milliseconds) 14:51:42.210 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324887 ASC NULLS LAST], true +- Scan ExistingRDD[a#324887] . 14:51:42.259 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324887 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324887] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS LAST) (797 milliseconds) 14:51:43.009 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324890 DESC NULLS LAST], true +- Scan ExistingRDD[a#324890] . 14:51:43.061 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324890 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324890] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS LAST) (848 milliseconds) 14:51:43.857 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324893 DESC NULLS FIRST], true +- Scan ExistingRDD[a#324893] . 14:51:43.903 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324893 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324893] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS FIRST) (827 milliseconds) 14:51:44.682 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324896 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324896] . 14:51:44.748 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324896 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324896] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS FIRST) (565 milliseconds) 14:51:45.248 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324899 ASC NULLS LAST], true +- Scan ExistingRDD[a#324899] . 14:51:45.312 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324899 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324899] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS LAST) (591 milliseconds) 14:51:45.841 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324902 DESC NULLS LAST], true +- Scan ExistingRDD[a#324902] . 14:51:45.905 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324902 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324902] . ``` ### Does this PR introduce _any_ user-facing change? Yes, it will show less warnings to users. Note that AQE is enabled by default from Spark 3.2, see SPARK-33679 ### How was this patch tested? Manually tested via unittests. Closes #34026 from HyukjinKwon/minor-log-level. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `917d7dad4d`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 12:01:53 +09:00
Wenchen Fan	c1bfe1a5c4	[SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project ### What changes were proposed in this pull request? `ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project. Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule. ### Why are the changes needed? Fix a bug that violates the semantic of nondeterministic expressions. ### Does this PR introduce _any_ user-facing change? Most likely no change, but in some cases, this is a correctness bug fix which changes the query result. ### How was this patch tested? existing tests Closes #34023 from cloud-fan/scan. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `dfd5237c0c`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 10:51:29 +08:00
Josh Rosen	3502fda783	[SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite ### What changes were proposed in this pull request? This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`. The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`. In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary. ### Why are the changes needed? Previously, `SparkSubmitSuite` tests would fail with messages like: ``` [info] - launch simple application with spark-submit * FAILED * (1 second, 832 milliseconds) [info] Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command. After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces: ``` [info] - launch simple application with spark-submit * FAILED * (2 seconds, 800 milliseconds) [info] spark-submit returned with exit code 101. [info] Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar' [info] [info] 2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [info] 2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually ran the affected test suites. Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit `3ae6e6775b`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2021-09-16 14:31:22 -07:00
Dongjoon Hyun	fbd24621ce	[SPARK-36759][BUILD][FOLLOWUP] Update version in scala-2.12 profile and doc ### What changes were proposed in this pull request? This is a follow-up to fix the leftover during switching the Scala version. ### Why are the changes needed? This should be consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is not tested by UT. We need to check manually. There is no more `2.12.14`. ``` $ git grep 2.12.14 R/pkg/tests/fulltests/test_sparkSQL.R: c(as.Date("2012-12-14"), as.Date("2013-12-15"), as.Date("2014-12-16"))) data/mllib/ridge-data/lpsa.data:3.5307626,0.987291634724086 -0.36279314978779 -0.922212414640967 0.232904453212813 -0.522940888712441 1.79270085261407 0.342627053981254 1.26288870310799 sql/hive/src/test/resources/data/files/over10k:-3\|454\|65705\|4294967468\|62.12\|14.32\|true\|mike white\|2013-03-01 09:11:58.703087\|40.18\|joggying ``` Closes #34020 from dongjoon-hyun/SPARK-36759-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `adbea252db`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-16 05:11:05 -07:00
Kousuke Saruta	dda43fe5ee	[SPARK-36777][INFRA] Move Java 17 on GitHub Actions from EA to LTS release ### What changes were proposed in this pull request? This PR aims to move Java 17 on GA from early access release to LTS release. ### Why are the changes needed? Java 17 LTS was released a few days ago and it's available on GA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #34017 from sarutak/ga-java17. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit `89a9456b13`) Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-09-16 18:05:04 +08:00
Gengliang Wang	d20ed030a8	[SPARK-36775][DOCS] Add documentation for ANSI store assignment rules ### What changes were proposed in this pull request? Add documentation for ANSI store assignment rules for - the valid source/target type combinations - runtime error will happen on numberic overflow ### Why are the changes needed? Better docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: ![image](https://user-images.githubusercontent.com/1097932/133554600-8c80c0a9-8753-4c01-94d0-994d8082e319.png) Closes #34014 from gengliangwang/addStoreAssignDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ff7705ad2a`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-16 15:50:57 +08:00
Dongjoon Hyun	6352846f1c	[SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest	2021-09-15 23:38:48 -07:00
Dongjoon Hyun	63b8417794	[SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes. ### Why are the changes needed? Apache ORC 1.6.11 has the following fixes. - https://issues.apache.org/jira/projects/ORC/versions/12350499 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33971 from dongjoon-hyun/SPARK-36732. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c217797297`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-15 23:36:36 -07:00
Dongjoon Hyun	2067661869	[SPARK-36759][BUILD] Upgrade Scala to 2.12.15 ### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better. ### Why are the changes needed? Scala 2.12.15 improves compatibility with JDK 17 and 18: https://github.com/scala/scala/releases/tag/v2.12.15 - Avoids IllegalArgumentException in JDK 17+ for lambda deserialization - Upgrades to ASM 9.2, for JDK 18 support in optimizer ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? Pass the CIs Closes #33999 from dongjoon-hyun/SPARK-36759. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `16f1f71ba5`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-15 13:43:36 -07:00
Chao Sun	a7dc8242ea	[SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit `a927b0836b`) Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-09-15 19:17:49 +00:00
dgd-contributor	017bce7b11	[SPARK-36722][PYTHON] Fix Series.update with another in same frame ### What changes were proposed in this pull request? Fix Series.update with another in same frame also add test for update series in diff frame ### Why are the changes needed? Fix Series.update with another in same frame Pandas behavior: ``` python >>> pdf = pd.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> pdf a b 0 NaN NaN 1 2.0 5.0 2 3.0 NaN 3 4.0 3.0 4 5.0 2.0 5 6.0 1.0 6 7.0 NaN 7 8.0 0.0 8 NaN 0.0 >>> pdf.a.update(pdf.b) >>> pdf a b 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 ``` ### Does this PR introduce _any_ user-facing change? Before ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update combined = combine_frames(self._psdf, other._psdf, how="leftouter") File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames assert not same_anchor( AssertionError: We don't need to combine. `this` and `that` are same. >>> ``` After ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) >>> psdf a b 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 >>> ``` ### How was this patch tested? unit tests Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `c15072cc73`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-15 11:08:12 -07:00
Angerszhuuuu	75bffd972d	[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZhuuuu/SPARK-36755. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `b665782f0d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:32:18 +08:00
Angerszhuuuu	e64155691f	[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `638085953f`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:04:24 +08:00

1 2 3 4 5 ...

31103 commits