### What changes were proposed in this pull request?
Migrating a Spark application from 2.4.x to 3.1.x and finding a difference in the exception chaining behavior. In a case of parsing a malformed CSV, where the root cause exception should be Caused by: java.lang.RuntimeException: Malformed CSV record, only the top level exception is kept, and all lower level exceptions and root cause are lost. Thus, when we call ExceptionUtils.getRootCause on the exception, we still get itself.
The reason for the difference is that RuntimeException is wrapped in BadRecordException, which has unserializable fields. When we try to serialize the exception from tasks and deserialize from scheduler, the exception is lost.
This PR makes unserializable fields of BadRecordException transient, so the rest of the exception could be serialized and deserialized properly.
### Why are the changes needed?
Make BadRecordException serializable
### Does this PR introduce _any_ user-facing change?
User could get root cause of BadRecordException
### How was this patch tested?
Unit testing
Closes#34167 from tianhanhu/master.
Authored-by: tianhanhu <adrianhu96@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit aed977c468)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks.
Proposed changes in the PR to fix the issue.
### What changes were proposed in this pull request?
Disabled Batch fetch when push based shuffle is enabled.
### Why are the changes needed?
Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested the PR within our PR, with Spark shell and the queries are:
sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales")
// Dynamically coalesce partitions
sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect
Unit tests to be added.
Closes#34156 from zhouyejoe/SPARK-36892.
Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 31b6f614d3)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
backport https://github.com/apache/spark/pull/34180
### What changes were proposed in this pull request?
This bug was introduced by https://github.com/apache/spark/pull/33177
When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.
### Why are the changes needed?
fix a regression
### Does this PR introduce _any_ user-facing change?
Yes, the result was wrong before this PR.
### How was this patch tested?
a new test
Closes#34193 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
- Make the val lazy wherever `isPushBasedShuffleEnabled` is invoked when it is a class instance variable, so it can happen after user-defined jars/classes in `spark.kryo.classesToRegister` are downloaded and available on executor-side, as part of the fix for the exception mentioned below.
- Add a flag `checkSerializer` to control whether we need to check a serializer is `supportsRelocationOfSerializedObjects` or not within `isPushBasedShuffleEnabled` as part of the fix for the exception mentioned below. Specifically, we don't check this in `registerWithExternalShuffleServer()` in `BlockManager` and `createLocalDirsForMergedShuffleBlocks()` in `DiskBlockManager.scala` as the same issue would raise otherwise.
- Move `instantiateClassFromConf` and `instantiateClass` from `SparkEnv` into `Utils`, in order to let `isPushBasedShuffleEnabled` to leverage them for instantiating serializer instances.
### Why are the changes needed?
When user tries to set classes for Kryo Serialization by `spark.kryo.classesToRegister`, below exception(or similar) would be encountered in `isPushBasedShuffleEnabled` as indicated below.
Reproduced the issue in our internal branch by launching spark-shell as:
```
spark-shell --spark-version 3.1.1 --packages ml.dmlc:xgboost4j_2.12:1.3.1 --conf spark.kryo.classesToRegister=ml.dmlc.xgboost4j.scala.Booster
```
```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:83)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:183)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:230)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:171)
at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:446)
at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:253)
at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:249)
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2584)
at org.apache.spark.MapOutputTrackerWorker.<init>(MapOutputTracker.scala:1109)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:322)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
... 4 more
Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.Booster
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:217)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:174)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173)
... 24 more
```
Registering user class for kryo serialization is happening after serializer creation in SparkEnv. Serializer creation can happen in `isPushBasedShuffleEnabled`, which can be called in some places prior to SparkEnv is created. Also, as per analysis by JoshRosen, this is probably due to Kryo instantiation was failing because added packages hadn't been downloaded to the executor yet (because this code is running during executor startup, not task startup). The proposed change helps fix this issue.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passed existing tests.
Tested this patch in our internal branch where user reported the issue. Issue is now not reproducible with this patch.
Closes#34158 from rmcyang/SPARK-33781-bugFix.
Lead-authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Co-authored-by: Minchu Yang <31781684+rmcyang@users.noreply.github.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit e5b01cd823)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")
df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.
df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```
The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#34172 from sarutak/fix-deduplication-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fa1805db48)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix `DataFrameGroupBy.apply` without shortcut.
Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:
```py
>>> pdf = pd.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"],
... )
>>> pdf.groupby('b').apply(lambda x: x['a'])
b
1 0 1
1 2
2 2 3
3 3 4
5 4 5
8 5 6
Name: a, dtype: int64
>>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
a 0 1
b
1 1 2
```
If there is only one group, it returns a "wide" `DataFrame` instead of `Series`.
In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.
### Why are the changes needed?
`DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.
```py
>>> ps.options.compute.shortcut_limit = 3
>>> psdf = ps.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"],
... )
>>> psdf.groupby("b").apply(lambda x: x["a"])
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
```
### Does this PR introduce _any_ user-facing change?
The error above will be gone:
```py
>>> psdf.groupby("b").apply(lambda x: x["a"])
b
1 0 1
1 2
2 2 3
3 3 4
5 4 5
8 5 6
Name: a, dtype: int64
```
### How was this patch tested?
Added tests.
Closes#34160 from ueshin/issues/SPARK-36907/groupby-apply.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 38d39812c1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds PySpark API document of `session_window`.
The docstring of the function doesn't comply with numpydoc format so this PR also fix it.
Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR.
### Why are the changes needed?
To provide PySpark users with the API document of the newly added function.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`make html` in `python/docs` and get the following docs.
[window]
![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png)
[session_window]
![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png)
Closes#34118 from sarutak/python-session-window-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 5a32e41e9c)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
`sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs.
This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell.
Spark currently returns
```
#\t}"4�"�B�w��U�*��你���l��
```
while the expected result is
```
23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7
```
This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string.
### Why are the changes needed?
`sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added new test to `HashExpressionsSuite.scala` which previously failed and now pass
Closes#34086 from richardc-db/sha224.
Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 6c6291b3f6)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add provided Guava dependency to `network-yarn` module.
### Why are the changes needed?
In Spark 3.1 and earlier the `network-yarn` module implicitly relies on Guava from `hadoop-client` dependency. This was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive Guava dependency. It stayed fine for a while since we were not using `createDependencyReducedPom` so it picks up the transitive dependency from `spark-network-common` instead. However, things start to break after SPARK-36835 where we restored `createDependencyReducedPom` and now it is no longer able to locate Guava classes:
```
build/mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol
symbol: class VisibleForTesting
location: class org.apache.spark.network.yarn.YarnShuffleService
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested with the above `mvn` command and it's now passing.
Closes#34125 from sunchao/SPARK-36873.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 53f58b6e51)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Currently, JAVA_HOME may be set to path "/usr" improperly, now JAVA_HOME is fetched from command "/usr/libexec/java_home" for macOS.
### Why are the changes needed?
Command "./build/mvn xxx" will be stuck on MacOS 11.4, because JAVA_HOME is set to path "/usr" improperly.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`build/mvn -DskipTests package` passed on `macOS 11.5.2`.
Closes#34111 from copperybean/work.
Authored-by: copperybean <copperybean.zhang@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 56c21e8e5a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This reverts commit aaa0d2a66b.
### Why are the changes needed?
This approach has 2 disadvantages:
1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`.
2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596
Instead, we can use bloom filter join pruning in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#34116 from wangyum/revert-SPARK-32855.
Authored-by: Yuming Wang <yumwangebay.com>
Signed-off-by: Yuming Wang <yumwangebay.com>
(cherry picked from commit e024bdc306)
Closes#34124 from wangyum/revert-SPARK-32855-branch-3.2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move `spark.yarn.isHadoopProvided` to Spark parent pom, so that under `resource-managers/yarn` we can make `hadoop-3.2` as the default profile.
### Why are the changes needed?
Currently under `resource-managers/yarn` there are 3 maven profiles : `hadoop-provided`, `hadoop-2.7`, and `hadoop-3.2`, of which `hadoop-3.2` is activated by default (via `activeByDefault`). The activation, however, doesn't work when there is other explicitly activated profiles. In specific, if users build Spark with `hadoop-provided`, maven will fail because it can't find Hadoop 3.2 related dependencies, which are defined in the `hadoop-3.2` profile section.
To fix the issue, this proposes to move the `hadoop-provided` section to the parent pom. Currently this is only used to define a property `spark.yarn.isHadoopProvided`, and it shouldn't matter where we define it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested via running the command:
```
build/mvn clean package -DskipTests -B -Pmesos -Pyarn -Pkubernetes -Pscala-2.12 -Phadoop-provided
```
which was failing before this PR but is succeeding with it.
Also checked active profiles with the command:
```
build/mvn -Pyarn -Phadoop-provided help:active-profiles
```
and it shows that `hadoop-3.2` is active for `spark-yarn` module now.
Closes#34110 from sunchao/SPARK-36835-followup2.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit f9efdeea8c)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Handle incorrect parsing of negative ANSI typed interval literals
[SPARK-36851](https://issues.apache.org/jira/browse/SPARK-36851)
### Why are the changes needed?
Incorrect result:
```
spark-sql> select interval -'1' year;
1-0
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add ut testcase
Closes#34107 from Peng-Lei/SPARK-36851.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 0fdca1f0df)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Fix an issue where Maven may stuck in an infinite loop when building Spark, for Hadoop 2.7 profile.
### Why are the changes needed?
After re-enabling `createDependencyReducedPom` for `maven-shade-plugin`, Spark build stopped working for Hadoop 2.7 profile and will stuck in an infinitely loop, likely due to a Maven shade plugin bug similar to https://issues.apache.org/jira/browse/MSHADE-148. This seems to be caused by the fact that, under `hadoop-2.7` profile, variable `hadoop-client-runtime.artifact` and `hadoop-client-api.artifact`are both `hadoop-client` which triggers the issue.
As a workaround, this changes `hadoop-client-runtime.artifact` to be `hadoop-yarn-api` when using `hadoop-2.7`. Since `hadoop-yarn-api` is a dependency of `hadoop-client`, this essentially moves the former to the same level as the latter. It should have no effect as both are dependencies of Spark.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#34100 from sunchao/SPARK-36835-followup.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 937a74e6e7)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it
### Why are the changes needed?
Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new unit test for the InMemoryStore.
Also, run a simple benchmark with
```
val testConf = conf.clone()
.set(MAX_RETAINED_STAGES, 1000)
val listener = new AppStatusListener(store, testConf, true)
val stages = (1 to 5000).map { i =>
val s = new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1",
resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
s.submissionTime = Some(i.toLong)
s
}
listener.onJobStart(SparkListenerJobStart(4, time, Nil, null))
val start = System.nanoTime()
stages.foreach { s =>
time +=1
s.submissionTime = Some(time)
listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties()))
s.completionTime = Some(time)
listener.onStageCompleted(SparkListenerStageCompleted(s))
}
println(System.nanoTime() - start)
```
Before changes:
InMemoryStore: 1.2s
After changes:
InMemoryStore: 0.23s
Closes#34092 from gengliangwang/cleanStage.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 7ac0a2c37b)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
InSet should handle NaN
```
InSet(Literal(Double.NaN), Set(Double.NaN, 1d)) should return true, but return false.
```
### Why are the changes needed?
InSet should handle NaN
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#34033 from AngersZhuuuu/SPARK-36792.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 64f4bf47af)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions.
For example
```sql
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t)
```
```
== Optimized Logical Plan ==
Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions.
+- Project [sum(c2)#10L]
+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) <--- Aggregate expression in join condition
:- LocalRelation [c2#3]
+- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
+- LocalRelation [c1#2, c2#3]
java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false])
```
Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions.
079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)
### Why are the changes needed?
To fix an existing optimizer issue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Authored-by: allisonwang-db <allison.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
(cherry picked from commit 4a8dc5f7a3)
Signed-off-by: allisonwang-db <allison.wangdatabricks.com>
Closes#34081 from allisonwang-db/cp-spark-36747.
Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…utors using config instead of command line"
### What changes were proposed in this pull request?
This reverts commit 866df69c62.
### Why are the changes needed?
After the change environment variables were not substituted in user classpath entries. Please find an example on SPARK-35672.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#34088 from gengliangwang/revertSPARK-35672.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Enable `createDependencyReducedPom` for Spark's Maven shaded plugin so that the effective pom won't contain those shaded artifacts such as `org.eclipse.jetty`
### Why are the changes needed?
At the moment, the effective pom leaks transitive dependencies to downstream apps for those shaded artifacts, which potentially will cause issues.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
I manually tested and the `core/dependency-reduced-pom.xml` no longer contains dependencies such as `jetty-XX`.
Closes#34085 from sunchao/SPARK-36835.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit ed88e610f0)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This's a follow-up of https://github.com/apache/spark/pull/34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is.
### Why are the changes needed?
To avoid any potential overhead.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass existing tests.
Closes#34076 from Ngone51/spark-36782-follow-up.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 9d8ac7c8e9)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in https://github.com/apache/spark/pull/29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per https://github.com/apache/spark/pull/29712.
This PR can go through from `master` to `branch-3.1`.
### Why are the changes needed?
To make `UnresolvedRelation.options`'s description clearer.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Scala linter by `dev/linter-scala`.
Closes#34075 from HyukjinKwon/minor-comment-unresolved-releation.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
(cherry picked from commit 0076eba8d0)
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
### What changes were proposed in this pull request?
Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself.
Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future.
Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests.
### Why are the changes needed?
[SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test as introduced in this PR.
---
Ping eejbyfeldt for notice.
Closes#34043 from f-thiele/SPARK-36782.
Lead-authored-by: Fabian A.J. Thiele <fabian.thiele@posteo.de>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Co-authored-by: Fabian A.J. Thiele <fthiele@liveintent.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 4ea54e8672)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
The PR fixes SPARK-36791 by replacing JHS_POST with JHS_HOST
### Why are the changes needed?
There are spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Not needed for docs
Closes#34031 from jiaoqingbo/jiaoqingbo.
Authored-by: jiaoqb <jiaoqb@asiainfo.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a1a91bd71)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes#33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a7cbe69986)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.
The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.
### Why are the changes needed?
It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a new test case in ParquetInteroperabilitySuite.scala.
Closes#34044 from sadikovi/parquet-legacy-write-mode-list-issue.
Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec26d94eac)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Disable tests related to LZ4 in `FileSourceCodecSuite` and `FileSuite` when using `hadoop-2.7` profile.
### Why are the changes needed?
At the moment, parquet-mr uses LZ4 compression codec provided by Hadoop, and only since HADOOP-17292 (in 3.3.1/3.4.0) the latter added `lz4-java` to remove the restriction that the codec can only be run with native library. As consequence, the test will fail when using `hadoop-2.7` profile.
### Does this PR introduce _any_ user-facing change?
No, it's just test.
### How was this patch tested?
Existing test
Closes#34066 from sunchao/SpARK-36820-3.2.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior.
This is a backport of https://github.com/apache/spark/pull/34052.
### Why are the changes needed?
As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series.
### Does this PR introduce _any_ user-facing change?
Yes, results of `pop` of Categorical Series change.
#### From
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```
#### To
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```
### How was this patch tested?
Unit tests.
Closes#34063 from xinrong-databricks/backport_cat_pop.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Remove `com.github.rdblue:brotli-codec:0.1.1` dependency.
### Why are the changes needed?
As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`.
Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA tests.
Closes#34059 from gengliangwang/removeDeps.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ba5708d944)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields.
### Why are the changes needed?
This will allow merging of schemas from different datasource files.
### Does this PR introduce _any_ user-facing change?
No, the ANSI interval types haven't released yet.
### How was this patch tested?
Added new test to `StructTypeSuite`.
Closes#34049 from MaxGekk/merge-ansi-interval-types.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit d2340f8e1c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image.
### Why are the changes needed?
`openjdk:11-jre-slim` image is upgraded to `Debian 11`.
```
$ docker run -it openjdk:11-jre-slim cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
```
It causes `R 3.5` installation failures in our K8s integration test environment.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/
```
The following packages have unmet dependencies:
r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable
Depends: libreadline7 (>= 6.0) but it is not installable
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update &&
apt install -y -t buster-cran35 r-base r-base-dev && rm -rf
```
### Does this PR introduce _any_ user-facing change?
Yes, this will recover the installation.
### How was this patch tested?
Succeed to build SparkR docker image in the K8s integration test in Jenkins CI.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/
```
Successfully built 32e1a0cd5ff8
Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51
```
Closes#34048 from dongjoon-hyun/SPARK-36806.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a178752540)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes#33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2fc7f2f702)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove the appAttemptId from TransportConf, and parsing through SparkEnv.
### Why are the changes needed?
Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine.
Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver.
Closes#34018 from zhouyejoe/SPARK-36772.
Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit cabc36b54d)
Signed-off-by: Gengliang Wang <gengliang@apache.org>