Commit graph

31121 commits

Author SHA1 Message Date
Ye Zhou 88f4809142 [SPARK-36892][CORE] Disable batch fetch for a shuffle when push based shuffle is enabled
We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some  stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks.
Proposed changes in the PR to fix the issue.

### What changes were proposed in this pull request?
Disabled Batch fetch when push based shuffle is enabled.

### Why are the changes needed?
Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested the PR within our PR, with Spark shell and the queries are:

sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales")
// Dynamically coalesce partitions
sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect

Unit tests to be added.

Closes #34156 from zhouyejoe/SPARK-36892.

Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 31b6f614d3)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-10-06 15:42:40 +08:00
Wenchen Fan 688808900d [SPARK-36926][3.2][SQL] Decimal average mistakenly overflow
backport https://github.com/apache/spark/pull/34180

### What changes were proposed in this pull request?

This bug was introduced by https://github.com/apache/spark/pull/33177

When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

Yes, the result was wrong before this PR.

### How was this patch tested?

a new test

Closes #34193 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-10-06 13:31:03 +08:00
Minchu Yang c54229741a [SPARK-36705][FOLLOW-UP] Support the case when user's classes need to register for Kryo serialization
### What changes were proposed in this pull request?

- Make the val lazy wherever `isPushBasedShuffleEnabled` is invoked when it is a class instance variable, so it can happen after user-defined jars/classes in `spark.kryo.classesToRegister` are downloaded and available on executor-side, as part of the fix for the exception mentioned below.

- Add a flag `checkSerializer` to control whether we need to check a serializer is `supportsRelocationOfSerializedObjects` or not within `isPushBasedShuffleEnabled` as part of the fix for the exception mentioned below. Specifically, we don't check this in `registerWithExternalShuffleServer()` in `BlockManager` and `createLocalDirsForMergedShuffleBlocks()` in `DiskBlockManager.scala` as the same issue would raise otherwise.

- Move `instantiateClassFromConf` and `instantiateClass` from `SparkEnv` into `Utils`, in order to let `isPushBasedShuffleEnabled` to leverage them for instantiating serializer instances.

### Why are the changes needed?

When user tries to set classes for Kryo Serialization by `spark.kryo.classesToRegister`, below exception(or similar) would be encountered in `isPushBasedShuffleEnabled` as indicated below.
Reproduced the issue in our internal branch by launching spark-shell as:
```
spark-shell --spark-version 3.1.1 --packages ml.dmlc:xgboost4j_2.12:1.3.1 --conf spark.kryo.classesToRegister=ml.dmlc.xgboost4j.scala.Booster
```

```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:83)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo
	at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:183)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:230)
	at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:171)
	at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
	at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
	at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
	at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
	at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:446)
	at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:253)
	at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:249)
	at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2584)
	at org.apache.spark.MapOutputTrackerWorker.<init>(MapOutputTracker.scala:1109)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:322)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	... 4 more
Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.Booster
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:217)
	at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:174)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173)
	... 24 more
```
Registering user class for kryo serialization is happening after serializer creation in SparkEnv. Serializer creation can happen in `isPushBasedShuffleEnabled`, which can be called in some places prior to SparkEnv is created. Also, as per analysis by JoshRosen, this is probably due to Kryo instantiation was failing because added packages hadn't been downloaded to the executor yet (because this code is running during executor startup, not task startup). The proposed change helps fix this issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passed existing tests.
Tested this patch in our internal branch where user reported the issue. Issue is now not reproducible with this patch.

Closes #34158 from rmcyang/SPARK-33781-bugFix.

Lead-authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Co-authored-by: Minchu Yang <31781684+rmcyang@users.noreply.github.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit e5b01cd823)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-10-05 12:06:50 -05:00
Kousuke Saruta 8ffe00e745 [SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join
### What changes were proposed in this pull request?

This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")

df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.

df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```

The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #34172 from sarutak/fix-deduplication-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fa1805db48)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-10-05 11:17:12 +08:00
Takuya UESHIN 04a19963e3 [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut
### What changes were proposed in this pull request?

Fix `DataFrameGroupBy.apply` without shortcut.

Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:

```py
>>> pdf = pd.DataFrame(
...      {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...      columns=["a", "b", "c"],
... )

>>> pdf.groupby('b').apply(lambda x: x['a'])
b
1  0    1
   1    2
2  2    3
3  3    4
5  4    5
8  5    6
Name: a, dtype: int64
>>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
a  0  1
b
1  1  2
```

If there is only one group, it returns a "wide" `DataFrame` instead of `Series`.

In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.

### Why are the changes needed?

`DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.

```py
>>> ps.options.compute.shortcut_limit = 3
>>> psdf = ps.DataFrame(
...     {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...     columns=["a", "b", "c"],
... )
>>> psdf.groupby("b").apply(lambda x: x["a"])
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
```

### Does this PR introduce _any_ user-facing change?

The error above will be gone:

```py
>>> psdf.groupby("b").apply(lambda x: x["a"])
b
1  0    1
   1    2
2  2    3
3  3    4
5  4    5
8  5    6
Name: a, dtype: int64
```

### How was this patch tested?

Added tests.

Closes #34160 from ueshin/issues/SPARK-36907/groupby-apply.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 38d39812c1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-03 12:25:27 +09:00
Kousuke Saruta 8b2b6bb0d3 [SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window
### What changes were proposed in this pull request?

This PR adds PySpark API document of `session_window`.
The docstring of the function doesn't comply with numpydoc format so this PR also fix it.
Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR.

### Why are the changes needed?

To provide PySpark users with the API document of the newly added function.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`make html` in `python/docs` and get the following docs.

[window]
![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png)

[session_window]
![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png)

Closes #34118 from sarutak/python-session-window-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 5a32e41e9c)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-30 16:51:27 +09:00
Hyukjin Kwon 939c4d93b5 [MINOR][DOCS] Mention other Python dependency tools in documentation
### What changes were proposed in this pull request?

Self-contained.

### Why are the changes needed?

For user's more information on available Python dependency management in PySpark.

### Does this PR introduce _any_ user-facing change?
Yes, documentation change.

### How was this patch tested?
Manaully built the docs and checked the results:
<img width="918" alt="Screen Shot 2021-09-29 at 10 11 56 AM" src="https://user-images.githubusercontent.com/6477701/135186536-2f271378-d06b-4c6b-a4be-691ce395db9f.png">
<img width="976" alt="Screen Shot 2021-09-29 at 10 12 22 AM" src="https://user-images.githubusercontent.com/6477701/135186541-0f4c5615-bc49-48e2-affd-dc2f5c0334bf.png">
<img width="920" alt="Screen Shot 2021-09-29 at 10 12 42 AM" src="https://user-images.githubusercontent.com/6477701/135186551-0b613096-7c86-4562-b345-ddd60208367b.png">

Closes #34134 from HyukjinKwon/minor-docs-py-deps.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 13c2b711e4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-29 14:46:18 +09:00
Gengliang Wang 4bd358474b Preparing development version 3.2.1-SNAPSHOT 2021-09-28 10:53:42 +00:00
Gengliang Wang dde73e2e1c Preparing Spark release v3.2.0-rc6 2021-09-28 10:53:35 +00:00
Richard Chen 493aad03ab [SPARK-36836][SQL] Fix incorrect result in sha2 expression
### What changes were proposed in this pull request?

`sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs.
This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell.

Spark currently returns
```
#\t}"4�"�B�w��U�*��你���l��
```
while the expected result is
```
23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7
```

This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string.

### Why are the changes needed?

`sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added new test to `HashExpressionsSuite.scala` which previously failed and now pass

Closes #34086 from richardc-db/sha224.

Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 6c6291b3f6)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-28 18:38:33 +08:00
Chao Sun cfc1bc295a [SPARK-36873][BUILD][TEST-MAVEN] Add provided Guava dependency for network-yarn module
### What changes were proposed in this pull request?

Add provided Guava dependency to `network-yarn` module.

### Why are the changes needed?

In Spark 3.1 and earlier the `network-yarn` module implicitly relies on Guava from `hadoop-client` dependency. This was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive Guava dependency. It stayed fine for a while since we were not using `createDependencyReducedPom` so it picks up the transitive dependency from `spark-network-common` instead. However, things start to break after SPARK-36835 where we restored `createDependencyReducedPom` and now it is no longer able to locate Guava classes:

```
build/mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested with the above `mvn` command and it's now passing.

Closes #34125 from sunchao/SPARK-36873.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 53f58b6e51)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-28 18:23:43 +08:00
copperybean 33fb9693e8 [SPARK-36856][BUILD] Get correct JAVA_HOME for macOS
### What changes were proposed in this pull request?
Currently, JAVA_HOME may be set to path "/usr" improperly, now JAVA_HOME is fetched from command "/usr/libexec/java_home" for macOS.

### Why are the changes needed?
Command "./build/mvn xxx" will be stuck on MacOS 11.4, because JAVA_HOME is set to path "/usr" improperly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`build/mvn -DskipTests package` passed on `macOS 11.5.2`.

Closes #34111 from copperybean/work.

Authored-by: copperybean <copperybean.zhang@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 56c21e8e5a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-28 17:27:18 +08:00
Yuming Wang 8f0c846b1d Revert "[SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type"
### What changes were proposed in this pull request?

This reverts commit aaa0d2a66b.

### Why are the changes needed?

This approach has 2 disadvantages:
1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`.
2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596

Instead, we can use bloom filter join pruning in the future.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #34116 from wangyum/revert-SPARK-32855.

Authored-by: Yuming Wang <yumwangebay.com>
Signed-off-by: Yuming Wang <yumwangebay.com>

(cherry picked from commit e024bdc306)

Closes #34124 from wangyum/revert-SPARK-32855-branch-3.2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-28 15:38:35 +08:00
Gengliang Wang 0c57bb8f7f Preparing development version 3.2.1-SNAPSHOT 2021-09-27 08:24:50 +00:00
Gengliang Wang 49aea14c5a Preparing Spark release v3.2.0-rc5 2021-09-27 08:24:44 +00:00
Chao Sun 228d12e30e [SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Move spark.yarn.isHadoopProvided to parent pom
### What changes were proposed in this pull request?

Move `spark.yarn.isHadoopProvided` to Spark parent pom, so that under `resource-managers/yarn` we can make `hadoop-3.2` as the default profile.

### Why are the changes needed?

Currently under `resource-managers/yarn` there are 3 maven profiles : `hadoop-provided`, `hadoop-2.7`, and `hadoop-3.2`, of which `hadoop-3.2` is activated by default (via `activeByDefault`). The activation, however, doesn't work when there is other explicitly activated profiles. In specific, if users build Spark with `hadoop-provided`, maven will fail because it can't find Hadoop 3.2 related dependencies, which are defined in the `hadoop-3.2` profile section.

To fix the issue, this proposes to move the `hadoop-provided` section to the parent pom. Currently this is only used to define a property `spark.yarn.isHadoopProvided`, and it shouldn't matter where we define it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested via running the command:
```
build/mvn clean package -DskipTests -B -Pmesos -Pyarn -Pkubernetes -Pscala-2.12 -Phadoop-provided
```
which was failing before this PR but is succeeding with it.

Also checked active profiles with the command:
```
build/mvn -Pyarn -Phadoop-provided help:active-profiles
```
and it shows that `hadoop-3.2` is active for `spark-yarn` module now.

Closes #34110 from sunchao/SPARK-36835-followup2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit f9efdeea8c)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-27 15:17:28 +08:00
Gengliang Wang 2348cce37e Preparing development version 3.2.1-SNAPSHOT 2021-09-26 12:28:46 +00:00
Gengliang Wang 2ed8c08c5b Preparing Spark release v3.2.0-rc5 2021-09-26 12:28:40 +00:00
PengLei eb794a4f58 [SPARK-36851][SQL] Incorrect parsing of negative ANSI typed interval literals
### What changes were proposed in this pull request?
Handle incorrect parsing of negative ANSI typed interval literals
[SPARK-36851](https://issues.apache.org/jira/browse/SPARK-36851)

### Why are the changes needed?
Incorrect result:
```
spark-sql> select interval -'1' year;
1-0
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add ut testcase

Closes #34107 from Peng-Lei/SPARK-36851.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 0fdca1f0df)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-26 18:43:38 +08:00
Chao Sun 540e45c3cc [SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Fix maven issue for Hadoop 2.7 profile after enabling dependency reduced pom
### What changes were proposed in this pull request?

Fix an issue where Maven may stuck in an infinite loop when building Spark, for Hadoop 2.7 profile.

### Why are the changes needed?

After re-enabling `createDependencyReducedPom` for `maven-shade-plugin`, Spark build stopped working for Hadoop 2.7 profile and will stuck in an infinitely loop, likely due to a Maven shade plugin bug similar to https://issues.apache.org/jira/browse/MSHADE-148. This seems to be caused by the fact that, under `hadoop-2.7` profile, variable `hadoop-client-runtime.artifact` and `hadoop-client-api.artifact`are both `hadoop-client` which triggers the issue.

As a workaround, this changes `hadoop-client-runtime.artifact` to be `hadoop-yarn-api` when using `hadoop-2.7`. Since `hadoop-yarn-api` is a dependency of `hadoop-client`, this essentially moves the former to the same level as the latter. It should have no effect as both are dependencies of Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #34100 from sunchao/SPARK-36835-followup.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 937a74e6e7)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-26 13:39:50 +08:00
Gengliang Wang da722d43cb Preparing development version 3.2.1-SNAPSHOT 2021-09-24 10:03:23 +00:00
Gengliang Wang 9e35703211 Preparing Spark release v3.2.0-rc5 2021-09-24 10:03:16 +00:00
Gengliang Wang 3fff405c95 [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
### What changes were proposed in this pull request?

Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it
### Why are the changes needed?

Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add new unit test for the InMemoryStore.
Also, run a simple benchmark with
```
    val testConf = conf.clone()
      .set(MAX_RETAINED_STAGES, 1000)

    val listener = new AppStatusListener(store, testConf, true)
    val stages = (1 to 5000).map { i =>
      val s = new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1",
        resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
      s.submissionTime = Some(i.toLong)
      s
    }
    listener.onJobStart(SparkListenerJobStart(4, time, Nil, null))
    val start = System.nanoTime()
    stages.foreach { s =>
      time +=1
      s.submissionTime = Some(time)
      listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties()))
      s.completionTime = Some(time)
      listener.onStageCompleted(SparkListenerStageCompleted(s))
    }
    println(System.nanoTime() - start)
```

Before changes:
InMemoryStore: 1.2s

After changes:
InMemoryStore: 0.23s

Closes #34092 from gengliangwang/cleanStage.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 7ac0a2c37b)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-24 17:24:32 +08:00
Angerszhuuuu b7174188e5 [SPARK-36792][SQL] InSet should handle NaN
### What changes were proposed in this pull request?
InSet should handle NaN
```
InSet(Literal(Double.NaN), Set(Double.NaN, 1d)) should return true, but return false.
```
### Why are the changes needed?
InSet should handle NaN

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #34033 from AngersZhuuuu/SPARK-36792.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 64f4bf47af)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-24 16:19:47 +08:00
allisonwang-db d0c97d6ed9 [SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
### What changes were proposed in this pull request?

This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions.

For example
```sql
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t)
```

```
== Optimized Logical Plan ==
Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions.
+- Project [sum(c2)#10L]
   +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))  <--- Aggregate expression in join condition
      :- LocalRelation [c2#3]
      +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
         +- LocalRelation [c1#2, c2#3]

java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false])
```
Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions.
079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)

### Why are the changes needed?

To fix an existing optimizer issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Authored-by: allisonwang-db <allison.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
(cherry picked from commit 4a8dc5f7a3)
Signed-off-by: allisonwang-db <allison.wangdatabricks.com>

Closes #34081 from allisonwang-db/cp-spark-36747.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-24 16:14:49 +08:00
Gengliang Wang 09a8535cc4 Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
…utors using config instead of command line"

### What changes were proposed in this pull request?
This reverts commit 866df69c62.

### Why are the changes needed?
After the change environment variables were not substituted in user classpath entries. Please find an example on SPARK-35672.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #34088 from gengliangwang/revertSPARK-35672.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-24 12:46:22 +09:00
Chao Sun 09283d3210 [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin
### What changes were proposed in this pull request?

Enable `createDependencyReducedPom` for Spark's Maven shaded plugin so that the effective pom won't contain those shaded artifacts such as `org.eclipse.jetty`

### Why are the changes needed?

At the moment, the effective pom leaks transitive dependencies to downstream apps for those shaded artifacts, which potentially will cause issues.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

I manually tested and the `core/dependency-reduced-pom.xml` no longer contains dependencies such as `jetty-XX`.

Closes #34085 from sunchao/SPARK-36835.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit ed88e610f0)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-24 10:16:46 +08:00
Gengliang Wang 0fb7127f85 Preparing development version 3.2.1-SNAPSHOT 2021-09-23 08:46:28 +00:00
Gengliang Wang b609f2fe0c Preparing Spark release v3.2.0-rc4 2021-09-23 08:46:22 +00:00
yi.wu 0ad382747d [SPARK-36782][CORE][FOLLOW-UP] Only handle shuffle block in separate thread pool
### What changes were proposed in this pull request?

This's a follow-up of https://github.com/apache/spark/pull/34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is.

### Why are the changes needed?

To avoid any potential overhead.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existing tests.

Closes #34076 from Ngone51/spark-36782-follow-up.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 9d8ac7c8e9)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-23 16:30:25 +08:00
Michael Chen 89894a4b1d [SPARK-36795][SQL] Explain Formatted has Duplicate Node IDs
Fixed explain formatted mode so it doesn't have duplicate node IDs when InMemoryRelation is present in query plan.

Having duplicated node IDs in the plan makes it confusing.

Yes, explain formatted string will change.
Notice how `ColumnarToRow` and `InMemoryRelation` have node id of 2.
Before changes =>
```
== Physical Plan ==
AdaptiveSparkPlan (14)
+- == Final Plan ==
   * BroadcastHashJoin Inner BuildLeft (9)
   :- BroadcastQueryStage (5)
   :  +- BroadcastExchange (4)
   :     +- * Filter (3)
   :        +- * ColumnarToRow (2)
   :           +- InMemoryTableScan (1)
   :                 +- InMemoryRelation (2)
   :                       +- * ColumnarToRow (4)
   :                          +- Scan parquet default.t1 (3)
   +- * Filter (8)
      +- * ColumnarToRow (7)
         +- Scan parquet default.t2 (6)
+- == Initial Plan ==
   BroadcastHashJoin Inner BuildLeft (13)
   :- BroadcastExchange (11)
   :  +- Filter (10)
   :     +- InMemoryTableScan (1)
   :           +- InMemoryRelation (2)
   :                 +- * ColumnarToRow (4)
   :                    +- Scan parquet default.t1 (3)
   +- Filter (12)
      +- Scan parquet default.t2 (6)

(1) InMemoryTableScan
Output [1]: [k#x]
Arguments: [k#x], [isnotnull(k#x)]

(2) InMemoryRelation
Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer401788d5,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) ColumnarToRow
+- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int>
,None)

(3) Scan parquet default.t1
Output [1]: [k#x]
Batched: true
Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1]
ReadSchema: struct<k:int>

(4) ColumnarToRow [codegen id : 1]
Input [1]: [k#x]

(5) BroadcastQueryStage
Output [1]: [k#x]
Arguments: 0

(6) Scan parquet default.t2
Output [1]: [key#x]
Batched: true
Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2]
PushedFilters: [IsNotNull(key)]
ReadSchema: struct<key:int>

(7) ColumnarToRow
Input [1]: [key#x]

(8) Filter
Input [1]: [key#x]
Condition : isnotnull(key#x)

(9) BroadcastHashJoin [codegen id : 2]
Left keys [1]: [k#x]
Right keys [1]: [key#x]
Join condition: None

(10) Filter
Input [1]: [k#x]
Condition : isnotnull(k#x)

(11) BroadcastExchange
Input [1]: [k#x]
Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x]

(12) Filter
Input [1]: [key#x]
Condition : isnotnull(key#x)

(13) BroadcastHashJoin
Left keys [1]: [k#x]
Right keys [1]: [key#x]
Join condition: None

(14) AdaptiveSparkPlan
Output [2]: [k#x, key#x]
Arguments: isFinalPlan=true
```

After Changes =>
```
== Physical Plan ==
AdaptiveSparkPlan (17)
+- == Final Plan ==
   * BroadcastHashJoin Inner BuildLeft (12)
   :- BroadcastQueryStage (8)
   :  +- BroadcastExchange (7)
   :     +- * Filter (6)
   :        +- * ColumnarToRow (5)
   :           +- InMemoryTableScan (1)
   :                 +- InMemoryRelation (2)
   :                       +- * ColumnarToRow (4)
   :                          +- Scan parquet default.t1 (3)
   +- * Filter (11)
      +- * ColumnarToRow (10)
         +- Scan parquet default.t2 (9)
+- == Initial Plan ==
   BroadcastHashJoin Inner BuildLeft (16)
   :- BroadcastExchange (14)
   :  +- Filter (13)
   :     +- InMemoryTableScan (1)
   :           +- InMemoryRelation (2)
   :                 +- * ColumnarToRow (4)
   :                    +- Scan parquet default.t1 (3)
   +- Filter (15)
      +- Scan parquet default.t2 (9)

(1) InMemoryTableScan
Output [1]: [k#x]
Arguments: [k#x], [isnotnull(k#x)]

(2) InMemoryRelation
Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer3ccb12d,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) ColumnarToRow
+- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int>
,None)

(3) Scan parquet default.t1
Output [1]: [k#x]
Batched: true
Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1]
ReadSchema: struct<k:int>

(4) ColumnarToRow [codegen id : 1]
Input [1]: [k#x]

(5) ColumnarToRow [codegen id : 1]
Input [1]: [k#x]

(6) Filter [codegen id : 1]
Input [1]: [k#x]
Condition : isnotnull(k#x)

(7) BroadcastExchange
Input [1]: [k#x]
Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x]

(8) BroadcastQueryStage
Output [1]: [k#x]
Arguments: 0

(9) Scan parquet default.t2
Output [1]: [key#x]
Batched: true
Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2]
PushedFilters: [IsNotNull(key)]
ReadSchema: struct<key:int>

(10) ColumnarToRow
Input [1]: [key#x]

(11) Filter
Input [1]: [key#x]
Condition : isnotnull(key#x)

(12) BroadcastHashJoin [codegen id : 2]
Left keys [1]: [k#x]
Right keys [1]: [key#x]
Join condition: None

(13) Filter
Input [1]: [k#x]
Condition : isnotnull(k#x)

(14) BroadcastExchange
Input [1]: [k#x]
Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x]

(15) Filter
Input [1]: [key#x]
Condition : isnotnull(key#x)

(16) BroadcastHashJoin
Left keys [1]: [k#x]
Right keys [1]: [key#x]
Join condition: None

(17) AdaptiveSparkPlan
Output [2]: [k#x, key#x]
Arguments: isFinalPlan=true
```

add test

Closes #34036 from ChenMichael/SPARK-36795-Duplicate-node-id-with-inMemoryRelation.

Authored-by: Michael Chen <mike.chen@workday.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6d7ab7b52b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-23 15:55:15 +09:00
Hyukjin Kwon af569d1b0a [MINOR][SQL][DOCS] Correct the 'options' description on UnresolvedRelation
### What changes were proposed in this pull request?

This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in https://github.com/apache/spark/pull/29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per https://github.com/apache/spark/pull/29712.

This PR can go through from `master` to `branch-3.1`.

### Why are the changes needed?

To make `UnresolvedRelation.options`'s description clearer.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Scala linter by `dev/linter-scala`.

Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
(cherry picked from commit 0076eba8d0)
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
2021-09-22 23:00:35 -07:00
Fabian A.J. Thiele d4050d7ee9 [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
### What changes were proposed in this pull request?
Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within  `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself.

Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future.

Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests.

### Why are the changes needed?
[SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test as introduced in this PR.

---

Ping eejbyfeldt for notice.

Closes #34043 from f-thiele/SPARK-36782.

Lead-authored-by: Fabian A.J. Thiele <fabian.thiele@posteo.de>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Co-authored-by: Fabian A.J. Thiele <fthiele@liveintent.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 4ea54e8672)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-23 12:57:54 +08:00
jiaoqb d203ed51ca [SPARK-36791][DOCS] Fix spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
### What changes were proposed in this pull request?
The PR fixes SPARK-36791 by replacing JHS_POST with JHS_HOST

### Why are the changes needed?
There are spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Not needed for docs

Closes #34031 from jiaoqingbo/jiaoqingbo.

Authored-by: jiaoqb <jiaoqb@asiainfo.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a1a91bd71)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-23 12:48:06 +09:00
Xinrong Meng 423cff4567 [SPARK-36818][PYTHON] Fix filtering a Series by a boolean Series
### What changes were proposed in this pull request?
Fix filtering a Series (without a name) by a boolean Series.

### Why are the changes needed?
A bugfix. The issue is raised as https://github.com/databricks/koalas/issues/2199.

### Does this PR introduce _any_ user-facing change?
Yes.

#### From
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
Traceback (most recent call last):
...
KeyError: 'none key'

```

#### To
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
0    0
1    1
2    2
dtype: int64
```

### How was this patch tested?
Unit test.

Closes #34061 from xinrong-databricks/filter_series.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 6a5ee0283c)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-22 12:53:06 -07:00
Angerszhuuuu 2ff038a7b3 [SPARK-36753][SQL] ArrayExcept handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a7cbe69986)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-22 23:51:58 +08:00
Ivan Sadikov fc0b85fb26 [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode
### What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

### Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec26d94eac)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-22 17:40:55 +08:00
Chao Sun a28d8d9b0e [SPARK-36820][3.2][SQL] Disable tests related to LZ4 for Hadoop 2.7 profile
### What changes were proposed in this pull request?

Disable tests related to LZ4 in `FileSourceCodecSuite` and `FileSuite` when using `hadoop-2.7` profile.
### Why are the changes needed?

At the moment, parquet-mr uses LZ4 compression codec provided by Hadoop, and only since HADOOP-17292 (in 3.3.1/3.4.0) the latter added `lz4-java` to remove the restriction that the codec can only be run with native library. As consequence, the test will fail when using `hadoop-2.7` profile.

### Does this PR introduce _any_ user-facing change?

No, it's just test.

### How was this patch tested?

Existing test

Closes #34066 from sunchao/SpARK-36820-3.2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-22 00:14:45 -07:00
Xinrong Meng 4543ac62bc [SPARK-36771][PYTHON][3.2] Fix pop of Categorical Series
### What changes were proposed in this pull request?
Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior.

This is a backport of https://github.com/apache/spark/pull/34052.

### Why are the changes needed?
As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series.

### Does this PR introduce _any_ user-facing change?
Yes, results of `pop` of Categorical Series change.

#### From
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```

#### To
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

```

### How was this patch tested?
Unit tests.

Closes #34063 from xinrong-databricks/backport_cat_pop.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-21 19:16:27 -07:00
Gengliang Wang affd7a4d47 [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency
### What changes were proposed in this pull request?

Remove `com.github.rdblue:brotli-codec:0.1.1` dependency.

### Why are the changes needed?

As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`.
Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests.

Closes #34059 from gengliangwang/removeDeps.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ba5708d944)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-21 10:57:34 -07:00
Max Gekk 7fa88b28a5 [SPARK-36807][SQL] Merge ANSI interval types to a tightest common type
### What changes were proposed in this pull request?
In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields.

### Why are the changes needed?
This will allow merging of schemas from different datasource files.

### Does this PR introduce _any_ user-facing change?
No, the ANSI interval types haven't released yet.

### How was this patch tested?
Added new test to `StructTypeSuite`.

Closes #34049 from MaxGekk/merge-ansi-interval-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit d2340f8e1c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-21 10:20:27 +03:00
dgd-contributor 3d47c692d2 [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value
### What changes were proposed in this pull request?
Fix DataFrame.isin when DataFrame has NaN value

### Why are the changes needed?
Fix DataFrame.isin when DataFrame has NaN value

``` python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
     a    b  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
      a     b     c
0  None  None  True
1  True  None  None
2  None  None  True
3  None  None  None
4  None  True  True
5  None  True  True
6  None  None  True
7  None  None  None
8  None  None  None

>>> psdf.to_pandas().isin(other)
       a      b      c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### Does this PR introduce _any_ user-facing change?
After this PR

``` python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
     a    b  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
       a      b      c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### How was this patch tested?
Unit tests

Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit cc182fe6f6)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-20 17:53:02 -07:00
Dongjoon Hyun 5d0e51e943 [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image
### What changes were proposed in this pull request?

This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image.

### Why are the changes needed?

`openjdk:11-jre-slim` image is upgraded to `Debian 11`.

```
$ docker run -it openjdk:11-jre-slim cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
```

It causes `R 3.5` installation failures in our K8s integration test environment.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/
```
The following packages have unmet dependencies:
 r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable
               Depends: libreadline7 (>= 6.0) but it is not installable
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c apt-get update &&   apt install -y gnupg &&   echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list &&   apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' &&   apt-get update &&
apt install -y -t buster-cran35 r-base r-base-dev &&   rm -rf
```

### Does this PR introduce _any_ user-facing change?

Yes, this will recover the installation.

### How was this patch tested?

Succeed to build SparkR docker image in the K8s integration test in Jenkins CI.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/
```
Successfully built 32e1a0cd5ff8
Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51
```

Closes #34048 from dongjoon-hyun/SPARK-36806.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a178752540)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-20 14:38:32 -07:00
Angerszhuuuu 337a1979d2 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2fc7f2f702)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-20 16:51:31 +08:00
Gengliang Wang b0249851f6 Preparing development version 3.2.1-SNAPSHOT 2021-09-18 11:30:12 +00:00
Gengliang Wang 96044e9735 Preparing Spark release v3.2.0-rc3 2021-09-18 11:30:06 +00:00
Ye Zhou d4d8a6320f [SPARK-36772] FinalizeShuffleMerge fails with an exception due to attempt id not matching
### What changes were proposed in this pull request?
Remove the appAttemptId from TransportConf, and parsing through SparkEnv.

### Why are the changes needed?
Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine.
Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver.

Closes #34018 from zhouyejoe/SPARK-36772.

Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit cabc36b54d)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-18 15:52:14 +08:00
dgd-contributor 36ce9cce55 [SPARK-36762][PYTHON] Fix Series.isin when Series has NaN values
### What changes were proposed in this pull request?
Fix Series.isin when Series has NaN values

### Why are the changes needed?
Fix Series.isin when Series has NaN values
``` python
>>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
>>> psser = ps.from_pandas(pser)
>>> pser.isin([1, 3, 5, None])
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool
>>> psser.isin([1, 3, 5, None])
0    None
1    True
2    None
3    True
4    None
5    True
6    None
7    None
8    None
dtype: object
```

### Does this PR introduce _any_ user-facing change?
After this PR
``` python
>>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
>>> psser = ps.from_pandas(pser)
>>> psser.isin([1, 3, 5, None])
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool

```

### How was this patch tested?
unit tests

Closes #34005 from dgd-contributor/SPARK-36762_fix_series.isin_when_values_have_NaN.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 32b8512912)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-17 17:48:27 -07:00
Liang-Chi Hsieh 275ad6bd0b [SPARK-36673][SQL][FOLLOWUP] Remove duplicate test in DataFrameSetOperationsSuite
### What changes were proposed in this pull request?

As a followup of #34025 to remove duplicate test.

### Why are the changes needed?

To remove duplicate test.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test.

Closes #34032 from viirya/remove.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit f9644cc253)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-17 11:52:26 -07:00
Yang He f093f3b939 [SPARK-36780][BUILD] Make dev/mima runs on Java 17
### What changes were proposed in this pull request?

Java 17 has been officially released. This PR makes `dev/mima` runs on Java 17.

### Why are the changes needed?

To make tests pass on Java 17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #34022 from RabbidHY/SPARK-36780.

Lead-authored-by: Yang He <stitch106hy@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5d0889bf36)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-17 08:54:57 -07:00