ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	5d45a415f3	Preparing Spark release v3.2.0-rc7	2021-10-06 11:45:26 +00:00
tianhanhu	9760c8ab60	[SPARK-36919][SQL] Make BadRecordException fields transient ### What changes were proposed in this pull request? Migrating a Spark application from 2.4.x to 3.1.x and finding a difference in the exception chaining behavior. In a case of parsing a malformed CSV, where the root cause exception should be Caused by: java.lang.RuntimeException: Malformed CSV record, only the top level exception is kept, and all lower level exceptions and root cause are lost. Thus, when we call ExceptionUtils.getRootCause on the exception, we still get itself. The reason for the difference is that RuntimeException is wrapped in BadRecordException, which has unserializable fields. When we try to serialize the exception from tasks and deserialize from scheduler, the exception is lost. This PR makes unserializable fields of BadRecordException transient, so the rest of the exception could be serialized and deserialized properly. ### Why are the changes needed? Make BadRecordException serializable ### Does this PR introduce _any_ user-facing change? User could get root cause of BadRecordException ### How was this patch tested? Unit testing Closes #34167 from tianhanhu/master. Authored-by: tianhanhu <adrianhu96@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `aed977c468`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-06 19:06:22 +09:00
Ye Zhou	88f4809142	[SPARK-36892][CORE] Disable batch fetch for a shuffle when push based shuffle is enabled We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks. Proposed changes in the PR to fix the issue. ### What changes were proposed in this pull request? Disabled Batch fetch when push based shuffle is enabled. ### Why are the changes needed? Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested the PR within our PR, with Spark shell and the queries are: sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales") // Dynamically coalesce partitions sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect Unit tests to be added. Closes #34156 from zhouyejoe/SPARK-36892. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `31b6f614d3`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-10-06 15:42:40 +08:00
Wenchen Fan	688808900d	[SPARK-36926][3.2][SQL] Decimal average mistakenly overflow backport https://github.com/apache/spark/pull/34180 ### What changes were proposed in this pull request? This bug was introduced by https://github.com/apache/spark/pull/33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? Yes, the result was wrong before this PR. ### How was this patch tested? a new test Closes #34193 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-10-06 13:31:03 +08:00
Minchu Yang	c54229741a	[SPARK-36705][FOLLOW-UP] Support the case when user's classes need to register for Kryo serialization ### What changes were proposed in this pull request? - Make the val lazy wherever `isPushBasedShuffleEnabled` is invoked when it is a class instance variable, so it can happen after user-defined jars/classes in `spark.kryo.classesToRegister` are downloaded and available on executor-side, as part of the fix for the exception mentioned below. - Add a flag `checkSerializer` to control whether we need to check a serializer is `supportsRelocationOfSerializedObjects` or not within `isPushBasedShuffleEnabled` as part of the fix for the exception mentioned below. Specifically, we don't check this in `registerWithExternalShuffleServer()` in `BlockManager` and `createLocalDirsForMergedShuffleBlocks()` in `DiskBlockManager.scala` as the same issue would raise otherwise. - Move `instantiateClassFromConf` and `instantiateClass` from `SparkEnv` into `Utils`, in order to let `isPushBasedShuffleEnabled` to leverage them for instantiating serializer instances. ### Why are the changes needed? When user tries to set classes for Kryo Serialization by `spark.kryo.classesToRegister`, below exception(or similar) would be encountered in `isPushBasedShuffleEnabled` as indicated below. Reproduced the issue in our internal branch by launching spark-shell as: ``` spark-shell --spark-version 3.1.1 --packages ml.dmlc:xgboost4j_2.12:1.3.1 --conf spark.kryo.classesToRegister=ml.dmlc.xgboost4j.scala.Booster ``` ``` Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:83) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:183) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:230) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:171) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:446) at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:253) at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:249) at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2584) at org.apache.spark.MapOutputTrackerWorker.<init>(MapOutputTracker.scala:1109) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:322) at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) ... 4 more Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.Booster at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:217) at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:174) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173) ... 24 more ``` Registering user class for kryo serialization is happening after serializer creation in SparkEnv. Serializer creation can happen in `isPushBasedShuffleEnabled`, which can be called in some places prior to SparkEnv is created. Also, as per analysis by JoshRosen, this is probably due to Kryo instantiation was failing because added packages hadn't been downloaded to the executor yet (because this code is running during executor startup, not task startup). The proposed change helps fix this issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing tests. Tested this patch in our internal branch where user reported the issue. Issue is now not reproducible with this patch. Closes #34158 from rmcyang/SPARK-33781-bugFix. Lead-authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz> Co-authored-by: Minchu Yang <31781684+rmcyang@users.noreply.github.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `e5b01cd823`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-10-05 12:06:50 -05:00
Kousuke Saruta	8ffe00e745	[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `fa1805db48`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-10-05 11:17:12 +08:00
Takuya UESHIN	04a19963e3	[SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut ### What changes were proposed in this pull request? Fix `DataFrameGroupBy.apply` without shortcut. Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,: ```py >>> pdf = pd.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, ... columns=["a", "b", "c"], ... ) >>> pdf.groupby('b').apply(lambda x: x['a']) b 1 0 1 1 2 2 2 3 3 3 4 5 4 5 8 5 6 Name: a, dtype: int64 >>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a']) a 0 1 b 1 1 2 ``` If there is only one group, it returns a "wide" `DataFrame` instead of `Series`. In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves. ### Why are the changes needed? `DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`. ```py >>> ps.options.compute.shortcut_limit = 3 >>> psdf = ps.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, ... columns=["a", "b", "c"], ... ) >>> psdf.groupby("b").apply(lambda x: x["a"]) org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements ``` ### Does this PR introduce _any_ user-facing change? The error above will be gone: ```py >>> psdf.groupby("b").apply(lambda x: x["a"]) b 1 0 1 1 2 2 2 3 3 3 4 5 4 5 8 5 6 Name: a, dtype: int64 ``` ### How was this patch tested? Added tests. Closes #34160 from ueshin/issues/SPARK-36907/groupby-apply. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `38d39812c1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-03 12:25:27 +09:00
Kousuke Saruta	8b2b6bb0d3	[SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window ### What changes were proposed in this pull request? This PR adds PySpark API document of `session_window`. The docstring of the function doesn't comply with numpydoc format so this PR also fix it. Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR. ### Why are the changes needed? To provide PySpark users with the API document of the newly added function. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `make html` in `python/docs` and get the following docs. [window] ![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png) [session_window] ![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png) Closes #34118 from sarutak/python-session-window-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `5a32e41e9c`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-09-30 16:51:27 +09:00
Hyukjin Kwon	939c4d93b5	[MINOR][DOCS] Mention other Python dependency tools in documentation ### What changes were proposed in this pull request? Self-contained. ### Why are the changes needed? For user's more information on available Python dependency management in PySpark. ### Does this PR introduce _any_ user-facing change? Yes, documentation change. ### How was this patch tested? Manaully built the docs and checked the results: <img width="918" alt="Screen Shot 2021-09-29 at 10 11 56 AM" src="https://user-images.githubusercontent.com/6477701/135186536-2f271378-d06b-4c6b-a4be-691ce395db9f.png"> <img width="976" alt="Screen Shot 2021-09-29 at 10 12 22 AM" src="https://user-images.githubusercontent.com/6477701/135186541-0f4c5615-bc49-48e2-affd-dc2f5c0334bf.png"> <img width="920" alt="Screen Shot 2021-09-29 at 10 12 42 AM" src="https://user-images.githubusercontent.com/6477701/135186551-0b613096-7c86-4562-b345-ddd60208367b.png"> Closes #34134 from HyukjinKwon/minor-docs-py-deps. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `13c2b711e4`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-29 14:46:18 +09:00
Gengliang Wang	4bd358474b	Preparing development version 3.2.1-SNAPSHOT	2021-09-28 10:53:42 +00:00
Gengliang Wang	dde73e2e1c	Preparing Spark release v3.2.0-rc6	2021-09-28 10:53:35 +00:00
Richard Chen	493aad03ab	[SPARK-36836][SQL] Fix incorrect result in `sha2` expression ### What changes were proposed in this pull request? `sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs. This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell. Spark currently returns ``` #\t}"4�"�B�w��U�*��你��l�� ``` while the expected result is ``` 23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7 ``` This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string. ### Why are the changes needed? `sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new test to `HashExpressionsSuite.scala` which previously failed and now pass Closes #34086 from richardc-db/sha224. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `6c6291b3f6`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 18:38:33 +08:00
Chao Sun	cfc1bc295a	[SPARK-36873][BUILD][TEST-MAVEN] Add provided Guava dependency for network-yarn module ### What changes were proposed in this pull request? Add provided Guava dependency to `network-yarn` module. ### Why are the changes needed? In Spark 3.1 and earlier the `network-yarn` module implicitly relies on Guava from `hadoop-client` dependency. This was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive Guava dependency. It stayed fine for a while since we were not using `createDependencyReducedPom` so it picks up the transitive dependency from `spark-network-common` instead. However, things start to break after SPARK-36835 where we restored `createDependencyReducedPom` and now it is no longer able to locate Guava classes: ``` build/mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested with the above `mvn` command and it's now passing. Closes #34125 from sunchao/SPARK-36873. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `53f58b6e51`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 18:23:43 +08:00
copperybean	33fb9693e8	[SPARK-36856][BUILD] Get correct JAVA_HOME for macOS ### What changes were proposed in this pull request? Currently, JAVA_HOME may be set to path "/usr" improperly, now JAVA_HOME is fetched from command "/usr/libexec/java_home" for macOS. ### Why are the changes needed? Command "./build/mvn xxx" will be stuck on MacOS 11.4, because JAVA_HOME is set to path "/usr" improperly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/mvn -DskipTests package` passed on `macOS 11.5.2`. Closes #34111 from copperybean/work. Authored-by: copperybean <copperybean.zhang@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `56c21e8e5a`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 17:27:18 +08:00
Yuming Wang	8f0c846b1d	Revert "[SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type" ### What changes were proposed in this pull request? This reverts commit `aaa0d2a66b`. ### Why are the changes needed? This approach has 2 disadvantages: 1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`. 2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596 Instead, we can use bloom filter join pruning in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34116 from wangyum/revert-SPARK-32855. Authored-by: Yuming Wang <yumwangebay.com> Signed-off-by: Yuming Wang <yumwangebay.com> (cherry picked from commit `e024bdc306`) Closes #34124 from wangyum/revert-SPARK-32855-branch-3.2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-28 15:38:35 +08:00
Gengliang Wang	0c57bb8f7f	Preparing development version 3.2.1-SNAPSHOT	2021-09-27 08:24:50 +00:00
Gengliang Wang	49aea14c5a	Preparing Spark release v3.2.0-rc5	2021-09-27 08:24:44 +00:00
Chao Sun	228d12e30e	[SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Move spark.yarn.isHadoopProvided to parent pom ### What changes were proposed in this pull request? Move `spark.yarn.isHadoopProvided` to Spark parent pom, so that under `resource-managers/yarn` we can make `hadoop-3.2` as the default profile. ### Why are the changes needed? Currently under `resource-managers/yarn` there are 3 maven profiles : `hadoop-provided`, `hadoop-2.7`, and `hadoop-3.2`, of which `hadoop-3.2` is activated by default (via `activeByDefault`). The activation, however, doesn't work when there is other explicitly activated profiles. In specific, if users build Spark with `hadoop-provided`, maven will fail because it can't find Hadoop 3.2 related dependencies, which are defined in the `hadoop-3.2` profile section. To fix the issue, this proposes to move the `hadoop-provided` section to the parent pom. Currently this is only used to define a property `spark.yarn.isHadoopProvided`, and it shouldn't matter where we define it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested via running the command: ``` build/mvn clean package -DskipTests -B -Pmesos -Pyarn -Pkubernetes -Pscala-2.12 -Phadoop-provided ``` which was failing before this PR but is succeeding with it. Also checked active profiles with the command: ``` build/mvn -Pyarn -Phadoop-provided help:active-profiles ``` and it shows that `hadoop-3.2` is active for `spark-yarn` module now. Closes #34110 from sunchao/SPARK-36835-followup2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `f9efdeea8c`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-27 15:17:28 +08:00
Gengliang Wang	2348cce37e	Preparing development version 3.2.1-SNAPSHOT	2021-09-26 12:28:46 +00:00
Gengliang Wang	2ed8c08c5b	Preparing Spark release v3.2.0-rc5	2021-09-26 12:28:40 +00:00
PengLei	eb794a4f58	[SPARK-36851][SQL] Incorrect parsing of negative ANSI typed interval literals ### What changes were proposed in this pull request? Handle incorrect parsing of negative ANSI typed interval literals [SPARK-36851](https://issues.apache.org/jira/browse/SPARK-36851) ### Why are the changes needed? Incorrect result: ``` spark-sql> select interval -'1' year; 1-0 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut testcase Closes #34107 from Peng-Lei/SPARK-36851. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `0fdca1f0df`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-26 18:43:38 +08:00
Chao Sun	540e45c3cc	[SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Fix maven issue for Hadoop 2.7 profile after enabling dependency reduced pom ### What changes were proposed in this pull request? Fix an issue where Maven may stuck in an infinite loop when building Spark, for Hadoop 2.7 profile. ### Why are the changes needed? After re-enabling `createDependencyReducedPom` for `maven-shade-plugin`, Spark build stopped working for Hadoop 2.7 profile and will stuck in an infinitely loop, likely due to a Maven shade plugin bug similar to https://issues.apache.org/jira/browse/MSHADE-148. This seems to be caused by the fact that, under `hadoop-2.7` profile, variable `hadoop-client-runtime.artifact` and `hadoop-client-api.artifact`are both `hadoop-client` which triggers the issue. As a workaround, this changes `hadoop-client-runtime.artifact` to be `hadoop-yarn-api` when using `hadoop-2.7`. Since `hadoop-yarn-api` is a dependency of `hadoop-client`, this essentially moves the former to the same level as the latter. It should have no effect as both are dependencies of Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #34100 from sunchao/SPARK-36835-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `937a74e6e7`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-26 13:39:50 +08:00
Gengliang Wang	da722d43cb	Preparing development version 3.2.1-SNAPSHOT	2021-09-24 10:03:23 +00:00
Gengliang Wang	9e35703211	Preparing Spark release v3.2.0-rc5	2021-09-24 10:03:16 +00:00
Gengliang Wang	3fff405c95	[SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data ### What changes were proposed in this pull request? Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it ### Why are the changes needed? Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new unit test for the InMemoryStore. Also, run a simple benchmark with ``` val testConf = conf.clone() .set(MAX_RETAINED_STAGES, 1000) val listener = new AppStatusListener(store, testConf, true) val stages = (1 to 5000).map { i => val s = new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1", resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID) s.submissionTime = Some(i.toLong) s } listener.onJobStart(SparkListenerJobStart(4, time, Nil, null)) val start = System.nanoTime() stages.foreach { s => time +=1 s.submissionTime = Some(time) listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties())) s.completionTime = Some(time) listener.onStageCompleted(SparkListenerStageCompleted(s)) } println(System.nanoTime() - start) ``` Before changes: InMemoryStore: 1.2s After changes: InMemoryStore: 0.23s Closes #34092 from gengliangwang/cleanStage. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `7ac0a2c37b`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-24 17:24:32 +08:00
Angerszhuuuu	b7174188e5	[SPARK-36792][SQL] InSet should handle NaN ### What changes were proposed in this pull request? InSet should handle NaN ``` InSet(Literal(Double.NaN), Set(Double.NaN, 1d)) should return true, but return false. ``` ### Why are the changes needed? InSet should handle NaN ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34033 from AngersZhuuuu/SPARK-36792. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `64f4bf47af`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-24 16:19:47 +08:00
allisonwang-db	d0c97d6ed9	[SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list ### What changes were proposed in this pull request? This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions. For example ```sql select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t) ``` ``` == Optimized Logical Plan == Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions. +- Project [sum(c2)#10L] +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) <--- Aggregate expression in join condition :- LocalRelation [c2#3] +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2] +- LocalRelation [c1#2, c2#3] java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false]) ``` Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions. `079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)` ### Why are the changes needed? To fix an existing optimizer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Authored-by: allisonwang-db <allison.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit `4a8dc5f7a3`) Signed-off-by: allisonwang-db <allison.wangdatabricks.com> Closes #34081 from allisonwang-db/cp-spark-36747. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-24 16:14:49 +08:00
Gengliang Wang	09a8535cc4	Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec… …utors using config instead of command line" ### What changes were proposed in this pull request? This reverts commit `866df69c62`. ### Why are the changes needed? After the change environment variables were not substituted in user classpath entries. Please find an example on SPARK-35672. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34088 from gengliangwang/revertSPARK-35672. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-24 12:46:22 +09:00
Chao Sun	09283d3210	[SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin ### What changes were proposed in this pull request? Enable `createDependencyReducedPom` for Spark's Maven shaded plugin so that the effective pom won't contain those shaded artifacts such as `org.eclipse.jetty` ### Why are the changes needed? At the moment, the effective pom leaks transitive dependencies to downstream apps for those shaded artifacts, which potentially will cause issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I manually tested and the `core/dependency-reduced-pom.xml` no longer contains dependencies such as `jetty-XX`. Closes #34085 from sunchao/SPARK-36835. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `ed88e610f0`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-24 10:16:46 +08:00
Gengliang Wang	0fb7127f85	Preparing development version 3.2.1-SNAPSHOT	2021-09-23 08:46:28 +00:00
Gengliang Wang	b609f2fe0c	Preparing Spark release v3.2.0-rc4	2021-09-23 08:46:22 +00:00
yi.wu	0ad382747d	[SPARK-36782][CORE][FOLLOW-UP] Only handle shuffle block in separate thread pool ### What changes were proposed in this pull request? This's a follow-up of https://github.com/apache/spark/pull/34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is. ### Why are the changes needed? To avoid any potential overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #34076 from Ngone51/spark-36782-follow-up. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `9d8ac7c8e9`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-23 16:30:25 +08:00
Michael Chen	89894a4b1d	[SPARK-36795][SQL] Explain Formatted has Duplicate Node IDs Fixed explain formatted mode so it doesn't have duplicate node IDs when InMemoryRelation is present in query plan. Having duplicated node IDs in the plan makes it confusing. Yes, explain formatted string will change. Notice how `ColumnarToRow` and `InMemoryRelation` have node id of 2. Before changes => ``` == Physical Plan == AdaptiveSparkPlan (14) +- == Final Plan == * BroadcastHashJoin Inner BuildLeft (9) :- BroadcastQueryStage (5) : +- BroadcastExchange (4) : +- * Filter (3) : +- * ColumnarToRow (2) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (8) +- * ColumnarToRow (7) +- Scan parquet default.t2 (6) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (13) :- BroadcastExchange (11) : +- Filter (10) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- Filter (12) +- Scan parquet default.t2 (6) (1) InMemoryTableScan Output [1]: [k#x] Arguments: [k#x], [isnotnull(k#x)] (2) InMemoryRelation Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer401788d5,StorageLevel(disk, memory, deserialized, 1 replicas),(1) ColumnarToRow +- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int> ,None) (3) Scan parquet default.t1 Output [1]: [k#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1] ReadSchema: struct<k:int> (4) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (5) BroadcastQueryStage Output [1]: [k#x] Arguments: 0 (6) Scan parquet default.t2 Output [1]: [key#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int> (7) ColumnarToRow Input [1]: [key#x] (8) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (9) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (10) Filter Input [1]: [k#x] Condition : isnotnull(k#x) (11) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (12) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (13) BroadcastHashJoin Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (14) AdaptiveSparkPlan Output [2]: [k#x, key#x] Arguments: isFinalPlan=true ``` After Changes => ``` == Physical Plan == AdaptiveSparkPlan (17) +- == Final Plan == BroadcastHashJoin Inner BuildLeft (12) :- BroadcastQueryStage (8) : +- BroadcastExchange (7) : +- * Filter (6) : +- * ColumnarToRow (5) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (11) +- * ColumnarToRow (10) +- Scan parquet default.t2 (9) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (16) :- BroadcastExchange (14) : +- Filter (13) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- Filter (15) +- Scan parquet default.t2 (9) (1) InMemoryTableScan Output [1]: [k#x] Arguments: [k#x], [isnotnull(k#x)] (2) InMemoryRelation Arguments: [k#x], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer3ccb12d,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) ColumnarToRow +- FileScan parquet default.t1[k#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apach..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int> ,None) (3) Scan parquet default.t1 Output [1]: [k#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t1] ReadSchema: struct<k:int> (4) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (5) ColumnarToRow [codegen id : 1] Input [1]: [k#x] (6) Filter [codegen id : 1] Input [1]: [k#x] Condition : isnotnull(k#x) (7) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (8) BroadcastQueryStage Output [1]: [k#x] Arguments: 0 (9) Scan parquet default.t2 Output [1]: [key#x] Batched: true Location: InMemoryFileIndex [file:/Users/mike.chen/code/apacheSpark/spark/spark-warehouse/org.apache.spark.sql.ExplainSuiteAE/t2] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int> (10) ColumnarToRow Input [1]: [key#x] (11) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (12) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (13) Filter Input [1]: [k#x] Condition : isnotnull(k#x) (14) BroadcastExchange Input [1]: [k#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#x] (15) Filter Input [1]: [key#x] Condition : isnotnull(key#x) (16) BroadcastHashJoin Left keys [1]: [k#x] Right keys [1]: [key#x] Join condition: None (17) AdaptiveSparkPlan Output [2]: [k#x, key#x] Arguments: isFinalPlan=true ``` add test Closes #34036 from ChenMichael/SPARK-36795-Duplicate-node-id-with-inMemoryRelation. Authored-by: Michael Chen <mike.chen@workday.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `6d7ab7b52b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-23 15:55:15 +09:00
Hyukjin Kwon	af569d1b0a	[MINOR][SQL][DOCS] Correct the 'options' description on UnresolvedRelation ### What changes were proposed in this pull request? This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in https://github.com/apache/spark/pull/29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per https://github.com/apache/spark/pull/29712. This PR can go through from `master` to `branch-3.1`. ### Why are the changes needed? To make `UnresolvedRelation.options`'s description clearer. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Scala linter by `dev/linter-scala`. Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> (cherry picked from commit `0076eba8d0`) Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>	2021-09-22 23:00:35 -07:00
Fabian A.J. Thiele	d4050d7ee9	[SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo ### What changes were proposed in this pull request? Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself. Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future. Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests. ### Why are the changes needed? [SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test as introduced in this PR. --- Ping eejbyfeldt for notice. Closes #34043 from f-thiele/SPARK-36782. Lead-authored-by: Fabian A.J. Thiele <fabian.thiele@posteo.de> Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Co-authored-by: Fabian A.J. Thiele <fthiele@liveintent.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `4ea54e8672`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-23 12:57:54 +08:00
jiaoqb	d203ed51ca	[SPARK-36791][DOCS] Fix spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST ### What changes were proposed in this pull request? The PR fixes SPARK-36791 by replacing JHS_POST with JHS_HOST ### Why are the changes needed? There are spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST ### Does this PR introduce any user-facing change? No ### How was this patch tested? Not needed for docs Closes #34031 from jiaoqingbo/jiaoqingbo. Authored-by: jiaoqb <jiaoqb@asiainfo.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8a1a91bd71`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-23 12:48:06 +09:00
Xinrong Meng	423cff4567	[SPARK-36818][PYTHON] Fix filtering a Series by a boolean Series ### What changes were proposed in this pull request? Fix filtering a Series (without a name) by a boolean Series. ### Why are the changes needed? A bugfix. The issue is raised as https://github.com/databricks/koalas/issues/2199. ### Does this PR introduce _any_ user-facing change? Yes. #### From ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] Traceback (most recent call last): ... KeyError: 'none key' ``` #### To ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] 0 0 1 1 2 2 dtype: int64 ``` ### How was this patch tested? Unit test. Closes #34061 from xinrong-databricks/filter_series. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `6a5ee0283c`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-22 12:53:06 -07:00
Angerszhuuuu	2ff038a7b3	[SPARK-36753][SQL] ArrayExcept handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN, 1d], but it should return [1d]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayExcept won't show handle equal `NaN` value ### How was this patch tested? Added UT Closes #33994 from AngersZhuuuu/SPARK-36753. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `a7cbe69986`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 23:51:58 +08:00
Ivan Sadikov	fc0b85fb26	[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ec26d94eac`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 17:40:55 +08:00
Chao Sun	a28d8d9b0e	[SPARK-36820][3.2][SQL] Disable tests related to LZ4 for Hadoop 2.7 profile ### What changes were proposed in this pull request? Disable tests related to LZ4 in `FileSourceCodecSuite` and `FileSuite` when using `hadoop-2.7` profile. ### Why are the changes needed? At the moment, parquet-mr uses LZ4 compression codec provided by Hadoop, and only since HADOOP-17292 (in 3.3.1/3.4.0) the latter added `lz4-java` to remove the restriction that the codec can only be run with native library. As consequence, the test will fail when using `hadoop-2.7` profile. ### Does this PR introduce _any_ user-facing change? No, it's just test. ### How was this patch tested? Existing test Closes #34066 from sunchao/SpARK-36820-3.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-22 00:14:45 -07:00
Xinrong Meng	4543ac62bc	[SPARK-36771][PYTHON][3.2] Fix `pop` of Categorical Series ### What changes were proposed in this pull request? Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior. This is a backport of https://github.com/apache/spark/pull/34052. ### Why are the changes needed? As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series. ### Does this PR introduce _any_ user-facing change? Yes, results of `pop` of Categorical Series change. #### From ```py >>> psser = ps.Series(["a", "b", "c", "a"], dtype="category") >>> psser 0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(0) 0 >>> psser 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(3) 0 >>> psser 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] ``` #### To ```py >>> psser = ps.Series(["a", "b", "c", "a"], dtype="category") >>> psser 0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(0) 'a' >>> psser 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(3) 'a' >>> psser 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] ``` ### How was this patch tested? Unit tests. Closes #34063 from xinrong-databricks/backport_cat_pop. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-21 19:16:27 -07:00
Gengliang Wang	affd7a4d47	[SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ### What changes were proposed in this pull request? Remove `com.github.rdblue:brotli-codec:0.1.1` dependency. ### Why are the changes needed? As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`. Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #34059 from gengliangwang/removeDeps. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `ba5708d944`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-21 10:57:34 -07:00
Max Gekk	7fa88b28a5	[SPARK-36807][SQL] Merge ANSI interval types to a tightest common type ### What changes were proposed in this pull request? In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields. ### Why are the changes needed? This will allow merging of schemas from different datasource files. ### Does this PR introduce _any_ user-facing change? No, the ANSI interval types haven't released yet. ### How was this patch tested? Added new test to `StructTypeSuite`. Closes #34049 from MaxGekk/merge-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `d2340f8e1c`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-21 10:20:27 +03:00
dgd-contributor	3d47c692d2	[SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value ### What changes were proposed in this pull request? Fix DataFrame.isin when DataFrame has NaN value ### Why are the changes needed? Fix DataFrame.isin when DataFrame has NaN value ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf a b c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 None None True 1 True None None 2 None None True 3 None None None 4 None True True 5 None True True 6 None None True 7 None None None 8 None None None >>> psdf.to_pandas().isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf a b c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### How was this patch tested? Unit tests Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `cc182fe6f6`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-20 17:53:02 -07:00
Dongjoon Hyun	5d0e51e943	[SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image ### What changes were proposed in this pull request? This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image. ### Why are the changes needed? `openjdk:11-jre-slim` image is upgraded to `Debian 11`. ``` $ docker run -it openjdk:11-jre-slim cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" ``` It causes `R 3.5` installation failures in our K8s integration test environment. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/ ``` The following packages have unmet dependencies: r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable Depends: libreadline7 (>= 6.0) but it is not installable E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update && apt install -y -t buster-cran35 r-base r-base-dev && rm -rf ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the installation. ### How was this patch tested? Succeed to build SparkR docker image in the K8s integration test in Jenkins CI. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/ ``` Successfully built 32e1a0cd5ff8 Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51 ``` Closes #34048 from dongjoon-hyun/SPARK-36806. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `a178752540`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-20 14:38:32 -07:00
Angerszhuuuu	337a1979d2	[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZhuuuu/SPARK-36754. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2fc7f2f702`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-20 16:51:31 +08:00
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Ye Zhou	d4d8a6320f	[SPARK-36772] FinalizeShuffleMerge fails with an exception due to attempt id not matching ### What changes were proposed in this pull request? Remove the appAttemptId from TransportConf, and parsing through SparkEnv. ### Why are the changes needed? Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine. Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver. Closes #34018 from zhouyejoe/SPARK-36772. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `cabc36b54d`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-18 15:52:14 +08:00
dgd-contributor	36ce9cce55	[SPARK-36762][PYTHON] Fix Series.isin when Series has NaN values ### What changes were proposed in this pull request? Fix Series.isin when Series has NaN values ### Why are the changes needed? Fix Series.isin when Series has NaN values ``` python >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) >>> psser = ps.from_pandas(pser) >>> pser.isin([1, 3, 5, None]) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 False 8 False dtype: bool >>> psser.isin([1, 3, 5, None]) 0 None 1 True 2 None 3 True 4 None 5 True 6 None 7 None 8 None dtype: object ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) >>> psser = ps.from_pandas(pser) >>> psser.isin([1, 3, 5, None]) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 False 8 False dtype: bool ``` ### How was this patch tested? unit tests Closes #34005 from dgd-contributor/SPARK-36762_fix_series.isin_when_values_have_NaN. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `32b8512912`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-17 17:48:27 -07:00

1 2 3 4 5 ...

31123 commits