### What changes were proposed in this pull request?
- [x] Expand dictionary definitions into standalone functions.
- [x] Fix annotations for ordering functions.
### Why are the changes needed?
To simplify further maintenance of docstrings.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#30143 from zero323/SPARK-32084.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation.
### Why are the changes needed?
To document common APIs in PySpark.
### Does this PR introduce _any_ user-facing change?
Yes, `Column.*` will be shown in API reference page.
### How was this patch tested?
Manually tested via `cd python` and `make clean html`.
Closes#30150 from HyukjinKwon/SPARK-32188.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive.
This pr to keep result same with hive script transform serde.
#### Hive Scrip Transform with serde in schemaless
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> DESCRIBE v;
key string
value string
hive> SELECT * FROM v;
1 1 1
2 2 2
hive> SELECT key FROM v;
1
2
hive> SELECT value FROM v;
1 1
2 2
```
#### Spark script transform with hive serde in schema less.
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> SELECT * FROM v;
1 1
2 2
```
**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)
### Why are the changes needed?
Keep same behavior with hive script transform
### Does this PR introduce _any_ user-facing change?
Before this pr with hive serde script transform
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2
```
After
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2 3 4
```
### How was this patch tested?
UT
Closes#29421 from AngersZhuuuu/SPARK-32388.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to use `hadoop-3.2` profile in K8s IT Jenkins jobs.
- [x] Switch the default value of `HADOOP_PROFILE` from `hadoop-2.7` to `hadoop-3.2`.
- [x] Remove `-Phadoop2.7` from Jenkins K8s IT job.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure
**BEFORE**
```
./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz -DzincPort=${ZINC_PORT} \
-Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
```
**AFTER**
```
./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz -DzincPort=${ZINC_PORT} \
-Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
```
### Why are the changes needed?
Since Apache Spark 3.1.0, Hadoop 3 is the default.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Check the Jenkins K8s IT log and result.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34899/
```
+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn clean package -DskipTests -DzincPort=4021 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
Using `mvn` from path: /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.3/bin/mvn
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
```
Closes#30153 from dongjoon-hyun/SPARK-33237.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster.
### Why are the changes needed?
If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash.
### Does this PR introduce _any_ user-facing change?
Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away"
### How was this patch tested?
Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost.
Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID
Closes#30141 from steveloughran/SPARK-33230-jobId.
Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The page returned by /jobs in Spark UI will store the detail information of each job in javascript like this:
```javascript
{
'className': 'executor added',
'group': 'executors',
'start': new Date(1602834008978),
'content': '<div class="executor-event-content"' +
'data-toggle="tooltip" data-placement="top"' +
'data-title="Executor 3<br>' +
'Added at 2020/10/16 15:40:08"' +
'data-html="true">Executor 3 added</div>'
}
```
if an application has a failed job, the failure reason corresponding to the job will be stored in the ` content` field in the javascript . if the failure reason contains the character: **'**, the javascript code will throw an exception to cause the `event timeline url` had no response , The following is an example of error json:
```javascript
{
'className': 'executor removed',
'group': 'executors',
'start': new Date(1602925908654),
'content': '<div class="executor-event-content"' +
'data-toggle="tooltip" data-placement="top"' +
'data-title="Executor 2<br>' +
'Removed at 2020/10/17 17:11:48' +
'<br>Reason: Container from a bad node: ... 20/10/17 16:00:42 WARN ShutdownHookManager: ShutdownHook **'$anon$2'** timeout..."' +
'data-html="true">Executor 2 removed</div>'
}
```
So we need to considier this special case , if the returned job info contains the character:**'**, just remove it
### Why are the changes needed?
Ensure that the UI page can function normally
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This pr only fixes an exception in a special case, manual test result as blows:
![fixed](https://user-images.githubusercontent.com/52202080/96711638-74490580-13d0-11eb-93e0-b44d9ed5da5c.gif)
Closes#30119 from akiyamaneko/timeline_view_cannot_open.
Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.
### Why are the changes needed?
Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.
Closes#30138 from c21/enable-auto-bucket.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to use a pre-built image for Github Action SparkR job.
### Why are the changes needed?
This will reduce the execution time and the flakiness.
**BEFORE (21 minutes 39 seconds)**
![Screen Shot 2020-10-16 at 1 24 43 PM](https://user-images.githubusercontent.com/9700541/96305593-fbeada80-0fb2-11eb-9b8e-86d8abaad9ef.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action `sparkr` job in this PR.
Closes#30066 from dongjoon-hyun/SPARKR.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime.
### Why are the changes needed?
`spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect.
### Does this PR introduce _any_ user-facing change?
Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect.
### How was this patch tested?
modified unit test
Closes#30108 from yuningzh-db/dynamic-analyzer-max-iterations.
Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`):
* For left side input row, check if there's a match on right side state store.
* if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store).
* if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store).
* For right side input row, check if there's a match on left side state store.
* For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics.
* State store eviction: evict rows from left/right side state store below watermark, same as inner join.
Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup.
### Why are the changes needed?
Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`.
Closes#30076 from c21/stream-join.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive.
This pr to keep result same with hive script transform serde.
#### Hive Scrip Transform with serde in schemaless
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> DESCRIBE v;
key string
value string
hive> SELECT * FROM v;
1 1 1
2 2 2
hive> SELECT key FROM v;
1
2
hive> SELECT value FROM v;
1 1
2 2
```
#### Spark script transform with hive serde in schema less.
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> SELECT * FROM v;
1 1
2 2
```
**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)
### Why are the changes needed?
Keep same behavior with hive script transform
### Does this PR introduce _any_ user-facing change?
Before this pr with hive serde script transform
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2
```
After
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2 3 4
```
### How was this patch tested?
UT
Closes#29421 from AngersZhuuuu/SPARK-32388.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
I am generating the SHA-512 using the standard shasum which also has a better output compared to GPG.
### Why are the changes needed?
Which makes the hash much easier to verify for users that don't have GPG.
Because an user having GPG can check the keys but an user without GPG will have a hard time validating the SHA-512 based on the 'pretty printed' format.
Apache Spark is the only project where I've seen this format. Most other Apache projects have a one-line hash file.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This patch assumes the build system has shasum (it should, but I can't test this).
Closes#30123 from emilianbold/master.
Authored-by: Emi <emilian.bold@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch adds new UTs to prevent SPARK-29438 for streaming aggregation as well as flatMapGroupsWithState, as we agree about the review comment quote here:
https://github.com/apache/spark/pull/26162#issuecomment-576929692
> LGTM for this PR. But on a additional note, this is a very subtle and easy-to-make bug with TaskContext.getPartitionId. I wonder if this bug is present in any other stateful operation. I wonder if this bug is present in any other stateful operation. Can you please verify how partitionId is used in the other stateful operations?
For now they're not broken, but even better if we have UTs to prevent the case for the future.
### Why are the changes needed?
New UTs will prevent streaming aggregation and flatMapGroupsWithState to be broken in future where it is placed on the right side of UNION and the number of partition is changing on the left side of UNION. Please refer SPARK-29438 for more details.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UTs.
Closes#27333 from HeartSaVioR/SPARK-29438-add-regression-test.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Handle executor failure with multiple containers
Added a spark property spark.kubernetes.executor.checkAllContainers,
with default being false. When it's true, the executor snapshot will
take all containers in the executor into consideration when deciding
whether the executor is in "Running" state, if the pod restart policy is
"Never". Also, added the new spark property to the doc.
### What changes were proposed in this pull request?
Checking of all containers in the executor pod when reporting executor status, if the `spark.kubernetes.executor.checkAllContainers` property is set to true.
### Why are the changes needed?
Currently, a pod remains "running" as long as there is at least one running container. This prevents Spark from noticing when a container has failed in an executor pod with multiple containers. With this change, user can configure the behavior to be different. Namely, if any container in the executor pod has failed, either the executor process or one of its sidecars, the pod is considered to be failed, and it will be rescheduled.
### Does this PR introduce _any_ user-facing change?
Yes, new spark property added.
User is now able to choose whether to turn on this feature using the `spark.kubernetes.executor.checkAllContainers` property.
### How was this patch tested?
Unit test was added and all passed.
I tried to run integration test by following the instruction [here](https://spark.apache.org/developer-tools.html) (section "Testing K8S") and also [here](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md), but I wasn't able to run it smoothly as it fails to talk with minikube cluster. Maybe it's because my minikube version is too new (I'm using v1.13.1)...? Since I've been trying it for two days and still can't make it work, I decided to submit this PR and hopefully the Jenkins test will pass.
Closes#29924 from huskysun/exec-sidecar-failure.
Authored-by: Shiqi Sun <s.sun@salesforce.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Add type hints guidelines to developer docs.
### Why are the changes needed?
Since it is a new and still somewhat evolving feature, we should provided clear guidelines for potential contributors.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#30094 from zero323/SPARK-33003.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a test case to audit all JDBC metadata behaviors to check and prevent potential APIs silent changing from both the upstream hive-jdbc module or the Spark thrift server side.
Forked from my kyuubi project here https://github.com/yaooqinn/kyuubi/blob/master/externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkOperationSuite.scala
### Why are the changes needed?
Make the SparkThriftServer safer to evolve.
### Does this PR introduce _any_ user-facing change?
dev only
### How was this patch tested?
new tests
Closes#30101 from yaooqinn/SPARK-33193.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to exclude `org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:tests` from `hadoop-yarn-server-tests` when we use Hadoop 2 profile.
For some reasons, after SBT 1.3 upgrade at SPARK-21708, SBT starts to pull the dependencies of 'hadoop-yarn-server-tests' with 'tests' classifier:
```
org/apache/hadoop/hadoop-common/2.7.4/hadoop-common-2.7.4-tests.jar
org/apache/hadoop/hadoop-yarn-common/2.7.4/hadoop-yarn-common-2.7.4-tests.jar
org/apache/hadoop/hadoop-yarn-server-resourcemanager/2.7.4/hadoop-yarn-server-resourcemanager-2.7.4-tests.jar
```
these were not pulled before the upgrade.
This specific `hadoop-yarn-server-resourcemanager-2.7.4-tests.jar` causes the problem (SPARK-33104)
1. When the test case creates the Hadoop configuration here,
cc06266ade/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala (L122)
2. Such jars above have higher precedence in the class path, instead of the specified custom `core-site.xml` in the test:
e93b8f02cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (L1375)
3. Later, `core-site.xml` in the jar is picked instead in Hadoop's `Configuration`:
Before this fix:
```
jar:file:/.../https/maven-central.storage-download.googleapis.com/maven2/org/apache/hadoop/
hadoop-yarn-server-resourcemanager/2.7.4/hadoop-yarn-server-resourcemanager-2.7.4-tests.jar!/core-site.xml
```
After this fix:
```
file:/.../spark/resource-managers/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/
org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/
usercache/.../filecache/10/__spark_conf__.zip/__hadoop_conf__/core-site.xml
```
4. the `core-site.xml` in the jar of course does not contain:
2cfd215dc4/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (L133-L141)
and the specific test fails.
This PR uses some kind of hacky approach. It was excluded from 'hadoop-yarn-server-tests' with 'tests' classifier, and then added back as a proper dependency (when Hadoop 2 profile is used). In this way, SBT does not pull `hadoop-yarn-server-resourcemanager` with `tests` classifier anymore.
### Why are the changes needed?
To make the build pass. This is a blocker.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
Manually tested and debugged:
```bash
build/sbt clean "yarn/testOnly *.YarnClusterSuite -- -z SparkHadoopUtil" -Pyarn -Phadoop-2.7 -Phive -Phive-2.3
```
Closes#30133 from HyukjinKwon/SPARK-33104.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT from 1.4.0 to 1.4.1.
### Why are the changes needed?
SBT 1.4.1 is a maintenance release at 1.4.x line. There are many bug fixes already.
- https://github.com/sbt/sbt/releases/tag/v1.4.1 (Released on 2020-10-19)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI and check [the Jenkins log](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130185/testReport).
```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Phive -Pspark-ganglia-lgpl -Pkinesis-asl -Pyarn -Phadoop-cloud -Phive-thriftserver -Pkubernetes -Pmesos test:package streaming-kinesis-asl-assembly/assembly
Using /usr/java/jdk1.8.0_191 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
Attempting to fetch sbt
Launching sbt from build/sbt-launch-1.4.1.jar
[info] [launcher] getting org.scala-sbt sbt 1.4.1 (this may take some time)...
downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.4.1/sbt-1.4.1.jar ...
```
Closes#30137 from dongjoon-hyun/SBT_1.4.1.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
#### case
the case here covers the static and dynamic SQL configs behavior in `sharedState` and `sessionState`, and the specially handled config `spark.sql.warehouse.dir`
the case can be found here - https://github.com/yaooqinn/sugar/blob/master/src/main/scala/com/netease/mammut/spark/training/sql/WarehouseSCBeforeSS.scala
```scala
import java.lang.reflect.Field
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object WarehouseSCBeforeSS extends App {
val wh = "spark.sql.warehouse.dir"
val td = "spark.sql.globalTempDatabase"
val custom = "spark.sql.custom"
val conf = new SparkConf()
.setMaster("local")
.setAppName("SPARK-32991")
.set(wh, "./data1")
.set(td, "bob")
val sc = new SparkContext(conf)
val spark = SparkSession.builder()
.config(wh, "./data2")
.config(td, "alice")
.config(custom, "kyao")
.getOrCreate()
val confField: Field = spark.sharedState.getClass.getDeclaredField("conf")
confField.setAccessible(true)
private val shared: SparkConf = confField.get(spark.sharedState).asInstanceOf[SparkConf]
println()
println(s"=====> SharedState: $wh=${shared.get(wh)}")
println(s"=====> SharedState: $td=${shared.get(td)}")
println(s"=====> SharedState: $custom=${shared.get(custom, "")}")
println(s"=====> SessionState: $wh=${spark.conf.get(wh)}")
println(s"=====> SessionState: $td=${spark.conf.get(td)}")
println(s"=====> SessionState: $custom=${spark.conf.get(custom, "")}")
val spark2 = SparkSession.builder().config(td, "fred").getOrCreate()
println(s"=====> SessionState 2: $wh=${spark2.conf.get(wh)}")
println(s"=====> SessionState 2: $td=${spark2.conf.get(td)}")
println(s"=====> SessionState 2: $custom=${spark2.conf.get(custom, "")}")
SparkSession.setActiveSession(spark)
spark.sql("RESET")
println(s"=====> SessionState RESET: $wh=${spark.conf.get(wh)}")
println(s"=====> SessionState RESET: $td=${spark.conf.get(td)}")
println(s"=====> SessionState RESET: $custom=${spark.conf.get(custom, "")}")
val spark3 = SparkSession.builder().getOrCreate()
println(s"=====> SessionState 3: $wh=${spark2.conf.get(wh)}")
println(s"=====> SessionState 3: $td=${spark2.conf.get(td)}")
println(s"=====> SessionState 3: $custom=${spark2.conf.get(custom, "")}")
}
```
#### outputs and analysis
```
// 1. Make the cloned spark conf in shared state respect the warehouse dir from the 1st SparkSession
//=====> SharedState: spark.sql.warehouse.dir=./data1
// 2. ⏬
//=====> SharedState: spark.sql.globalTempDatabase=alice
//=====> SharedState: spark.sql.custom=kyao
//=====> SessionState: spark.sql.warehouse.dir=./data2
//=====> SessionState: spark.sql.globalTempDatabase=alice
//=====> SessionState: spark.sql.custom=kyao
//=====> SessionState 2: spark.sql.warehouse.dir=./data2
//=====> SessionState 2: spark.sql.globalTempDatabase=alice
//=====> SessionState 2: spark.sql.custom=kyao
// 2'.🔼 OK until here
// 3. Make the below 3 ones respect the cloned spark conf in shared state with issue 1 fixed
//=====> SessionState RESET: spark.sql.warehouse.dir=./data1
//=====> SessionState RESET: spark.sql.globalTempDatabase=bob
//=====> SessionState RESET: spark.sql.custom=
// 4. Then the SparkSessions created after RESET will be corrected.
//=====> SessionState 3: spark.sql.warehouse.dir=./data1
//=====> SessionState 3: spark.sql.globalTempDatabase=bob
//=====> SessionState 3: spark.sql.custom=
```
In this PR, we gather all valid config to the cloned conf of `sharedState` during being constructed, well, actually only `spark.sql.warehouse.dir` is missing. Then we use this conf as defaults for `RESET` Command.
`SparkSession.clearActiveSession/clearDefaultSession` will make the shared state invisible and unsharable. They will be internal only soon (confirmed with Wenchen), so cases with them called will not be a problem.
### Why are the changes needed?
bugfix for programming API to call RESET while users creating SparkContext first and config SparkSession later.
### Does this PR introduce _any_ user-facing change?
yes, before this change when you use programming API and call RESET, all configs will be reset to SparkContext.conf, now they go to SparkSession.sharedState.conf
### How was this patch tested?
new tests
Closes#30045 from yaooqinn/SPARK-32991.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes:
* Keep `Worker`'s decommission status be consistent between the case where decommission starts from `Worker` and the case where decommission starts from the `MasterWebUI`: sending `DecommissionWorker` from `Master` to `Worker` in the latter case.
* Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse.
* Only send one message instead of two(`DecommissionSelf`/`DecommissionBlockManager`) when decommission the executor: executor and `BlockManager` are in the same JVM.
* Clean up codes around here.
### Why are the changes needed?
Before:
<img width="1948" alt="WeChat56c00cc34d9785a67a544dca036d49da" src="https://user-images.githubusercontent.com/16397174/92850308-dc461c80-f41e-11ea-8ac0-287825f4e0c4.png">
After:
<img width="1968" alt="WeChat05f7afb017e3f0132394c5e54245e49e" src="https://user-images.githubusercontent.com/16397174/93189571-de88dd80-f774-11ea-9300-1943920aa27d.png">
(Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.)
After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated existing tests.
Closes#29817 from Ngone51/simplify-decommission-rpc.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`PLAN_CHANGE_LOG_LEVEL` config document is wrong. This is to fix it.
### Why are the changes needed?
Fix wrong doc.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Only doc change.
Closes#30136 from viirya/minor-sqlconf.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Building spark internally in orgs where access to outside internet is not allowed takes a long time because unsuccessful attempts are made to download artifacts from repositories which are not accessible. The unsuccessful attempts unnecessarily add significant amount of time to the build. I have seen a difference of up-to 1hr for some runs.
- Adding 1 environment variables that should be present that the start of the build and if they exist, override the default repos defined in the code and scripts.
envVariables:
- DEFAULT_ARTIFACT_REPOSITORY=https://artifacts.internal.com/libs-release/
### Why are the changes needed?
To allow orgs to build spark internally without relying on external repositories for artifact downloads.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Multiple builds with and without env variables set.
Closes#29874 from ankits/SPARK-32998.
Authored-by: Ankit Srivastava <ankit_srivastava@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
- If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
- If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
- For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.
### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modified test in `ParquetIOSuite`
Closes#30132 from MaxGekk/int96-flip-metadata-rebase-key.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The purpose of this pr is to resolve SPARK-32978.
The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side.
The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct.
### Why are the changes needed?
The the number of dynamic part metric we display on the UI should be correct.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add a new test case refer to described in SPARK-32978
Closes#30026 from LuciferYang/SPARK-32978.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support `spark.sql.hive.metastore.jars` use HDFS location.
When user need to use path to set hive metastore jars, you should set
`spark.sql.hive.metasstore.jars=path` and set real path in `spark.sql.hive.metastore.jars.path`
since we use `File.pathSeperator` to split path, but `FIle.pathSeparator` is `:` in unix, it will split hdfs location `hdfs://nameservice/xx`. So add new config `spark.sql.hive.metastore.jars.path` to set comma separated paths.
To keep both two way supported
### Why are the changes needed?
All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node.
### Does this PR introduce _any_ user-facing change?
User can use HDFS location to store hive metastore jars
### How was this patch tested?
Manuel tested.
Closes#29881 from AngersZhuuuu/SPARK-32852.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Override the default SQL strings for:
ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MySQL JDBC dialect according to official documentation.
Write MySQL integration tests for JDBC.
### Why are the changes needed?
Improved code coverage and support mysql dialect for jdbc.
### Does this PR introduce _any_ user-facing change?
Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
### How was this patch tested?
Added tests.
Closes#30025 from ScrapCodes/mysql-dialect.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support query auto cancelling when running too long on thriftserver.
This is the rework of #28991 and the credit should be the original author, leoluan2009.
Closes#28991
### Why are the changes needed?
For some cases, we use thriftserver as long-running applications.
Some times we want all the query need not to run more than given time.
In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests.
Closes#29933 from maropu/pr28991.
Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Luan <luanxuedong2009@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This updates the misleading log messages for removed shuffle block during migration.
### Why are the changes needed?
1. For the deleted shuffle blocks, `IndexShuffleBlockResolver` shows users WARN message saying `skipping migration`. However, `BlockManagerDecommissioner` shows users INFO message including `Migrated ShuffleBlockInfo(...)` inconsistently. Technically, we didn't migrated. We should not show `Migrated` message in this case.
```
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3)
WARN IndexShuffleBlockResolver: Failed to resolve shuffle block ShuffleBlockInfo(109,18924), skipping migration. This is expected to occur if a block is removed after decommissioning has started.
INFO BlockManagerDecommissioner: Got migration sub-blocks List()
...
INFO BlockManagerDecommissioner: Migrated ShuffleBlockInfo(109,18924) to BlockManagerId(...)
```
2. In addition, if the shuffle file is deleted while the information is in the queue, the above messages are repeated multiple times, `spark.storage.decommission.maxReplicationFailuresPerBlock`. We had better use one line instead of the group of messages for that case.
```
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (0 / 3)
...
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (1 / 3)
...
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3)
```
3. Skipping or not is a role of `BlockManagerDecommissioner` class. `IndexShuffleBlockResolver.getMigrationBlocks` is used twice differently like the following. We had better inform users at `BlockManagerDecommissioner` once.
- At the beginning, to get the sub-blocks.
- In case of `IOException`, to determine whether ignoring it or re-throwing. And, `BlockManagerDecommissioner` shows WARN message (`Skipping block ...`) again.
### Does this PR introduce _any_ user-facing change?
No. This is an update for log message info to be consistent.
### How was this patch tested?
Manually.
Closes#30129 from dongjoon-hyun/SPARK-33218.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string.
But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore:
AddressLine1 -> address_line_1
If we support the parameter position, we can do like this(e.g. Oracle):
```
WITH strings as (
SELECT 'AddressLine1' s FROM dual union all
SELECT 'ZipCode' s FROM dual union all
SELECT 'Country' s FROM dual
)
SELECT s "STRING",
lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING"
FROM strings;
```
The output:
```
STRING MODIFIED_STRING
-------------------- --------------------
AddressLine1 address_line_1
ZipCode zip_code
Country country
```
There are some mainstream database support the syntax.
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490
**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace
**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html
### Why are the changes needed?
The parameter position for `REGEXP_REPLACE` is very useful.
### Does this PR introduce _any_ user-facing change?
'Yes'.
### How was this patch tested?
Jenkins test.
Closes#29891 from beliefer/add-position-for-regex_replace.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client.
In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:
```
hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```
which default to:
```
hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster
```
but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`.
Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests).
### Why are the changes needed?
This serves two purposes:
- to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop.
- avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts.
### Does this PR introduce _any_ user-facing change?
When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.
### How was this patch tested?
Relying on existing tests.
Closes#29843 from sunchao/SPARK-29250.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`.
2. Update the SQL migration guide.
### Why are the changes needed?
Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By existing test suites like `ParquetIOSuite`.
Closes#30121 from MaxGekk/int96-exception-by-default.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment)
### Why are the changes needed?
The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with
```
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))
AssertionError: False is not true
```
```
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS
Failed example:
predictions[0]
Expected:
Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
Row(user=0, item=2, newPrediction=0.6929104924201965)
...
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This path changes a test target. Just executed the tests to verify they pass.
Closes#30104 from AlessandroPatti/apatti/rounding-errors.
Authored-by: Alessandro Patti <ale812@yahoo.it>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata.
2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
### Why are the changes needed?
To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By updating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
Closes#30118 from MaxGekk/int96-rebase-benchmark.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a small followup of https://github.com/apache/spark/pull/28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`.
`is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (87a1cc21ca).
### Why are the changes needed?
To avoid using deprecated APIs, and remove warnings.
### Does this PR introduce _any_ user-facing change?
Yes, it will remove warnings that says `is_categorical` is deprecated.
### How was this patch tested?
By running any pandas UDF with pandas 1.1.0+:
```python
import pandas as pd
from pyspark.sql.functions import pandas_udf
def func(x: pd.Series) -> pd.Series:
return x
spark.range(10).select(pandas_udf(func, "long")("id")).show()
```
Before:
```
/.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
...
```
After:
```
...
```
Closes#30114 from HyukjinKwon/replace-deprecated-is_categorical.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
This PR changes `<` into `>` in the following to fix data loss during storage migrations.
```scala
// If we found any new shuffles to migrate or otherwise have not migrated everything.
- newShufflesToMigrate.nonEmpty || migratingShuffles.size < numMigratedShuffles.get()
+ newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get()
```
### Why are the changes needed?
`refreshOffloadingShuffleBlocks` should return `true` when the migration is still on-going.
Since `migratingShuffles` is defined like the following, `migratingShuffles.size > numMigratedShuffles.get()` means the migration is not finished.
```scala
// Shuffles which are either in queue for migrations or migrated
protected[storage] val migratingShuffles = mutable.HashSet[ShuffleBlockInfo]()
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI with the updated test cases.
Closes#30116 from dongjoon-hyun/SPARK-33202.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR intends to upgrade snappy-java from 1.1.7.5 to 1.1.8.
### Why are the changes needed?
For performance improvements; the released `snappy-java` bundles the latest `Snappy` v1.1.8 binaries with small performance improvements.
- snappy-java release note: https://github.com/xerial/snappy-java/releases/tag/1.1.8
- snappy release note: https://github.com/google/snappy/releases/tag/1.1.8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA tests.
Closes#30120 from maropu/Snappy1.1.8.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
1, optimize `predictQuantiles` by pre-computing an auxiliary var.
### Why are the changes needed?
In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var.
It is about 56% faster than existing impl.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30034 from zhengruifeng/aft_quantiles_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Address comments https://github.com/apache/spark/pull/29635#discussion_r507241899 to improve migration guide
### Why are the changes needed?
improve migration guide
### Does this PR introduce _any_ user-facing change?
NO,only doc update
### How was this patch tested?
passing GitHub action
Closes#30113 from yaooqinn/SPARK-32785-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion.
### Why are the changes needed?
The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to fix `getMigrationBlocks` error handling and to add test coverage.
1. `getMigrationBlocks` should not fail at indexFile only case.
2. `assert` causes `java.lang.AssertionError` which is not an `Exception`.
### Why are the changes needed?
To handle the exception correctly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI with the newly added test case.
Closes#30110 from dongjoon-hyun/SPARK-33198.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to improve the log message for better analysis.
### Why are the changes needed?
Good logs are crucial always.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual review.
Closes#30109 from dongjoon-hyun/k8s_log.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pull request changes the description about `to_avro` and `from_avro` functions to include Python as a supported language as the functions have been supported in Python since Apache Spark 3.0.0 [[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)].
### Why are the changes needed?
Same as above.
### Does this PR introduce _any_ user-facing change?
Yes. The description changed by this pull request is on https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro.
### How was this patch tested?
Tested manually by building and checking the document in the local environment.
Closes#30105 from kjmrknsn/fix-docs-sql-data-sources-avro.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix:
```
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-client mode
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar
```
it currently fails as below:
```
20/10/16 19:20:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (amp-jenkins-worker-03.amp executor 1): org.apache.spark.SparkException:
Error from python worker:
Traceback (most recent call last):
File "/usr/lib64/python2.6/runpy.py", line 104, in _run_module_as_main
loader, code, fname = _get_module_details(mod_name)
File "/usr/lib64/python2.6/runpy.py", line 79, in _get_module_details
loader = get_loader(mod_name)
File "/usr/lib64/python2.6/pkgutil.py", line 456, in get_loader
return find_loader(fullname)
File "/usr/lib64/python2.6/pkgutil.py", line 466, in find_loader
for importer in iter_importers(fullname):
File "/usr/lib64/python2.6/pkgutil.py", line 422, in iter_importers
__import__(pkg)
File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/__init__.py", line 53, in <module>
from pyspark.rdd import RDD, RDDBarrier
File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/rdd.py", line 34, in <module>
from pyspark.java_gateway import local_connect_and_auth
File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/java_gateway.py", line 29, in <module>
from py4j.java_gateway import java_import, JavaGateway, JavaObject, GatewayParameters
File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 60
PY4J_TRUE = {"yes", "y", "t", "true"}
^
SyntaxError: invalid syntax
```
I think this was broken when Python 2 was dropped but was not caught because this specific test does not run when there's no change in YARN codes. See also https://github.com/apache/spark/pull/29843#issuecomment-712540024
The root cause seems like the paths are different, see https://github.com/apache/spark/pull/29843#pullrequestreview-502595199. I _think_ Jenkins uses a different Python executable via Anaconda and the executor side does not know where it is for some reasons.
This PR proposes to fix it just by explicitly specifying the absolute path for Python executable so the tests should pass in any environment.
### Why are the changes needed?
To make tests pass.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
This issue looks specific to Jenkins. It should run the tests on Jenkins.
Closes#30099 from HyukjinKwon/SPARK-33191.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/), and some tests fail with PyArrow 2.0.0+:
```
======================================================================
ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key
.select('id', 'result').collect()
File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect
sock_info = self._jdf.collectToPython()
File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
process()
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process
serializer.dump_stream(out_iter, outfile)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream
return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
for batch in iterator:
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches
for series in iterator:
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper
return f(keys, vals)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda>
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped
result = f(key, pd.concat(value_series, axis=1))
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper
return f(*args, **kwargs)
File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f
"{} != {}".format(expected_key[i][1], window_range)
AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
```
https://github.com/apache/spark/runs/1278917457
This PR proposes to set the upper bound of PyArrow in GitHub Actions build. This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189).
### Why are the changes needed?
To make build pass.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions in this build will test it out.
Closes#30098 from HyukjinKwon/hot-fix-test.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Postgres and MSSQL connection providers are not able to get custom `appEntry` because under some circumstances the driver is wrapped with `DriverWrapper`. Such case is not handled in the mentioned providers. In this PR I've added this edge case handling by passing unwrapped `Driver` from `JdbcUtils`.
### Why are the changes needed?
`DriverWrapper` is not considered.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing + additional unit tests.
Closes#30024 from gaborgsomogyi/SPARK-32229.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0.
2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`.
3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.
4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag:
- **No rebasing** in testing when the SQL config `spark.test.forceNoRebase` is set to `true`
- **No rebasing** if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`.
- **With rebasing** if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`.
- Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`.
New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of:
- To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED.
- To have different modes for INT96 and DATE in load (or in save).
- To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default.
### Why are the changes needed?
1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed.
2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date).
3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps.
### Does this PR introduce _any_ user-facing change?
It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`.
### How was this patch tested?
- Added a test to check the metadata key `org.apache.spark.int96NoRebase`
- By `ParquetIOSuite`
Closes#30056 from MaxGekk/parquet-rebase-int96.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output.
This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain().
### Why are the changes needed?
without the patch, the explained plan is inconsistent with what is actually executed
<b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters)
```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
+- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int>
```
```
== Physical Plan ==
*(1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1)))
+- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]]
```
<b> with change </b> the plan looks like (the actually executed partition filters are exhibited)
```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
+- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int>
```
```
== Physical Plan ==
*(1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1)))
+- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))]
```
### Does this PR introduce _any_ user-facing change
no
### How was this patch tested?
Unit test.
Closes#29831 from CodingCat/SPARK-32351.
Lead-authored-by: Nan Zhu <nanzhu@uber.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>