### What changes were proposed in this pull request?
Currently in JDBCTableCatalog, we ignore the table options when creating table.
```
// TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog
if (!properties.isEmpty) {
logWarning("Cannot create JDBC table with properties, these properties will be " +
"ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]"))
}
```
### Why are the changes needed?
need to apply the table options when we create table
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
add new test
Closes#30154 from huaxingao/table_options.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to support sorted shuffle block migration.
### Why are the changes needed?
Since the current shuffle block migration works in a random order, the failure during worker decommission affects all shuffles. We had better finish the shuffles one by one to minimize the number of affected shuffle.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the newly added test case.
Closes#30293 from dongjoon-hyun/SPARK-33387.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In `RenameBasedFSDataOutputStream.cancel`, we do two things: closing underlying stream and delete temporary file, in a single try/catch block. Closing `OutputStream` could possibly throw `IOException` so we possibly missing deleting temporary file.
This patch proposes to delete temporary even underlying stream throwing error.
### Why are the changes needed?
To avoid leaving temporary files during canceling writing in `RenameBasedFSDataOutputStream`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#30290 from viirya/SPARK-33384.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There are two similar compilation warnings about procedure-like declaration in Scala 2.13:
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
```
and
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type
```
this pr is the first part to resolve SPARK-33352:
- For constructors method definition add `=` to convert to function syntax
- For without `return type` methods definition add `: Unit =` to convert to function syntax
### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30255 from LuciferYang/SPARK-29392-FOLLOWUP.1.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Javadocs updated for the encoder to include maps as a collection type
### Why are the changes needed?
The javadocs were not updated with fix SPARK-16706
### Does this PR introduce _any_ user-facing change?
Yes, the javadocs are updated
### How was this patch tested?
sbt was run to ensure it meets scalastyle
Closes#30274 from hannahkamundson/SPARK-32860.
Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com>
Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
There are 4 fields in `MutableApplicationInfo ` seems useless:
- `coresGranted`
- `maxCores`
- `coresPerExecutor`
- `memoryPerExecutorMB`
They are always `None` and not reassigned.
So the main change of this pr is cleanup these useless fields in `MutableApplicationInfo`.
### Why are the changes needed?
Cleanup useless variables.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30251 from LuciferYang/SPARK-33347.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
The changes in [SPARK-32501 Inconsistent NULL conversions to strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced some behavior that I'd like to clean up a bit.
Here's sample code to illustrate the behavior I'd like to clean up:
```scala
val rows = Seq[String](null)
.toDF("value")
.withColumn("struct1", struct('value as "value1"))
.withColumn("struct2", struct('value as "value1", 'value as "value2"))
.withColumn("array1", array('value))
.withColumn("array2", array('value, 'value))
// Show the DataFrame using the "first" codepath.
rows.show(truncate=false)
+-----+-------+-------------+------+--------+
|value|struct1|struct2 |array1|array2 |
+-----+-------+-------------+------+--------+
|null |{ null}|{ null, null}|[] |[, null]|
+-----+-------+-------------+------+--------+
// Write the DataFrame to disk, then read it back and show it to trigger the "codegen" code path:
rows.write.parquet("rows")
spark.read.parquet("rows").show(truncate=false)
+-----+-------+-------------+-------+-------------+
|value|struct1|struct2 |array1 |array2 |
+-----+-------+-------------+-------+-------------+
|null |{ null}|{ null, null}|[ null]|[ null, null]|
+-----+-------+-------------+-------+-------------+
```
Notice:
1. If the first element of a struct is null, it is printed with a leading space (e.g. "\{ null\}"). I think it's preferable to print it without the leading space (e.g. "\{null\}"). This is consistent with how non-null values are printed inside a struct.
2. If the first element of an array is null, it is not printed at all in the first code path, and the "codegen" code path prints it with a leading space. I think both code paths should be consistent and print it without a leading space (e.g. "[null]").
The desired result of this PR is to product the following output via both code paths:
```
+-----+-------+------------+------+------------+
|value|struct1|struct2 |array1|array2 |
+-----+-------+------------+------+------------+
|null |{null} |{null, null}|[null]|[null, null]|
+-----+-------+------------+------+------------+
```
This contribution is my original work and I license the work to the project under the project’s open source license.
### Why are the changes needed?
To correct errors and inconsistencies in how DataFrame.show() displays nulls inside arrays and structs.
### Does this PR introduce _any_ user-facing change?
Yes. This PR changes what is printed out by DataFrame.show().
### How was this patch tested?
I added new test cases in CastSuite.scala to cover the cases addressed by this PR.
Closes#30189 from stwhit/show_nulls.
Authored-by: Stuart White <stuart.white1@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Add executor peak jvm memory metrics in executors page
![image](https://user-images.githubusercontent.com/1633312/97767765-9121bf00-1adb-11eb-93c7-7912d9fe7826.png)
### Why are the changes needed?
Users can know executor peak jvm metrics on in executors page
### Does this PR introduce _any_ user-facing change?
Users can know executor peak jvm metrics on in executors page
### How was this patch tested?
Manually tested
Closes#30186 from warrenzhu25/23432.
Authored-by: Warren Zhu <warren.zhu25@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to introduce the `purge` option in `TableCatalog.dropTable` so that v2 catalogs can use the option if needed.
Related discussion: https://github.com/apache/spark/pull/30079#discussion_r510594110
### Why are the changes needed?
Spark DDL supports passing the purge option to `DROP TABLE` command. However, the option is not used (ignored) for v2 catalogs.
### Does this PR introduce _any_ user-facing change?
This PR introduces a new API in `TableCatalog`.
### How was this patch tested?
Added a test.
Closes#30267 from imback82/purge_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Override the default SQL strings for:
ALTER TABLE RENAME COLUMN
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MsSQLServer JDBC dialect according to official documentation.
Write MsSqlServer integration tests for JDBC.
### Why are the changes needed?
To add the support for alter table when interacting with MSSql Server.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
added tests
Closes#30038 from ScrapCodes/mssql-dialect.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
fix the wrong url and display name of blocking thread in threadDump page.
The blockingThreadId variable passed to the page should be of string type instead of Option type.
### Why are the changes needed?
blocking threadId in the ui page is not displayed well, and the corresponding url cannot be redirected normally
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
The pr only involves minor changes to the page and does not affect other functions,
The manual test results are as follows. The thread name displayed on the page is correct, and you can click on the URL to jump to the corresponding url
![shows_ok](https://user-images.githubusercontent.com/52202080/98108177-89488d00-1ed6-11eb-9488-8446c3f38bad.gif)
Closes#30249 from akiyamaneko/thread-dump-improve.
Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions.
2. Refine and add more comments.
3. Remove `isRelative` as it's useless.
### Why are the changes needed?
code refinement
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#30261 from cloud-fan/window.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by SPARK-31670 (https://github.com/apache/spark/pull/28490)
```scala
- // SPARK-31607: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations
+ // SPARK-31670: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations
```
### Why are the changes needed?
Fix the wrong information.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This is a comment change. Manually review.
Closes#30269 from dongjoon-hyun/SPARK-31670-MINOR.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to update SBT from 1.4.1 to 1.4.2.
### Why are the changes needed?
This will bring the latest bug fixes.
- https://github.com/sbt/sbt/releases/tag/v1.4.2
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#30268 from williamhyun/sbt.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Removing duplicated code in `ResolveOutputRelation`, by adding `V2WriteCommand.withNewQuery`
### Why are the changes needed?
code cleanup
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#30264 from cloud-fan/ds-minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently when run in `cluster` mode on YARN, the Spark `yarn.Client` will print out the application report into the logs, to be easily viewed by users. For example:
```
INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: X.X.X.X
ApplicationMaster RPC port: 0
queue: default
start time: 1602782566027
final status: UNDEFINED
tracking URL: http://hostname:8888/proxy/application_<id>/
user: xkrogen
```
I propose adding, alongside the application report, some additional lines like:
```
Driver Logs (stdout): http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stdout?start=-4096
Driver Logs (stderr): http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stderr?start=-4096
```
This information isn't contained in the `ApplicationReport`, so it's necessary to query the ResourceManager REST API. For now I have added this as an always-on feature, but if there is any concern about adding this REST dependency, I think hiding this feature behind an off-by-default flag is reasonable.
### Why are the changes needed?
Typically, the tracking URL can be used to find the logs of the ApplicationMaster/driver while the application is running. Later, the Spark History Server can be used to track this information down, using the stdout/stderr links on the Executors page.
However, in the situation when the driver crashed _before_ writing out a history file, the SHS may not be aware of this application, and thus does not contain links to the driver logs. When this situation arises, it can be difficult for users to debug further, since they can't easily find their driver logs.
It is possible to reach the logs by using the `yarn logs` commands, but the average Spark user isn't aware of this and shouldn't have to be.
With this information readily available in the logs, users can quickly jump to their driver logs, even if it crashed before the SHS became aware of the application. This has the additional benefit of providing a quick way to access driver logs, which often contain useful information, in a single click (instead of navigating through the Spark UI).
### Does this PR introduce _any_ user-facing change?
Yes, some additional print statements will be created in the application report when using YARN in cluster mode.
### How was this patch tested?
Added unit tests for the parsing logic in `yarn.ClientSuite`. Also tested against a live cluster. When the driver is running:
```
INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state: RUNNING)
INFO Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: host.example.com
ApplicationMaster RPC port: ######
queue: queue_name
start time: 1604529046091
final status: UNDEFINED
tracking URL: http://host.example.com:8080/proxy/application_XXXXXXXXX_YYYYYY/
user: xkrogen
Driver Logs (stdout): http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stdout?start=-4096
Driver Logs (stderr): http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stderr?start=-4096
INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state: RUNNING)
```
I confirmed that when the driver has not yet launched, the report does not include the two Driver Logs items. Will omit the output here for brevity since it looks the same.
Closes#30096 from xkrogen/xkrogen-SPARK-33185-yarn-client-print.
Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
Update SQL migration guide for SPARK-33290
### Why are the changes needed?
Make the change better documented.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#30256 from sunchao/SPARK-33290-2.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR change the behavior of GitHub Actions job that caches dependencies.
SPARK-33226 upgraded sbt to 1.4.1.
As of 1.3.0, sbt uses Coursier as the dependency resolver / fetcher.
So let's change the dependency cache configuration for the GitHub Actions job.
### Why are the changes needed?
To make build faster with Coursier for the GitHub Actions job.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Should be done by GitHub Actions itself.
Closes#30259 from sarutak/coursier-cache.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved.
### Why are the changes needed?
To prevent potential bugs that we skip resolve the input query.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
a new test
Closes#30265 from cloud-fan/ds-minor-2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There's a concept of 'read-only' and 'read+write' state store in Spark which is defined "implicitly". Spark doesn't prevent write for 'read-only' state store; Spark just assumes read-only stateful operator will not modify the state store. Given it's not defined explicitly, the instance of state store has to be implemented as 'read+write' even it's being used as 'read-only', which sometimes brings confusion.
For example, abort() in HDFSBackedStateStore - d38f816748/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (L143-L155)
The comment sounds as if statement works differently between 'read-only' and 'read+write', but that's not true as both state store has state initialized as UPDATING (no difference). So 'read-only' state also creates the temporary file, initializes output streams to write to temporary file, closes output streams, and finally deletes the temporary file. This unnecessary operations are being done per batch/partition.
This patch explicitly defines 'read-only' StateStore, and enables state store provider to create 'read-only' StateStore instance if requested. Relevant code paths are modified, as well as 'read-only' StateStore implementation for HDFSBackedStateStore is introduced. The new implementation gets rid of unnecessary operations explained above.
In point of backward-compatibility view, the only thing being changed in public API side is `StateStoreProvider`. The trait `StateStoreProvider` has to be changed to allow requesting 'read-only' StateStore; this patch adds default implementation which leverages 'read+write' StateStore but wrapping with 'write-protected' StateStore instance, so that custom providers don't need to change their code to reflect the change. But if the providers can optimize for read-only workload, they'll be happy to make a change.
Please note that this patch makes ReadOnlyStateStore extend StateStore and being referred as StateStore, as StateStore is being used in so many places and it's not easy to support both traits if we differentiate them. So unfortunately these write methods are still exposed for read-only state; it just throws UnsupportedOperationException.
### Why are the changes needed?
The new API opens the chance to optimize read-only state store instance compared with read+write state store instance. HDFSBackedStateStoreProvider is modified to provide read-only version of state store which doesn't deal with temporary file as well as state machine.
### Does this PR introduce any user-facing change?
Clearly "no" for most end users, and also "no" for custom state store providers as it doesn't touch trait `StateStore` as well as provides default implementation for added method in trait `StateStoreProvider`.
### How was this patch tested?
Modified UT. Existing UTs ensure the change doesn't break anything.
Closes#26935 from HeartSaVioR/SPARK-30294.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Update join type in stream static joins code examples in structured streaming programming guide.
1) Scala, Java and Python examples have a common issue.
The join keyword is "right_join", it should be "left_outer".
_Reasons:_
a) This code snippet is an example of "left outer join" as the streaming df is on left and static df is on right. Also, right outer join between stream df(left) and static df(right) is not supported.
b) The keyword "right_join/left_join" is unsupported and it should be "right_outer/left_outer".
So, all of these code snippets have been updated to "left_outer".
2) R exmaple is correct, but the example is of "right_outer" with static df (left) and streaming df(right).
It is changed to "left_outer" to make it consistent with other three examples of scala, java and python.
### Why are the changes needed?
To fix the mistake in example code of documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it is a user-facing change (but documentation update only).
**Screenshots 1: Scala/Java/python example (similar issue)**
_Before:_
<img width="941" alt="Screenshot 2020-11-05 at 12 16 09 AM" src="https://user-images.githubusercontent.com/62717942/98155351-19e59400-1efc-11eb-8142-e6a25a5e6497.png">
_After:_
<img width="922" alt="Screenshot 2020-11-05 at 12 17 12 AM" src="https://user-images.githubusercontent.com/62717942/98155503-5d400280-1efc-11eb-96e1-5ba0f3c35c82.png">
**Screenshots 2: R example (Make it consistent with above change)**
_Before:_
<img width="896" alt="Screenshot 2020-11-05 at 12 19 57 AM" src="https://user-images.githubusercontent.com/62717942/98155685-ac863300-1efc-11eb-93bc-b7ca4dd34634.png">
_After:_
<img width="919" alt="Screenshot 2020-11-05 at 12 20 51 AM" src="https://user-images.githubusercontent.com/62717942/98155739-c0ca3000-1efc-11eb-8f95-a7538fa784b7.png">
### How was this patch tested?
The change was tested locally.
1) cd docs/
SKIP_API=1 jekyll build
2) Verify docs/_site/structured-streaming-programming-guide.html file in browser.
Closes#30252 from sarveshdave1/doc-update-stream-static-joins.
Authored-by: Sarvesh Dave <sarveshdave1@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR removes the old Probot Autolabeler labeling configuration, as the probot autolabeler has been deprecated. I've updated the configs in Iceberg and in Avro, and we also need to update here. This PR adds in an additional workflow for labeling PRs and migrates the old probot config to the new format. Unfortunately, because certain features have not been released upstream, we will not get the _exact_ behavior as before. I have documented where that is and what changes are neeeded, and in the associated ticket I've also discussed other options and why I think this is the best way to go. Definitely a follow up ticket is needed to get the original behavior back in these few cases, but PRs have not been labeled for almost a month and so it's probably best to get it right 95% of the time and occasionally have some UI related PRs labeled as `CORE` while the issue is resolved upstream and/or further investigated.
### Why are the changes needed?
The probot autolabeler is dead and will not be maintained going forward. This has been confirmed with github user [at]mithro in an issue in their repository.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
To test this PR, I first merged the config into my local fork. I then edited it several times and ran tests on that.
Unfortunately, I've overwritten my fork with the apache repo in order to create a proper PR. However, I've also added the config for the same thing in the Iceberg repo as well as the Avro repo.
I have now merged this PR into my local repo and will be running some tests on edge cases there and for validating in general:
- [Check that the SQL label is applied for changes directly below repo root's sql directory](https://github.com/kbendick/spark/pull/16) ✅
- [Check that the structured streaming label is applied](https://github.com/kbendick/spark/pull/20) ✅
- [Check that a wildcard at the end of a pattern will match nested files](https://github.com/kbendick/spark/pull/19) ✅
- [Check that the rule **/*pom.xml will match the root pom.xml file](https://github.com/kbendick/spark/pull/25) ✅
I've also discovered that we're likely not killing github actions that run (like large tests etc) when users push to their PR. In most cases, I see that a user has to mark something as "OK to test", but it still seems like we might want to discuss whether or not we should add a cancellation step In order to save time / capacity on the runners. If so desired, we would add an action in each workflow that cancels old runs when a `push` action occurs on a PR. This will likely make waiting for test runners much faster iff tests are automatically rerun on push by anybody (such as PMCs, PRs that have been marked OK to test, etc). We could free a large number of resources potentially if a cancellation step was added to all of the workflows in the Apache account (as github action API limits are set at the account level).
Admittedly, the fact that the "old" workflow runs weren't cancelled could admittedly be because of the fact that I was working in a fork, but given that there are explicit actions to be added to the start of workflows to cancel old PR workflows and given that we don't have them configured indicates to me that likely this is the case in this repo (and in most `apache` repos as well), at least under certain circumstances (e.g. repos that don't have "Ok to test"-like webhooks as one example).
This is a separate issue though, which I can bring up on the mailing list once I'm done with this PR. Unfortunately I've been very busy the past two weeks, but if somebody else wanted to work on that I would be happy to support with any knowledge I have.
The last Apache repo to still have the probot autolabeler in it is Beam, at which point we can have Gavin from ASF Infra remove the permissions for the probot autolabeler entirely. See the associated JIRA ticket for the links to other tickets, like the one for ASF Infra to remove the dead probot autolabeler's read and write permissions to our PRs in the Apache organization.
Closes#30244 from kbendick/begin-migration-to-github-labeler-action.
Authored-by: Kyle Bendickson <kjbendickson@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This change is to support user provided nullable Avro schema for data with non-nullable catalyst schema in Avro writing.
Without this change, when users try to use a nullable Avro schema to write data with a non-nullable catalyst schema, it will throw an `IncompatibleSchemaException` with a message like `Cannot convert Catalyst type StringType to Avro type ["null","string"]`. With this change it will assume that the data is non-nullable, log a warning message for the nullability difference and serialize the data to Avro format with the nullable Avro schema provided.
### Why are the changes needed?
This change is needed because sometimes our users do not have full control over the nullability of the Avro schemas they use, and this change provides them with the flexibility.
### Does this PR introduce _any_ user-facing change?
Yes. Users are allowed to use nullable Avro schemas for data with non-nullable catalyst schemas in Avro writing after the change.
### How was this patch tested?
Added unit tests.
Closes#30224 from bozhang2820/avro-nullable.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR adds a check to RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record.
This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row).
### Why are the changes needed?
SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row.
For example (which assumes V1 Avro reader):
```scala
val df = spark.range(0, 25).toDF("index")
df.write.mode("overwrite").format("avro").save("index_avro")
val loaded = spark.read.format("avro").load("index_avro")
// The following will give the expected size
loaded.collect.size
// The following will give the wrong size
loaded.orderBy("index").collect.size
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests, which fail without the fix.
Closes#30221 from bersprockets/avro_iterator_play.
Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fix the issue that spark-shell doesn't work if it's built with `sbt package` (without any profiles specified).
It's due to hadoop-client-runtime.jar isn't copied to assembly/target/scala-2.12/jars.
```
$ bin/spark-shell
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper
at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:877)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.com.ctc.wstx.io.InputBootstrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
```
### Why are the changes needed?
This is a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran spark-shell and confirmed it works.
Closes#30250 from sarutak/copy-runtime-sbt.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to register the executor source with the Spark metrics system when running in local mode.
### Why are the changes needed?
The Apache Spark metrics system provides many useful insights on the Spark workload.
In particular, the [executor source metrics](https://github.com/apache/spark/blob/master/docs/monitoring.md#component-instance--executor) provide detailed info, including the number of active tasks, I/O metrics, and several task metrics details. The executor source metrics, contrary to other sources (for example ExecutorMetrics source), is not available when running in local mode.
Having executor metrics in local mode can be useful when testing and troubleshooting Spark workloads in a development environment. The metrics can be fed to a dashboard to see the evolution of resource usage and can be used to troubleshoot performance,
as [in this example](https://github.com/cerndb/spark-dashboard).
Currently users will have to deploy on a cluster to be able to collect executor source metrics, while the possibility of having them in local mode is handy for testing.
### Does this PR introduce _any_ user-facing change?
- This PR exposes executor source metrics data when running in local mode.
### How was this patch tested?
- Manually tested by running in local mode and inspecting the metrics listed in http://localhost:4040/metrics/json/
- Also added a test in `SourceConfigSuite`
Closes#28528 from LucaCanali/metricsWithLocalMode.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.
### Why are the changes needed?
This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3
scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```
Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```
**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```
**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the newly added test case.
Closes#30246 from dongjoon-hyun/SPARK-33338.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR renames some part of `Seq` in `PostgresIntegrationSuite` to `scala.collection.Seq`.
When I run `docker-integration-test`, I noticed that `PostgresIntegrationSuite` failed due to `ClassCastException`.
The reason is the same as what is resolved in SPARK-29292.
### Why are the changes needed?
To pass `docker-integration-test` for Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran `PostgresIntegrationSuite` fixed and confirmed it successfully finished.
Closes#30166 from sarutak/fix-toseq-postgresql.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs.
### Why are the changes needed?
In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions:
```
// For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to
// avoid downloading Spark of different versions in each run.
private val sparkTestingDir = new File("/tmp/test-spark")
```
However this doesn't work, since it gets deleted every time:
```
override def afterAll(): Unit = {
try {
Utils.deleteRecursively(wareHousePath)
Utils.deleteRecursively(tmpDataDir)
Utils.deleteRecursively(sparkTestingDir)
} finally {
super.afterAll()
}
}
```
It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it.
### Does this PR introduce _any_ user-facing change?
Developer-facing changes only, as this is in a test.
### How was this patch tested?
The test continues to execute as expected.
Closes#30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix.
Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `ANALYZE TABLE` and `ANALYZE TABLE ... FOR COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
Note that `ANALYZE TABLE` is not supported for v2 tables.
### Why are the changes needed?
The changes allow consistent resolution behavior when resolving the table/view identifier. For example, the following is the current behavior:
```scala
sql("create temporary view t as select 1")
sql("create database db")
sql("create table db.t using csv as select 1")
sql("use db")
sql("ANALYZE TABLE t compute statistics") // Succeeds
```
With this change, ANALYZE TABLE above fails with the following:
```
org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$40(Analyzer.scala:872)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:870)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:856)
```
, which is expected since temporary view is resolved first and ANALYZE TABLE doesn't support a temporary view.
### Does this PR introduce _any_ user-facing change?
After this PR, `ANALYZE TABLE t` is resolved to a temp view `t` instead of table `db.t`.
### How was this patch tested?
Updated existing tests.
Closes#30229 from imback82/parse_v1table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add query.resolved before convert hive relation.
### Why are the changes needed?
For better error msg.
```
CREATE TABLE t STORED AS PARQUET AS
SELECT * FROM (
SELECT c3 FROM (
SELECT c1, c2 from values(1,2) t(c1, c2)
)
)
```
Before this PR, we get such error msg
```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: *
at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244)
at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
```
### Does this PR introduce _any_ user-facing change?
Yes, error msg changed.
### How was this patch tested?
Add test.
Closes#30230 from ulysses-you/SPARK-33323.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename `TableWriteExec` in `WriteToDataSourceV2Exec.scala` to `TableWriteExecHelper`.
### Why are the changes needed?
See [discussion](https://github.com/apache/spark/pull/30193#discussion_r516412653). The former is too general.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#30235 from sunchao/SPARK-33293-2.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the old version of spark in the storage UI page, the sorting function is normal, but sorting in the new version will cause the header content to be lost, So I try to fix the bug.
### Why are the changes needed?
The header field of the table on the page is similar to the following, **note that each th contains the span attribute**:
```html
<thead>
<tr>
....
<th width="" class="">
<span data-toggle="tooltip" title="" data-original-title="StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) andreplication factor of the persisted RDD">
Storage Level
</span>
</th>
.....
</tr>
</thead>
```
Since [PR#26136](https://github.com/apache/spark/pull/26136), if the `th` in the table itself contains the `span` attribute, the `span` will be deleted directly after clicking the sort, and the original header content will be lost.
There are three problems in `sorttable.js`:
1. `sortrevind.class = "sorttable_sortrevind"` in [sorttab.js#107](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L107)) and `sortfwdind.class = "sorttable_sortfwdind"` in [sorttab.js#125](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L125))
sorttable_xx attribute should be assigned to`className` instead of `class`, as javascript uses `rowlists[j].className.search` rather than `rowlists[j].class.search` to determine whether the component has a sorting flag or not.
2. `rowlists[j].className.search(/\sorttable_sortrevind\b/)` in [sorttab.js#120](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L120)) was wrong. The original intention is to search whether `className` contains the word `sorttable_sortrevind` , but the expression is wrong, it should be `\bsorttable_sortrevind\b` instead of `\sorttable_sortrevind\b`
3. The if check statement in the following code snippet ([sorttab.js#141](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L141))) was wrong. **If the `search` function does not find the target, it will return -1, but Boolean(-1) is actually equals true**. This statement will cause span to be deleted even if it does not contain `sorttable_sortfwdind` or `sorttable_sortrevind`.
```javascript
rowlists = this.parentNode.getElementsByTagName("span");
for (var j=0; j < rowlists.length; j++) {
if (rowlists[j].className.search(/\bsorttable_sortfwdind\b/)
|| rowlists[j].className.search(/\sorttable_sortrevind\b/) ) {
rowlists[j].parentNode.removeChild(rowlists[j]);
}
}
```
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
The manual test result of the ui page is as below:
![fix sorted](https://user-images.githubusercontent.com/52202080/97543194-daeaa680-1a02-11eb-8b11-8109c3e4e9a3.gif)
Closes#30182 from akiyamaneko/ui_storage_sort_error.
Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adds support for passing `Column`s as input to PySpark sorting functions.
### Why are the changes needed?
According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible.
### Does this PR introduce _any_ user-facing change?
PySpark users can now provide both `Column` and `str` as an argument for `asc*` and `desc*` functions.
### How was this patch tested?
New unit tests.
Closes#30227 from zero323/SPARK-33257.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade `Kubernetes-client` from 4.10.3 to 4.11.1.
### Why are the changes needed?
This upgrades the dependency for Apache Spark 3.1.0.
Since 4.12.0 is still new and has a breaking API changes, this PR chooses the latest compatible one.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the all CIs including K8s IT.
Closes#30233 from dongjoon-hyun/SPARK-33324.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243.
While I am migrating, I also fixed some Python type hints accordingly.
### Why are the changes needed?
For better documentation as text itself, and generated HTMLs
### Does this PR introduce _any_ user-facing change?
Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243.
### How was this patch tested?
Manually tested via running `./dev/lint-python`.
Closes#30181 from HyukjinKwon/SPARK-33250.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs.
Closes#30201
### Why are the changes needed?
Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By linters.
Closes#30226 from MaxGekk/from_json-common-schema-parsing-2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to group `DateType` and `TimestampType` together in checking time zone needs in the `Cast.needsTimeZone()` method.
### Why are the changes needed?
To improve code maintainability.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By the existing test `"SPARK-33306: Timezone is needed when cast Date to String"`.
Closes#30223 from MaxGekk/WangGuangxin-SPARK-33306-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr add all built-in SerDes to `HiveSerDeReadWriteSuite`.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe
### Why are the changes needed?
We will upgrade Parquet, ORC and Avro, need to ensure compatibility.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#30228 from wangyum/SPARK-33319.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
As a followup comment from https://github.com/apache/spark/pull/29804#issuecomment-700650620 , here we add add the physical plan rule DisableUnnecessaryBucketedScan into AQE AdaptiveSparkPlanExec.queryStagePreparationRules, to make auto bucketed scan work with AQE.
The change is mostly in:
* `AdaptiveSparkPlanExec.scala`: add physical plan rule `DisableUnnecessaryBucketedScan`
* `DisableUnnecessaryBucketedScan.scala`: propagate logical plan link for the file source scan exec operator, otherwise we lose the logical plan link information when AQE is enabled, and will get exception [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L176). (for example, for query `SELECT * FROM bucketed_table` with AQE is enabled)
* `DisableUnnecessaryBucketedScanSuite.scala`: add new test suite for AQE enabled - `DisableUnnecessaryBucketedScanWithoutHiveSupportSuiteAE`, and changed some of tests to use `AdaptiveSparkPlanHelper.find/collect`, to make the plan verification work when AQE enabled.
### Why are the changes needed?
It's reasonable to add the support to allow disabling unnecessary bucketed scan with AQE is enabled, this helps optimize the query when AQE is enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `DisableUnnecessaryBucketedScanSuite`.
Closes#30200 from c21/auto-bucket-aqe.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support rename column for mysql dialect.
### Why are the changes needed?
At the moment, it does not work for mysql version 5.x. So, we should throw proper exception for that case.
### Does this PR introduce _any_ user-facing change?
Yes, `column rename` with mysql dialect should work correctly.
### How was this patch tested?
Added tests for rename column.
Ran the tests to pass with both versions of mysql.
* `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31`
* `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0`
Closes#30142 from ScrapCodes/mysql-dialect-rename.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `R/pkg/tests/run-all.R` by:
- Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant.
- Add `testthat` 3.x support to avoid AppVeyor failures.
### Why are the changes needed?
Currently used internal API has been removed in the latest `testthat` release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0`
Closes#30219 from zero323/SPARK-33313.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Allow to run the Spark web UI behind a reverse proxy with URLs prefixed by a context root, like www.mydomain.com/spark. In particular, this allows to access multiple Spark clusters through the same virtual host, only distinguishing them by context root, like www.mydomain.com/cluster1, www.mydomain.com/cluster2, and it allows to run the Spark UI in a common cookie domain (for SSO) with other services.
### Why are the changes needed?
This PR is to take over https://github.com/apache/spark/pull/17455.
After changes, Spark allows showing customized prefix URL in all the `href` links of the HTML pages.
### Does this PR introduce _any_ user-facing change?
Yes, all the links of UI pages will be contains the value of `spark.ui.reverseProxyUrl` if it is configurated.
### How was this patch tested?
New HTML Unit tests in MasterSuite
Manual UI testing for master, worker and app UI with an nginx proxy
Spark config:
```
spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
```
nginx config:
```
server {
listen 9000;
set $SPARK_MASTER http://127.0.0.1:8080;
# split spark UI path into prefix and local path within master UI
location ~ ^(/path/to/spark/) {
# strip prefix when forwarding request
rewrite /path/to/spark(/.*) $1 break;
#rewrite /path/to/spark/ "/" ;
# forward to spark master UI
proxy_pass $SPARK_MASTER;
proxy_intercept_errors on;
error_page 301 302 307 = handle_redirects;
}
location handle_redirects {
set $saved_redirect_location '$upstream_http_location';
proxy_pass $saved_redirect_location;
}
}
```
Closes#29820 from gengliangwang/revertProxyURL.
Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Oliver Köth <okoeth@de.ibm.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.
### Why are the changes needed?
Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
E.g.,:
```py
spark.range(0, 100000, 1, 1).write.parquet(path)
spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
def f(x):
return 0
fUdf = udf(f, LongType())
spark.read.parquet(path).select(fUdf('id')).head()
```
This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests, and manually.
Closes#30177 from ueshin/issues/SPARK-33277/python_pandas_udf.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable.
### Why are the changes needed?
Convenience & ensuring the typing reflects the functionality
### Does this PR introduce _any_ user-facing change?
Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions.
### How was this patch tested?
Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this.
Closes#30209 from dhimmel/patch-1.
Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as 72ad9dcd5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (L309).
Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting.
A simple way to reproduce this is
```
spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true
```
when we execute the following sql,
```
select a.d1 from
(select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
join
(select concat('2000-01-0', id) as d2 from range(1, 2)) b
on a.d1 = b.d2
```
it will throw
```
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)
```
### Why are the changes needed?
As described above, it's a bug here.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add more UT
Closes#30213 from WangGuangxin/SPARK-33306.
Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Refactor `WriteToDataSourceV2Exec` via removing code duplication around write to table logic:
- renamed `AtomicTableWriteExec` to `TableWriteExec` so that the table write logic in this trait can be modified and shared with `CreateTableAsSelectExec`, `ReplaceTableAsSelectExec`, `AtomicCreateTableAsSelectExec ` and `AtomicReplaceTableAsSelectExec`.
- similar to the above, renamed `writeToStagedTable` to `writeToTable` in `TableWriteExec`.
- extended `writeToTable` so that it can handle both staged table as well as non-staged table.
### Why are the changes needed?
Simplify the logic and remove duplication, to make this piece of code easier to maintain.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass CIs with the existing test coverage.
Closes#30193 from sunchao/SPARK-33293.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>