Commit graph

28678 commits

Author SHA1 Message Date
neko f6c0007970 [SPARK-33342][WEBUI] fix the wrong url and display name of blocking thread in threadDump page
### What changes were proposed in this pull request?
fix the wrong url and display name of blocking thread in threadDump page.
The blockingThreadId variable passed to the page should be of string type instead of Option type.

### Why are the changes needed?
blocking threadId in the ui page is not displayed well, and the corresponding  url cannot be redirected normally

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
The pr  only involves minor changes to the page and does not affect other functions,
The manual test results are as follows. The thread name displayed on the page is correct, and you can click on the URL to jump to the corresponding url

![shows_ok](https://user-images.githubusercontent.com/52202080/98108177-89488d00-1ed6-11eb-9488-8446c3f38bad.gif)

Closes #30249 from akiyamaneko/thread-dump-improve.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-11-06 13:45:02 +08:00
Wenchen Fan d16311051d [SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments
### What changes were proposed in this pull request?

1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions.
2. Refine and add more comments.
3. Remove `isRelative` as it's useless.

### Why are the changes needed?

code refinement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #30261 from cloud-fan/window.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-06 05:20:25 +00:00
Dongjoon Hyun 90f35c663e [MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer
### What changes were proposed in this pull request?

This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by  SPARK-31670 (https://github.com/apache/spark/pull/28490)
```scala
- // SPARK-31607: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations
+ // SPARK-31670: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations
```

### Why are the changes needed?

Fix the wrong information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a comment change. Manually review.

Closes #30269 from dongjoon-hyun/SPARK-31670-MINOR.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-06 12:46:26 +09:00
William Hyun 4941b7ae18 [SPARK-33365][BUILD] Update SBT to 1.4.2
### What changes were proposed in this pull request?
This PR aims to update SBT from 1.4.1 to 1.4.2.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://github.com/sbt/sbt/releases/tag/v1.4.2

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #30268 from williamhyun/sbt.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 17:37:44 -08:00
Wenchen Fan cd4e3d3b0c [SPARK-33360][SQL] Simplify DS v2 write resolution
### What changes were proposed in this pull request?

Removing duplicated code in `ResolveOutputRelation`, by adding `V2WriteCommand.withNewQuery`

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #30264 from cloud-fan/ds-minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 15:44:04 -08:00
Erik Krogen 324275ae83 [SPARK-33185][YARN] Set up yarn.Client to print direct links to driver stdout/stderr
### What changes were proposed in this pull request?
Currently when run in `cluster` mode on YARN, the Spark `yarn.Client` will print out the application report into the logs, to be easily viewed by users. For example:
```
INFO yarn.Client:
 	 client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
 	 diagnostics: N/A
 	 ApplicationMaster host: X.X.X.X
 	 ApplicationMaster RPC port: 0
 	 queue: default
 	 start time: 1602782566027
 	 final status: UNDEFINED
 	 tracking URL: http://hostname:8888/proxy/application_<id>/
 	 user: xkrogen
```

I propose adding, alongside the application report, some additional lines like:
```
         Driver Logs (stdout): http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stdout?start=-4096
         Driver Logs (stderr): http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stderr?start=-4096
```

This information isn't contained in the `ApplicationReport`, so it's necessary to query the ResourceManager REST API. For now I have added this as an always-on feature, but if there is any concern about adding this REST dependency, I think hiding this feature behind an off-by-default flag is reasonable.

### Why are the changes needed?
Typically, the tracking URL can be used to find the logs of the ApplicationMaster/driver while the application is running. Later, the Spark History Server can be used to track this information down, using the stdout/stderr links on the Executors page.

However, in the situation when the driver crashed _before_ writing out a history file, the SHS may not be aware of this application, and thus does not contain links to the driver logs. When this situation arises, it can be difficult for users to debug further, since they can't easily find their driver logs.

It is possible to reach the logs by using the `yarn logs` commands, but the average Spark user isn't aware of this and shouldn't have to be.

With this information readily available in the logs, users can quickly jump to their driver logs, even if it crashed before the SHS became aware of the application. This has the additional benefit of providing a quick way to access driver logs, which often contain useful information, in a single click (instead of navigating through the Spark UI).

### Does this PR introduce _any_ user-facing change?
Yes, some additional print statements will be created in the application report when using YARN in cluster mode.

### How was this patch tested?
Added unit tests for the parsing logic in `yarn.ClientSuite`. Also tested against a live cluster. When the driver is running:
```
INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state: RUNNING)
INFO Client:
         client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
         diagnostics: N/A
         ApplicationMaster host: host.example.com
         ApplicationMaster RPC port: ######
         queue: queue_name
         start time: 1604529046091
         final status: UNDEFINED
         tracking URL: http://host.example.com:8080/proxy/application_XXXXXXXXX_YYYYYY/
         user: xkrogen
         Driver Logs (stdout): http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stdout?start=-4096
         Driver Logs (stderr): http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stderr?start=-4096
INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state: RUNNING)
```
I confirmed that when the driver has not yet launched, the report does not include the two Driver Logs items. Will omit the output here for brevity since it looks the same.

Closes #30096 from xkrogen/xkrogen-SPARK-33185-yarn-client-print.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2020-11-05 12:38:42 -06:00
Chao Sun 1a704793f4 [SPARK-33290][SQL][DOCS][FOLLOW-UP] Update SQL migration guide
### What changes were proposed in this pull request?

Update SQL migration guide for SPARK-33290

### Why are the changes needed?

Make the change better documented.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #30256 from sunchao/SPARK-33290-2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 10:09:28 -08:00
Kousuke Saruta 208b94e4c1 [SPARK-33353][BUILD] Cache dependencies for Coursier with new sbt in GitHub Actions
### What changes were proposed in this pull request?

This PR change the behavior of GitHub Actions job that caches dependencies.
SPARK-33226 upgraded sbt to 1.4.1.
As of 1.3.0, sbt uses Coursier as the dependency resolver / fetcher.
So let's change the dependency cache configuration for the GitHub Actions job.

### Why are the changes needed?

To make build faster with Coursier for the GitHub Actions job.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions itself.

Closes #30259 from sarutak/coursier-cache.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 09:29:53 -08:00
Wenchen Fan 26ea417b14 [SPARK-33362][SQL] skipSchemaResolution should still require query to be resolved
### What changes were proposed in this pull request?

Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved.

### Why are the changes needed?

To prevent potential bugs that we skip resolve the input query.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a new test

Closes #30265 from cloud-fan/ds-minor-2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 09:23:41 -08:00
Jungtaek Lim (HeartSaVioR) 21413b7dd4 [SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore
### What changes were proposed in this pull request?

There's a concept of 'read-only' and 'read+write' state store in Spark which is defined "implicitly". Spark doesn't prevent write for 'read-only' state store; Spark just assumes read-only stateful operator will not modify the state store. Given it's not defined explicitly, the instance of state store has to be implemented as 'read+write' even it's being used as 'read-only', which sometimes brings confusion.

For example, abort() in HDFSBackedStateStore - d38f816748/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (L143-L155)

The comment sounds as if statement works differently between 'read-only' and 'read+write', but that's not true as both state store has state initialized as UPDATING (no difference). So 'read-only' state also creates the temporary file, initializes output streams to write to temporary file, closes output streams, and finally deletes the temporary file. This unnecessary operations are being done per batch/partition.

This patch explicitly defines 'read-only' StateStore, and enables state store provider to create 'read-only' StateStore instance if requested. Relevant code paths are modified, as well as 'read-only' StateStore implementation for HDFSBackedStateStore is introduced. The new implementation gets rid of unnecessary operations explained above.

In point of backward-compatibility view, the only thing being changed in public API side is `StateStoreProvider`. The trait `StateStoreProvider` has to be changed to allow requesting 'read-only' StateStore; this patch adds default implementation which leverages 'read+write' StateStore but wrapping with 'write-protected' StateStore instance, so that custom providers don't need to change their code to reflect the change. But if the providers can optimize for read-only workload, they'll be happy to make a change.

Please note that this patch makes ReadOnlyStateStore extend StateStore and being referred as StateStore, as StateStore is being used in so many places and it's not easy to support both traits if we differentiate them. So unfortunately these write methods are still exposed for read-only state; it just throws UnsupportedOperationException.

### Why are the changes needed?

The new API opens the chance to optimize read-only state store instance compared with read+write state store instance. HDFSBackedStateStoreProvider is modified to provide read-only version of state store which doesn't deal with temporary file as well as state machine.

### Does this PR introduce any user-facing change?

Clearly "no" for most end users, and also "no" for custom state store providers as it doesn't touch trait `StateStore` as well as provides default implementation for added method in trait `StateStoreProvider`.

### How was this patch tested?

Modified UT. Existing UTs ensure the change doesn't break anything.

Closes #26935 from HeartSaVioR/SPARK-30294.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-05 18:21:17 +09:00
Sarvesh Dave e66201b30b [MINOR][SS][DOCS] Update join type in stream static joins code examples
### What changes were proposed in this pull request?
Update join type in stream static joins code examples in structured streaming programming guide.
1) Scala, Java and Python examples have a common issue.
    The join keyword is "right_join", it should be "left_outer".

    _Reasons:_
    a) This code snippet is an example of "left outer join" as the streaming df is on left and static df is on right. Also, right outer    join between stream df(left) and static df(right) is not supported.
    b) The keyword "right_join/left_join" is unsupported and it should be "right_outer/left_outer".

So, all of these code snippets have been updated to "left_outer".

2) R exmaple is correct, but the example is of "right_outer" with static df (left) and streaming df(right).
It is changed to "left_outer" to make it consistent with other three examples of scala, java and python.

### Why are the changes needed?
To fix the mistake in example code of documentation.

### Does this PR introduce _any_ user-facing change?
Yes, it is a user-facing change (but documentation update only).

**Screenshots 1: Scala/Java/python example (similar issue)**
_Before:_
<img width="941" alt="Screenshot 2020-11-05 at 12 16 09 AM" src="https://user-images.githubusercontent.com/62717942/98155351-19e59400-1efc-11eb-8142-e6a25a5e6497.png">

_After:_
<img width="922" alt="Screenshot 2020-11-05 at 12 17 12 AM" src="https://user-images.githubusercontent.com/62717942/98155503-5d400280-1efc-11eb-96e1-5ba0f3c35c82.png">

**Screenshots 2: R example (Make it consistent with above change)**
_Before:_
<img width="896" alt="Screenshot 2020-11-05 at 12 19 57 AM" src="https://user-images.githubusercontent.com/62717942/98155685-ac863300-1efc-11eb-93bc-b7ca4dd34634.png">

_After:_
<img width="919" alt="Screenshot 2020-11-05 at 12 20 51 AM" src="https://user-images.githubusercontent.com/62717942/98155739-c0ca3000-1efc-11eb-8f95-a7538fa784b7.png">

### How was this patch tested?
The change was tested locally.
1) cd docs/
    SKIP_API=1 jekyll build
2) Verify docs/_site/structured-streaming-programming-guide.html file in browser.

Closes #30252 from sarveshdave1/doc-update-stream-static-joins.

Authored-by: Sarvesh Dave <sarveshdave1@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-05 16:22:31 +09:00
HyukjinKwon d530ed0ea8 Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends"
This reverts commit b8a440f098.
2020-11-05 16:15:17 +09:00
Kyle Bendickson 0535b34ad4 [SPARK-33282] Migrate from deprecated probot autolabeler to GitHub labeler action
### What changes were proposed in this pull request?

This PR removes the old Probot Autolabeler labeling configuration, as the probot autolabeler has been deprecated. I've updated the configs in Iceberg and in Avro, and we also need to update here. This PR adds in an additional workflow for labeling PRs and migrates the old probot config to the new format. Unfortunately, because certain features have not been released upstream, we will not get the _exact_ behavior as before. I have documented where that is and what changes are neeeded, and in the associated ticket I've also discussed other options and why I think this is the best way to go. Definitely a follow up ticket is needed to get the original behavior back in these few cases, but PRs have not been labeled for almost a month and so it's probably best to get it right 95% of the time and occasionally have some UI related PRs labeled as `CORE` while the issue is resolved upstream and/or further investigated.

### Why are the changes needed?

The probot autolabeler is dead and will not be maintained going forward. This has been confirmed with github user [at]mithro in an issue in their repository.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

To test this PR, I first merged the config into my local fork. I then edited it several times and ran tests on that.

Unfortunately, I've overwritten my fork with the apache repo in order to create a proper PR. However, I've also added the config for the same thing in the Iceberg repo as well as the Avro repo.

I have now merged this PR into my local repo and will be running some tests on edge cases there and for validating in general:
- [Check that the SQL label is applied for changes directly below repo root's sql directory](https://github.com/kbendick/spark/pull/16) 
- [Check that the structured streaming label is applied](https://github.com/kbendick/spark/pull/20) 
- [Check that a wildcard at the end of a pattern will match nested files](https://github.com/kbendick/spark/pull/19) 
- [Check that the rule **/*pom.xml will match the root pom.xml file](https://github.com/kbendick/spark/pull/25) 

I've also discovered that we're likely not killing github actions that run (like large tests etc) when users push to their PR. In most cases, I see that a user has to mark something as "OK to test", but it still seems like we might want to discuss whether or not we should add a cancellation step In order to save time / capacity on the runners. If so desired, we would add an action in each workflow that cancels old runs when a `push` action occurs on a PR. This will likely make waiting for test runners much faster iff tests are automatically rerun on push by anybody (such as PMCs, PRs that have been marked OK to test, etc). We could free a large number of resources potentially if a cancellation step was added to all of the workflows in the Apache account (as github action API limits are set at the account level).

Admittedly, the fact that the "old" workflow runs weren't cancelled could admittedly be because of the fact that I was working in a fork, but given that there are explicit actions to be added to the start of workflows to cancel old PR workflows and given that we don't have them configured indicates to me that likely this is the case in this repo (and in most `apache` repos as well), at least under certain circumstances (e.g. repos that don't have "Ok to test"-like webhooks as one example).

This is a separate issue though, which I can bring up on the mailing list once I'm done with this PR. Unfortunately I've been very busy the past two weeks, but if somebody else wanted to work on that I would be happy to support with any knowledge I have.

The last Apache repo to still have the probot autolabeler in it is Beam, at which point we can have Gavin from ASF Infra remove the permissions for the probot autolabeler entirely. See the associated JIRA ticket for the links to other tickets, like the one for ASF Infra to remove the dead probot autolabeler's read and write permissions to our PRs in the Apache organization.

Closes #30244 from kbendick/begin-migration-to-github-labeler-action.

Authored-by: Kyle Bendickson <kjbendickson@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-05 16:10:52 +09:00
Bo Zhang 551b504cfe [SPARK-33316][SQL] Support user provided nullable Avro schema for non-nullable catalyst schema in Avro writing
### What changes were proposed in this pull request?
This change is to support user provided nullable Avro schema for data with non-nullable catalyst schema in Avro writing.

Without this change, when users try to use a nullable Avro schema to write data with a non-nullable catalyst schema, it will throw an `IncompatibleSchemaException` with a message like `Cannot convert Catalyst type StringType to Avro type ["null","string"]`. With this change it will assume that the data is non-nullable, log a warning message for the nullability difference and serialize the data to Avro format with the nullable Avro schema provided.

### Why are the changes needed?
This change is needed because sometimes our users do not have full control over the nullability of the Avro schemas they use, and this change provides them with the flexibility.

### Does this PR introduce _any_ user-facing change?
Yes. Users are allowed to use nullable Avro schemas for data with non-nullable catalyst schemas in Avro writing after the change.

### How was this patch tested?
Added unit tests.

Closes #30224 from bozhang2820/avro-nullable.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-11-05 12:27:20 +08:00
Bruce Robbins 7e8eb0447b [SPARK-33314][SQL] Avoid dropping rows in Avro reader
### What changes were proposed in this pull request?

This PR adds a check to  RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record.

This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row).

### Why are the changes needed?

SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row.

For example (which assumes V1 Avro reader):
```scala
val df = spark.range(0, 25).toDF("index")
df.write.mode("overwrite").format("avro").save("index_avro")
val loaded = spark.read.format("avro").load("index_avro")
// The following will give the expected size
loaded.collect.size
// The following will give the wrong size
loaded.orderBy("index").collect.size
```
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests, which fail without the fix.

Closes #30221 from bersprockets/avro_iterator_play.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-05 11:50:11 +09:00
Kousuke Saruta d24dbe8955 [SPARK-33343][BUILD] Fix the build with sbt to copy hadoop-client-runtime.jar
### What changes were proposed in this pull request?

This PR fix the issue that spark-shell doesn't work if it's built with `sbt package` (without any profiles specified).
It's due to hadoop-client-runtime.jar isn't copied to assembly/target/scala-2.12/jars.
```
$ bin/spark-shell
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper
	at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:877)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.com.ctc.wstx.io.InputBootstrapper
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
```

### Why are the changes needed?

This is a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran spark-shell and confirmed it works.

Closes #30250 from sarutak/copy-runtime-sbt.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-04 15:05:35 -08:00
Luca Canali b7fff03973 [SPARK-31711][CORE] Register the executor source with the metrics system when running in local mode
### What changes were proposed in this pull request?
This PR proposes to register the executor source with the Spark metrics system when running in local mode.

### Why are the changes needed?
The Apache Spark metrics system provides many useful insights on the Spark workload.
In particular, the [executor source metrics](https://github.com/apache/spark/blob/master/docs/monitoring.md#component-instance--executor) provide detailed info, including the number of active tasks, I/O metrics, and several task metrics details. The executor source metrics, contrary to other sources (for example ExecutorMetrics source), is not available when running in local mode.
Having executor metrics in local mode can be useful when testing and troubleshooting Spark workloads in a development environment. The metrics can be fed to a dashboard to see the evolution of resource usage and can be used to troubleshoot performance,
as [in this example](https://github.com/cerndb/spark-dashboard).
Currently users will have to deploy on a cluster to be able to collect executor source metrics, while the possibility of having them in local mode is handy for testing.

### Does this PR introduce _any_ user-facing change?
- This PR exposes executor source metrics data when running in local mode.

### How was this patch tested?
- Manually tested by running in local mode and inspecting the metrics listed in http://localhost:4040/metrics/json/
- Also added a test in `SourceConfigSuite`

Closes #28528 from LucaCanali/metricsWithLocalMode.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-11-04 16:48:55 -06:00
Dongjoon Hyun 42c0b175ce [SPARK-33338][SQL] GROUP BY using literal map should not fail
### What changes were proposed in this pull request?

This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.

### Why are the changes needed?

This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3

scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```

Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```

**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```

**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #30246 from dongjoon-hyun/SPARK-33338.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-04 08:35:10 -08:00
Kousuke Saruta 0b557b3290 [SPARK-33265][TEST] Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
### What changes were proposed in this pull request?

This PR renames some part of `Seq` in `PostgresIntegrationSuite` to `scala.collection.Seq`.
When I run `docker-integration-test`, I noticed that `PostgresIntegrationSuite` failed due to `ClassCastException`.
The reason is the same as what is resolved in SPARK-29292.

### Why are the changes needed?

To pass `docker-integration-test` for Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran `PostgresIntegrationSuite` fixed and confirmed it successfully finished.

Closes #30166 from sarutak/fix-toseq-postgresql.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-04 17:39:06 +09:00
Erik Krogen ff724d23b6 [SPARK-33214][TEST][HIVE] Stop HiveExternalCatalogVersionsSuite from using a hard-coded location to store localized Spark binaries
### What changes were proposed in this pull request?
This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs.

### Why are the changes needed?
In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions:
```
  // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to
  // avoid downloading Spark of different versions in each run.
  private val sparkTestingDir = new File("/tmp/test-spark")
```
However this doesn't work, since it gets deleted every time:
```
  override def afterAll(): Unit = {
    try {
      Utils.deleteRecursively(wareHousePath)
      Utils.deleteRecursively(tmpDataDir)
      Utils.deleteRecursively(sparkTestingDir)
    } finally {
      super.afterAll()
    }
  }
```

It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it.

### Does this PR introduce _any_ user-facing change?
Developer-facing changes only, as this is in a test.

### How was this patch tested?
The test continues to execute as expected.

Closes #30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-04 06:51:54 +00:00
Terry Kim 0ad35ba5f8 [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `ANALYZE TABLE` and `ANALYZE TABLE ... FOR COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `ANALYZE TABLE` is not supported for v2 tables.

### Why are the changes needed?

The changes allow consistent resolution behavior when resolving the table/view identifier. For example, the following is the current behavior:
```scala
sql("create temporary view t as select 1")
sql("create database db")
sql("create table db.t using csv as select 1")
sql("use db")
sql("ANALYZE TABLE t compute statistics") // Succeeds
```
With this change, ANALYZE TABLE above fails with the following:
```
    org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$40(Analyzer.scala:872)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:870)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:856)
```
, which is expected since temporary view is resolved first and ANALYZE TABLE doesn't support a temporary view.

### Does this PR introduce _any_ user-facing change?

After this PR, `ANALYZE TABLE t` is resolved to a temp view `t` instead of table `db.t`.

### How was this patch tested?

Updated existing tests.

Closes #30229 from imback82/parse_v1table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-04 06:50:37 +00:00
ulysses 1740b29b3f [SPARK-33323][SQL] Add query resolved check before convert hive relation
### What changes were proposed in this pull request?

Add query.resolved before  convert hive relation.

### Why are the changes needed?

For better error msg.
```
CREATE TABLE t STORED AS PARQUET AS
SELECT * FROM (
 SELECT c3 FROM (
  SELECT c1, c2 from values(1,2) t(c1, c2)
  )
)
```
 Before this PR, we get such error msg
```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: *
  at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244)
  at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
  at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?

Yes, error msg changed.

### How was this patch tested?

Add test.

Closes #30230 from ulysses-you/SPARK-33323.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-04 05:01:39 +00:00
Wenchen Fan 034070a23a Revert "[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size"
This reverts commit 0c943cd2fb.
2020-11-04 12:30:38 +08:00
Chao Sun d900c6ff49 [SPARK-33293][SQL][FOLLOW-UP] Rename TableWriteExec to TableWriteExecHelper
### What changes were proposed in this pull request?

Rename `TableWriteExec` in `WriteToDataSourceV2Exec.scala` to `TableWriteExecHelper`.

### Why are the changes needed?

See [discussion](https://github.com/apache/spark/pull/30193#discussion_r516412653). The former is too general.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #30235 from sunchao/SPARK-33293-2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-03 14:53:01 -08:00
neko 56c623e98c [SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to sort the table will cause the header content to be lost
### What changes were proposed in this pull request?
In the old version of spark in the storage UI page, the sorting function is normal, but sorting in the new version will cause the header content to be lost, So I try to fix the bug.

### Why are the changes needed?

The header field of the table on the page is similar to the following, **note that each th contains the span attribute**:

```html
<thead>
    <tr>
        ....
        <th width="" class="">
              <span data-toggle="tooltip" title="" data-original-title="StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) andreplication factor of the persisted RDD">
                Storage Level
              </span>
        </th>
       .....
    </tr>
</thead>
```

Since  [PR#26136](https://github.com/apache/spark/pull/26136), if the `th` in the table itself contains the `span` attribute, the `span` will be deleted directly after clicking the sort, and the original header content will be lost.

There are three problems  in `sorttable.js`:

1. `sortrevind.class = "sorttable_sortrevind"` in  [sorttab.js#107](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L107)) and  `sortfwdind.class = "sorttable_sortfwdind"` in  [sorttab.js#125](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L125))
sorttable_xx attribute should be assigned to`className` instead of `class`, as javascript uses `rowlists[j].className.search` rather than `rowlists[j].class.search` to determine whether the component has a sorting flag or not.
2.  `rowlists[j].className.search(/\sorttable_sortrevind\b/)` in  [sorttab.js#120](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L120)) was wrong. The original intention is to search whether `className` contains  the word `sorttable_sortrevind` , but the expression is wrong,  it should be `\bsorttable_sortrevind\b` instead of `\sorttable_sortrevind\b`
3. The if check statement in the following code snippet ([sorttab.js#141](9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L141))) was wrong. **If the `search` function does not find the target, it will return -1, but Boolean(-1) is actually equals true**. This statement will cause span to be deleted even if it does not contain `sorttable_sortfwdind` or `sorttable_sortrevind`.
```javascript
rowlists = this.parentNode.getElementsByTagName("span");
for (var j=0; j < rowlists.length; j++) {
              if (rowlists[j].className.search(/\bsorttable_sortfwdind\b/)
                  || rowlists[j].className.search(/\sorttable_sortrevind\b/) ) {
                  rowlists[j].parentNode.removeChild(rowlists[j]);
              }
          }
```

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
The manual test result of the ui page is as below:

![fix sorted](https://user-images.githubusercontent.com/52202080/97543194-daeaa680-1a02-11eb-8b11-8109c3e4e9a3.gif)

Closes #30182 from akiyamaneko/ui_storage_sort_error.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-03 08:49:52 -06:00
zero323 4c8ee8856c [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
### What changes were proposed in this pull request?

This PR adds support for passing `Column`s as input to PySpark sorting functions.

### Why are the changes needed?

According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible.

### Does this PR introduce _any_ user-facing change?

PySpark users can now provide both `Column` and `str` as an argument for `asc*` and `desc*` functions.

### How was this patch tested?

New unit tests.

Closes #30227 from zero323/SPARK-33257.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-03 22:50:59 +09:00
Dongjoon Hyun 27d8136934 [SPARK-33324][K8S][BUILD] Upgrade kubernetes-client to 4.11.1
### What changes were proposed in this pull request?

This PR aims to upgrade `Kubernetes-client` from 4.10.3 to 4.11.1.

### Why are the changes needed?

This upgrades the dependency for Apache Spark 3.1.0.
Since 4.12.0 is still new and has a breaking API changes, this PR chooses the latest compatible one.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the all CIs including K8s IT.

Closes #30233 from dongjoon-hyun/SPARK-33324.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 22:23:26 -08:00
HyukjinKwon 3959f0d987 [SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)
### What changes were proposed in this pull request?

This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243.
While I am migrating, I also fixed some Python type hints accordingly.

### Why are the changes needed?

For better documentation as text itself, and generated HTMLs

### Does this PR introduce _any_ user-facing change?

Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243.

### How was this patch tested?

Manually tested via running `./dev/lint-python`.

Closes #30181 from HyukjinKwon/SPARK-33250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-03 10:00:49 +09:00
Max Gekk bdabf60fb4 [SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for from_json
### What changes were proposed in this pull request?
Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs.

Closes #30201

### Why are the changes needed?
Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By linters.

Closes #30226 from MaxGekk/from_json-common-schema-parsing-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 10:10:24 -08:00
Max Gekk eecebd0302 [SPARK-33306][SQL][FOLLOWUP] Group DateType and TimestampType together in needsTimeZone()
### What changes were proposed in this pull request?
In the PR, I propose to group `DateType` and `TimestampType` together in checking time zone needs in the `Cast.needsTimeZone()` method.

### Why are the changes needed?
To improve code maintainability.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By the existing test `"SPARK-33306: Timezone is needed when cast Date to String"`.

Closes #30223 from MaxGekk/WangGuangxin-SPARK-33306-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 10:07:18 -08:00
Yuming Wang 789d19cab5 [SPARK-33319][SQL][TEST] Add all built-in SerDes to HiveSerDeReadWriteSuite
### What changes were proposed in this pull request?

This pr add all built-in SerDes to `HiveSerDeReadWriteSuite`.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe

### Why are the changes needed?

We will upgrade Parquet, ORC and Avro, need to ensure compatibility.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #30228 from wangyum/SPARK-33319.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 08:34:50 -08:00
Cheng Su e52b858ef7 [SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE
### What changes were proposed in this pull request?

As a followup comment from https://github.com/apache/spark/pull/29804#issuecomment-700650620 , here we add add the physical plan rule DisableUnnecessaryBucketedScan into AQE AdaptiveSparkPlanExec.queryStagePreparationRules, to make auto bucketed scan work with AQE.

The change is mostly in:
* `AdaptiveSparkPlanExec.scala`: add physical plan rule `DisableUnnecessaryBucketedScan`
* `DisableUnnecessaryBucketedScan.scala`: propagate logical plan link for the file source scan exec operator, otherwise we lose the logical plan link information when AQE is enabled, and will get exception [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L176). (for example, for query `SELECT * FROM bucketed_table` with AQE is enabled)
* `DisableUnnecessaryBucketedScanSuite.scala`: add new test suite for AQE enabled - `DisableUnnecessaryBucketedScanWithoutHiveSupportSuiteAE`, and changed some of tests to use `AdaptiveSparkPlanHelper.find/collect`, to make the plan verification work when AQE enabled.

### Why are the changes needed?

It's reasonable to add the support to allow disabling unnecessary bucketed scan with AQE is enabled, this helps optimize the query when AQE is enabled.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `DisableUnnecessaryBucketedScanSuite`.

Closes #30200 from c21/auto-bucket-aqe.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-02 06:44:07 +00:00
Prashant Sharma 6226ccc092 [SPARK-33095] Follow up, support alter table column rename
### What changes were proposed in this pull request?

Support rename column for mysql dialect.

### Why are the changes needed?

At the moment, it does not work for mysql version 5.x. So, we should throw proper exception for that case.

### Does this PR introduce _any_ user-facing change?

Yes, `column rename` with mysql dialect should work correctly.

### How was this patch tested?

Added tests for rename column.
Ran the tests to pass with both versions of mysql.

* `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31`

* `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0`

Closes #30142 from ScrapCodes/mysql-dialect-rename.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-02 05:03:41 +00:00
zero323 d71b2febaf [SPARK-30663][SPARK-33313][TESTS][R] Drop testthat 1.x support and add testthat 3.x support
### What changes were proposed in this pull request?

This PR modifies `R/pkg/tests/run-all.R` by:

- Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant.
- Add `testthat` 3.x support to avoid AppVeyor failures.

### Why are the changes needed?

Currently used internal API has been removed in the latest `testthat` release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0`

Closes #30219 from zero323/SPARK-33313.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-02 08:54:08 +09:00
Gengliang Wang 2b6dfa5f7b [SPARK-20044][UI] Support Spark UI behind front-end reverse proxy using a path prefix Revert proxy url
### What changes were proposed in this pull request?

Allow to run the Spark web UI behind a reverse proxy with URLs prefixed by a context root, like www.mydomain.com/spark. In particular, this allows to access multiple Spark clusters through the same virtual host, only distinguishing them by context root, like www.mydomain.com/cluster1, www.mydomain.com/cluster2, and it allows to run the Spark UI in a common cookie domain (for SSO) with other services.

### Why are the changes needed?

This PR is to take over https://github.com/apache/spark/pull/17455.
After changes, Spark allows showing customized prefix URL in all the `href` links of the HTML pages.

### Does this PR introduce _any_ user-facing change?

Yes, all the links of UI pages will be contains the value of `spark.ui.reverseProxyUrl` if it is configurated.
### How was this patch tested?

New HTML Unit tests in MasterSuite
Manual UI testing for master, worker and app UI with an nginx proxy
Spark config:
```
spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
```
nginx config:
```
server {
    listen 9000;
    set $SPARK_MASTER http://127.0.0.1:8080;
    # split spark UI path into prefix and local path within master UI
    location ~ ^(/path/to/spark/) {
        # strip prefix when forwarding request
        rewrite /path/to/spark(/.*) $1  break;
        #rewrite /path/to/spark/ "/" ;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        proxy_intercept_errors on;
        error_page 301 302 307 = handle_redirects;
    }
    location handle_redirects {
        set $saved_redirect_location '$upstream_http_location';
        proxy_pass $saved_redirect_location;
    }
}
```

Closes #29820 from gengliangwang/revertProxyURL.

Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Oliver Köth <okoeth@de.ibm.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-11-01 23:57:57 +08:00
Takuya UESHIN b8a440f098 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends
### What changes were proposed in this pull request?

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.

### Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

```py
spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()
```

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests, and manually.

Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-01 20:28:12 +09:00
Daniel Himmelstein 56587f076d [SPARK-33310][PYTHON] Relax pyspark typing for sql str functions
### What changes were proposed in this pull request?

Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable.

### Why are the changes needed?

Convenience & ensuring the typing reflects the functionality

### Does this PR introduce _any_ user-facing change?

Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions.

### How was this patch tested?

Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this.

Closes #30209 from dhimmel/patch-1.

Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-01 19:09:12 +09:00
wangguangxin.cn 69c27f49ac [SPARK-33306][SQL] Timezone is needed when cast date to string
### What changes were proposed in this pull request?
When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as 72ad9dcd5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (L309).

Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting.

A simple way to reproduce this is
```
spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true

```
when we execute the following sql,
```
select a.d1 from
(select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
join
(select concat('2000-01-0', id) as d2 from range(1, 2)) b
on a.d1 = b.d2
```
it will throw
```
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:529)
  at scala.None$.get(Option.scala:527)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
  at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
  at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)
```

### Why are the changes needed?
As described above, it's a bug here.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add more UT

Closes #30213 from WangGuangxin/SPARK-33306.

Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-31 15:14:46 -07:00
Chao Sun c51e5fc14b [SPARK-33293][SQL] Refactor WriteToDataSourceV2Exec and reduce code duplication
### What changes were proposed in this pull request?

Refactor `WriteToDataSourceV2Exec` via removing code duplication around write to table logic:
- renamed `AtomicTableWriteExec` to `TableWriteExec` so that the table write logic in this trait can be modified and shared with `CreateTableAsSelectExec`, `ReplaceTableAsSelectExec`, `AtomicCreateTableAsSelectExec ` and `AtomicReplaceTableAsSelectExec`.
- similar to the above, renamed `writeToStagedTable` to `writeToTable` in `TableWriteExec`.
- extended `writeToTable` so that it can handle both staged table as well as non-staged table.

### Why are the changes needed?

Simplify the logic and remove duplication, to make this piece of code easier to maintain.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CIs with the existing test coverage.

Closes #30193 from sunchao/SPARK-33293.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-31 10:01:31 -07:00
Chao Sun 32b78d3795 [SPARK-33290][SQL] REFRESH TABLE should invalidate cache even though the table itself may not be cached
### What changes were proposed in this pull request?

In `CatalogImpl.refreshTable`, this moves the `uncacheQuery` call out of the condition `if (cache.nonEmpty)` so that it will be called whether the table itself is cached or not.

### Why are the changes needed?

In the case like the following:
```sql
CREATE TABLE t ...;
CREATE VIEW t1 AS SELECT * FROM t;
REFRESH TABLE t;
```

If the table `t` is refreshed, the view `t1` which is depending on `t` will not be invalidated. This could lead to incorrect result and is similar to [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765).

On the other hand, if we have:

```sql
CREATE TABLE t ...;
CACHE TABLE t;
CREATE VIEW t1 AS SELECT * FROM t;
REFRESH TABLE t;
```

Then the view `t1` will be refreshed. The behavior is somewhat inconsistent.

### Does this PR introduce _any_ user-facing change?

Yes, with the change any cache that are depending on the table refreshed will be invalidated with the change. Previously this only happens if the table itself is cached.

### How was this patch tested?

Added a new UT for the case.

Closes #30187 from sunchao/SPARK-33290.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-31 09:49:18 -07:00
Thomas Graves 72ad9dcd5d [SPARK-32037][CORE] Rename blacklisting feature
### What changes were proposed in this pull request?

this PR renames the blacklisting feature. I ended up using  "excludeOnFailure" or "excluded" in most cases but there is a mix. I renamed the BlacklistTracker to HealthTracker, but for the TaskSetBlacklist HealthTracker didn't make sense to me since its not the health of the taskset itself but rather tracking the things its excluded on so I renamed it to be TaskSetExcludeList.  Everything else I tried to use the context and in most cases excluded made sense. It made more sense to me then blocked since you are basically excluding those executors and nodes from scheduling tasks on them. Then can be unexcluded later after timeouts and such. The configs I changed the name to use excludeOnFailure which I thought explained it.

I unfortunately couldn't get rid of some of them because its part of the event listener and history files.  To keep backwards compatibility I kept the events and some of the parsing so that the history server would still properly read older history files.  It is not forward compatible though - meaning a new application write the "Excluded" events so the older history server won't properly read display them as being blacklisted.

A few of the files below are showing up as deleted and recreated even though I did a git mv on them. I'm not sure why.

### Why are the changes needed?

get rid of problematic language

### Does this PR introduce _any_ user-facing change?

Config name changes but the old configs still work but are deprecated.

### How was this patch tested?

updated tests and also manually tested the UI changes and manually tested the history server reading older versions of history files and vice versa.

Closes #29906 from tgravescs/SPARK-32037.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-10-30 17:16:53 -05:00
Holden Karau 491a0fb08b [SPARK-33262][K8S][FOLLOWUP] Verify pod allocation does not stall
### What changes were proposed in this pull request?

Add a test that pending executor does not stall pod allocation.

### Why are the changes needed?

Better test coverage

### Does this PR introduce _any_ user-facing change?

Test only change.

### How was this patch tested?

New test passes.

Closes #30205 from holdenk/verify-pod-allocation-does-not-stall.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-30 11:26:30 -07:00
Dmitry Sabanin 7c897c1216 [MINOR][CORE][DOCS] Fix typo in "spark.storage.decommission.shuffleBlocks.enabled" description
### What changes were proposed in this pull request?
Small typo fix in the description of `spark.storage.decommission.shuffleBlocks.enabled` property.

Closes #30208 from dsabanin/patch-1.

Authored-by: Dmitry Sabanin <sdmitry@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-30 11:14:42 -07:00
HyukjinKwon 3af1651e50 [SPARK-33297][BUILD] Switch to use flat class loader strategy in SBT
### What changes were proposed in this pull request?

This PR proposes to switch the class loader strategy from `ScalaLibrary` to `Flat` (see https://www.scala-sbt.org/1.x/docs/In-Process-Classloaders.html):

https://github.com/apache/spark/runs/1314691686

```
Error:  java.util.MissingResourceException: Can't find bundle for base name org.scalactic.ScalacticBundle, locale en
Error:  	at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581)
Error:  	at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396)
Error:  	at java.util.ResourceBundle.getBundle(ResourceBundle.java:782)
Error:  	at org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8)
Error:  	at org.scalactic.Resources$.resourceBundle(Resources.scala:8)
Error:  	at org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256)
Error:  	at org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65)
Error:  	at org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85)
Error:  	at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
Error:  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Error:  	at java.lang.reflect.Method.invoke(Method.java:498)
```

See also https://github.com/sbt/sbt/issues/5736

### Why are the changes needed?

To make the build unflaky.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this test.

Closes #30198 from HyukjinKwon/SPARK-33297.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-30 17:53:30 +09:00
ulysses d59f6a7095 [SPARK-33294][SQL] Add query resolved check before analyze InsertIntoDir
### What changes were proposed in this pull request?

Add `query.resolved` before analyze `InsertIntoDir`.

### Why are the changes needed?

For better error msg.
```
INSERT OVERWRITE DIRECTORY '/tmp/file' USING PARQUET
SELECT * FROM (
 SELECT c3 FROM (
  SELECT c1, c2 from values(1,2) t(c1, c2)
  )
)
```
 Before this PR, we get such error msg
```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: *
  at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244)
  at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
  at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?

Yes, error msg changed.

### How was this patch tested?

New test.

Closes #30197 from ulysses-you/SPARK-33294.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-30 08:18:10 +00:00
angerszhu 0c943cd2fb [SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
### What changes were proposed in this pull request?
Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size.
Since we can't decide whether it's a but and some use need it behavior same as Hive.

### Why are the changes needed?
Provides a compatible choice between historical behavior and Hive

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30156 from AngersZhuuuu/SPARK-33284.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-30 14:11:25 +09:00
Max Gekk 343e0bb3ad [SPARK-33286][SQL] Improve the error message about schema parsing by from_json/from_csv
# What changes were proposed in this pull request?
In the PR, I propose to improve the error message from `from_json`/`from_csv` by combining errors from all schema parsers:
- DataType.fromJson (except CSV)
- CatalystSqlParser.parseDataType
- CatalystSqlParser.parseTableSchema

Before the changes, `from_json` does not show error messages from the first parser in the chain that could mislead users.

### Why are the changes needed?
Currently, `from_json` outputs the error message from the fallback schema parser which can confuse end-users. For example:

```scala
    val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}"""
    df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show()
```
The JSON schema has an issue in `{"a":123}` but the error message doesn't point it out:
```
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '{' expecting {'ADD', 'AFTER',  ... }(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
```

### Does this PR introduce _any_ user-facing change?
Yes, after the changes for the example above:
```
Cannot parse the schema in JSON format: Failed to convert the JSON string '{"a":123}' to a field.
Failed fallback parsing: Cannot parse the data type:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^

Failed fallback parsing:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
```

### How was this patch tested?
- By existing tests suites like `JsonFunctionsSuite` and `JsonExpressionsSuite`.
- Add new test to `JsonFunctionsSuite`.
- Re-gen results for `json-functions.sql`.

Closes #30183 from MaxGekk/fromDDL-error-msg.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-30 11:18:47 +09:00
Dongjoon Hyun 838791bf0b [SPARK-33292][SQL] Make Literal ArrayBasedMapData string representation disambiguous
### What changes were proposed in this pull request?

This PR aims to wrap `ArrayBasedMapData` literal representation with `map(...)`.

### Why are the changes needed?

Literal ArrayBasedMapData has inconsistent string representation from `LogicalPlan` to `Optimized Logical Plan/Physical Plan`. Also, the representation at `Optimized Logical Plan` and `Physical Plan` is ambiguous like `[1 AS a#0, keys: [key1], values: [value1] AS b#1]`.

**BEFORE**
```scala
scala> spark.version
res0: String = 2.4.7

scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true)
== Parsed Logical Plan ==
'Project [1 AS a#0, 'map(key1, value1) AS b#1]
+- OneRowRelation

== Analyzed Logical Plan ==
a: int, b: map<string,string>
Project [1 AS a#0, map(key1, value1) AS b#1]
+- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS a#0, keys: [key1], values: [value1] AS b#1]
+- OneRowRelation

== Physical Plan ==
*(1) Project [1 AS a#0, keys: [key1], values: [value1] AS b#1]
+- Scan OneRowRelation[]
```

**AFTER**
```scala
scala> spark.version
res0: String = 3.1.0-SNAPSHOT

scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true)
== Parsed Logical Plan ==
'Project [1 AS a#4, 'map(key1, value1) AS b#5]
+- OneRowRelation

== Analyzed Logical Plan ==
a: int, b: map<string,string>
Project [1 AS a#4, map(key1, value1) AS b#5]
+- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5]
+- OneRowRelation

== Physical Plan ==
*(1) Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5]
+- *(1) Scan OneRowRelation[]
```

### Does this PR introduce _any_ user-facing change?

Yes. This changes the query plan's string representation in `explain` command and UI. However, this is a bug fix.

### How was this patch tested?

Pass the CI with the newly added test case.

Closes #30190 from dongjoon-hyun/SPARK-33292.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-29 19:10:01 -07:00
luluorta cbd3fdea62 [SPARK-33008][SQL] Division by zero on divide-like operations returns incorrect result
### What changes were proposed in this pull request?
In ANSI mode, when a division by zero occurs performing a divide-like operation (Divide, IntegralDivide, Remainder or Pmod), we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
added UT + existing UTs (improved)

Closes #29882 from luluorta/SPARK-33008.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-29 16:44:17 +00:00
yangjie01 fa6311731b [SPARK-33283][CORE] Remove useless externalBlockStoreSize from RDDInfo
### What changes were proposed in this pull request?
"external block store" API was removed after SPARK-12667,  `externalBlockStoreSize` in `RDDInfo` looks like always 0 and useless. So this pr just to remove this useless variable.

### Why are the changes needed?
remove useless variable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30179 from LuciferYang/SPARK-12667-FOLLOWUP.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-29 08:00:23 -07:00