Commit graph

31126 commits

Author SHA1 Message Date
yangjie01 9cefde8db3 [SPARK-36580][CORE][K8S] Use intersect and diff API on Set instead of manual implementation
### What changes were proposed in this pull request?
The main change of this pr is replace `filter` + `contains` with `intersect` api and `filterNot` + `contains` with `diff`

**Before**

```scala
val set = Set(1, 2)
val others = Set(2, 3)
set.filter(others.contains(_))
set.filterNot(others.contains)
```

**After**
```scala
val set = Set(1, 2)
val others = Set(2, 3)
set.intersect(others)
set.diff(others)
```

### Why are the changes needed?
Code simplification, replace manual implementation with existing API

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33829 from LuciferYang/SPARK-36580.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-29 09:24:37 -07:00
Andrew Olson 7d1be3710d [SPARK-36576][SS] Improve range split calculation for Kafka Source minPartitions option
### What changes were proposed in this pull request?

Proposing that the `KafkaOffsetRangeCalculator`'s range calculation logic be modified to exclude small (i.e. un-split) partitions from the overall proportional distribution math, in order to more reasonably divide the large partitions when they are accompanied by many small partitions, and to provide optimal behavior for cases where a `minPartitions` value is deliberately computed based on the volume of data being read.

### Why are the changes needed?

While the [documentation](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) does contain a clear disclaimer,

> Please note that this configuration is like a hint: the number of Spark tasks will be **approximately** `minPartitions`. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.

there are cases where the calculated Kafka partition range splits can differ greatly from expectations. For evenly distributed data and most `minPartitions `values this would not be a major or commonly encountered concern. However when the distribution of data across partitions is very heavily skewed, somewhat surprising range split calculations can result.

For example, given the following input data:

- 1 partition containing 10,000 messages
- 1,000 partitions each containing 1 message

Spark processing code loading from this collection of 1,001 partitions may decide that it would like each task to read no more than 1,000 messages. Consequently, it could specify a `minPartitions` value of 1,010 — expecting the single large partition to be split into 10 equal chunks, along with the 1,000 small partitions each having their own task. That is far from what actually occurs. The `KafkaOffsetRangeCalculator` algorithm ends up splitting the large partition into 918 chunks of 10 or 11 messages, two orders of magnitude from the desired maximum message count per task and nearly double the number of Spark tasks hinted in the configuration.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests and added new unit tests

Closes #33827 from noslowerdna/SPARK-36576.

Authored-by: Andrew Olson <aolson1@cerner.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-29 16:38:29 +09:00
Hyukjin Kwon 22c492a6b8 [MINOR][DOCS] Add Apache license header to GitHub Actions workflow files
### What changes were proposed in this pull request?

Some of GitHub Actions workflow files do not have the Apache license header. This PR adds them.

### Why are the changes needed?

To comply Apache license.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33862 from HyukjinKwon/minor-lisence.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-28 20:30:16 -07:00
Gengliang Wang 8a52ad9f82 [SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide
### What changes were proposed in this pull request?

The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval

And, the `try_divide` function allows the following inputs:

- number, number
- interval, number

However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.

### Why are the changes needed?

Improve documentation and tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)

Closes #33861 from gengliangwang/enhanceTryDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-29 10:30:04 +09:00
yangjie01 8ffb59d4e9 [SPARK-36583][BUILD] Upgrade Apache commons-pool2 from 2.6.2 to 2.11.1
### What changes were proposed in this pull request?
This pr upgrade Apache `commons-pool2` from `2.6.2` to `2.11.1`, `2.11.1` is a Java 8 build version and `2.6.2` is still a Java 7 build version.

### Why are the changes needed?
Bring some bug fix like `DefaultPooledObject.getIdleTime() drops nanoseconds on Java 9 and greater`
Other changes: [RELEASE-NOTES](https://gitbox.apache.org/repos/asf?p=commons-pool.git;a=blob;f=RELEASE-NOTES.txt)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33831 from LuciferYang/commons-pool2.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-29 10:21:57 +09:00
Kousuke Saruta 4356d6603a [SPARK-36605][BUILD] Upgrade Jackson to 2.12.5
### What changes were proposed in this pull request?

This PR upgrades Jackson from `2.12.3` to `2.12.5`.

### Why are the changes needed?

Recently, Jackson `2.12.5` was released and it seems to be expected as the last full patch release for 2.12.x.
This release includes a fix for a regression in jackson-databind introduced in `2.12.3` which Spark 3.2 currently depends on.
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.5

### Does this PR introduce _any_ user-facing change?

Dependency maintenance.

### How was this patch tested?

CIs.

Closes #33860 from sarutak/upgrade-jackson-2.12.5.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-28 15:57:24 -07:00
Kousuke Saruta ea8c31e5ea [SPARK-36509][CORE] Fix the issue that executors are never re-scheduled if the worker stops with standalone cluster
### What changes were proposed in this pull request?

This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops.
As a result, the application stucks.
You can easily reproduce this issue by the following procedures.

```
# Run master
$ sbin/start-master.sh

# Run worker 1
$ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077

# Run worker 2
$ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077

# Run Spark Shell
$ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1

# Check which worker the executor runs on and then kill the worker.
$ kill <worker pid>
```

With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't.

The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`.
So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called.

This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state.
This issue causes assertion error.
```
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal
```

### Why are the changes needed?

It's a critical bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested with the procedure shown above and confirmed the executor is re-scheduled.

Closes #33818 from sarutak/fix-scheduling-stuck.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-28 18:01:55 +09:00
senthilkumarb fe7bf5f96f [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory
### What changes were proposed in this pull request?

This PR does minor changes in the file SaveAsHiveFile.scala.

It contains the below changes :

1. dropping getParent from below part of code
===============================
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
===============================

### Why are the changes needed?

Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs).

In HIVE:
```
 beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO  : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
```

but spark's behaviour,

```
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
...
```

The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs.

After this change is applied spark-sql creates .staging inside /db/table/,  similar to hive, as below,

```
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
```

The reason why we don't see this issue in Hive but only occurs in Spark-sql:

In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs)

### Does this PR introduce _any_ user-facing change?
 No

### How was this patch tested?

Tested manually by creating hive-sql.jar

Closes #33577 from senthh/viewfs-792392.

Authored-by: senthilkumarb <senthilkumarb@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-27 12:58:28 -07:00
Gengliang Wang e650d06ba9 [SPARK-36597][DOCS] Fix issues in SQL function docs
### What changes were proposed in this pull request?

* the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional
* remove the `|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get
* Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699

### Why are the changes needed?

Improve the documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build doc and preview:
![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png)
![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png)

![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png)

![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png)

Closes #33847 from gengliangwang/auditSQLDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 13:29:34 +08:00
Huaxin Gao 15e42b4442 [SPARK-36578][ML] UnivariateFeatureSelector API doc improvement
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`

### Why are the changes needed?
make the doc look better

### Does this PR introduce _any_ user-facing change?
yes, API doc change

### How was this patch tested?
Manually checked

Closes #33855 from huaxingao/ml_doc.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-26 21:16:49 -07:00
Jungtaek Lim bc32144a91 [SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc
### What changes were proposed in this pull request?

This PR proposes to document `window` & `session_window` function in SQL API doc page.

Screenshot of functions:

> window

![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)

> session_window

![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)

### Why are the changes needed?

Description is missing in both `window` / `session_window` functions for SQL API page.

### Does this PR introduce _any_ user-facing change?

Yes, the description of `window` / `session_window` functions will be available in SQL API page.

### How was this patch tested?

Only doc changes.

Closes #33846 from HeartSaVioR/SPARK-36595.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-27 12:39:09 +09:00
Yuanjian Li dd3f0fa8c2 [SPARK-35611][SS][FOLLOW-UP] Improve the user guide document
### What changes were proposed in this pull request?
Improve the user guide document.

### Why are the changes needed?
Make the user guide clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Doc change only.

Closes #33854 from xuanyuanking/SPARK-35611-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:27:06 +09:00
Wenchen Fan 72d6d64835 [SPARK-36587][SQL] Migrate CreateNamespaceStatement to v2 command framework
### What changes were proposed in this pull request?

This PR migrates CreateNamespaceStatement to the v2 command framework. Two new logical plans `UnresolvedObjectName` and `ResolvedObjectName` are introduced to support these CreateXXXStatements.

### Why are the changes needed?

Avoid duplicated code

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33835 from cloud-fan/ddl.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 20:36:04 +08:00
Cheng Su 400dc7ba67 [SPARK-36594][SQL] ORC vectorized reader should properly check maximal number of fields
### What changes were proposed in this pull request?

Debugged internally and found a bug where we should disable vectorized reader now based on schema recursively. Currently we check `schema.length` to be no more than `wholeStageMaxNumFields` to enable vectorization. `schema.length` does not take nested columns sub-fields into condition (i.e. view nested column same as primitive column). This check will be wrong when enabling vectorization for nested columns. We should follow [same check](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L896) from `WholeStageCodegenExec` to check sub-fields recursively. This will not cause correctness issue but will cause performance issue where we may enable vectorization for nested columns by mistake when nested column has a lot of sub-fields.

### Why are the changes needed?

Avoid OOM/performance regression when reading ORC table with nested column types.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `OrcQuerySuite.scala`. Verified test failed without this change.

Closes #33842 from c21/field-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-26 19:42:25 +08:00
Leona Yoda aeb3da2798 [SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark
### What changes were proposed in this pull request?

Replace images in pyspark on pandas document because those images uses the word Koalas

### Why are the changes needed?

Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR .
https://github.com/apache/spark/pull/32835

I think we have to match the word on that images

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Screen shots
![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png)
![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png)
![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png)
![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png)
Environment
- Windows 10
- Google Chrome 92.0.4515.159

[images.pptx](https://github.com/apache/spark/files/7029087/images.pptx)

Closes #33786 from yoda-mon/replace-pyspark-doc-images.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 19:03:02 +09:00
itholic fe486185c4 [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype
### What changes were proposed in this pull request?

This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.

- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.

### Why are the changes needed?

We should enable the tests as much as possible even if pandas has a bug.

And we should follow the behavior of latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, `GroupBy.transform` now follow the behavior of latest pandas.

### How was this patch tested?

Unittests.

Closes #33817 from itholic/SPARK-36537.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 17:43:49 +09:00
itholic 97e7d6e667 [SPARK-36505][PYTHON] Improve test coverage for frame.py
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.

This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33833 from itholic/SPARK-36505.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 17:43:00 +09:00
Gengliang Wang 1a42aa5bd4 [SPARK-36457][DOCS] Review and fix issues in Scala/Java API docs
### What changes were proposed in this pull request?

Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:

- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc

### Why are the changes needed?

Improve API docs

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #33824 from gengliangwang/auditDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-26 12:59:18 +08:00
Pablo Langa 14622fcec8 [SPARK-36488][SQL] Improve error message with quotedRegexColumnNames
### What changes were proposed in this pull request?

When `spark.sql.parser.quotedRegexColumnNames=true` and a pattern is used in a place where is not allowed the message is a little bit confusing

```
scala> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")

scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'divide'
```
This PR attempts to improve the error message
```
scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of regular expression in expression 'divide'
```

### Why are the changes needed?

To clarify the error message with this option active

### Does this PR introduce _any_ user-facing change?

Yes, change the error message

### How was this patch tested?

Unit testing and manual testing

Closes #33802 from planga82/feature/spark36488_improve_error_message.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 11:33:40 +08:00
Peter Toth 3bb8cd918d [SPARK-36568][SQL] Better FileScan statistics estimation
### What changes were proposed in this pull request?
This PR modifies `FileScan.estimateStatistics()` to take the read schema into account.

### Why are the changes needed?
`V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. The better statistics returned by `FileScan.estimateStatistics()` can mean better query plans. For example, with this change the broadcast issue in SPARK-36568 can be avoided.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #33825 from peter-toth/SPARK-36568-scan-statistics-estimation.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 11:26:39 +08:00
Max Gekk 159ff9fd14 [SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone
### What changes were proposed in this pull request?
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.

### Why are the changes needed?
Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```

### Does this PR introduce _any_ user-facing change?
Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```

Closes #33838 from MaxGekk/fix-ts_ntz-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:09:18 +08:00
Max Gekk c4e739fb4b [SPARK-35581][SPARK-36567][SQL][DOCS][FOLLOWUP] Update the SQL migration guide about foldable special datetime values
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">

### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By generating docs, and checking results manually.

Closes #33840 from MaxGekk/special-datetime-cast-migr-guide.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:02:00 +08:00
Holden Karau ff3f3c4566 [SPARK-36058][K8S] Add support for statefulset APIs in K8s
### What changes were proposed in this pull request?

Generalize the pod allocator and add support for statefulsets.

### Why are the changes needed?

Allocating individual pods in Spark can be not ideal for some clusters and using higher level operators like statefulsets and replicasets can be useful.

### Does this PR introduce _any_ user-facing change?

Yes new config options.

### How was this patch tested?

Completed: New unit & basic integration test
PV integration tests

Closes #33508 from holdenk/SPARK-36058-support-replicasets-or-job-api-like-things.

Lead-authored-by: Holden Karau <hkarau@netflix.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@netflix.com>
2021-08-25 17:38:57 -07:00
Gengliang Wang 18143fb426 [SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry
### What changes were proposed in this pull request?

Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:

- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like

This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`

### Why are the changes needed?

Correct the SQL doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually

Closes #33834 from gengliangwang/allowSQLFunVersion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-25 22:32:20 +08:00
Kousuke Saruta b2ff01608f [SPARK-36398][SQL] Redact sensitive information in Spark Thrift Server log
### What changes were proposed in this pull request?

This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log.
For example, JDBC password can be exposed in the log.
```
21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613
```

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed the log.
```
21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909
```

Closes #33832 from sarutak/fix-SPARK-36398.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-25 21:30:43 +09:00
Max Gekk df0ec56723 [SPARK-36567][SQL] Support foldable special datetime strings by CAST
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.

### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```

### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.

### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```

Closes #33816 from MaxGekk/foldable-datetime-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-25 14:08:59 +08:00
Kousuke Saruta bd0a4950ae [SPARK-35236][SQL][DOCS][FOLLOWUP] Mention ARCHIVE as an acceptable resource type for CREATE FUNCTION statement
### What changes were proposed in this pull request?

This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).

### Why are the changes needed?

To maintain the document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)

Closes #33823 from sarutak/create-function-archive.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:04:53 +09:00
Hyukjin Kwon 93cec49212 [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:02:53 +09:00
PengLei 3e32ea17db [SPARK-36336][SQL] Add new exception of base exception used in QueryExecutionErrors
### What changes were proposed in this pull request?
When we refactor the query execution errors to use error classes in QueryExecutionErrors, we need define some exception that mix SparkThrowable into a base Exception type.
according the example [SparkArithmeticException](f90eb6a5db/core/src/main/scala/org/apache/spark/SparkException.scala (L75))

Add SparkXXXException as follows:
- `SparkClassNotFoundException`
- `SparkConcurrentModificationException`
- `SparkDateTimeException`
- `SparkFileAlreadyExistsException`
- `SparkFileNotFoundException`
- `SparkNoSuchMethodException`
- `SparkIndexOutOfBoundsException`
- `SparkIOException`
- `SparkSecurityException`
- `SparkSQLException`
- `SparkSQLFeatureNotSupportedException`

Refactor some exceptions in QueryExecutionErrors to use error classes and new exception for testing new exception

Some added by [PR](https://github.com/apache/spark/pull/33538) as follows:

- `SparkUnsupportedOperationException`
- `SparkIllegalStateException`
- `SparkNumberFormatException`
- `SparkIllegalArgumentException`
- `SparkArrayIndexOutOfBoundsException`
- `SparkNoSuchElementException`

### Why are the changes needed?
[SPARK-36336](https://issues.apache.org/jira/browse/SPARK-36336)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed ut test

Closes #33573 from Peng-Lei/SPARK-36336.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 09:39:28 +09:00
Gengliang Wang de932f51ce Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?

Revert 397b843890 and 5a48eb8d00

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:38:14 -07:00
yi.wu d6c453aaea [SPARK-36564][CORE] Fix NullPointerException in LiveRDDDistribution.toApi
### What changes were proposed in this pull request?

This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`.

### Why are the changes needed?

Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order.

```
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception
java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
	at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
	at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
	at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
	at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
	at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
	at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
	at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
	at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
	at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
	at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
	...
```

### Does this PR introduce _any_ user-facing change?

Yes, users will see the expected RDD info in UI instead of the NPE error.

 ### How was this patch tested?

Pass existing tests.

Closes #33812 from Ngone51/fix-hostport-npe.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:33:42 -07:00
Luca Canali e03afc906f [SPARK-36573][BUILD][TEST] Add a default value to ORACLE_DOCKER_IMAGE
### What changes were proposed in this pull request?
Currently, the procedure to run the Oracle Integration Suite is based on building the Oracle RDBMS image from the Dockerfiles provided by Oracle.
Recently, Oracle has started providing database images, see  https://container-registry.oracle.com
Moreover an Oracle employee is maintaining Oracle XE images that are streamlined for testing at https://hub.docker.com/r/gvenzl/oracle-xe and https://github.com/gvenzl/oci-oracle-xe This solves the issue that official images are quite large and make testing resource-intensive and slow.
This proposes to document the available options and to introduce a default value for ORACLE_DOCKER_IMAGE

### Why are the changes needed?
This change will make it easier and faster to run the Oracle Integration Suite, removing the need to manually build an Oracle DB image.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually tested:
```
export ENABLE_DOCKER_INTEGRATION_TESTS=1
./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite"
./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite"
```

Closes #33821 from LucaCanali/oracleDockerIntegration.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:30:21 -07:00
Gengliang Wang 5b4c216478 [SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package.

### Why are the changes needed?

There are two packages for `org.apache.spark.sql.connector`
SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector
Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector

As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst.
### Does this PR introduce _any_ user-facing change?

No, the trait is not released yet.

### How was this patch tested?

Existing UT.

Closes #33826 from gengliangwang/moveLocalScan.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:23:50 -07:00
Aravind Patnam ee20fbb3dc [SPARK-36419][CORE] Optionally move final aggregation in RDD.treeAggregate to executor
## What changes were proposed in this pull request?

Move final iteration of aggregation of RDD.treeAggregate to an executor with one partition and fetch that result to the driver

## Why are the changes needed?
1. RDD.fold pulls all shuffle partitions to the driver to merge the result
        a. Driver becomes a single point of failure in the case that there are a lot of partitions to do the final aggregation on
2. Shuffle machinery at executors is much more robust/fault tolerant compared to fetching results to driver.

## Does this PR introduce any user-facing change?
The previous behavior always did the final aggregation in the driver. The user can now (optionally) provide a boolean config (default = false) ENABLE_EXECUTOR_TREE_AGGREGATE to do that final aggregation in a single partition executor before fetching the results to the driver. The only additional cost is that the user will see an extra stage in their job.

## How was this patch tested?
This patch was tested via unit tests, and also tested on a cluster.
The screenshots showing the extra stage on a cluster are attached below (before vs after).
![before](https://user-images.githubusercontent.com/24758726/128249830-eefc4bda-f737-4d68-960e-1d1907762538.png)
![after](https://user-images.githubusercontent.com/24758726/128249838-be70bc95-9f39-489c-be17-c9c80c4846a4.png)

Closes #33644 from akpatnam25/SPARK-36419.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-24 22:29:26 +08:00
Angerszhuuuu e83f8a872a [SPARK-35991][SQL][FOLLOWUP] Add back protected modifier of sparkConf to TPCBase
### What changes were proposed in this pull request?
Add back protected modifier of sparkConf to TPCBase according to https://github.com/apache/spark/pull/33736/files#r694054229

### Why are the changes needed?
Add back protected modifier of sparkConf to TPCBase

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33813 from AngersZhuuuu/SPARK-35991-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-24 11:32:30 +09:00
Hyukjin Kwon fa53aa06d1 [SPARK-36560][PYTHON][INFRA] Deflake PySpark coverage job
### What changes were proposed in this pull request?

This PR proposes to increase timeouts for:
- `pyspark.sql.tests.test_streaming.StreamingTests.test_parameter_accuracy`
- `pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests. test_parameter_accuracy`

to deflake PySpark coverage build:

- https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
- https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
- https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
- https://github.com/apache/spark/runs/3338876122?check_suite_focus=true

### Why are the changes needed?

To have more stable PySpark coverage report: https://app.codecov.io/gh/apache/spark

### Does this PR introduce _any_ user-facing change?

Spark developers will be able to see more stable results in https://app.codecov.io/gh/apache/spark

### How was this patch tested?

GitHub Actions' scheduled jobs will test them out.

Closes #33808 from HyukjinKwon/SPARK-36560.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-24 11:08:43 +09:00
Huaxin Gao cd2342691d [SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.

### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33815 from huaxingao/agg_package.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-23 15:31:13 -07:00
Max Gekk 9f595c4ce3 [SPARK-36418][SPARK-36536][SQL][DOCS][FOLLOWUP] Update the SQL migration guide about using CAST in datetime parsing
### What changes were proposed in this pull request?
In the PR, I propose the update the SQL migration guide about the changes introduced by the PRs https://github.com/apache/spark/pull/33709 and https://github.com/apache/spark/pull/33769.

<img width="1011" alt="Screenshot 2021-08-23 at 11 40 35" src="https://user-images.githubusercontent.com/1580697/130419710-640f20b3-6a38-4eb1-a6d6-2e069dc5665c.png">

### Why are the changes needed?
To inform users about the upcoming changes in parsing datetime strings. This should help users to migrate on the new release.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By generating the doc, and checking by eyes:
```
$ SKIP_API=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 SKIP_SCALADOC=1 bundle exec jekyll build
```

Closes #33809 from MaxGekk/datetime-cast-migr-guide.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-23 13:07:37 +03:00
Yuto Akutsu adc485a94d [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md
### What changes were proposed in this pull request?

Mentioning Hadoop 3 in YARN introduction on cluster-overview.md.

### Why are the changes needed?

It only mentioned Hadoop 2 but Hadoop3 is currently the latest and I believe the most used version.

### Does this PR introduce _any_ user-facing change?

Yes, docs changed.

### How was this patch tested?

`SKIP_API=1 bundle exec jekyll build`
<img width="599" alt="minor_hadoop" src="https://user-images.githubusercontent.com/87687356/130002403-989f4290-d614-46f2-9823-1897fa39f370.png">

Closes #33783 from yutoacts/minor.

Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-23 16:48:54 +09:00
Xinrong Meng 0b6af464dc [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map
### What changes were proposed in this pull request?
Implement `CategoricalIndex.map` and `DatetimeIndex.map`

`MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary.

### Why are the changes needed?
Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

### Does this PR introduce _any_ user-facing change?
Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now.

- CategoricalIndex.map

```py
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser)
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
```

- DatetimeIndex.map

```py
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33756 from xinrong-databricks/other_indexes_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 10:08:40 +09:00
Gengliang Wang 3da0e9500f [SPARK-36557][DOCS] Update the MAVEN_OPTS in Spark build docs
### What changes were proposed in this pull request?

As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.

### Why are the changes needed?

Correct the documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test. The MAVEN_OPTS is consistent with our github action build.

Closes #33804 from gengliangwang/updateBuildDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 09:46:00 +09:00
Venkata krishnan Sowrirajan 7b2842e986 [SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl
### What changes were proposed in this pull request?

Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.

### Why are the changes needed?

To keep the config names consistent

### Does this PR introduce _any_ user-facing change?

Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.

### How was this patch tested?

Existing tests.

Closes #33799 from venkata91/SPARK-36374-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-08-22 01:28:31 -05:00
Liang-Chi Hsieh 5876e04de2 [MINOR][SS][DOCS] Update doc for streaming deduplication
### What changes were proposed in this pull request?

This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.

### Why are the changes needed?

Update the user document.

### Does this PR introduce _any_ user-facing change?

No. It's a doc only change.

### How was this patch tested?

Doc only change.

Closes #33801 from viirya/minor-ss-deduplication.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-21 18:20:17 -07:00
Angerszhuuuu 5740d5641d [SPARK-36549][SQL] Add taskStatus supports multiple value to monitoring doc
### What changes were proposed in this pull request?
In Stage related restful API, we support `taskStatus` parameter as a list
```
 QueryParam("taskStatus") taskStatus: JList[TaskStatus]
```
In restful we should write like
```
taskStatus=SUCCESS&taskStatus=FAILED
```

It's usefule but not show in the doc, and many user don't know how to write the list parameters.
So add this feature to monitoring doc too.

### Why are the changes needed?
Make doc clear

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
With restful request
```
http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED
```
Resultful request result tasks
```
tasks" : {
    "0" : {
      "taskId" : 0,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.515GMT",
      "duration" : 273,
      "executorId" : "driver",
      "host" : "host",
      "status" : "FAILED",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorDeserializeCpuTime" : 0,
        "executorRunTime" : 206,
        "executorCpuTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 67,
      "gettingResultTime" : 0
    }
  },
```

With restful request
```
http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED&taskStatus=SUCCESS
```
Restful result tasks
```
"tasks" : {
    "1" : {
      "taskId" : 1,
      "index" : 1,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.786GMT",
      "duration" : 16,
      "executorId" : "driver",
      "host" : "host",
      "status" : "SUCCESS",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 2,
        "executorDeserializeCpuTime" : 2638000,
        "executorRunTime" : 2,
        "executorCpuTime" : 1993000,
        "resultSize" : 837,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 12,
      "gettingResultTime" : 0
    },
    "0" : {
      "taskId" : 0,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.515GMT",
      "duration" : 273,
      "executorId" : "driver",
      "host" : "host",
      "status" : "FAILED",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorDeserializeCpuTime" : 0,
        "executorRunTime" : 206,
        "executorCpuTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 67,
      "gettingResultTime" : 0
    }
  },
```

Closes #33793 from AngersZhuuuu/SPARK-36549.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:45:21 +09:00
Kent Yao f918c123a0 [SPARK-36552][SQL] Fix different behavior for writing char/varchar to hive and datasource table
### What changes were proposed in this pull request?

For the hive table, the actual write path and the schema handling are inconsistent when `spark.sql.legacy.charVarcharAsString` is true.

This causes problems like SPARK-36552 described.

In this PR we respect `spark.sql.legacy.charVarcharAsString` when generates hive table schema from spark data types.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

yes, when `spark.sql.legacy.charVarcharAsString` is true, hive table with char/varchar will respect string behavior.

### How was this patch tested?

newly added test

Closes #33798 from yaooqinn/SPARK-36552.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:38:39 +09:00
yangjie01 1ccb06ca8c Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:

- https://github.com/apache/spark/pull/31517
- https://github.com/apache/spark/pull/33772

### Why are the changes needed?

1. No really performance improvement in Spark
2. Added an additional dependency

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33784 from LuciferYang/revert-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:36:15 +09:00
ulysses-you 90cbf9ca3e [SPARK-35083][CORE][FOLLLOWUP] Improve docs and migration guide
### What changes were proposed in this pull request?

* improve docs in `docs/job-scheduling.md`
* add migration guide docs in `docs/core-migration-guide.md`

### Why are the changes needed?

Help user to migrate.

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Pass CI

Closes #33794 from ulysses-you/SPARK-35083-f.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
2021-08-20 21:32:48 +08:00
sweisdb c441c7e365 Updates AuthEngine to pass the correct SecretKeySpec format
AuthEngineSuite was passing on some platforms (MacOS), but failing on others (Linux) with an InvalidKeyException stemming from this line. We should explicitly pass AES as the key format.

### What changes were proposed in this pull request?

Changes the AuthEngine SecretKeySpec from "RAW" to "AES".

### Why are the changes needed?

Unit tests were failing on some platforms with InvalidKeyExceptions when this key was used to instantiate a Cipher.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests on a MacOS and Linux platform.

Closes #33790 from sweisdb/patch-1.

Authored-by: sweisdb <60895808+sweisdb@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-20 08:31:39 -05:00
Yesheng Ma 5c0762b5d2 [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
### What changes were proposed in this pull request?
Change all exceptions in NoSuchItemException.scala to case classes.

### Why are the changes needed?
Exceptions in NoSuchItemException.scala are not case classes. This is causing issues because in Analyzer's executeAndCheck method always calls the `copy` method on the exception. However, since these exceptions are not case classes, the `copy` method was always delegated to `AnalysisException::copy`, which is not the specialized version.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #33673 from yeshengm/SPARK-36448.

Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-20 20:16:30 +08:00
Gengliang Wang 42eebb84f5 [SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.

### Why are the changes needed?

Fix release script and update README of docs

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Manual test locally.

Closes #33797 from gengliangwang/fixReleaseDocker.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-20 20:02:24 +08:00