Commit graph

31073 commits

Author SHA1 Message Date
Kousuke Saruta a3901ed384 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0
### What changes were proposed in this pull request?

This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33895 from sarutak/fix-sequence-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit cf3bc65e69)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-03 23:25:33 +09:00
William Hyun 99f6f7f8f8 [SPARK-36657][SQL] Update comment in 'gen-sql-config-docs.py'
### What changes were proposed in this pull request?
This PR aims to update comments in `gen-sql-config-docs.py`.

### Why are the changes needed?
To make it up to date according to Spark version 3.2.0 release.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A.

Closes #33902 from williamhyun/fixtool.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit b72fa5ef1c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-02 18:51:10 -07:00
Angerszhuuuu 8b4cc90c44 [SPARK-36637][SQL] Provide proper error message when use undefined window frame
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message

1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

### Why are the changes needed?
Provide proper error message

### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc

### How was this patch tested?
Added UT

Closes #33892 from AngersZhuuuu/SPARK-36637.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 568ad6aa44)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-02 22:32:47 +08:00
Cary Lee 11d10fc994
[SPARK-36617][PYTHON] Fix type hints for approxQuantile to support multi-column version
### What changes were proposed in this pull request?
Update both `DataFrame.approxQuantile` and `DataFrameStatFunctions.approxQuantile` to support overloaded definitions when multiple columns are supplied.

### Why are the changes needed?
The current type hints don't support the multi-column signature, a form that was added in Spark 2.2 (see [the approxQuantile docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html).) This change was also introduced to pyspark-stubs (https://github.com/zero323/pyspark-stubs/pull/552). zero323 asked me to open a PR for the upstream change.

### Does this PR introduce _any_ user-facing change?
This change only affects type hints - it brings the `approxQuantile` type hints up to date with the actual code.

### How was this patch tested?
Ran `./dev/lint-python`.

Closes #33880 from carylee/master.

Authored-by: Cary Lee <cary@amperity.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
(cherry picked from commit 37f5ab07fa)
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-09-02 15:03:08 +02:00
Hyukjin Kwon e9f2e34261 [SPARK-36631][R] Ask users if they want to download and install SparkR in non Spark scripts
### What changes were proposed in this pull request?

This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.

`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.

### Why are the changes needed?

This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E

### Does this PR introduce _any_ user-facing change?

Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.

### How was this patch tested?

**R shell**

Valid input (`n`):

```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```

Invalid input:

```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```

Valid input (`y`):

```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```

**Rscript**

```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```

```
Rscript tmp.R
```

Valid input (`n`):

```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```

Invalid input:

```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```

Valid input (`y`):

```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```

`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).

Closes #33887 from HyukjinKwon/SPARK-36631.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e983ba8fce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-02 13:27:55 +09:00
Leona Yoda b81e9741cd [SPARK-36621][PYTHON][DOCS] Add Apache license headers to Pandas API on Spark documents
### What changes were proposed in this pull request?

 Apache license headers to Pandas API on Spark documents.

### Why are the changes needed?

Pandas API on Spark document sources do not have license headers, while the other docs have.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Closes #33871 from yoda-mon/add-license-header.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 77fdf5f0e4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-02 12:35:51 +09:00
Dongjoon Hyun cdf21f729a [SPARK-36629][BUILD] Upgrade aircompressor to 1.21
### What changes were proposed in this pull request?

This PR aims to upgrade `aircompressor` dependency from 1.19 to 1.21.

### Why are the changes needed?

This will bring the latest bug fix which exists in `aircompressor` 1.17 ~ 1.20.
- 1e364f7133

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33883 from dongjoon-hyun/SPARK-36629.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ff8cc4b800)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-31 22:35:53 -07:00
Gengliang Wang 1bad04d028 Preparing development version 3.2.1-SNAPSHOT 2021-08-31 17:04:14 +00:00
Gengliang Wang 03f5d23e96 Preparing Spark release v3.2.0-rc2 2021-08-31 17:04:08 +00:00
Jungtaek Lim 9a71c4ca84 [SPARK-36619][SS] Fix bugs around prefix-scan for HDFS backed state store and RocksDB state store
### What changes were proposed in this pull request?

This PR proposes to fix bugs around prefix-scan for both HDFS backed state store and RocksDB state store.

> HDFS backed state store

We did "shallow-copy" on copying prefix map, which leads the values of prefix map (mutable Set) to be "same instances" across multiple versions. This PR fixes it via creating a new mutable Set and copying elements.

> RocksDB state store

Prefix-scan iterators are only closed on RocksDB.rollback(), which is only called in RocksDBStateStore.abort().

While `RocksDBStateStore.abort()` method will be called for streaming session window (since it has two physical plans for read and write), other stateful operators which only have read-write physical plan will call either commit or abort, and don't close the iterators on committing. These unclosed iterators can be "reused" and produce incorrect outputs.

This PR ensures that resetting prefix-scan iterators is done on loading RocksDB, which was only done in rollback.

### Why are the changes needed?

Please refer the above section on explanation of bugs and treatments.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified UT which failed without this PR and passes with this PR.

Closes #33870 from HeartSaVioR/SPARK-36619.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 60a72c938a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-01 00:51:58 +08:00
yi.wu 8be53c3298 [SPARK-36614][CORE][UI] Correct executor loss reason caused by decommission in UI
### What changes were proposed in this pull request?

Post the correct executor loss reason to UI.

### Why are the changes needed?

To show the accurate loss reason.

### Does this PR introduce _any_ user-facing change?

Yes. Users can see the difference from the UI.

Before:
<img width="509" alt="WeChataad8d1f27d9f9aa7cf93ced4bcc820e2" src="https://user-images.githubusercontent.com/16397174/131341692-6f412607-87b8-405e-822d-0d28f07928da.png">
<img width="1138" alt="WeChat13c9f1345a096ff83d193e4e9853b165" src="https://user-images.githubusercontent.com/16397174/131341699-f2c9de09-635f-49df-8e27-2495f34276c0.png">

After:

<img width="599" alt="WeChata4313fa2dbf27bf2dbfaef5c1d4a19cf" src="https://user-images.githubusercontent.com/16397174/131341754-e3c93b5d-5252-4006-a4cc-94d76f41303b.png">
<img width="1182" alt="WeChat5559d52fd3070ae6c42fe32d56f9dc94" src="https://user-images.githubusercontent.com/16397174/131341761-e1e0644f-1e76-49c0-915a-26aad77ec272.png">

### How was this patch tested?

Manully tested.

Closes #33868 from Ngone51/fix-executor-remove-reason.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit ebe7bb6217)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-30 09:09:34 -07:00
gengjiaan d42536a6ee [SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source
### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.

### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.

### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.

### How was this patch tested?
Jenkins test.

Closes #33822 from beliefer/SPARK-36574.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fcc91cfec4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-30 19:09:45 +08:00
Sean Owen b76471c5df [SPARK-36603][CORE] Use WeakReference not SoftReference in LevelDB
### What changes were proposed in this pull request?

Use WeakReference not SoftReference in LevelDB

### Why are the changes needed?

(See discussion at https://github.com/apache/spark/pull/28769#issuecomment-906722390 )

"The soft reference to iterator introduced in this pr unfortunately ended up causing iterators to not be closed when they go out of scope (which would have happened earlier in the finalize)

This is because java is more conservative in cleaning up SoftReference's.
The net result was we ended up having 50k files for SHS while typically they get compacted away to 200 odd files.

Changing from SoftReference to WeakReference should make it much more aggresive in cleanup and prevent the issue - which we observed in a 3.1 SHS"

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #33859 from srowen/SPARK-36603.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 89e907f76c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-29 09:30:20 -07:00
Hyukjin Kwon e3edb65bf0 [MINOR][DOCS] Add Apache license header to GitHub Actions workflow files
### What changes were proposed in this pull request?

Some of GitHub Actions workflow files do not have the Apache license header. This PR adds them.

### Why are the changes needed?

To comply Apache license.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33862 from HyukjinKwon/minor-lisence.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 22c492a6b8)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-28 20:30:25 -07:00
Gengliang Wang 3719d87668 [SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide
### What changes were proposed in this pull request?

The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval

And, the `try_divide` function allows the following inputs:

- number, number
- interval, number

However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.

### Why are the changes needed?

Improve documentation and tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)

Closes #33861 from gengliangwang/enhanceTryDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a52ad9f82)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-29 10:30:13 +09:00
Kousuke Saruta 93f2b00501 [SPARK-36509][CORE] Fix the issue that executors are never re-scheduled if the worker stops with standalone cluster
### What changes were proposed in this pull request?

This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops.
As a result, the application stucks.
You can easily reproduce this issue by the following procedures.

```
# Run master
$ sbin/start-master.sh

# Run worker 1
$ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077

# Run worker 2
$ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077

# Run Spark Shell
$ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1

# Check which worker the executor runs on and then kill the worker.
$ kill <worker pid>
```

With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't.

The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`.
So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called.

This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state.
This issue causes assertion error.
```
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal
```

### Why are the changes needed?

It's a critical bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested with the procedure shown above and confirmed the executor is re-scheduled.

Closes #33818 from sarutak/fix-scheduling-stuck.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit ea8c31e5ea)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-28 18:02:11 +09:00
Gengliang Wang 9b28c2b09e [SPARK-36597][DOCS][3.2] Fix issues in SQL function docs
### What changes were proposed in this pull request?

* the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional
* remove the `|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get
* Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699

### Why are the changes needed?

Improve the documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build doc and preview:
![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png)
![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png)

![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png)

![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png)

Closes #33857 from gengliangwang/SPARK-36597-3.2.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-27 13:00:12 -07:00
itholic 396b76466b [SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests.

Some tests are missing

No

Unittest

Closes #33776 from itholic/SPARK-36388-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c91ae544fd)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 20:46:54 +09:00
Huaxin Gao 786d773585 [SPARK-36578][ML] UnivariateFeatureSelector API doc improvement
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`

### Why are the changes needed?
make the doc look better

### Does this PR introduce _any_ user-facing change?
yes, API doc change

### How was this patch tested?
Manually checked

Closes #33855 from huaxingao/ml_doc.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 15e42b4442)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-26 21:16:59 -07:00
Jungtaek Lim 118a53d87f [SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc
### What changes were proposed in this pull request?

This PR proposes to document `window` & `session_window` function in SQL API doc page.

Screenshot of functions:

> window

![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)

> session_window

![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)

### Why are the changes needed?

Description is missing in both `window` / `session_window` functions for SQL API page.

### Does this PR introduce _any_ user-facing change?

Yes, the description of `window` / `session_window` functions will be available in SQL API page.

### How was this patch tested?

Only doc changes.

Closes #33846 from HeartSaVioR/SPARK-36595.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit bc32144a91)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-27 12:39:21 +09:00
Gengliang Wang eca81cc0ae [SPARK-36457][DOCS][3.2] Review and fix issues in Scala/Java API docs
### What changes were proposed in this pull request?

Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:

- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc

### Why are the changes needed?

Improve API docs

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #33845 from gengliangwang/SPARK-36457-3.2.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 10:51:27 +08:00
Yuanjian Li f50f2d474c [SPARK-35611][SS][FOLLOW-UP] Improve the user guide document
### What changes were proposed in this pull request?
Improve the user guide document.

### Why are the changes needed?
Make the user guide clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Doc change only.

Closes #33854 from xuanyuanking/SPARK-35611-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit dd3f0fa8c2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:27:37 +09:00
Hyukjin Kwon 069f326e36 [MINOR] Address conflicts for SPARK-36367 cherry-pick 2021-08-27 10:24:18 +09:00
itholic 2dc15d9d84 [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype
This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.

- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.

We should enable the tests as much as possible even if pandas has a bug.

And we should follow the behavior of latest pandas.

Yes, `GroupBy.transform` now follow the behavior of latest pandas.

Unittests.

Closes #33817 from itholic/SPARK-36537.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fe486185c4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:00:23 +09:00
itholic 8829406366 [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3
This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.

**Before:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['b', 'c', 'a']
```

**After:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']  # CategoricalDtype is not updated if dtype is the same
```

`CategoricalDtype` is treated as a same `dtype` if the unique values are the same.

```python
>>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
>>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
>>> pcat1.dtype == pcat2.dtype
True
```

We should follow the latest pandas as much as possible.

Yes, the behavior is changed as example in the PR description.

Unittest

Closes #33757 from itholic/SPARK-36368.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit f2e593bcf1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:00:12 +09:00
itholic 31557d4759 [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string
This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.

In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                    NaT
Name: datetime, dtype: string
```

This is changed to

```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                   <NA>
Name: datetime, dtype: string
```

in pandas >= 1.3, so we follow the behavior of latest pandas.

Because pandas-on-Spark always follow the behavior of latest pandas.

Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)

Unittest passed

Closes #33735 from itholic/SPARK-36387.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c0441bb7e8)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:00:01 +09:00
itholic 0fc8c393b4 [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.

`RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.

Before:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       A    B
A
1 0  NaN  NaN
  1  2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN
```

After:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       B
A
1 0  NaN
  1  1.0
2 2  NaN
3 3  NaN
```

We should follow the behavior of pandas as much as possible.

Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.

Unit tests.

Closes #33646 from itholic/SPARK-36388.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit b8508f4876)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 09:59:48 +09:00
itholic f2f09e4cdb [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.

Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```

After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```

This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289.

We should follow the behavior of pandas as much as possible.

Yes, the result for some cases have duplicates values will change.

Unit test.

Closes #33634 from itholic/SPARK-36369.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a9f371c247)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 09:59:32 +09:00
Takuya UESHIN cb075b5301 [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3
Disable tests failed by the incompatible behavior of pandas 1.3.

Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.

No.

Disabled some tests related to the behavior change.

Closes #33598 from ueshin/issues/SPARK-36367/disable_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8cb9cf39b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 09:58:42 +09:00
Gengliang Wang c25f1e4347 [SPARK-36227][SQL][FOLLOWUP][3.2] Remove unused import in TimestampNTZType.scala
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/33837
It is to fix compilation error: https://github.com/apache/spark/runs/3431646840

### Why are the changes needed?

Fix a compilation error

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass existing UTs

Closes #33851 from gengliangwang/fixCompile.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 02:18:24 +08:00
Gengliang Wang 52b3b2d5bc [SPARK-36227][SQL][DOCS][3.2] Remove TimestampNTZ from API docs
### What changes were proposed in this pull request?

Although we try to remove TimestampNTZ from Branch 3.2 in , it still shows up in our API doc:
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/scala/org/apache/spark/sql/types/TimestampNTZType.html
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/java/org/apache/spark/sql/types/DataType.html

This PR is to clean it up in the API docs by
* making the TimestampNTZ type private
* remove TimestampNTZ in DataTypes

The changes are only for branch 3.2.

### Why are the changes needed?

Fix API doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually check generated docs

Closes #33837 from gengliangwang/privateNTZ.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-26 18:27:33 +08:00
Leona Yoda 36be232eea [SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark
### What changes were proposed in this pull request?

Replace images in pyspark on pandas document because those images uses the word Koalas

### Why are the changes needed?

Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR .
https://github.com/apache/spark/pull/32835

I think we have to match the word on that images

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Screen shots
![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png)
![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png)
![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png)
![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png)
Environment
- Windows 10
- Google Chrome 92.0.4515.159

[images.pptx](https://github.com/apache/spark/files/7029087/images.pptx)

Closes #33786 from yoda-mon/replace-pyspark-doc-images.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit aeb3da2798)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 19:03:11 +09:00
itholic 0feb19c53a [SPARK-36505][PYTHON] Improve test coverage for frame.py
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.

This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33833 from itholic/SPARK-36505.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 97e7d6e667)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 17:43:12 +09:00
Cheng Su c21303f02c [SPARK-36594][SQL][3.2] ORC vectorized reader should properly check maximal number of fields
### What changes were proposed in this pull request?

This is the patch on branch-3.2 for https://github.com/apache/spark/pull/33842. See the description in the other PR.

### Why are the changes needed?

Avoid OOM/performance regression when reading ORC table with nested column types.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `OrcSourceSuite.scala`.

Closes #33843 from c21/branch-3.2.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 14:55:21 +08:00
Max Gekk 0c364e607d [SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.

Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```

Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```

By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```

Closes #33838 from MaxGekk/fix-ts_ntz-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 159ff9fd14)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:11:37 +08:00
Max Gekk 5198c0c316 [SPARK-35581][SPARK-36567][SQL][DOCS][FOLLOWUP] Update the SQL migration guide about foldable special datetime values
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">

### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By generating docs, and checking results manually.

Closes #33840 from MaxGekk/special-datetime-cast-migr-guide.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c4e739fb4b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:02:15 +08:00
Dongjoon Hyun d841679ecc [MINOR][3.2] Remove unused numpy import
### What changes were proposed in this pull request?

This fixed Python linter failure.

### Why are the changes needed?

```
flake8 checks failed:
./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused
import numpy as np
F401 'numpy as np' imported but unused
Error: Process completed with exit code 1.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action Linter job.

Closes #33841 from dongjoon-hyun/unused_import.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 09:52:54 +09:00
Gengliang Wang 464841224c [SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry
### What changes were proposed in this pull request?

Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:

- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like

This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`

### Why are the changes needed?

Correct the SQL doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually

Closes #33834 from gengliangwang/allowSQLFunVersion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 18143fb426)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-25 22:32:49 +08:00
Kousuke Saruta fb38887e00 [SPARK-36398][SQL] Redact sensitive information in Spark Thrift Server log
### What changes were proposed in this pull request?

This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log.
For example, JDBC password can be exposed in the log.
```
21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613
```

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed the log.
```
21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909
```

Closes #33832 from sarutak/fix-SPARK-36398.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit b2ff01608f)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-25 21:31:04 +09:00
Max Gekk a4c5140242 [SPARK-36567][SQL] Support foldable special datetime strings by CAST
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.

### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```

### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.

### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```

Closes #33816 from MaxGekk/foldable-datetime-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df0ec56723)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-25 14:09:13 +08:00
Kousuke Saruta beabf91ea1 [SPARK-35236][SQL][DOCS][FOLLOWUP] Mention ARCHIVE as an acceptable resource type for CREATE FUNCTION statement
### What changes were proposed in this pull request?

This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).

### Why are the changes needed?

To maintain the document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)

Closes #33823 from sarutak/create-function-archive.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bd0a4950ae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:05:00 +09:00
Hyukjin Kwon 26ae9e93da [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 93cec49212)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:03:00 +09:00
Gengliang Wang 5463caac0d Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?

Revert 397b843890 and 5a48eb8d00

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:39:29 -07:00
yi.wu 36df86c0d0 [SPARK-36564][CORE] Fix NullPointerException in LiveRDDDistribution.toApi
### What changes were proposed in this pull request?

This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`.

### Why are the changes needed?

Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order.

```
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception
java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
	at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
	at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
	at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
	at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
	at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
	at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
	at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
	at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
	at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
	at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
	...
```

### Does this PR introduce _any_ user-facing change?

Yes, users will see the expected RDD info in UI instead of the NPE error.

 ### How was this patch tested?

Pass existing tests.

Closes #33812 from Ngone51/fix-hostport-npe.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit d6c453aaea)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:33:51 -07:00
Gengliang Wang a313082d67 [SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package.

### Why are the changes needed?

There are two packages for `org.apache.spark.sql.connector`
SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector
Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector

As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst.
### Does this PR introduce _any_ user-facing change?

No, the trait is not released yet.

### How was this patch tested?

Existing UT.

Closes #33826 from gengliangwang/moveLocalScan.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5b4c216478)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:24:06 -07:00
Huaxin Gao e48de7884d [SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.

### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33815 from huaxingao/agg_package.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit cd2342691d)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-23 15:31:35 -07:00
Xinrong Meng 56c211bd6a [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map
Implement `CategoricalIndex.map` and `DatetimeIndex.map`

`MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary.

Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now.

- CategoricalIndex.map

```py
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser)
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
```

- DatetimeIndex.map

```py
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')
```

Unit tests.

Closes #33756 from xinrong-databricks/other_indexes_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 0b6af464dc)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 10:11:21 +09:00
Gengliang Wang eea7d0037e [SPARK-36557][DOCS] Update the MAVEN_OPTS in Spark build docs
### What changes were proposed in this pull request?

As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.

### Why are the changes needed?

Correct the documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test. The MAVEN_OPTS is consistent with our github action build.

Closes #33804 from gengliangwang/updateBuildDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3da0e9500f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 09:46:41 +09:00
Venkata krishnan Sowrirajan 0f2e318894 [SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl
### What changes were proposed in this pull request?

Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.

### Why are the changes needed?

To keep the config names consistent

### Does this PR introduce _any_ user-facing change?

Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.

### How was this patch tested?

Existing tests.

Closes #33799 from venkata91/SPARK-36374-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 7b2842e986)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-22 01:29:36 -05:00
Liang-Chi Hsieh 212a21ee4f [MINOR][SS][DOCS] Update doc for streaming deduplication
### What changes were proposed in this pull request?

This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.

### Why are the changes needed?

Update the user document.

### Does this PR introduce _any_ user-facing change?

No. It's a doc only change.

### How was this patch tested?

Doc only change.

Closes #33801 from viirya/minor-ss-deduplication.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 5876e04de2)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-21 18:20:27 -07:00