### What changes were proposed in this pull request?
This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.
### Why are the changes needed?
We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests.
Closes#33912 from viirya/SPARK-36670.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 5a0ae694d0)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Fix regression in AQEShuffleReadExec when used in conjunction with Spark plugins with custom partitioning.
Signed-off-by: Andy Grove <andygrove73gmail.com>
### What changes were proposed in this pull request?
Return `UnknownPartitioning` rather than throw an exception in `AQEShuffleReadExec`.
### Why are the changes needed?
The [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) replaces `AQEShuffleReadExec` with a custom operator that runs on the GPU. Due to changes in [SPARK-36315](dd80457ffb), Spark now throws an exception if the shuffle exchange does not have recognized partitioning, and this happens before the postStageOptimizer rules so there is no opportunity to replace this operator now.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I am still in the process of testing this change. I will update the PR in the next few days with status.
Closes#33910 from andygrove/SPARK-36666.
Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit f78d8394dc)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch proposes to add e2e tests for using Hadoop codecs to write sequence files.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests.
Closes#33924 from viirya/hadoop-seq-test.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 6745d77818)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR fixes the description about the possible values of `spark.sql.catalogImplementation` property.
It was added in SPARK-36153 (#33362) but the possible values are `hive` or `in-memory` rather than `true` or `false`.
### Why are the changes needed?
To fix wrong description.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I just confirmed `in-memory` and `hive` are the valid values with SparkShell.
Closes#33923 from sarutak/fix-doc-about-catalogImplementation.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a5fe5d368c)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to ensure StateStoreProvider instances are properly closed for each test in StateStoreSuite/RocksDBStateStoreSuite.
### Why are the changes needed?
While this doesn't break the test, this is a bad practice and may possibly make nasty problems in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs
Closes#33916 from HeartSaVioR/SPARK-36667.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 093c2080fe)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config
### Why are the changes needed?
spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big, there would be performance issues.
It's better to leave this choice to users
### Does this PR introduce _any_ user-facing change?
spark.sql.execution.topKSortFallbackThreshold is now user-facing
### How was this patch tested?
passing GA
Closes#33904 from yaooqinn/SPARK-36659.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 7f1ad7be18)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33895 from sarutak/fix-sequence-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit cf3bc65e69)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR aims to update comments in `gen-sql-config-docs.py`.
### Why are the changes needed?
To make it up to date according to Spark version 3.2.0 release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A.
Closes#33902 from williamhyun/fixtool.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit b72fa5ef1c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message
1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
### Why are the changes needed?
Provide proper error message
### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc
### How was this patch tested?
Added UT
Closes#33892 from AngersZhuuuu/SPARK-36637.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 568ad6aa44)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update both `DataFrame.approxQuantile` and `DataFrameStatFunctions.approxQuantile` to support overloaded definitions when multiple columns are supplied.
### Why are the changes needed?
The current type hints don't support the multi-column signature, a form that was added in Spark 2.2 (see [the approxQuantile docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html).) This change was also introduced to pyspark-stubs (https://github.com/zero323/pyspark-stubs/pull/552). zero323 asked me to open a PR for the upstream change.
### Does this PR introduce _any_ user-facing change?
This change only affects type hints - it brings the `approxQuantile` type hints up to date with the actual code.
### How was this patch tested?
Ran `./dev/lint-python`.
Closes#33880 from carylee/master.
Authored-by: Cary Lee <cary@amperity.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
(cherry picked from commit 37f5ab07fa)
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.
`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.
### Why are the changes needed?
This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E
### Does this PR introduce _any_ user-facing change?
Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.
### How was this patch tested?
**R shell**
Valid input (`n`):
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```
Invalid input:
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```
**Rscript**
```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```
```
Rscript tmp.R
```
Valid input (`n`):
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```
Invalid input:
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```
`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).
Closes#33887 from HyukjinKwon/SPARK-36631.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e983ba8fce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Apache license headers to Pandas API on Spark documents.
### Why are the changes needed?
Pandas API on Spark document sources do not have license headers, while the other docs have.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`make html`
Closes#33871 from yoda-mon/add-license-header.
Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 77fdf5f0e4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade `aircompressor` dependency from 1.19 to 1.21.
### Why are the changes needed?
This will bring the latest bug fix which exists in `aircompressor` 1.17 ~ 1.20.
- 1e364f7133
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33883 from dongjoon-hyun/SPARK-36629.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ff8cc4b800)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix bugs around prefix-scan for both HDFS backed state store and RocksDB state store.
> HDFS backed state store
We did "shallow-copy" on copying prefix map, which leads the values of prefix map (mutable Set) to be "same instances" across multiple versions. This PR fixes it via creating a new mutable Set and copying elements.
> RocksDB state store
Prefix-scan iterators are only closed on RocksDB.rollback(), which is only called in RocksDBStateStore.abort().
While `RocksDBStateStore.abort()` method will be called for streaming session window (since it has two physical plans for read and write), other stateful operators which only have read-write physical plan will call either commit or abort, and don't close the iterators on committing. These unclosed iterators can be "reused" and produce incorrect outputs.
This PR ensures that resetting prefix-scan iterators is done on loading RocksDB, which was only done in rollback.
### Why are the changes needed?
Please refer the above section on explanation of bugs and treatments.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified UT which failed without this PR and passes with this PR.
Closes#33870 from HeartSaVioR/SPARK-36619.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 60a72c938a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.
### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.
### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.
### How was this patch tested?
Jenkins test.
Closes#33822 from beliefer/SPARK-36574.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fcc91cfec4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Use WeakReference not SoftReference in LevelDB
### Why are the changes needed?
(See discussion at https://github.com/apache/spark/pull/28769#issuecomment-906722390 )
"The soft reference to iterator introduced in this pr unfortunately ended up causing iterators to not be closed when they go out of scope (which would have happened earlier in the finalize)
This is because java is more conservative in cleaning up SoftReference's.
The net result was we ended up having 50k files for SHS while typically they get compacted away to 200 odd files.
Changing from SoftReference to WeakReference should make it much more aggresive in cleanup and prevent the issue - which we observed in a 3.1 SHS"
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33859 from srowen/SPARK-36603.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 89e907f76c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Some of GitHub Actions workflow files do not have the Apache license header. This PR adds them.
### Why are the changes needed?
To comply Apache license.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#33862 from HyukjinKwon/minor-lisence.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 22c492a6b8)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval
And, the `try_divide` function allows the following inputs:
- number, number
- interval, number
However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.
### Why are the changes needed?
Improve documentation and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)
Closes#33861 from gengliangwang/enhanceTryDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a52ad9f82)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops.
As a result, the application stucks.
You can easily reproduce this issue by the following procedures.
```
# Run master
$ sbin/start-master.sh
# Run worker 1
$ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077
# Run worker 2
$ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077
# Run Spark Shell
$ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1
# Check which worker the executor runs on and then kill the worker.
$ kill <worker pid>
```
With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't.
The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`.
So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called.
This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state.
This issue causes assertion error.
```
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal
```
### Why are the changes needed?
It's a critical bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually tested with the procedure shown above and confirmed the executor is re-scheduled.
Closes#33818 from sarutak/fix-scheduling-stuck.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit ea8c31e5ea)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests.
Some tests are missing
No
Unittest
Closes#33776 from itholic/SPARK-36388-followup.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c91ae544fd)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`
### Why are the changes needed?
make the doc look better
### Does this PR introduce _any_ user-facing change?
yes, API doc change
### How was this patch tested?
Manually checked
Closes#33855 from huaxingao/ml_doc.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 15e42b4442)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to document `window` & `session_window` function in SQL API doc page.
Screenshot of functions:
> window
![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)
> session_window
![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)
### Why are the changes needed?
Description is missing in both `window` / `session_window` functions for SQL API page.
### Does this PR introduce _any_ user-facing change?
Yes, the description of `window` / `session_window` functions will be available in SQL API page.
### How was this patch tested?
Only doc changes.
Closes#33846 from HeartSaVioR/SPARK-36595.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit bc32144a91)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Improve API docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33845 from gengliangwang/SPARK-36457-3.2.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Improve the user guide document.
### Why are the changes needed?
Make the user guide clear.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Doc change only.
Closes#33854 from xuanyuanking/SPARK-35611-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit dd3f0fa8c2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.
- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.
We should enable the tests as much as possible even if pandas has a bug.
And we should follow the behavior of latest pandas.
Yes, `GroupBy.transform` now follow the behavior of latest pandas.
Unittests.
Closes#33817 from itholic/SPARK-36537.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fe486185c4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.
**Before:**
```python
>>> pcat
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0 a
1 b
2 c
dtype: category
Categories (3, object): ['b', 'c', 'a']
```
**After:**
```python
>>> pcat
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c'] # CategoricalDtype is not updated if dtype is the same
```
`CategoricalDtype` is treated as a same `dtype` if the unique values are the same.
```python
>>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
>>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
>>> pcat1.dtype == pcat2.dtype
True
```
We should follow the latest pandas as much as possible.
Yes, the behavior is changed as example in the PR description.
Unittest
Closes#33757 from itholic/SPARK-36368.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit f2e593bcf1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.
In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0 2020-10-27 00:00:01
1 NaT
Name: datetime, dtype: string
```
This is changed to
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0 2020-10-27 00:00:01
1 <NA>
Name: datetime, dtype: string
```
in pandas >= 1.3, so we follow the behavior of latest pandas.
Because pandas-on-Spark always follow the behavior of latest pandas.
Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)
Unittest passed
Closes#33735 from itholic/SPARK-36387.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c0441bb7e8)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.
`RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.
Before:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
A B
A
1 0 NaN NaN
1 2.0 1.0
2 2 NaN NaN
3 3 NaN NaN
```
After:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
B
A
1 0 NaN
1 1.0
2 2 NaN
3 3 NaN
```
We should follow the behavior of pandas as much as possible.
Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.
Unit tests.
Closes#33646 from itholic/SPARK-36388.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit b8508f4876)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Disable tests failed by the incompatible behavior of pandas 1.3.
Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.
No.
Disabled some tests related to the behavior change.
Closes#33598 from ueshin/issues/SPARK-36367/disable_tests.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8cb9cf39b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/33837
It is to fix compilation error: https://github.com/apache/spark/runs/3431646840
### Why are the changes needed?
Fix a compilation error
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass existing UTs
Closes#33851 from gengliangwang/fixCompile.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.
This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing
### Why are the changes needed?
To make the project healthier by improving coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittest.
Closes#33833 from itholic/SPARK-36505.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 97e7d6e667)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is the patch on branch-3.2 for https://github.com/apache/spark/pull/33842. See the description in the other PR.
### Why are the changes needed?
Avoid OOM/performance regression when reading ORC table with nested column types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `OrcSourceSuite.scala`.
Closes#33843 from c21/branch-3.2.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.
Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```
Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```
By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```
Closes#33838 from MaxGekk/fix-ts_ntz-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 159ff9fd14)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">
### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By generating docs, and checking results manually.
Closes#33840 from MaxGekk/special-datetime-cast-migr-guide.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c4e739fb4b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This fixed Python linter failure.
### Why are the changes needed?
```
flake8 checks failed:
./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused
import numpy as np
F401 'numpy as np' imported but unused
Error: Process completed with exit code 1.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action Linter job.
Closes#33841 from dongjoon-hyun/unused_import.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like
This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`
### Why are the changes needed?
Correct the SQL doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually
Closes#33834 from gengliangwang/allowSQLFunVersion.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 18143fb426)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log.
For example, JDBC password can be exposed in the log.
```
21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613
```
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed the log.
```
21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909
```
Closes#33832 from sarutak/fix-SPARK-36398.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit b2ff01608f)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.
### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```
### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.
### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```
Closes#33816 from MaxGekk/foldable-datetime-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df0ec56723)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).
### Why are the changes needed?
To maintain the document.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)
Closes#33823 from sarutak/create-function-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bd0a4950ae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.
```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```
**Before:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
+- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
+- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
+- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
+- Project [id#37L]
+- Filter atleastnnonnulls(1, id#37L)
+- Scan ExistingRDD[__index_level_0__#36L,id#37L]
# ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```
**After:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
+- HashAggregate(keys=[id#258L], functions=[count(1)])
+- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
+- Filter atleastnnonnulls(1, id#258L)
+- Range (0, 10, step=1, splits=16)
# ^^^ Removed the Spark job execution for `zipWithIndex`
```
### Why are the changes needed?
To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.
Closes#33807 from HyukjinKwon/SPARK-36559.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 93cec49212)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Revert 397b843890 and 5a48eb8d00
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci tests
Closes#33819 from gengliangwang/revert-SPARK-34415.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`.
### Why are the changes needed?
Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order.
```
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
...
```
### Does this PR introduce _any_ user-facing change?
Yes, users will see the expected RDD info in UI instead of the NPE error.
### How was this patch tested?
Pass existing tests.
Closes#33812 from Ngone51/fix-hostport-npe.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit d6c453aaea)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>