### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.
### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.
### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.
### How was this patch tested?
Jenkins test.
Closes#33822 from beliefer/SPARK-36574.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fcc91cfec4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Use WeakReference not SoftReference in LevelDB
### Why are the changes needed?
(See discussion at https://github.com/apache/spark/pull/28769#issuecomment-906722390 )
"The soft reference to iterator introduced in this pr unfortunately ended up causing iterators to not be closed when they go out of scope (which would have happened earlier in the finalize)
This is because java is more conservative in cleaning up SoftReference's.
The net result was we ended up having 50k files for SHS while typically they get compacted away to 200 odd files.
Changing from SoftReference to WeakReference should make it much more aggresive in cleanup and prevent the issue - which we observed in a 3.1 SHS"
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33859 from srowen/SPARK-36603.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 89e907f76c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Some of GitHub Actions workflow files do not have the Apache license header. This PR adds them.
### Why are the changes needed?
To comply Apache license.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#33862 from HyukjinKwon/minor-lisence.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 22c492a6b8)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval
And, the `try_divide` function allows the following inputs:
- number, number
- interval, number
However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.
### Why are the changes needed?
Improve documentation and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)
Closes#33861 from gengliangwang/enhanceTryDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a52ad9f82)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops.
As a result, the application stucks.
You can easily reproduce this issue by the following procedures.
```
# Run master
$ sbin/start-master.sh
# Run worker 1
$ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077
# Run worker 2
$ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077
# Run Spark Shell
$ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1
# Check which worker the executor runs on and then kill the worker.
$ kill <worker pid>
```
With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't.
The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`.
So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called.
This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state.
This issue causes assertion error.
```
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal
```
### Why are the changes needed?
It's a critical bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually tested with the procedure shown above and confirmed the executor is re-scheduled.
Closes#33818 from sarutak/fix-scheduling-stuck.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit ea8c31e5ea)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests.
Some tests are missing
No
Unittest
Closes#33776 from itholic/SPARK-36388-followup.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c91ae544fd)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`
### Why are the changes needed?
make the doc look better
### Does this PR introduce _any_ user-facing change?
yes, API doc change
### How was this patch tested?
Manually checked
Closes#33855 from huaxingao/ml_doc.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 15e42b4442)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to document `window` & `session_window` function in SQL API doc page.
Screenshot of functions:
> window
![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)
> session_window
![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)
### Why are the changes needed?
Description is missing in both `window` / `session_window` functions for SQL API page.
### Does this PR introduce _any_ user-facing change?
Yes, the description of `window` / `session_window` functions will be available in SQL API page.
### How was this patch tested?
Only doc changes.
Closes#33846 from HeartSaVioR/SPARK-36595.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit bc32144a91)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Improve API docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33845 from gengliangwang/SPARK-36457-3.2.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Improve the user guide document.
### Why are the changes needed?
Make the user guide clear.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Doc change only.
Closes#33854 from xuanyuanking/SPARK-35611-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit dd3f0fa8c2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.
- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.
We should enable the tests as much as possible even if pandas has a bug.
And we should follow the behavior of latest pandas.
Yes, `GroupBy.transform` now follow the behavior of latest pandas.
Unittests.
Closes#33817 from itholic/SPARK-36537.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fe486185c4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.
**Before:**
```python
>>> pcat
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0 a
1 b
2 c
dtype: category
Categories (3, object): ['b', 'c', 'a']
```
**After:**
```python
>>> pcat
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c'] # CategoricalDtype is not updated if dtype is the same
```
`CategoricalDtype` is treated as a same `dtype` if the unique values are the same.
```python
>>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
>>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
>>> pcat1.dtype == pcat2.dtype
True
```
We should follow the latest pandas as much as possible.
Yes, the behavior is changed as example in the PR description.
Unittest
Closes#33757 from itholic/SPARK-36368.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit f2e593bcf1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.
In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0 2020-10-27 00:00:01
1 NaT
Name: datetime, dtype: string
```
This is changed to
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0 2020-10-27 00:00:01
1 <NA>
Name: datetime, dtype: string
```
in pandas >= 1.3, so we follow the behavior of latest pandas.
Because pandas-on-Spark always follow the behavior of latest pandas.
Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)
Unittest passed
Closes#33735 from itholic/SPARK-36387.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c0441bb7e8)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.
`RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.
Before:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
A B
A
1 0 NaN NaN
1 2.0 1.0
2 2 NaN NaN
3 3 NaN NaN
```
After:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
B
A
1 0 NaN
1 1.0
2 2 NaN
3 3 NaN
```
We should follow the behavior of pandas as much as possible.
Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.
Unit tests.
Closes#33646 from itholic/SPARK-36388.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit b8508f4876)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Disable tests failed by the incompatible behavior of pandas 1.3.
Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.
No.
Disabled some tests related to the behavior change.
Closes#33598 from ueshin/issues/SPARK-36367/disable_tests.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8cb9cf39b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/33837
It is to fix compilation error: https://github.com/apache/spark/runs/3431646840
### Why are the changes needed?
Fix a compilation error
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass existing UTs
Closes#33851 from gengliangwang/fixCompile.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.
This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing
### Why are the changes needed?
To make the project healthier by improving coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittest.
Closes#33833 from itholic/SPARK-36505.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 97e7d6e667)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is the patch on branch-3.2 for https://github.com/apache/spark/pull/33842. See the description in the other PR.
### Why are the changes needed?
Avoid OOM/performance regression when reading ORC table with nested column types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `OrcSourceSuite.scala`.
Closes#33843 from c21/branch-3.2.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.
Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```
Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```
By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```
Closes#33838 from MaxGekk/fix-ts_ntz-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 159ff9fd14)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">
### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By generating docs, and checking results manually.
Closes#33840 from MaxGekk/special-datetime-cast-migr-guide.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c4e739fb4b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This fixed Python linter failure.
### Why are the changes needed?
```
flake8 checks failed:
./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused
import numpy as np
F401 'numpy as np' imported but unused
Error: Process completed with exit code 1.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action Linter job.
Closes#33841 from dongjoon-hyun/unused_import.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like
This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`
### Why are the changes needed?
Correct the SQL doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually
Closes#33834 from gengliangwang/allowSQLFunVersion.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 18143fb426)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log.
For example, JDBC password can be exposed in the log.
```
21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613
```
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed the log.
```
21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909
```
Closes#33832 from sarutak/fix-SPARK-36398.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit b2ff01608f)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.
### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```
### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.
### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```
Closes#33816 from MaxGekk/foldable-datetime-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df0ec56723)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).
### Why are the changes needed?
To maintain the document.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)
Closes#33823 from sarutak/create-function-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bd0a4950ae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.
```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```
**Before:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
+- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
+- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
+- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
+- Project [id#37L]
+- Filter atleastnnonnulls(1, id#37L)
+- Scan ExistingRDD[__index_level_0__#36L,id#37L]
# ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```
**After:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
+- HashAggregate(keys=[id#258L], functions=[count(1)])
+- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
+- Filter atleastnnonnulls(1, id#258L)
+- Range (0, 10, step=1, splits=16)
# ^^^ Removed the Spark job execution for `zipWithIndex`
```
### Why are the changes needed?
To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.
Closes#33807 from HyukjinKwon/SPARK-36559.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 93cec49212)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Revert 397b843890 and 5a48eb8d00
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci tests
Closes#33819 from gengliangwang/revert-SPARK-34415.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`.
### Why are the changes needed?
Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order.
```
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
...
```
### Does this PR introduce _any_ user-facing change?
Yes, users will see the expected RDD info in UI instead of the NPE error.
### How was this patch tested?
Pass existing tests.
Closes#33812 from Ngone51/fix-hostport-npe.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit d6c453aaea)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.
### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33815 from huaxingao/agg_package.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit cd2342691d)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.
### Why are the changes needed?
Correct the documentation
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test. The MAVEN_OPTS is consistent with our github action build.
Closes#33804 from gengliangwang/updateBuildDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3da0e9500f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.
### Why are the changes needed?
To keep the config names consistent
### Does this PR introduce _any_ user-facing change?
Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.
### How was this patch tested?
Existing tests.
Closes#33799 from venkata91/SPARK-36374-follow-up.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 7b2842e986)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request?
This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.
### Why are the changes needed?
Update the user document.
### Does this PR introduce _any_ user-facing change?
No. It's a doc only change.
### How was this patch tested?
Doc only change.
Closes#33801 from viirya/minor-ss-deduplication.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 5876e04de2)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
For the hive table, the actual write path and the schema handling are inconsistent when `spark.sql.legacy.charVarcharAsString` is true.
This causes problems like SPARK-36552 described.
In this PR we respect `spark.sql.legacy.charVarcharAsString` when generates hive table schema from spark data types.
### Why are the changes needed?
bugfix
### Does this PR introduce _any_ user-facing change?
yes, when `spark.sql.legacy.charVarcharAsString` is true, hive table with char/varchar will respect string behavior.
### How was this patch tested?
newly added test
Closes#33798 from yaooqinn/SPARK-36552.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f918c123a0)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
* improve docs in `docs/job-scheduling.md`
* add migration guide docs in `docs/core-migration-guide.md`
### Why are the changes needed?
Help user to migrate.
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
Pass CI
Closes#33794 from ulysses-you/SPARK-35083-f.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 90cbf9ca3e)
Signed-off-by: Kent Yao <yao@apache.org>
AuthEngineSuite was passing on some platforms (MacOS), but failing on others (Linux) with an InvalidKeyException stemming from this line. We should explicitly pass AES as the key format.
### What changes were proposed in this pull request?
Changes the AuthEngine SecretKeySpec from "RAW" to "AES".
### Why are the changes needed?
Unit tests were failing on some platforms with InvalidKeyExceptions when this key was used to instantiate a Cipher.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests on a MacOS and Linux platform.
Closes#33790 from sweisdb/patch-1.
Authored-by: sweisdb <60895808+sweisdb@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit c441c7e365)
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.
### Why are the changes needed?
Fix release script and update README of docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test locally.
Closes#33797 from gengliangwang/fixReleaseDocker.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 42eebb84f5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
When preparing Spark 3.2.0 RC1, I hit the same issue of https://github.com/apache/spark/pull/31031.
```
[INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ...
[ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes
java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package
java.lang.ClassLoader.checkCerts(ClassLoader.java:891)
java.lang.ClassLoader.preDefineClass(ClassLoader.java:661)
```
This PR is to apply the same fix again by downgrading scala-maven-plugin to 4.3.0
### Why are the changes needed?
To unblock the release process.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build test
Closes#33791 from gengliangwang/downgrade.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit f0775d215e)
Signed-off-by: Gengliang Wang <gengliang@apache.org>