### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">
### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By generating docs, and checking results manually.
Closes#33840 from MaxGekk/special-datetime-cast-migr-guide.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c4e739fb4b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).
### Why are the changes needed?
To maintain the document.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)
Closes#33823 from sarutak/create-function-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bd0a4950ae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Revert 397b843890 and 5a48eb8d00
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci tests
Closes#33819 from gengliangwang/revert-SPARK-34415.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.
### Why are the changes needed?
Correct the documentation
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test. The MAVEN_OPTS is consistent with our github action build.
Closes#33804 from gengliangwang/updateBuildDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3da0e9500f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.
### Why are the changes needed?
To keep the config names consistent
### Does this PR introduce _any_ user-facing change?
Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.
### How was this patch tested?
Existing tests.
Closes#33799 from venkata91/SPARK-36374-follow-up.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 7b2842e986)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request?
This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.
### Why are the changes needed?
Update the user document.
### Does this PR introduce _any_ user-facing change?
No. It's a doc only change.
### How was this patch tested?
Doc only change.
Closes#33801 from viirya/minor-ss-deduplication.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 5876e04de2)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
* improve docs in `docs/job-scheduling.md`
* add migration guide docs in `docs/core-migration-guide.md`
### Why are the changes needed?
Help user to migrate.
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
Pass CI
Closes#33794 from ulysses-you/SPARK-35083-f.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 90cbf9ca3e)
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.
### Why are the changes needed?
Fix release script and update README of docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test locally.
Closes#33797 from gengliangwang/fixReleaseDocker.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 42eebb84f5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add more documents and checking logic for the new options `minOffsetPerTrigger` and `maxTriggerDelay`.
### Why are the changes needed?
Have a clear description of the behavior introduced in SPARK-35312
### Does this PR introduce _any_ user-facing change?
Yes.
If the user set minOffsetsPerTrigger > maxOffsetsPerTrigger, the new code will throw an AnalysisException. The original behavior is to ignore the maxOffsetsPerTrigger silenctly.
### How was this patch tested?
Existing tests.
Closes#33792 from xuanyuanking/SPARK-35312-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit a0b24019ed)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…mote scheduler pool files support"
This reverts commit e3902d1975. The feature is improvement instead of behavior change.
Closes#33789 from gengliangwang/revertDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit b36b1c7e8a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add remote scheduler pool files support to the migration guide.
### Why are the changes needed?
To highlight this useful support.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass exiting tests.
Closes#33785 from Ngone51/SPARK-35083-follow-up.
Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit e3902d1975)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/30648
ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.
### Why are the changes needed?
simplify the doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#33781 from cloud-fan/doc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 07d173a8b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#####` in `sql-ref-literals.md`
### Why are the changes needed?
The subitems in the sidebar are quite useless, as the menu page serves the same functionalities:
<img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png">
It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync).
The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page.
The `#####` is not rendered properly:
<img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png">
It's better to avoid using it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#33767 from cloud-fan/doc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4b015e8d7d)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)
### Why are the changes needed?
It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.
As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.
### Does this PR introduce _any_ user-facing change?
No, the feature is not released yet.
### How was this patch tested?
Unit tests
Closes#33758 from gengliangwang/revertGroupByAlias.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 8bfb4f1e72)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add the document for the new RocksDBStateStoreProvider.
### Why are the changes needed?
User guide for the new feature.
### Does this PR introduce _any_ user-facing change?
No, doc only.
### How was this patch tested?
Doc only.
Closes#33683 from xuanyuanking/SPARK-36041.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 3d57e00a7f)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png))
### Why are the changes needed?
Helps users understand the feature
### Does this PR introduce _any_ user-facing change?
Docs
### How was this patch tested?
N/A
Closes#33615 from venkata91/SPARK-36374.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2270ecf32f)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request?
This patch supports dynamic gap duration in session window.
### Why are the changes needed?
The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.
### Does this PR introduce _any_ user-facing change?
Yes, users can specify dynamic gap duration.
### How was this patch tested?
Modified existing tests and new test.
Closes#33691 from viirya/dynamic-session-window-gap.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 8b8d91cf64)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to prohibit update mode in streaming aggregation with session window.
UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.
This PR also cleans up test code via deduplicating.
### Why are the changes needed?
The semantic of "update" mode for session window based streaming aggregation is quite unclear.
For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.
This doesn't hold true for session window based streaming aggregation, as session range is changing.
If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.
If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.
### Does this PR introduce _any_ user-facing change?
No, as we haven't released this feature.
### How was this patch tested?
Updated tests.
Closes#33689 from HeartSaVioR/SPARK-36463.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit ed60aaa9f1)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Add local-cluster mode option to submitting-applications.md
### Why are the changes needed?
Help users to find/use this option for unit tests.
### Does this PR introduce _any_ user-facing change?
Yes, docs changed.
### How was this patch tested?
`SKIP_API=1 bundle exec jekyll build`
<img width="460" alt="docchange" src="https://user-images.githubusercontent.com/87687356/127125380-6beb4601-7cf4-4876-b2c6-459454ce2a02.png">
Closes#33537 from yutoacts/SPARK-595.
Lead-authored-by: Yuto Akutsu <yuto.akutsu@jp.nttdata.com>
Co-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
(cherry picked from commit 41b011e416)
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Add documentation for new functions try_cast/try_add/try_divide
### Why are the changes needed?
Better documentation. These new functions are useful when migrating to the ANSI dialect.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/128209312-34a6cc6a-a73d-4aed-8646-22b1cb7ce702.png)
Closes#33638 from gengliangwang/addDocForTry.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a35243fa7)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add doc for the shuffle checksum configs in `configuration.md`.
### Why are the changes needed?
doc
### Does this PR introduce _any_ user-facing change?
No, since Spark 3.2 hasn't been released.
### How was this patch tested?
Pass existed tests.
Closes#33637 from Ngone51/SPARK-36384.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3b92c721b5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR removes obsolete `contributing-to-spark.md` which is not referenced from anywhere.
### Why are the changes needed?
Just clean up.
### Does this PR introduce _any_ user-facing change?
No. Users can't have access to contributing-to-spark.html unless they directly point to the URL.
### How was this patch tested?
Built the document and confirmed that this change doesn't affect the result.
Closes#33619 from sarutak/remove-obsolete-contribution-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c31b653806)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is follow-up of https://github.com/apache/spark/pull/31522.
It adds docs for the new metrics of task/job commit time
### Why are the changes needed?
So that users can understand the metrics better and know that the new metrics are only for file table writes.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/127198210-2ab201d3-5fca-4065-ace6-0b930390380f.png)
Closes#33542 from gengliangwang/addDocForMetrics.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c9a7ff3f36)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
update java doc, JDBC data source doc, address follow up comments
### Why are the changes needed?
update doc and address follow up comments
### Does this PR introduce _any_ user-facing change?
Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc.
### How was this patch tested?
manually checked
Closes#33526 from huaxingao/aggPD_followup.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c8dd97d456)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Update the tables at https://spark.apache.org/docs/latest/sql-ref-datatypes.html about mapping ANSI interval types to Java/Scala/SQL types.
2. Remove `CalendarIntervalType` from the table of mapping Catalyst types to SQL types.
<img width="1028" alt="Screenshot 2021-07-27 at 20 52 57" src="https://user-images.githubusercontent.com/1580697/127204790-7ccb9c64-daf2-427d-963e-b7367aaa3439.png">
<img width="1017" alt="Screenshot 2021-07-27 at 20 53 22" src="https://user-images.githubusercontent.com/1580697/127204806-a0a51950-3c2d-4198-8a22-0f6614bb1487.png">
### Why are the changes needed?
To inform users which types from language APIs should be used as ANSI interval types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checking by building the docs:
```
$ SKIP_RDOC=1 SKIP_API=1 SKIP_PYTHONDOC=1 bundle exec jekyll build
```
Closes#33543 from MaxGekk/doc-interval-type-lang-api.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit 1614d00417)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
In the PR, I propose to update the page https://spark.apache.org/docs/latest/sql-ref-datatypes.html and add information about the year-month and day-time interval types introduced by SPARK-27790.
<img width="932" alt="Screenshot 2021-07-27 at 10 38 23" src="https://user-images.githubusercontent.com/1580697/127115289-e633ca3a-2c18-49a0-a7c0-22421ae5c363.png">
### Why are the changes needed?
To inform users about new ANSI interval types, and improve UX with Spark SQL.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Should be tested by a GitHub action.
Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Kousuke Saruta <sarutakoss.nttdata.com>
(cherry picked from commit f4837961a9)
Signed-off-by: Kousuke Saruta <sarutakoss.nttdata.com>
Closes#33539 from sarutak/backport-SPARK-34619-3.2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Bugfix: link to correction location of Pyspark Dataframe documentation
### Why are the changes needed?
Current website returns "Not found"
### Does this PR introduce _any_ user-facing change?
Website fix
### How was this patch tested?
Documentation change
Closes#33420 from dominikgehl/feature/SPARK-36209.
Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 3a1db2ddd4)
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the column encryption feature which can be called from Spark SQL. The aim of this PR is to document the use of Parquet encryption in Spark.
### Why are the changes needed?
- To provide information on how to use Parquet column encryption
### Does this PR introduce _any_ user-facing change?
Yes, documents a new feature.
### How was this patch tested?
bundle exec jekyll build
Closes#32895 from ggershinsky/parquet-encryption-doc.
Authored-by: Gidon Gershinsky <ggershinsky@apple.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Update trasform's doc to latest code.
![image](https://user-images.githubusercontent.com/46485123/126175747-672cccbc-4e42-440f-8f1e-f00b6dc1be5f.png)
### Why are the changes needed?
keep consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Closes#33362 from AngersZhuuuu/SPARK-36153.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR documents a new feature "native support of session window" into Structured Streaming guide doc.
Screenshots are following:
![스크린샷 2021-07-20 오후 5 04 20](https://user-images.githubusercontent.com/1317309/126284848-526ec056-1028-4a70-a1f4-ae275d4b5437.png)
![스크린샷 2021-07-20 오후 3 34 38](https://user-images.githubusercontent.com/1317309/126276763-763cf841-aef7-412a-aa03-d93273f0c850.png)
### Why are the changes needed?
This change is needed to explain a new feature to the end users.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation changes.
Closes#33433 from HeartSaVioR/SPARK-36172.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 0eb31a06d6)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR documents about unicode literals added in SPARK-34051 (#31096) and a past PR in `sql-ref-literals.md`.
### Why are the changes needed?
Notice users about the literals.
### Does this PR introduce _any_ user-facing change?
Yes, but just add a sentence.
### How was this patch tested?
Built the document and confirmed the result.
```
SKIP_API=1 bundle exec jekyll build
```
![unicode-literals](https://user-images.githubusercontent.com/4736016/126283923-944dc162-1817-47bc-a7e8-c3145225586b.png)
Closes#33434 from sarutak/unicode-literal-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ba1294ea5a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add reference to kubernetes-client's version
### Why are the changes needed?
Running Spark on Kubernetes potentially has upper limitation of Kubernetes version.
I think it is better for users to notice it because Kubernetes update speed is so fast that users tends to run Spark Jobs on unsupported version.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
SKIP_API=1 bundle exec jekyll build
Closes#33255 from yoda-mon/add-reference-kubernetes-client.
Authored-by: yoda-mon <yodal@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit eea69c122f)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes#33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 57a4f310df)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR allow the parser to parse unit list interval literals like `'3' day '10' hours '3' seconds` or `'8' years '3' months` as `YearMonthIntervalType` or `DayTimeIntervalType`.
### Why are the changes needed?
For ANSI compliance.
### Does this PR introduce _any_ user-facing change?
Yes. I noted the following things in the `sql-migration-guide.md`.
* Unit list interval literals are parsed as `YearMonthIntervaType` or `DayTimeIntervalType` instead of `CalendarIntervalType`.
* `WEEK`, `MILLISECONS`, `MICROSECOND` and `NANOSECOND` are not valid units for unit list interval literals.
* Units of year-month and day-time cannot be mixed like `1 YEAR 2 MINUTES`.
### How was this patch tested?
New tests and modified tests.
Closes#32949 from sarutak/day-time-multi-units.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8e92ef825a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to update the SQL migration guide, in particular the section about the migration from Spark 2.4 to 3.0. New item informs users about the following issue:
**What**: Spark doesn't detect encoding (charset) in CSV files with BOM correctly. Such files can be read only in the multiLine mode when the CSV option encoding matches to the actual encoding of CSV files. For example, Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the default mode. This is the case of the current ES ticket.
**Why**: In previous Spark versions, encoding wasn't propagated to the underlying library that means the lib tried to detect file encoding automatically. It could success for some encodings that require BOM presents at the beginning of files. Starting from the versions 3.0, users can specify file encoding via the CSV option encoding which has UTF-8 as the default value. Spark propagates such default to the underlying library (uniVocity), and as a consequence this turned off encoding auto-detection.
**When**: Since Spark 3.0. In particular, the commit 2df34db586 causes the issue.
**Workaround**: Enabling the encoding auto-detection mechanism in uniVocity by passing null as the value of CSV option encoding. A more recommended approach is to set the encoding option explicitly.
### Why are the changes needed?
To improve user experience with Spark SQL. This should help to users in their migration from Spark 2.4.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Should be checked by building docs in GA/jenkins.
Closes#33300 from MaxGekk/csv-encoding-migration-guide.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e788a3fa88)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Update AQE is `disabled` to `enabled` in sql-performance-tuning docs
### Why are the changes needed?
Make docs correct.
### Does this PR introduce _any_ user-facing change?
yes, docs changed.
### How was this patch tested?
Not need.
Closes#33295 from ulysses-you/enable-AQE.
Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 286c231c1e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add new configs in sql-performance-tuning docs.
* spark.sql.adaptive.coalescePartitions.parallelismFirst
* spark.sql.adaptive.coalescePartitions.minPartitionSize
* spark.sql.adaptive.autoBroadcastJoinThreshold
* spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold
### Why are the changes needed?
Help user to find them.
### Does this PR introduce _any_ user-facing change?
yes, docs changed.
### How was this patch tested?
![image](https://user-images.githubusercontent.com/12025282/125152379-be506200-e17e-11eb-80fe-68328ba1c8f5.png)
![image](https://user-images.githubusercontent.com/12025282/125152388-d1fbc880-e17e-11eb-8515-d4a5ed33159d.png)
Closes#32960 from ulysses-you/SPARK-35813.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0e9786c712)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to remove the automatic build guides of documentation in `docs/README.md`.
### Why are the changes needed?
This doesn't work very well:
1. It doesn't detect the changes in RST files. But PySpark internally generates RST files so we can't just simply include it in the detection. Otherwise, it goes to an infinite loop
2. During PySpark documentation generation, it launches some jobs to generate plot images now. This is broken with `entr` command, and the job fails. Seems like it's related to how `entr` creates the process internally.
3. Minor issue but the documentation build directory was changed (`_build` -> `build` in `python/docs`)
I don't think it's worthwhile testing and fixing the docs to show an working example because dev people are already able to do it manually.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested.
Closes#33266 from HyukjinKwon/SPARK-36051.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a1ce64904f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As described in #32831, Spark has compatible issues when querying a view created by an
older version. The root cause is that Spark changed the auto-generated alias name. To avoid
this in the future, we could ask the user to specify explicit column names when creating
a view.
### Why are the changes needed?
Avoid compatible issue when querying a view
### Does this PR introduce _any_ user-facing change?
Yes. User will get error when running query below after this change
```
CREATE OR REPLACE VIEW v AS SELECT CAST(t.a AS INT), to_date(t.b, 'yyyyMMdd') FROM t
```
### How was this patch tested?
not yet
Closes#32832 from linhongliu-db/SPARK-35686-no-auto-alias.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In https://issues.apache.org/jira/browse/SPARK-34862, we added support for ORC nested column vectorized reader, and it is disabled by default for now. So we would like to add the user-facing documentation for it, and user can opt-in to use it if they want.
### Why are the changes needed?
To make user be aware of the feature, and let them know the instruction to use the feature.
### Does this PR introduce _any_ user-facing change?
Yes, the documentation itself.
### How was this patch tested?
Manually check generated documentation as below.
<img width="1153" alt="Screen Shot 2021-07-01 at 12 19 40 AM" src="https://user-images.githubusercontent.com/4629931/124083422-b0724280-da02-11eb-93aa-a25d118ba56e.png">
<img width="1147" alt="Screen Shot 2021-07-01 at 12 19 52 AM" src="https://user-images.githubusercontent.com/4629931/124083442-b5cf8d00-da02-11eb-899f-827d55b8558d.png">
Closes#33168 from c21/orc-doc.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>