Commit graph

2961 commits

Author SHA1 Message Date
Takeshi Yamamuro c2bea045e3 [SPARK-33264][SQL][DOCS] Add a dedicated page for SQL-on-file in SQL documents
### What changes were proposed in this pull request?

This PR intends to add a dedicated page for SQL-on-file in SQL documents.
This comes from the comment: https://github.com/apache/spark/pull/30095/files#r508965149

### Why are the changes needed?

For better documentations.

### Does this PR introduce _any_ user-facing change?

<img width="544" alt="Screen Shot 2020-10-28 at 9 56 59" src="https://user-images.githubusercontent.com/692303/97378051-c1fbcb80-1904-11eb-86c0-a88c5269d41c.png">

### How was this patch tested?

N/A

Closes #30165 from maropu/DocForFile.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-28 11:21:35 +09:00
Stuart White 7d11d972c3 [SPARK-33246][SQL][DOCS] Correct documentation for null semantics of "NULL AND False"
### What changes were proposed in this pull request?

The documentation of the Spark SQL null semantics states that "NULL AND False" yields NULL.  This is incorrect.  "NULL AND False" yields False.

```
Seq[(java.lang.Boolean, java.lang.Boolean)](
  (null, false)
)
  .toDF("left_operand", "right_operand")
  .withColumn("AND", 'left_operand && 'right_operand)
  .show(truncate = false)

+------------+-------------+-----+
|left_operand|right_operand|AND  |
+------------+-------------+-----+
|null        |false        |false|
+------------+-------------+-----+
```

I propose the documentation be updated to reflect that "NULL AND False" yields False.

This contribution is my original work and I license it to the project under the project’s open source license.

### Why are the changes needed?

This change improves the accuracy of the documentation.

### Does this PR introduce _any_ user-facing change?

Yes.  This PR introduces a fix to the documentation.

### How was this patch tested?

Since this is only a documentation change, no tests were added.

Closes #30161 from stwhit/SPARK-33246.

Authored-by: Stuart White <stuart@spotright.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-28 08:36:14 +09:00
HyukjinKwon 9818f079aa [SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency
### What changes were proposed in this pull request?

This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings.
This PR also adds one migration example of `SparkContext`.

- **Before:**
    ...
    ![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png)
    ...
    ![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png)
    ...

- **After:**

    ...
    ![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png)
    ...

See also https://numpydoc.readthedocs.io/en/latest/format.html

### Why are the changes needed?

There are many reasons for switching to NumPy documentation style.

1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax.

2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`.

3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas,
matplotlib, ... Using NumPy documentation style can give users a consistent documentation style.

### Does this PR introduce _any_ user-facing change?

The dependency itself doesn't change anything user-facing.
The documentation change in `SparkContext` does, as shown above.

### How was this patch tested?

Manually tested via running `cd python` and `make clean html`.

Closes #30149 from HyukjinKwon/SPARK-33243.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 14:03:57 +09:00
Shiqi Sun f659527727 [SPARK-30821][K8S] Handle executor failure with multiple containers
Handle executor failure with multiple containers

Added a spark property spark.kubernetes.executor.checkAllContainers,
with default being false. When it's true, the executor snapshot will
take all containers in the executor into consideration when deciding
whether the executor is in "Running" state, if the pod restart policy is
"Never". Also, added the new spark property to the doc.

### What changes were proposed in this pull request?

Checking of all containers in the executor pod when reporting executor status, if the `spark.kubernetes.executor.checkAllContainers` property is set to true.

### Why are the changes needed?

Currently, a pod remains "running" as long as there is at least one running container. This prevents Spark from noticing when a container has failed in an executor pod with multiple containers. With this change, user can configure the behavior to be different. Namely, if any container in the executor pod has failed, either the executor process or one of its sidecars, the pod is considered to be failed, and it will be rescheduled.

### Does this PR introduce _any_ user-facing change?

Yes, new spark property added.
User is now able to choose whether to turn on this feature using the `spark.kubernetes.executor.checkAllContainers` property.

### How was this patch tested?

Unit test was added and all passed.
I tried to run integration test by following the instruction [here](https://spark.apache.org/developer-tools.html) (section "Testing K8S") and also [here](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md), but I wasn't able to run it smoothly as it fails to talk with minikube cluster. Maybe it's because my minikube version is too new (I'm using v1.13.1)...? Since I've been trying it for two days and still can't make it work, I decided to submit this PR and hopefully the Jenkins test will pass.

Closes #29924 from huskysun/exec-sidecar-failure.

Authored-by: Shiqi Sun <s.sun@salesforce.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-10-24 09:55:57 -07:00
Max Gekk ba13b94f6b [SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default
### What changes were proposed in this pull request?
1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`.
2. Update the SQL migration guide.

### Why are the changes needed?
Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By existing test suites like `ParquetIOSuite`.

Closes #30121 from MaxGekk/int96-exception-by-default.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 03:04:29 +00:00
Kent Yao dcb0820433 [SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for incomplete interval literals
### What changes were proposed in this pull request?

Address comments  https://github.com/apache/spark/pull/29635#discussion_r507241899 to improve migration guide

### Why are the changes needed?

improve migration guide

### Does this PR introduce _any_ user-facing change?

NO,only doc update

### How was this patch tested?

passing GitHub action

Closes #30113 from yaooqinn/SPARK-32785-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-21 15:51:16 +09:00
Keiji Yoshida 46ad325e56 [MINOR][DOCS] Fix the description about to_avro and from_avro functions
### What changes were proposed in this pull request?
This pull request changes the description about `to_avro` and `from_avro` functions to include Python as a supported language as the functions have been supported in Python since Apache Spark 3.0.0 [[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)].

### Why are the changes needed?
Same as above.

### Does this PR introduce _any_ user-facing change?
Yes. The description changed by this pull request is on https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro.

### How was this patch tested?
Tested manually by building and checking the document in the local environment.

Closes #30105 from kjmrknsn/fix-docs-sql-data-sources-avro.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-21 00:36:45 +09:00
liaoaoyuan97 f65a24412b [SPARK-33181][SQL][DOCS] Document Load Table Directly from File in SQL Select Reference
### What changes were proposed in this pull request?

Add the link to the feature: "Run SQL on files directly" to SQL reference documentation page

### Why are the changes needed?

To make SQL Reference complete

### Does this PR introduce _any_ user-facing change?

yes. Previously, reading in sql from file directly is not included in the documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html, not listed in from_items. The new link is added to the select statement documentation, like the below:

![image](https://user-images.githubusercontent.com/16770242/96517999-c34f3900-121e-11eb-8d56-c4ba0432855e.png)
![image](https://user-images.githubusercontent.com/16770242/96518808-8126f700-1220-11eb-8c98-fb398eee0330.png)

### How was this patch tested?

Manually built and tested

Closes #30095 from liaoaoyuan97/master.

Authored-by: liaoaoyuan97 <al3468@columbia.edu>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 10:23:58 +09:00
Keiji Yoshida d2f328aba6 [MINOR][DOCS] Fix the link to the pickle module page in RDD Programming Guide
### What changes were proposed in this pull request?
This pull request changes the link to the pickle module page from https://docs.python.org/2/library/pickle.html to https://docs.python.org/3/library/pickle.html in RDD Programming Guide.

### Why are the changes needed?
Since Python 2 is no longer supported and it is preferable to refer to the pickle module page of Python 3.

### Does this PR introduce _any_ user-facing change?
Yes.
Before: the `Pickle` link's destination page was https://docs.python.org/2/library/pickle.html
After: the `Pickle` link's destination page is https://docs.python.org/3/library/pickle.html

### How was this patch tested?
By building the documentation site and check the link's destination page is changed correctly in the local environment.

Closes #30081 from kjmrknsn/docs-fix-pickle-link.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-18 17:13:55 +09:00
Liang-Chi Hsieh 2c4599db4b [MINOR][SS][DOCS] Update Structured Streaming guide doc and update code typo
### What changes were proposed in this pull request?

This is a minor change to update structured-streaming-programming-guide and typos in code.

### Why are the changes needed?

Keep the user-facing document correct and updated.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #30074 from viirya/ss-minor.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 22:18:12 -07:00
xuewei.linxuewei 306872eefa [SPARK-33139][SQL] protect setActionSession and clearActiveSession
### What changes were proposed in this pull request?

This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.

Change of the PR:

* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive

### Why are the changes needed?

Make SQLConf.get reliable and stable.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test

Closes #30042 from leanken/leanken-SPARK-33139.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-16 06:05:17 +00:00
Dongjoon Hyun 8e7c39089f [SPARK-33155][K8S] spark.kubernetes.pyspark.pythonVersion allows only '3'
### What changes were proposed in this pull request?

This PR makes `spark.kubernetes.pyspark.pythonVersion` allow only `3`. In other words, it will reject `2` for `Python 2`.
- [x] Configuration description and check is updated.
- [x] Documentation is updated
- [x] Unit test cases are updated.
- [x] Docker image script is updated.

### Why are the changes needed?

After SPARK-32138, Apache Spark 3.1 dropped Python 2 support.

### Does this PR introduce _any_ user-facing change?

Yes, but Python 2 support is already dropped officially.

### How was this patch tested?

Pass the CI.

Closes #30049 from dongjoon-hyun/SPARK-DROP-PYTHON2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-15 01:51:01 -07:00
xuewei.linxuewei dc697a8b59 [SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero
### What changes were proposed in this pull request?

As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result.

Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard.

### Why are the changes needed?

SQL correctness issue.

### Does this PR introduce any user-facing change?
Yes. See sql-migration-guide

In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.

### How was this patch tested?
Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior.
Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior.

Closes #29983 from leanken/leanken-SPARK-13860.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-13 13:21:45 +00:00
manubatham20 4a47b3e110 [DOC][MINOR] pySpark usage - removed repeated keyword causing confusion
### What changes were proposed in this pull request?
While explaining pySpark usage, use of repeated synonymous words were causing confusion.
Removed "instead of a JAR" word, to keep it more readable.

### Why are the changes needed?
To keep the docs more readable and easy to understand.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No code changes, minor documentation change only. No tests added.

Closes #29956 from manubatham20/patch-1.

Authored-by: manubatham20 <manubatham2006@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-10-08 07:52:00 -05:00
Dongjoon Hyun 008a2ad1f8 [SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1)
### What changes were proposed in this pull request?

As of today,
- SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository.
- SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions.

This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0.
```
<hive.group>org.spark-project.hive</hive.group>
<hive.version>1.2.1.spark2</hive.version>
```

For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it.

### Why are the changes needed?

- First, Apache Spark community should not use the unofficial forked release of another Apache project.
- Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far.

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`.

### How was this patch tested?

1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366)
2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382)
3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.)
4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected)

Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-05 15:29:56 -07:00
Kousuke Saruta 005999721f [SPARK-33046][DOCS] Update how to build doc for Scala 2.13 with sbt
### What changes were proposed in this pull request?

This PR fixes the description how to build Spark for Scala 2.13 with sbt.
In the current doc, how to build Spark for Scala 2.13 with sbt is described like:
![scala-2 13-build-before](https://user-images.githubusercontent.com/4736016/94816248-80c3e900-0436-11eb-9bc2-99af5786971a.png)

But build fails with this command because scala-2.13 profile is not enabled and scala-parallel-collections is absent.

```
[error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:23: object parallel is not a member of package collection
```

The correct command should be:
```
build/sbt -Pspark-2.13 compile
```

### Why are the changes needed?

The build command is wrong.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I checked that `sbt -Pspark-2.13` is correct with the following command:
```
build/sbt -Dscala.version=2.13.3 -Phive -Phive-thriftserver -Pyarn -Pkubernetes  compile
```

I also build the modified doc and checked the generated html:
![spark-scala-2 13-build-doc-after](https://user-images.githubusercontent.com/4736016/94869259-f2745500-047f-11eb-89e5-20816f3ed24d.png)

Closes #29921 from sarutak/fix-scala-2.13-build-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-10-01 18:01:23 -05:00
iRakson d3dbe1a907 [SQL][DOC][MINOR] Corrects input table names in the examples of CREATE FUNCTION doc
### What changes were proposed in this pull request?
Fix Typo

### Why are the changes needed?
To maintain consistency.
Correct table name should be used for SELECT command.

### Does this PR introduce _any_ user-facing change?
Yes. Now CREATE FUNCTION doc will show the correct name of table.

### How was this patch tested?
Manually. Doc changes.

Closes #29920 from iRakson/fixTypo.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-01 20:50:16 +09:00
Peter Toth 28ed3a512a [SPARK-32723][WEBUI] Upgrade to jQuery 3.5.1
### What changes were proposed in this pull request?
Upgrade to the latest available version of jQuery (3.5.1).

### Why are the changes needed?
There are some CVE-s reported (CVE-2020-11022, CVE-2020-11023) affecting older versions of jQuery. Although Spark UI is read-only and those CVEs doesn't seem to affect Spark, using the latest version of this library can help to handle vulnerability reports of security scans.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual tests and checked the jQuery 3.5 upgrade guide.

Closes #29902 from peter-toth/SPARK-32723-upgrade-to-jquery-3.5.1.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-30 21:30:17 -07:00
GuoPhilipse 3bdbb5546d [SPARK-31753][SQL][DOCS][FOLLOW-UP] Add missing keywords in the SQL docs
### What changes were proposed in this pull request?
update sql-ref docs, the following key words will be added in this PR.

CLUSTERED BY
SORTED BY
INTO num_buckets BUCKETS

### Why are the changes needed?
let more users know the sql key words usage

### Does this PR introduce _any_ user-facing change?
No
![image](https://user-images.githubusercontent.com/46367746/94428281-0a6b8080-01c3-11eb-9ff3-899f8da602ca.png)
![image](https://user-images.githubusercontent.com/46367746/94428285-0d667100-01c3-11eb-8a54-90e7641d917b.png)
![image](https://user-images.githubusercontent.com/46367746/94428288-0f303480-01c3-11eb-9e1d-023538aa6e2d.png)

### How was this patch tested?
generate html test

Closes #29883 from GuoPhilipse/add-sql-missing-keywords.

Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Co-authored-by: GuoPhilipse <guofei_ok@126.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-01 08:15:53 +09:00
Dongjoon Hyun ece8d8e22c [SPARK-33006][K8S][DOCS] Add dynamic PVC usage example into K8s doc
### What changes were proposed in this pull request?

This updates K8s document to describe new dynamic PVC features.

### Why are the changes needed?

This will help the user use the new features easily.

### Does this PR introduce _any_ user-facing change?

Yes, but it's a doc updates.

### How was this patch tested?

Manual.

<img width="847" alt="Screen Shot 2020-09-28 at 3 54 53 PM" src="https://user-images.githubusercontent.com/9700541/94494923-3ed04400-01a5-11eb-81f9-127db42d4256.png">

<img width="779" alt="Screen Shot 2020-09-28 at 3 55 07 PM" src="https://user-images.githubusercontent.com/9700541/94494930-4394f800-01a5-11eb-9387-50ebc14af477.png">

Closes #29897 from dongjoon-hyun/SPARK-33006.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-30 09:27:57 -07:00
Dongjoon Hyun cc06266ade [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
### What changes were proposed in this pull request?

Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still.

### Why are the changes needed?

Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions.

- [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default

### Does this PR introduce _any_ user-facing change?

Yes. This changes the default behavior. Users can override this conf.

### How was this patch tested?

Manual.

**BEFORE (spark-3.0.1-bin-hadoop3.2)**
```scala
scala> sc.version
res0: String = 3.0.1

scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
res1: String = 2
```

**AFTER**
```scala
scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
res0: String = 1
```

Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-29 12:02:45 -07:00
Kousuke Saruta 790d9ef2d3 [SPARK-32955][DOCS] An item in the navigation bar in the WebUI has a wrong link
### What changes were proposed in this pull request?

This PR fixes an link in `_layouts/global.html`.
The item `More` in the navigation bar in the WebUI links to `api.html` but it seems to be wrong.
This PR also removes `api.md` because it and `api.html` generated from it are not referred from anywhere.

### Why are the changes needed?

Fix the wrong link.

### Does this PR introduce _any_ user-facing change?

Yes. "More" item no longer links to `api.html`.

### How was this patch tested?

`SKIP_API=1 jekyll build` and confirmed that the item no longer links to `api.html`.
I also confirmed `api.md` and `api.html` are no longer referred from anywhere by the following command.
```
$ grep -Erl "api\.(html|md)" docs
```

Closes #29821 from sarutak/fix-api-doc-link.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 14:46:27 +09:00
itholic 9c653c957f [SPARK-32189][DOCS][PYTHON] Development - Setting up IDEs
### What changes were proposed in this pull request?

This PR proposes to document the way of setting up IDEs

![스크린샷 2020-09-21 오전 10 43 12](https://user-images.githubusercontent.com/44108233/93727715-5c2a6e80-fbf7-11ea-821b-555723b00bc8.png)
![스크린샷 2020-09-21 오전 10 43 45](https://user-images.githubusercontent.com/44108233/93727716-5f255f00-fbf7-11ea-9c6c-7b8a973bc511.png)

### Why are the changes needed?

To let users know how to setup IDEs

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new page in the documentation about setting IDEs.

### How was this patch tested?

Manually built the doc.

Closes #29781 from itholic/SPARK-32189.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 12:29:17 +09:00
Udbhav30 88e87bc8eb [SPARK-32887][DOC] Correct the typo for SHOW TABLE
### What changes were proposed in this pull request?
Correct the typo in Show Table document

### Why are the changes needed?
Current Document of Show Table returns in parse error, so it is misleading to users

### Does this PR introduce _any_ user-facing change?
Yes, the document of show table is corrected now

### How was this patch tested?
NA

Closes #29758 from Udbhav30/showtable.

Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-17 09:25:17 -07:00
bowen.li 0549c20c6f [SPARK-32865][DOC] python section in quickstart page doesn't display SPARK_VERSION correctly
### What changes were proposed in this pull request?

In https://github.com/apache/spark/blame/master/docs/quick-start.md#L402,it should be `{{site.SPARK_VERSION}}` rather than `{site.SPARK_VERSION}`

### Why are the changes needed?

SPARK_VERSION isn't displayed correctly, as shown below

![image](https://user-images.githubusercontent.com/1892692/93006726-d03c8680-f514-11ea-85e3-1d7cfb682ef2.png)

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested locally, as shown below

![image](https://user-images.githubusercontent.com/1892692/93006712-a6835f80-f514-11ea-8d78-6831c9d65265.png)

Closes #29738 from bowenli86/doc.

Authored-by: bowen.li <bowenli86@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-12 21:45:55 -07:00
Jungtaek Lim (HeartSaVioR) 8f61005723 [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset
### What changes were proposed in this pull request?

This patch proposes to update the doc (both SS guide doc and Dataset dropDuplicates method doc) to leave a note to check on using SQL statements with streaming Dataset.

Once end users create a temp view based on streaming Dataset, they won't bother with thinking about "streaming" and do whatever they do with batch query. In many cases it works, but not just smoothly for the case when streaming aggregation is involved. They still need to concern about maintaining state store.

### Why are the changes needed?

Although SPARK-32456 fixed the weird error message, as a side effect some operations are enabled on streaming workload via SQL statement, which is error-prone if end users don't indicate what they're doing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Only doc change.

Closes #29461 from HeartSaVioR/SPARK-32456-FOLLOWUP-DOC.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 08:10:32 +00:00
HyukjinKwon c336ae39cd [SPARK-32186][DOCS][PYTHON] Development - Debugging
### What changes were proposed in this pull request?

This PR proposes to document the way of debugging PySpark. It's pretty much self-descriptive.

I made a demo site to review it more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/debugging.html

### Why are the changes needed?

To let users know how to debug PySpark applications.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new page in the documentation about debugging PySpark.

### How was this patch tested?

Manually built the doc.

Closes #29639 from HyukjinKwon/SPARK-32186.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-08 10:32:22 +09:00
Kent Yao de44e9cfa0 [SPARK-32785][SQL] Interval with dangling parts should not results null
### What changes were proposed in this pull request?

bugfix for incomplete interval values, e.g. interval '1', interval '1 day 2', currently these cases will result null, but actually we should fail them with IllegalArgumentsException

### Why are the changes needed?

correctness

### Does this PR introduce _any_ user-facing change?

yes, incomplete intervals will throw exception now

#### before
```
bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'"

NULL NULL NULL
```
#### after

```
-- !query
select interval '1'
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.catalyst.parser.ParseException

Cannot parse the INTERVAL value: 1(line 1, pos 7)

== SQL ==
select interval '1'
```

### How was this patch tested?

unit tests added

Closes #29635 from yaooqinn/SPARK-32785.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 05:11:30 +00:00
Wenchen Fan ccc0250a08 [SPARK-32718][SQL] Remove unnecessary keywords for interval units
### What changes were proposed in this pull request?

Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway.

### Why are the changes needed?

These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries.

Closes #29560 from cloud-fan/keyword.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-29 14:06:01 -07:00
HyukjinKwon c154629171 [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow
### What changes were proposed in this pull request?

This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").

Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html

### Why are the changes needed?

To have a single place for PySpark users, and better documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.

### How was this patch tested?

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

and

```bash
cd python/docs
make clean html
```

Closes #29548 from HyukjinKwon/SPARK-32183.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 15:09:06 +09:00
waleedfateem 8749b2b6fa [SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value
The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class.

### What changes were proposed in this pull request?

I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment.

### Why are the changes needed?

An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-7282

### Does this PR introduce _any_ user-facing change?

Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate.

### How was this patch tested?

Checked changes locally in browser

Closes #29541 from waleedfateem/SPARK-32701.

Authored-by: waleedfateem <waleed.fateem@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-27 09:05:50 -05:00
Dale Clarke ed51a7f083 [SPARK-30654] Bootstrap4 docs upgrade
### What changes were proposed in this pull request?
We are using an older version of Bootstrap (v. 2.1.0) for the online documentation site.  Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release).  Older versions of Bootstrap are also getting flagged in security scans for various CVEs:

    https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889
    https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700
    https://snyk.io/vuln/npm:bootstrap:20180529
    https://snyk.io/vuln/npm:bootstrap:20160627

I haven't validated each CVE, but it would probably be good practice to resolve any potential issues and get on a supported release.

The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4.  I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the documentation.  This is a fairly large change so I'm sure additional testing and fixes will be needed.

### How was this patch tested?
This has been manually tested, but as there is a lot of documentation it is possible issues were missed.  Additional testing and feedback is welcomed.  If it appears a whole section was missed let me know and I'll take a pass at addressing that section.

Closes #27369 from clarkead/bootstrap4-docs-upgrade.

Authored-by: Dale Clarke <a.dale.clarke@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-27 09:03:39 -05:00
Terry Kim baaa756dee [SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()
### What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

### Why are the changes needed?

The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.

### Does this PR introduce _any_ user-facing change?

Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added new tests.

Closes #29543 from imback82/path_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-27 06:21:04 +00:00
HyukjinKwon b54103016a [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to:
- add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.

<br/>

I made a simple demo to make it easier to review. Please see:
- [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
- [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html)

<br/>

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:
- open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
- download it as a `.ipynb` file:
    ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png)
- upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
- alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
- https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

### Does this PR introduce _any_ user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

### How was this patch tested?

Manually tested, and GitHub Actions builds will test.

Closes #29491 from HyukjinKwon/SPARK-32204.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 12:23:24 +09:00
Kent Yao 1f3bb51757 [SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for datetime pattern F
### What changes were proposed in this pull request?

This PR fixes the doc error and add a migration guide for datetime pattern.

### Why are the changes needed?
This is a bug of the doc that we inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482

The SimpleDateFormatter(**F Day of week in month**) we used in 2.x and the DatetimeFormatter(**F week-of-month**) we use now both have the opposite meanings to what they declared in the java docs. And unfortunately, this also leads to silent data change in Spark too.

The `week-of-month` is actually the pattern `W` in DatetimeFormatter, which is banned to use in Spark 3.x.

If we want to keep pattern `F`, we need to accept the behavior change with proper migration guide and fix the doc in Spark

### Does this PR introduce _any_ user-facing change?

Yes, doc changed

### How was this patch tested?

passing ci doc generating job

Closes #29538 from yaooqinn/SPARK-32683.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-25 13:17:03 +00:00
Terry Kim e3a88a9767 [SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters
### What changes were proposed in this pull request?

This PR proposes to make the behavior consistent for the `path` option when loading dataframes with a single path (e.g, `option("path", path).format("parquet").load(path)` vs. `option("path", path).parquet(path)`) by disallowing `path` option to coexist with `load`'s path parameters.

### Why are the changes needed?

The current behavior is inconsistent:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
|    1|
+-----+

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
+-----+
```

### Does this PR introduce _any_ user-facing change?

Yes, now if the `path` option is specified along with `load`'s path parameters, it would fail:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
  ... 47 elided
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added a test

Closes #29328 from imback82/dfw_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-24 16:30:30 +00:00
Huaxin Gao db74fd0d33 [SPARK-32552][SQL][DOCS] Complete the documentation for Table-valued Function
# What changes were proposed in this pull request?
There are two types of TVF. We only documented one type. Adding the doc for the 2nd type.

### Why are the changes needed?
complete Table-valued Function doc

### Does this PR introduce _any_ user-facing change?
<img width="1099" alt="Screen Shot 2020-08-06 at 5 30 25 PM" src="https://user-images.githubusercontent.com/13592258/89595926-c5eae680-d80a-11ea-918b-0c3646f9930e.png">

<img width="1100" alt="Screen Shot 2020-08-06 at 5 30 49 PM" src="https://user-images.githubusercontent.com/13592258/89595929-c84d4080-d80a-11ea-9803-30eb502ccd05.png">

<img width="1101" alt="Screen Shot 2020-08-06 at 5 31 19 PM" src="https://user-images.githubusercontent.com/13592258/89595931-ca170400-d80a-11ea-8812-2f009746edac.png">

<img width="1100" alt="Screen Shot 2020-08-06 at 5 31 40 PM" src="https://user-images.githubusercontent.com/13592258/89595934-cb483100-d80a-11ea-9e18-9357aa9f2c5c.png">

### How was this patch tested?
Manually build and check

Closes #29355 from huaxingao/tvf.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-24 09:43:41 +09:00
Yuanjian Li 8b26c69ce7 [SPARK-31792][SS][DOC][FOLLOW-UP] Rephrase the description for some operations
### What changes were proposed in this pull request?
Rephrase the description for some operations to make it clearer.

### Why are the changes needed?
Add more detail in the document.

### Does this PR introduce _any_ user-facing change?
No, document only.

### How was this patch tested?
Document only.

Closes #29269 from xuanyuanking/SPARK-31792-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-08-22 21:32:23 +09:00
Brandon Jiang 1450b5e095 [MINOR][DOCS] fix typo for docs,log message and comments
### What changes were proposed in this pull request?
Fix typo for docs, log messages and comments

### Why are the changes needed?
typo fix to increase readability

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test has been performed to test the updated

Closes #29443 from brandonJY/spell-fix-doc.

Authored-by: Brandon Jiang <Brandon.jiang.a@outlook.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-22 06:45:35 +09:00
Chao Sun bf221debd0 [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc
### What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

### Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #29498 from sunchao/SPARK-32674.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-21 16:48:54 +09:00
Gengliang Wang 1b39215a65 [SPARK-32018][FOLLOWUP][DOC] Add migration guide for decimal value overflow in sum aggregation
### What changes were proposed in this pull request?

Add migration guide for decimal value overflow behavior in sum aggregation, introduced in https://github.com/apache/spark/pull/29026

### Why are the changes needed?

Add migration guide for the behavior changes from 3.0 to 3.1.
See also: https://github.com/apache/spark/pull/29450#issuecomment-675222779

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/90589256-8b7e3380-e192-11ea-8ff1-05a447c20722.png)

Closes #29458 from gengliangwang/migrationGuideDecimalOverflow.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-19 11:37:53 +08:00
Luca Canali 21e0dd0461 [SPARK-32119][FOLLOWUP][DOC] Update monitoring doc following the improvement in SPARK-32119
### What changes were proposed in this pull request?
Update monitoring doc following the improvement/fix in SPARK-32119.

### Why are the changes needed?
SPARK-32119 removes the limitations listed in the monitoring doc "Distribution of the jar files containing the plugin code is currently not done by Spark."

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not relevant

Closes #29463 from LucaCanali/followupSPARK32119.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2020-08-18 18:53:34 +09:00
Kousuke Saruta 9a79bbc8b6 [SPARK-32610][DOCS] Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
### What changes were proposed in this pull request?

This PR fixes the link to metrics.dropwizard.io in monitoring.md to refer the proper version of the library.

### Why are the changes needed?

There are links to metrics.dropwizard.io in monitoring.md but the link targets refer the version 3.1.0, while we use 4.1.1.
Now that users can create their own metrics using the dropwizard library, it's better to fix the links to refer the proper version.

### Does this PR introduce _any_ user-facing change?

Yes. The modified links refer the version 4.1.1.

### How was this patch tested?

Build the docs and visit all the modified links.

Closes #29426 from sarutak/fix-dropwizard-url.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-16 12:07:37 -05:00
HyukjinKwon 9dec67717b [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
### What changes were proposed in this pull request?

This PR proposes to `include` `_images` and `_sources` directories, generated from Sphinx, in Jekyll build.

**For `_images` directory,**
After SPARK-31851, now we add some images to use within the pages built by Sphinx. It copies and images into `_images` directory. Later, when Jekyll builds, the underscore directories are ignored by default which ends up with missing image in the main doc.

Before:
![Screen Shot 2020-08-11 at 1 52 46 PM](https://user-images.githubusercontent.com/6477701/89859104-2e571080-dbdb-11ea-817c-c04bbcd4088e.png)

After:
![Screen Shot 2020-08-11 at 1 49 00 PM](https://user-images.githubusercontent.com/6477701/89859105-30b96a80-dbdb-11ea-85c6-8a135eddf613.png)

**For `_sources` directory,**
Please refer [here](https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links) and [here](https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-html_copy_source). They are generated by default and used by default in the documentations by Sphinx, and we should better include them.

### Why are the changes needed?

To show the images correctly in PySpark documentation.

### Does this PR introduce _any_ user-facing change?

No, only in unreleased branches.

### How was this patch tested?

Manually tested via:

```bash
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

Closes #29402 from HyukjinKwon/SPARK-32584.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-11 15:15:30 +09:00
Luca Canali 99f50c6286 [SPARK-32409][DOC] Document dependency between spark.metrics.staticSources.enabled and JVMSource registration
### What changes were proposed in this pull request?
Document the dependency between the config `spark.metrics.staticSources.enabled` and JVMSource registration.

### Why are the changes needed?

This PT just documents the dependency between config `spark.metrics.staticSources.enabled` and JVM source registration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Manually tested.

Closes #29203 from LucaCanali/bugJVMMetricsRegistration.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-10 09:32:01 -07:00
Dongjoon Hyun b421bf0196 [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3
### What changes were proposed in this pull request?

This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`.

### Why are the changes needed?

In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX.

### Does this PR introduce _any_ user-facing change?

Yes. This provides a new built-in option.

### How was this patch tested?

Pass the GitHub Action or Jenkins with the revised test cases.

Closes #29331 from dongjoon-hyun/SPARK-32517.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-10 07:33:06 -07:00
Takeshi Yamamuro bf4ac3bacc [SPARK-32554][K8S][DOCS] Remove the words "experimental" in the k8s document
### What changes were proposed in this pull request?

This PR targets at dropping the words "experimental" in the k8s document from the primary branch.

This update comes from a thread in the spark-dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html

### Why are the changes needed?

To prepare a GA announcement for the k8s scheduler in the next feature release (v3.1.0)

### Does this PR introduce _any_ user-facing change?

Yes

BEFORE:
<img width="938" alt="Screen Shot 2020-08-10 at 21 17 48" src="https://user-images.githubusercontent.com/692303/89781831-0752fd00-db4f-11ea-843a-67fb23fc8f71.png">
AFTER:
<img width="874" alt="Screen Shot 2020-08-10 at 21 17 21" src="https://user-images.githubusercontent.com/692303/89781816-01f5b280-db4f-11ea-9ab4-4d1012bad80e.png">

### How was this patch tested?

N/A

Closes #29368 from maropu/UpdateDocForK8S.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-10 06:38:19 -07:00
Liang-Chi Hsieh f9f992e9a4 [SPARK-32191][PYTHON][DOCS] Port migration guide for PySpark docs
### What changes were proposed in this pull request?

This proposes to port old PySpark migration guide to new PySpark docs.

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No. Documentation only.

### How was this patch tested?

Generated document locally.

<img width="1521" alt="Screen Shot 2020-08-07 at 1 53 20 PM" src="https://user-images.githubusercontent.com/68855/89687618-672e7700-d8b5-11ea-8f29-67a9ab271fa8.png">

Closes #29385 from viirya/SPARK-32191.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-10 15:41:32 +09:00
Max Gekk 3a437ed22b [SPARK-32501][SQL] Convert null to "null" in structs, maps and arrays while casting to strings
### What changes were proposed in this pull request?
Convert `NULL` elements of maps, structs and arrays to the `"null"` string while converting maps/struct/array values to strings. The SQL config `spark.sql.legacy.omitNestedNullInCast.enabled` controls the behaviour. When it is `true`, `NULL` elements of structs/maps/arrays will be omitted otherwise, when it is `false`, `NULL` elements will be converted to `"null"`.

### Why are the changes needed?
1. It is impossible to distinguish empty string and null, for instance:
```scala
scala> Seq(Seq(""), Seq(null)).toDF().show
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
```
2. Inconsistent NULL conversions for top-level values and nested columns, for instance:
```scala
scala> sql("select named_struct('c', null), null").show
+---------------------+----+
|named_struct(c, NULL)|NULL|
+---------------------+----+
|                   []|null|
+---------------------+----+
```
3. `.show()` is different from conversions to Hive strings, and as a consequence its output is different from `spark-sql` (sql tests):
```sql
spark-sql> select named_struct('c', null) as struct;
{"c":null}
```
```scala
scala> sql("select named_struct('c', null) as struct").show
+------+
|struct|
+------+
|    []|
+------+
```

4. It is impossible to distinguish empty struct/array from struct/array with null in the current implementation:
```scala
scala> Seq[Seq[String]](Seq(), Seq(null)).toDF.show()
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
```

### Does this PR introduce _any_ user-facing change?
Yes, before:
```scala
scala> Seq(Seq(""), Seq(null)).toDF().show
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
```

After:
```scala
scala> Seq(Seq(""), Seq(null)).toDF().show
+------+
| value|
+------+
|    []|
|[null]|
+------+
```

### How was this patch tested?
By existing test suite `CastSuite`.

Closes #29311 from MaxGekk/nested-null-to-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-05 12:03:36 +00:00
HyukjinKwon 15b73339d9 [SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to write the main page of PySpark documentation. The base work is finished at https://github.com/apache/spark/pull/29188.

### Why are the changes needed?

For better usability and readability in PySpark documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it creates a new main page as below:

![Screen Shot 2020-07-31 at 10 02 44 PM](https://user-images.githubusercontent.com/6477701/89037618-d2d68880-d379-11ea-9a44-562f2aa0e3fd.png)

### How was this patch tested?

Manually built the PySpark documentation.

```bash
cd python
make clean html
```

Closes #29320 from HyukjinKwon/SPARK-32507.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-05 11:14:14 +09:00