Commit graph

2579 commits

Author SHA1 Message Date
Alessandro Patti 4a33cd928d [SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors
### What changes were proposed in this pull request?

Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment)

### Why are the changes needed?
The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with
```
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))
AssertionError: False is not true
```
```
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS
Failed example:
    predictions[0]
Expected:
    Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
    Row(user=0, item=2, newPrediction=0.6929104924201965)
...
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This path changes a test target. Just executed the tests to verify they pass.

Closes #30104 from AlessandroPatti/apatti/rounding-errors.

Authored-by: Alessandro Patti <ale812@yahoo.it>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-21 18:14:21 -07:00
HyukjinKwon 66005a3236 [SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical
### What changes were proposed in this pull request?

This PR is a small followup of https://github.com/apache/spark/pull/28793 and  proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`.

`is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (87a1cc21ca).

### Why are the changes needed?

To avoid using deprecated APIs, and remove warnings.

### Does this PR introduce _any_ user-facing change?

Yes, it will remove warnings that says `is_categorical` is deprecated.

### How was this patch tested?

By running any pandas UDF with pandas 1.1.0+:

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf

def func(x: pd.Series) -> pd.Series:
    return x

spark.range(10).select(pandas_udf(func, "long")("id")).show()
```

Before:

```
/.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
...
```

After:

```
...
```

Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-10-21 14:46:47 -07:00
Bryan Cutler 47a6568265 [SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested timestamps in pyarrow
### What changes were proposed in this pull request?

Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion.

### Why are the changes needed?

The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-21 09:13:33 +09:00
xuewei.linxuewei 388e067a90 [SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py
### What changes were proposed in this pull request?

In [SPARK-33139](https://github.com/apache/spark/pull/30042), I was using reflect "Class.forName" in python code to invoke method in SparkSession which is not recommended. using getattr to access "SparkSession$.Module$" instead.

### Why are the changes needed?

Code refine.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Existing tests.

Closes #30092 from leanken/leanken-SPARK-33139-followup.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 16:40:48 +09:00
xuewei.linxuewei 306872eefa [SPARK-33139][SQL] protect setActionSession and clearActiveSession
### What changes were proposed in this pull request?

This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.

Change of the PR:

* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive

### Why are the changes needed?

Make SQLConf.get reliable and stable.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test

Closes #30042 from leanken/leanken-SPARK-33139.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-16 06:05:17 +00:00
Chuliang Xiao 81d3a8eeca [MINOR][PYTHON] Fix the typo in the docstring of method agg()
### What changes were proposed in this pull request?
Change `df.groupBy.agg()` to `df.groupBy().agg()` in the docstring of `agg()`

### Why are the changes needed?
Fix typo in a docstring

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #30060 from ChuliangXiao/patch-1.

Authored-by: Chuliang Xiao <ChuliangX@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-10-15 17:24:22 -07:00
zero323 83f8e13956 [SPARK-33086][FOLLOW-UP] Remove unused Optional import from pyspark.resource.profile stub
### What changes were proposed in this pull request?

Remove unused `typing.Optional` import from `pyspark.resource.profile` stub.

### Why are the changes needed?

Since SPARK-32319 we don't allow unused imports.  However, this one slipped both local and CI tests for some reason.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests and mypy check.

Closes #30002 from zero323/SPARK-33086-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-12 10:29:28 +09:00
zero323 3beab8d8a8 [SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and SparkR
### What changes were proposed in this pull request?

- Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801).
- Add `assert_true` and `raise_error`  to SparkR NAMESPACE.
- Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004).

### Why are the changes needed?

As discussed in review for #29947.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Existing tests.
- Validation of annotations using MyPy

Closes #29978 from zero323/SPARK-32793-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-09 09:50:45 +09:00
Karen Feng 39510b0e9b [SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true
## What changes were proposed in this pull request?

Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.

### Why are the changes needed?

Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.

### Does this PR introduce _any_ user-facing change?

Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.

### How was this patch tested?

Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.

Closes #29947 from karenfeng/spark-32793.

Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 12:05:39 +09:00
zero323 473b3ba6aa [SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark
### What changes were proposed in this pull request?

This PR adds `dropFields` method to:

- PySpark `Column`
- SparkR `Column`

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

No, new API.

### How was this patch tested?

- New unit tests.
- Manual verification of examples / doctests.
- Manual run of MyPy tests

Closes #29967 from zero323/SPARK-32511-FOLLOW-UP-PYSPARK-SPARKR.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 10:37:42 +09:00
zero323 37e1b0c4a5 [SPARK-33086][PYTHON] Add static annotations for pyspark.resource
### What changes were proposed in this pull request?

This PR replaces dynamically generated annotations for following modules:

- `pyspark.resource.information`
- `pyspark.resource.profile`
- `pyspark.resource.requests`

### Why are the changes needed?

These modules where not manually annotated in `pyspark-stubs`, but are part of the public API and we should provide more precise annotations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

MyPy tests:

```
mypy --no-incremental --config python/mypy.ini python/pyspark
```

Closes #29969 from zero323/SPARK-32714-FOLLOW-UP-RESOURCE.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 10:32:30 +09:00
zero323 72da6f86cf [SPARK-33002][PYTHON] Remove non-API annotations
### What changes were proposed in this pull request?

This PR:

- removes annotations for modules which are not part of the public API.
- removes `__init__.pyi` files, if no annotations, beyond exports, are present.

### Why are the changes needed?

Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests and additional MyPy checks:

```
mypy --no-incremental --config python/mypy.ini python/pyspark
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```

Closes #29879 from zero323/SPARK-33002.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 19:53:59 +09:00
itholic 4e1ded67f8 [SPARK-32189][DOCS][PYTHON][FOLLOW-UP] Fixed broken link and typo in PySpark docs
### What changes were proposed in this pull request?

This PR is a follow-up of #29781 to fix broken link and typo.

<img width="638" alt="Screen Shot 2020-10-07 at 3 56 28 PM" src="https://user-images.githubusercontent.com/44108233/95297583-aa0ccb00-08b5-11eb-85db-89022c76d7e1.png">

<img width="734" alt="Screen Shot 2020-10-07 at 3 55 36 PM" src="https://user-images.githubusercontent.com/44108233/95297508-8ba6cf80-08b5-11eb-9caa-0b52a2482ada.png">

### Why are the changes needed?

Current link is not working properly because of wrong path.

### Does this PR introduce _any_ user-facing change?

Yes, the link is working properly now.

### How was this patch tested?

Manually built the doc.

Closes #29963 from itholic/SPARK-32189-FOLLOWUP.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 16:39:25 +09:00
HyukjinKwon 5ce321dc80 [SPARK-33017][PYTHON][DOCS][FOLLOW-UP] Add getCheckpointDir into API documentation
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/29918. We should add it into the documentation as well.

### Why are the changes needed?

To show users new APIs.

### Does this PR introduce _any_ user-facing change?

Yes, `SparkContext.getCheckpointDir` will be documented.

### How was this patch tested?

Manually built the PySpark documentation:

```bash
cd python/docs
make clean html
cd build/html
open index.html
```

Closes #29960 from HyukjinKwon/SPARK-33017.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 13:00:59 +09:00
Bryan Cutler 0812d6c17c [SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures
### What changes were proposed in this pull request?

This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.

### Why are the changes needed?

Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.

If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info.

The error handling is improved by:

- narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures.
- The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place.
- The original exception is chained to better show it to the user.

### Does this PR introduce _any_ user-facing change?

Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.

### How was this patch tested?

Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot

Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-06 18:11:24 +09:00
reidy-p 4ab9aa0305 [SPARK-33017][PYTHON] Add getCheckpointDir method to PySpark Context
### What changes were proposed in this pull request?

Adding a method to get the checkpoint directory from the PySpark context to match the Scala API

### Why are the changes needed?

To make the Scala and Python APIs consistent and remove the need to use the JavaObject

### Does this PR introduce _any_ user-facing change?

Yes, there is a new method which makes it easier to get the checkpoint directory directly rather than using the JavaObject

#### Previous behaviour:
```python
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> sc._jsc.sc().getCheckpointDir().get()
'file:/tmp/spark/checkpoint/63f7b67c-e5dc-4d11-a70c-33554a71717a'
```
This method returns a confusing Scala error if it has not been set
```python
>>> sc._jsc.sc().getCheckpointDir().get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/paul/Desktop/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/home/paul/Desktop/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/paul/Desktop/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o25.get.
: java.util.NoSuchElementException: None.get
        at scala.None$.get(Option.scala:529)
        at scala.None$.get(Option.scala:527)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

```

#### New method:
```python
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> spark.sparkContext.getCheckpointDir()
'file:/tmp/spark/checkpoint/b38aca2e-8ace-44fc-a4c4-f4e36c2da2a7'
```

``getCheckpointDir()`` returns ``None`` if it has not been set
```python
>>> print(spark.sparkContext.getCheckpointDir())
None
```

### How was this patch tested?

Added to existing unit tests. But I'm not sure how to add a test for the case where ``getCheckpointDir()`` should return ``None`` since the existing checkpoint tests set the checkpoint directory in the ``setUp`` method before any tests are run as far as I can tell.

Closes #29918 from reidy-p/SPARK-33017.

Authored-by: reidy-p <paul_reidy@outlook.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-05 11:48:28 +09:00
Max Gekk 1b60ff5afe [MINOR][DOCS] Document when current_date and current_timestamp are evaluated
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value

### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running `./dev/scalastyle`.

Closes #29892 from MaxGekk/doc-current_date.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 05:20:12 +00:00
HyukjinKwon 6868b40517 [SPARK-33020][PYTHON] Add nth_value as a PySpark function
### What changes were proposed in this pull request?

`nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API.

### Why are the changes needed?

To support the consistent APIs

### Does this PR introduce _any_ user-facing change?

Yes, it introduces a new PySpark function API.

### How was this patch tested?

Unittest was added.

Closes #29899 from HyukjinKwon/SPARK-33020.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-28 22:14:28 -07:00
HyukjinKwon 376ede1301 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
### What changes were proposed in this pull request?

Move functions related test cases from `test_context.py` to `test_functions.py`.

### Why are the changes needed?

To group the similar test cases.

### Does this PR introduce _any_ user-facing change?

Nope, test-only.

### How was this patch tested?

Jenkins and GitHub Actions should test.

Closes #29898 from HyukjinKwon/SPARK-33021.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-28 21:54:00 -07:00
Fabian Höring a7f84a0b45 [SPARK-32187][PYTHON][DOCS] Doc on Python packaging
### What changes were proposed in this pull request?

This PR proposes to document PySpark specific packaging guidelines.

### Why are the changes needed?

To have a single place for PySpark users, and better documentation.

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

```
cd python/docs
make clean html
```

Closes #29806 from fhoering/add_doc_python_packaging.

Lead-authored-by: Fabian Höring <f.horing@criteo.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-28 12:30:28 +09:00
zero323 c65b64552f [SPARK-32714][FOLLOW-UP][PYTHON] Address pyspark.install typing errors
### What changes were proposed in this pull request?

This PR adds two `type: ignores`, one in `pyspark.install` and one in related tests.

### Why are the changes needed?

To satisfy MyPy type checks. It seems like we originally missed some changes that happened around merge of
31a16fbb40

```
python/pyspark/install.py:30: error: Need type annotation for 'UNSUPPORTED_COMBINATIONS' (hint: "UNSUPPORTED_COMBINATIONS: List[<type>] = ...")  [var-annotated]
python/pyspark/tests/test_install_spark.py:105: error: Cannot find implementation or library stub for module named 'xmlrunner'  [import]
python/pyspark/tests/test_install_spark.py:105: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Existing tests.
- MyPy tests
    ```
    mypy --show-error-code --no-incremental --config python/mypy.ini python/pyspark
   ```

Closes #29878 from zero323/SPARK-32714-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-27 16:21:23 +09:00
HyukjinKwon 688d016c7a [SPARK-32982][BUILD] Remove hive-1.2 profiles in PIP installation option
### What changes were proposed in this pull request?

This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well).

### Why are the changes needed?

Hive 1.2 is a fork version. We shouldn't promote users to use.

### Does this PR introduce _any_ user-facing change?

Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only.

### How was this patch tested?

Manually tested:

```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v
```

Closes #29858 from HyukjinKwon/SPARK-32981.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:49:58 +09:00
zero323 31a16fbb40 [SPARK-32714][PYTHON] Initial pyspark-stubs port
### What changes were proposed in this pull request?

This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.

### How was this patch tested?

- [x] MyPy tests of the PySpark source
    ```
    mypy --no-incremental --config python/mypy.ini python/pyspark
    ```
- [x] MyPy tests of Spark examples
    ```
   MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
    ```
- [x] Existing Flake8 linter

- [x] Existing unit tests

Tested against:

- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`

Closes #29591 from zero323/SPARK-32681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:15:36 +09:00
zhengruifeng 432afac07e [SPARK-32907][ML] adaptively blockify instances - revert blockify gmm
### What changes were proposed in this pull request?
revert blockify gmm

### Why are the changes needed?
WeichenXu123  and I thought we should use memory size instead of number of rows to blockify instance; then if a buffer's size is large and determined by number of rows, we should discard it.
In GMM, we found that the pre-allocated memory maybe too large and should be discarded:
```
transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures)
```
We had some offline discuss and thought it is better to revert blockify GMM.

### Does this PR introduce _any_ user-facing change?
blockSize added in master branch will be removed

### How was this patch tested?
existing testsuites

Closes #29782 from zhengruifeng/unblockify_gmm.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-09-23 15:54:56 +08:00
HyukjinKwon 942f577b6e [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?

This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.

**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:

    ```bash
    pip install pyspark --install-option="hadoop3.2"
    ```

    This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.

    It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.

- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.

  Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.

- This way is sort of consistent with SparkR:

  SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.

  PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.

  If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.

- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.

  The usual way looks either `--install-option` above with hacks or environment variables given my investigation.

### Why are the changes needed?

To provide users the options to select Hadoop and Hive versions.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

### How was this patch tested?

Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):

```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```

Mac:

```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```

Windows:

```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```

Closes #29703 from HyukjinKwon/SPARK-32017.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:30:51 +09:00
zero323 779f0a84ea [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods
### What changes were proposed in this pull request?

This PR adjusts signatures of methods decorated with `keyword_only` to indicate  using [Python 3 keyword-only syntax](https://www.python.org/dev/peps/pep-3102/).

__Note__:

For the moment the goal is not to replace `keyword_only`. For justification see https://github.com/apache/spark/pull/29591#discussion_r489402579

### Why are the changes needed?

Right now it is not clear that `keyword_only` methods are indeed keyword only. This proposal addresses that.

In practice we could probably capture `locals` and drop `keyword_only` completel, i.e:

```python
keyword_only
def __init__(self, *, featuresCol="features"):
    ...
    kwargs = self._input_kwargs
    self.setParams(**kwargs)
```

could be replaced with

```python
def __init__(self, *, featuresCol="features"):
    kwargs = locals()
    del kwargs["self"]
    ...
    self.setParams(**kwargs)
```

### Does this PR introduce _any_ user-facing change?

Docstrings and inspect tools will now indicate that `keyword_only` methods expect only keyword arguments.

For example with ` LinearSVC` will change from

```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__(
    self,
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    maxIter=100,
    regParam=0.0,
    tol=1e-06,
    rawPredictionCol='rawPrediction',
    fitIntercept=True,
    standardization=True,
    threshold=0.0,
    weightCol=None,
    aggregationDepth=2,
)
Docstring: __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",                  maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction",                  fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,                  aggregationDepth=2):
File:      /path/to/python/pyspark/ml/classification.py
Type:      function
```

to

```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__   (
    self,
    *,
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    maxIter=100,
    regParam=0.0,
    tol=1e-06,
    rawPredictionCol='rawPrediction',
    fitIntercept=True,
    standardization=True,
    threshold=0.0,
    weightCol=None,
    aggregationDepth=2,
    blockSize=1,
)
Docstring: __init__(self, \*, featuresCol="features", labelCol="label", predictionCol="prediction",                  maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction",                  fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,                  aggregationDepth=2, blockSize=1):
File:      ~/Workspace/spark/python/pyspark/ml/classification.py
Type:      function
```

### How was this patch tested?

Existing tests.

Closes #29799 from zero323/SPARK-32933.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:28:33 +09:00
Max Gekk 7c14f177eb [SPARK-32306][SQL][DOCS] Clarify the result of percentile_approx()
### What changes were proposed in this pull request?
More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that  the function returns **one of elements** (or array of elements) from the input column.

### Why are the changes needed?
To improve Spark docs and avoid misunderstanding of the function behavior.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`./dev/scalastyle`

Closes #29835 from MaxGekk/doc-percentile_approx.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-09-22 12:45:19 -07:00
Takuya UESHIN 5440ea84ee [SPARK-32312][DOC][FOLLOWUP] Fix the minimum version of PyArrow in the installation guide
### What changes were proposed in this pull request?

Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide.

### Why are the changes needed?

The minimum version of PyArrow was upgraded to `1.0.0`.

### Does this PR introduce _any_ user-facing change?

Users see the correct minimum version in the installation guide.

### How was this patch tested?

N/A

Closes #29829 from ueshin/issues/SPARK-32312/doc.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 11:04:14 +09:00
itholic 9c653c957f [SPARK-32189][DOCS][PYTHON] Development - Setting up IDEs
### What changes were proposed in this pull request?

This PR proposes to document the way of setting up IDEs

![스크린샷 2020-09-21 오전 10 43 12](https://user-images.githubusercontent.com/44108233/93727715-5c2a6e80-fbf7-11ea-821b-555723b00bc8.png)
![스크린샷 2020-09-21 오전 10 43 45](https://user-images.githubusercontent.com/44108233/93727716-5f255f00-fbf7-11ea-9c6c-7b8a973bc511.png)

### Why are the changes needed?

To let users know how to setup IDEs

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new page in the documentation about setting IDEs.

### How was this patch tested?

Manually built the doc.

Closes #29781 from itholic/SPARK-32189.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 12:29:17 +09:00
zero323 7fb9f6884f [SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName
### What changes were proposed in this pull request?

Add optional `allowMissingColumns` argument to SparkR `unionByName`.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

`unionByName` supports `allowMissingColumns`.

### How was this patch tested?

Existing unit tests. New unit tests targeting this feature.

Closes #29813 from zero323/SPARK-32799.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 09:39:34 +09:00
HyukjinKwon f893a19c4c [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide
### What changes were proposed in this pull request?

This PR:
- rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors"
- documents extra dependency installation `pip install pyspark[sql]`
- uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html
- adds some more details

I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html

### Why are the changes needed?

To improve installation guide.

### Does this PR introduce _any_ user-facing change?

Yes, it updates the user-facing installation guide.

### How was this patch tested?

Manually built the doc and tested.

Closes #29779 from HyukjinKwon/SPARK-32180.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-20 10:58:17 +09:00
HyukjinKwon 657e39a334 [SPARK-32897][PYTHON] Don't show a deprecation warning at SparkSession.builder.getOrCreate
### What changes were proposed in this pull request?

In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below:

```python
import warnings
from pyspark.sql import SparkSession, SQLContext
warnings.simplefilter('always', DeprecationWarning)
spark.stop()
SparkSession.builder.getOrCreate()
```

it shows the deprecation warning as below:

```
/.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
  DeprecationWarning)
```

via d3304268d3/python/pyspark/sql/session.py (L222)

We shouldn't print the deprecation warning from it. This is the only place ^.

### Why are the changes needed?

To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly.

### Does this PR introduce _any_ user-facing change?

Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`.

### How was this patch tested?

Manually tested as above.

Closes #29768 from HyukjinKwon/SPARK-32897.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2020-09-16 10:13:47 -07:00
zero323 c918909c1a [SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword
### What changes were proposed in this pull request?

Replace `__metaclass__` fields with `metaclass` keyword in the class statements.

### Why are the changes needed?

`__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons.

```
>>> from pyspark.sql.types import BooleanType
>>> BooleanType() is BooleanType()
False
```

and classes, which suppose to be abstract, are not

```
>>> import inspect
>>> from pyspark.ml import Estimator
>>> inspect.isabstract(Estimator)
False
```

### Does this PR introduce _any_ user-facing change?

Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix.

### How was this patch tested?

Existing tests.

Closes #29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:22:11 +09:00
Adam Binford e884290587 [SPARK-32835][PYTHON] Add withField method to the pyspark Column class
### What changes were proposed in this pull request?

This PR adds a `withField` method on the pyspark Column class to call the Scala API method added in https://github.com/apache/spark/pull/27066.

### Why are the changes needed?

To update the Python API to match a new feature in the Scala API.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test

Closes #29699 from Kimahriman/feature/pyspark-with-field.

Authored-by: Adam Binford <adam.binford@radiantsolutions.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:18:36 +09:00
Liang-Chi Hsieh 550c1c9cfb [SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV
### What changes were proposed in this pull request?

This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header.

### Why are the changes needed?

This behavior can confuse users. We should explicitly document it.

### Does this PR introduce _any_ user-facing change?

No. Only doc change.

### How was this patch tested?

Only doc change.

Closes #29765 from viirya/SPARK-32888.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:16:15 +09:00
Abhishek Dixit 6f36db1fa5 [SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py
### What changes were proposed in this pull request?
Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Using existing tests

Closes #29242 from abhishekd0907/SPARK-31448.

Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-15 08:41:22 -05:00
Sean Owen ce566bed17 [SPARK-32180][FOLLOWUP] Fix .rst error in new Pyspark installation guide
This simply fixes an .rst generation error in https://github.com/apache/spark/pull/29640

Closes #29735 from srowen/SPARK-32180.2.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2020-09-11 20:08:22 -07:00
Rohit.Mishra f6322d1cb1 [SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation
### What changes were proposed in this pull request?
This PR proposes to add getting started- installation to new PySpark docs.

### Why are the changes needed?
Better documentation.

### Does this PR introduce _any_ user-facing change?
No. Documentation only.

### How was this patch tested?
Generating documents locally.

Closes #29640 from rohitmishr1484/SPARK-32180-Getting-Started-Installation.

Authored-by: Rohit.Mishra <rohit.mishra@utopusinsights.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-11 10:38:01 -05:00
Bryan Cutler e0538bd38c [SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1
### What changes were proposed in this pull request?

Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0.

This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits.

Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users:

ARROW-9300 - [Java] Separate Netty Memory to its own module
ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion
ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators
ARROW-8664 - [Java] Add skip null check to all Vector types
ARROW-8485 - [Integration][Java] Implement extension types integration
ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times
ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table
ARROW-8230 - [Java] Move Netty memory manager into a separate module
ARROW-8229 - [Java] Move ArrowBuf into the Arrow package
ARROW-7955 - [Java] Support large buffer for file/stream IPC
ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors
ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++
ARROW-6110 - [Java] Support LargeList Type and add integration test with C++
ARROW-5760 - [C++] Optimize Take implementation
ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD
ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column
ARROW-9066 - [Python] Raise correct error in isnull()
ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs
ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class
ARROW-7610 - [Java] Finish support for 64 bit int allocations
ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working
ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison
ARROW-8537 - [C++] Performance regression from ARROW-8523
ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader
ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults

View release notes here:
https://arrow.apache.org/release/1.0.1.html
https://arrow.apache.org/release/1.0.0.html

### Why are the changes needed?

Upgrade brings fixes, improvements and stability guarantees.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests with pyarrow 1.0.0 and 1.0.1

Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-10 14:16:19 +09:00
Wenchen Fan f7995c576a Revert "[SPARK-32677][SQL] Load function resource before create"
This reverts commit 05fcf26b79.
2020-09-09 18:15:22 +00:00
itholic c8c082ce38 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py
### What changes were proposed in this pull request?

In certain environments, seems it fails to run `run-tests.py` script as below:

```
Traceback (most recent call last):
 File "<string>", line 1, in <module>
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

```

The reason is that `Manager.dict()` launches another process when the main process is initiated.

It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself.

### Why are the changes needed?

To prevent the test failure for Python.

### Does this PR introduce _any_ user-facing change?

No, it fixes a test script.

### How was this patch tested?

Manually ran the script after fixing.

```
Running PySpark tests. Output is in /.../python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(/.../python3): pyspark.sql.dataframe (33s)
Finished test(python3.8): pyspark.sql.dataframe (34s)
Tests passed in 34 seconds
```

Closes #29666 from itholic/SPARK-32812.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-08 12:22:13 +09:00
HyukjinKwon c336ae39cd [SPARK-32186][DOCS][PYTHON] Development - Debugging
### What changes were proposed in this pull request?

This PR proposes to document the way of debugging PySpark. It's pretty much self-descriptive.

I made a demo site to review it more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/debugging.html

### Why are the changes needed?

To let users know how to debug PySpark applications.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new page in the documentation about debugging PySpark.

### How was this patch tested?

Manually built the doc.

Closes #29639 from HyukjinKwon/SPARK-32186.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-08 10:32:22 +09:00
itholic 8bd3770552 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark
### What changes were proposed in this pull request?

This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not.

### Why are the changes needed?

To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new examples with new argument in the docstring.

### How was this patch tested?

Doctest added and manually tested

```
$ python/run-tests --testnames pyspark.sql.dataframe
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(python3.8): pyspark.sql.dataframe (35s)
Finished test(/.../python3): pyspark.sql.dataframe (35s)
Tests passed in 35 seconds
```

Closes #29657 from itholic/SPARK-32798.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-08 09:41:02 +09:00
ulysses 05fcf26b79 [SPARK-32677][SQL] Load function resource before create
### What changes were proposed in this pull request?

Change `CreateFunctionCommand` code that add class check before create function.

### Why are the changes needed?

We have different behavior between create permanent function and temporary function when function class is invaild. e.g.,
```
create function f as 'test.non.exists.udf';
-- Time taken: 0.104 seconds

create temporary function f as 'test.non.exists.udf'
-- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath;
```

And Hive also fails both of them.

### Does this PR introduce _any_ user-facing change?

Yes, user will get exception when create a invalid udf.

### How was this patch tested?

New test.

Closes #29502 from ulysses-you/function.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 06:00:23 +00:00
HyukjinKwon 32d87c2b59 [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark
### What changes were proposed in this pull request?

This PR proposes to add a page to describe how to test PySpark. Note that it avoids duplication of https://spark.apache.org/developer-tools.html and it more aims to add put the relevant links together.

I made a demo site to review more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/testing.html

### Why are the changes needed?

To guide PySpark developers easily test.

### Does this PR introduce _any_ user-facing change?

Yes, it will adds a new documentation page.

### How was this patch tested?

Manually tested.

Closes #29634 from HyukjinKwon/SPARK-32783.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-04 18:06:25 +09:00
HyukjinKwon d80c85c2e3 [SPARK-32191][FOLLOW-UP][PYTHON][DOCS] Indent the table and reword the main page in migration guide
### What changes were proposed in this pull request?

This PR is a minor followup to fix:

1. Slightly reword the wording in the main page.

2. The indentation in the table at the migration guide;

    from

    ![Screen Shot 2020-09-01 at 1 53 40 PM](https://user-images.githubusercontent.com/6477701/91796204-91781800-ec5a-11ea-9f57-d7a9f4207ba0.png)

    to

    ![Screen Shot 2020-09-01 at 1 53 26 PM](https://user-images.githubusercontent.com/6477701/91796202-9046eb00-ec5a-11ea-9db2-815139ddfdb9.png)

### Why are the changes needed?

In order to show the migration guide pretty.

### Does this PR introduce _any_ user-facing change?

Yes, this is a change to user-facing documentation.

### How was this patch tested?

Manually built the documentation.

Closes #29606 from HyukjinKwon/SPARK-32191.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-01 15:08:03 +09:00
HyukjinKwon 86ca90ccd7 [SPARK-32190][PYTHON][DOCS] Development - Contribution Guide in PySpark
### What changes were proposed in this pull request?

This PR proposes to document PySpark specific contribution guides at "Development" section.

Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/development/contributing.html

### Why are the changes needed?

To have a single place for PySpark users, and better documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it is a new documentation. See the demo linked above.

### How was this patch tested?

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

and

```bash
cd python/docs
make clean html
```

Closes #29596 from HyukjinKwon/SPARK-32190.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-01 14:20:07 +09:00
HyukjinKwon eaaf783148 [MINOR][DOCS] Fix the Binder link to point the quickstart notebook correctly
### What changes were proposed in this pull request?

This PR fixes the link of Binder in Quickstart notebook and documentation.

From:

https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb

To:

https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb

This link is the same as the one in RST files:

b54103016a/python/docs/source/conf.py (L57)

### Why are the changes needed?

The link was wrong, and points out non-existent file and repo.

### Does this PR introduce _any_ user-facing change?

Yes, it will fixes the link so users can correctly try Binder.

### How was this patch tested?

Manually tested by building the documentation.

Closes #29597 from HyukjinKwon/minor-link-quickstart.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 22:00:56 +09:00
zero323 5574734093 [SPARK-32138][FOLLOW-UP] Drop obsolete StringIO import branching
### What changes were proposed in this pull request?

Removal of branched `StringIO` import.

### Why are the changes needed?

Top level `StringIO` is no longer present in Python 3.x.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29590 from zero323/SPARK-32138-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 16:56:50 +09:00
Fokko Driesprong a1e459ed9f [SPARK-32719][PYTHON] Add Flake8 check missing imports
https://issues.apache.org/jira/browse/SPARK-32719

### What changes were proposed in this pull request?

Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).

### Why are the changes needed?

To make sure that the quality of the Python code is up to standard.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit-tests and Flake8 static analysis

Closes #29563 from Fokko/fd-add-check-missing-imports.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:23:31 +09:00