Commit graph

1035 commits

Author SHA1 Message Date
HyukjinKwon d530ed0ea8 Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends"
This reverts commit b8a440f098.
2020-11-05 16:15:17 +09:00
zero323 4c8ee8856c [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
### What changes were proposed in this pull request?

This PR adds support for passing `Column`s as input to PySpark sorting functions.

### Why are the changes needed?

According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible.

### Does this PR introduce _any_ user-facing change?

PySpark users can now provide both `Column` and `str` as an argument for `asc*` and `desc*` functions.

### How was this patch tested?

New unit tests.

Closes #30227 from zero323/SPARK-33257.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-03 22:50:59 +09:00
HyukjinKwon 3959f0d987 [SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)
### What changes were proposed in this pull request?

This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243.
While I am migrating, I also fixed some Python type hints accordingly.

### Why are the changes needed?

For better documentation as text itself, and generated HTMLs

### Does this PR introduce _any_ user-facing change?

Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243.

### How was this patch tested?

Manually tested via running `./dev/lint-python`.

Closes #30181 from HyukjinKwon/SPARK-33250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-03 10:00:49 +09:00
Max Gekk bdabf60fb4 [SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for from_json
### What changes were proposed in this pull request?
Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs.

Closes #30201

### Why are the changes needed?
Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By linters.

Closes #30226 from MaxGekk/from_json-common-schema-parsing-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 10:10:24 -08:00
Takuya UESHIN b8a440f098 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends
### What changes were proposed in this pull request?

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.

### Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

```py
spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()
```

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests, and manually.

Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-01 20:28:12 +09:00
Daniel Himmelstein 56587f076d [SPARK-33310][PYTHON] Relax pyspark typing for sql str functions
### What changes were proposed in this pull request?

Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable.

### Why are the changes needed?

Convenience & ensuring the typing reflects the functionality

### Does this PR introduce _any_ user-facing change?

Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions.

### How was this patch tested?

Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this.

Closes #30209 from dhimmel/patch-1.

Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-01 19:09:12 +09:00
Max Gekk b409025641 [SPARK-33281][SQL] Return SQL schema instead of Catalog string from the SchemaOfCsv expression
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression.

### Why are the changes needed?
To unify output of the `schema_of_json()` and `schema_of_csv()`.

### Does this PR introduce _any_ user-facing change?
Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter.

Before:
```
> SELECT schema_of_csv('1,abc');
  struct<_c0:int,_c1:string>
```

After:
```
> SELECT schema_of_csv('1,abc');
  STRUCT<`_c0`: INT, `_c1`: STRING>
```

### How was this patch tested?
By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`.

Closes #30180 from MaxGekk/schema_of_csv-sql-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-29 21:02:10 +09:00
Max Gekk 9d5e48ea95 [SPARK-33270][SQL] Return SQL schema instead of Catalog string from the SchemaOfJson expression
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression.

### Why are the changes needed?
In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`.

Here is the example:
```scala
val in = Seq("""{"a b": 1}""").toDS()
in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed")
```
raises the exception:
```
== SQL ==
struct<a b:bigint>
------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131)
	at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33)
	at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537)
	at org.apache.spark.sql.functions$.from_json(functions.scala:4141)
```

### Does this PR introduce _any_ user-facing change?
Yes. For example, `schema_of_json` for the input `{"col":0}`.

Before: `struct<col:bigint>`
After: `STRUCT<`col`: BIGINT>`

### How was this patch tested?
By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`.

Closes #30172 from MaxGekk/schema_of_json-sql-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-29 10:30:41 +09:00
Takeshi Yamamuro a6216e2446 [SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType
### What changes were proposed in this pull request?

This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows;
```
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import col
>>> from pyspark.sql.types import *
>>> from pyspark.testing.sqlutils import *
>>>
>>> row = Row(point=ExamplePoint(1.0, 2.0))
>>> df = spark.createDataFrame([row])
>>> df.select(col("point").cast(PythonOnlyUDT()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o44.select.
: java.lang.NullPointerException
	at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84)
	at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96)
	at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267)
	at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290)
	at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290)
```
A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines  `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`.

### Why are the changes needed?

Bug fixes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #30169 from maropu/FixPythonUDTCast.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-28 08:33:02 -07:00
zero323 4e6a310f80 [SPARK-32084][PYTHON][SQL] Expand dictionary functions
### What changes were proposed in this pull request?

- [x] Expand dictionary definitions into standalone functions.
- [x] Fix annotations for ordering functions.

### Why are the changes needed?

To simplify further maintenance of docstrings.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #30143 from zero323/SPARK-32084.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 11:05:53 +09:00
HyukjinKwon 66005a3236 [SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical
### What changes were proposed in this pull request?

This PR is a small followup of https://github.com/apache/spark/pull/28793 and  proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`.

`is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (87a1cc21ca).

### Why are the changes needed?

To avoid using deprecated APIs, and remove warnings.

### Does this PR introduce _any_ user-facing change?

Yes, it will remove warnings that says `is_categorical` is deprecated.

### How was this patch tested?

By running any pandas UDF with pandas 1.1.0+:

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf

def func(x: pd.Series) -> pd.Series:
    return x

spark.range(10).select(pandas_udf(func, "long")("id")).show()
```

Before:

```
/.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
...
```

After:

```
...
```

Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-10-21 14:46:47 -07:00
xuewei.linxuewei 388e067a90 [SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py
### What changes were proposed in this pull request?

In [SPARK-33139](https://github.com/apache/spark/pull/30042), I was using reflect "Class.forName" in python code to invoke method in SparkSession which is not recommended. using getattr to access "SparkSession$.Module$" instead.

### Why are the changes needed?

Code refine.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Existing tests.

Closes #30092 from leanken/leanken-SPARK-33139-followup.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 16:40:48 +09:00
xuewei.linxuewei 306872eefa [SPARK-33139][SQL] protect setActionSession and clearActiveSession
### What changes were proposed in this pull request?

This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.

Change of the PR:

* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive

### Why are the changes needed?

Make SQLConf.get reliable and stable.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test

Closes #30042 from leanken/leanken-SPARK-33139.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-16 06:05:17 +00:00
Chuliang Xiao 81d3a8eeca [MINOR][PYTHON] Fix the typo in the docstring of method agg()
### What changes were proposed in this pull request?
Change `df.groupBy.agg()` to `df.groupBy().agg()` in the docstring of `agg()`

### Why are the changes needed?
Fix typo in a docstring

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #30060 from ChuliangXiao/patch-1.

Authored-by: Chuliang Xiao <ChuliangX@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-10-15 17:24:22 -07:00
zero323 3beab8d8a8 [SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and SparkR
### What changes were proposed in this pull request?

- Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801).
- Add `assert_true` and `raise_error`  to SparkR NAMESPACE.
- Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004).

### Why are the changes needed?

As discussed in review for #29947.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Existing tests.
- Validation of annotations using MyPy

Closes #29978 from zero323/SPARK-32793-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-09 09:50:45 +09:00
Karen Feng 39510b0e9b [SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true
## What changes were proposed in this pull request?

Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.

### Why are the changes needed?

Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.

### Does this PR introduce _any_ user-facing change?

Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.

### How was this patch tested?

Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.

Closes #29947 from karenfeng/spark-32793.

Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 12:05:39 +09:00
zero323 473b3ba6aa [SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark
### What changes were proposed in this pull request?

This PR adds `dropFields` method to:

- PySpark `Column`
- SparkR `Column`

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

No, new API.

### How was this patch tested?

- New unit tests.
- Manual verification of examples / doctests.
- Manual run of MyPy tests

Closes #29967 from zero323/SPARK-32511-FOLLOW-UP-PYSPARK-SPARKR.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 10:37:42 +09:00
zero323 72da6f86cf [SPARK-33002][PYTHON] Remove non-API annotations
### What changes were proposed in this pull request?

This PR:

- removes annotations for modules which are not part of the public API.
- removes `__init__.pyi` files, if no annotations, beyond exports, are present.

### Why are the changes needed?

Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests and additional MyPy checks:

```
mypy --no-incremental --config python/mypy.ini python/pyspark
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```

Closes #29879 from zero323/SPARK-33002.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 19:53:59 +09:00
Bryan Cutler 0812d6c17c [SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures
### What changes were proposed in this pull request?

This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.

### Why are the changes needed?

Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.

If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info.

The error handling is improved by:

- narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures.
- The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place.
- The original exception is chained to better show it to the user.

### Does this PR introduce _any_ user-facing change?

Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.

### How was this patch tested?

Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot

Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-06 18:11:24 +09:00
Max Gekk 1b60ff5afe [MINOR][DOCS] Document when current_date and current_timestamp are evaluated
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value

### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running `./dev/scalastyle`.

Closes #29892 from MaxGekk/doc-current_date.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 05:20:12 +00:00
HyukjinKwon 6868b40517 [SPARK-33020][PYTHON] Add nth_value as a PySpark function
### What changes were proposed in this pull request?

`nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API.

### Why are the changes needed?

To support the consistent APIs

### Does this PR introduce _any_ user-facing change?

Yes, it introduces a new PySpark function API.

### How was this patch tested?

Unittest was added.

Closes #29899 from HyukjinKwon/SPARK-33020.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-28 22:14:28 -07:00
HyukjinKwon 376ede1301 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
### What changes were proposed in this pull request?

Move functions related test cases from `test_context.py` to `test_functions.py`.

### Why are the changes needed?

To group the similar test cases.

### Does this PR introduce _any_ user-facing change?

Nope, test-only.

### How was this patch tested?

Jenkins and GitHub Actions should test.

Closes #29898 from HyukjinKwon/SPARK-33021.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-28 21:54:00 -07:00
zero323 31a16fbb40 [SPARK-32714][PYTHON] Initial pyspark-stubs port
### What changes were proposed in this pull request?

This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.

### How was this patch tested?

- [x] MyPy tests of the PySpark source
    ```
    mypy --no-incremental --config python/mypy.ini python/pyspark
    ```
- [x] MyPy tests of Spark examples
    ```
   MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
    ```
- [x] Existing Flake8 linter

- [x] Existing unit tests

Tested against:

- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`

Closes #29591 from zero323/SPARK-32681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:15:36 +09:00
zero323 779f0a84ea [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods
### What changes were proposed in this pull request?

This PR adjusts signatures of methods decorated with `keyword_only` to indicate  using [Python 3 keyword-only syntax](https://www.python.org/dev/peps/pep-3102/).

__Note__:

For the moment the goal is not to replace `keyword_only`. For justification see https://github.com/apache/spark/pull/29591#discussion_r489402579

### Why are the changes needed?

Right now it is not clear that `keyword_only` methods are indeed keyword only. This proposal addresses that.

In practice we could probably capture `locals` and drop `keyword_only` completel, i.e:

```python
keyword_only
def __init__(self, *, featuresCol="features"):
    ...
    kwargs = self._input_kwargs
    self.setParams(**kwargs)
```

could be replaced with

```python
def __init__(self, *, featuresCol="features"):
    kwargs = locals()
    del kwargs["self"]
    ...
    self.setParams(**kwargs)
```

### Does this PR introduce _any_ user-facing change?

Docstrings and inspect tools will now indicate that `keyword_only` methods expect only keyword arguments.

For example with ` LinearSVC` will change from

```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__(
    self,
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    maxIter=100,
    regParam=0.0,
    tol=1e-06,
    rawPredictionCol='rawPrediction',
    fitIntercept=True,
    standardization=True,
    threshold=0.0,
    weightCol=None,
    aggregationDepth=2,
)
Docstring: __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",                  maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction",                  fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,                  aggregationDepth=2):
File:      /path/to/python/pyspark/ml/classification.py
Type:      function
```

to

```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__   (
    self,
    *,
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    maxIter=100,
    regParam=0.0,
    tol=1e-06,
    rawPredictionCol='rawPrediction',
    fitIntercept=True,
    standardization=True,
    threshold=0.0,
    weightCol=None,
    aggregationDepth=2,
    blockSize=1,
)
Docstring: __init__(self, \*, featuresCol="features", labelCol="label", predictionCol="prediction",                  maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction",                  fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,                  aggregationDepth=2, blockSize=1):
File:      ~/Workspace/spark/python/pyspark/ml/classification.py
Type:      function
```

### How was this patch tested?

Existing tests.

Closes #29799 from zero323/SPARK-32933.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:28:33 +09:00
Max Gekk 7c14f177eb [SPARK-32306][SQL][DOCS] Clarify the result of percentile_approx()
### What changes were proposed in this pull request?
More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that  the function returns **one of elements** (or array of elements) from the input column.

### Why are the changes needed?
To improve Spark docs and avoid misunderstanding of the function behavior.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`./dev/scalastyle`

Closes #29835 from MaxGekk/doc-percentile_approx.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-09-22 12:45:19 -07:00
zero323 7fb9f6884f [SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName
### What changes were proposed in this pull request?

Add optional `allowMissingColumns` argument to SparkR `unionByName`.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

`unionByName` supports `allowMissingColumns`.

### How was this patch tested?

Existing unit tests. New unit tests targeting this feature.

Closes #29813 from zero323/SPARK-32799.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 09:39:34 +09:00
HyukjinKwon 657e39a334 [SPARK-32897][PYTHON] Don't show a deprecation warning at SparkSession.builder.getOrCreate
### What changes were proposed in this pull request?

In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below:

```python
import warnings
from pyspark.sql import SparkSession, SQLContext
warnings.simplefilter('always', DeprecationWarning)
spark.stop()
SparkSession.builder.getOrCreate()
```

it shows the deprecation warning as below:

```
/.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
  DeprecationWarning)
```

via d3304268d3/python/pyspark/sql/session.py (L222)

We shouldn't print the deprecation warning from it. This is the only place ^.

### Why are the changes needed?

To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly.

### Does this PR introduce _any_ user-facing change?

Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`.

### How was this patch tested?

Manually tested as above.

Closes #29768 from HyukjinKwon/SPARK-32897.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2020-09-16 10:13:47 -07:00
zero323 c918909c1a [SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword
### What changes were proposed in this pull request?

Replace `__metaclass__` fields with `metaclass` keyword in the class statements.

### Why are the changes needed?

`__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons.

```
>>> from pyspark.sql.types import BooleanType
>>> BooleanType() is BooleanType()
False
```

and classes, which suppose to be abstract, are not

```
>>> import inspect
>>> from pyspark.ml import Estimator
>>> inspect.isabstract(Estimator)
False
```

### Does this PR introduce _any_ user-facing change?

Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix.

### How was this patch tested?

Existing tests.

Closes #29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:22:11 +09:00
Adam Binford e884290587 [SPARK-32835][PYTHON] Add withField method to the pyspark Column class
### What changes were proposed in this pull request?

This PR adds a `withField` method on the pyspark Column class to call the Scala API method added in https://github.com/apache/spark/pull/27066.

### Why are the changes needed?

To update the Python API to match a new feature in the Scala API.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test

Closes #29699 from Kimahriman/feature/pyspark-with-field.

Authored-by: Adam Binford <adam.binford@radiantsolutions.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:18:36 +09:00
Liang-Chi Hsieh 550c1c9cfb [SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV
### What changes were proposed in this pull request?

This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header.

### Why are the changes needed?

This behavior can confuse users. We should explicitly document it.

### Does this PR introduce _any_ user-facing change?

No. Only doc change.

### How was this patch tested?

Only doc change.

Closes #29765 from viirya/SPARK-32888.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:16:15 +09:00
Abhishek Dixit 6f36db1fa5 [SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py
### What changes were proposed in this pull request?
Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Using existing tests

Closes #29242 from abhishekd0907/SPARK-31448.

Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-15 08:41:22 -05:00
Bryan Cutler e0538bd38c [SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1
### What changes were proposed in this pull request?

Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0.

This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits.

Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users:

ARROW-9300 - [Java] Separate Netty Memory to its own module
ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion
ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators
ARROW-8664 - [Java] Add skip null check to all Vector types
ARROW-8485 - [Integration][Java] Implement extension types integration
ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times
ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table
ARROW-8230 - [Java] Move Netty memory manager into a separate module
ARROW-8229 - [Java] Move ArrowBuf into the Arrow package
ARROW-7955 - [Java] Support large buffer for file/stream IPC
ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors
ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++
ARROW-6110 - [Java] Support LargeList Type and add integration test with C++
ARROW-5760 - [C++] Optimize Take implementation
ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD
ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column
ARROW-9066 - [Python] Raise correct error in isnull()
ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs
ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class
ARROW-7610 - [Java] Finish support for 64 bit int allocations
ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working
ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison
ARROW-8537 - [C++] Performance regression from ARROW-8523
ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader
ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults

View release notes here:
https://arrow.apache.org/release/1.0.1.html
https://arrow.apache.org/release/1.0.0.html

### Why are the changes needed?

Upgrade brings fixes, improvements and stability guarantees.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests with pyarrow 1.0.0 and 1.0.1

Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-10 14:16:19 +09:00
Wenchen Fan f7995c576a Revert "[SPARK-32677][SQL] Load function resource before create"
This reverts commit 05fcf26b79.
2020-09-09 18:15:22 +00:00
itholic 8bd3770552 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark
### What changes were proposed in this pull request?

This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not.

### Why are the changes needed?

To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new examples with new argument in the docstring.

### How was this patch tested?

Doctest added and manually tested

```
$ python/run-tests --testnames pyspark.sql.dataframe
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(python3.8): pyspark.sql.dataframe (35s)
Finished test(/.../python3): pyspark.sql.dataframe (35s)
Tests passed in 35 seconds
```

Closes #29657 from itholic/SPARK-32798.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-08 09:41:02 +09:00
ulysses 05fcf26b79 [SPARK-32677][SQL] Load function resource before create
### What changes were proposed in this pull request?

Change `CreateFunctionCommand` code that add class check before create function.

### Why are the changes needed?

We have different behavior between create permanent function and temporary function when function class is invaild. e.g.,
```
create function f as 'test.non.exists.udf';
-- Time taken: 0.104 seconds

create temporary function f as 'test.non.exists.udf'
-- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath;
```

And Hive also fails both of them.

### Does this PR introduce _any_ user-facing change?

Yes, user will get exception when create a invalid udf.

### How was this patch tested?

New test.

Closes #29502 from ulysses-you/function.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 06:00:23 +00:00
Fokko Driesprong a1e459ed9f [SPARK-32719][PYTHON] Add Flake8 check missing imports
https://issues.apache.org/jira/browse/SPARK-32719

### What changes were proposed in this pull request?

Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).

### Why are the changes needed?

To make sure that the quality of the Python code is up to standard.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit-tests and Flake8 static analysis

Closes #29563 from Fokko/fd-add-check-missing-imports.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:23:31 +09:00
HyukjinKwon 5775073a01 [SPARK-32722][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 1.0.1, pandas 1.1.1, Python 3.7)
### What changes were proposed in this pull request?

This PR updates the chart generated at SPARK-25666. We bumped up the minimal PyArrow version. It's better to use PyArrow 0.15.1+

### Why are the changes needed?

To track the changes in type coercion of PySpark <> PyArrow <> pandas.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Use this code to generate the chart:

```python
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf

columns = [
    ('none', 'object(NoneType)'),
    ('bool', 'bool'),
    ('int8', 'int8'),
    ('int16', 'int16'),
    ('int32', 'int32'),
    ('int64', 'int64'),
    ('uint8', 'uint8'),
    ('uint16', 'uint16'),
    ('uint32', 'uint32'),
    ('uint64', 'uint64'),
    ('float64', 'float16'),
    ('float64', 'float32'),
    ('float64', 'float64'),
    ('date', 'datetime64[ns]'),
    ('tz_aware_dates', 'datetime64[ns, US/Eastern]'),
    ('string', 'object(string)'),
    ('decimal', 'object(Decimal)'),
    ('array', 'object(array[int32])'),
    ('float128', 'float128'),
    ('complex64', 'complex64'),
    ('complex128', 'complex128'),
    ('category', 'category'),
    ('tdeltas', 'timedelta64[ns]'),
]

def create_dataframe():
    import pandas as pd
    import numpy as np
    import decimal
    pdf = pd.DataFrame({
        'none': [None, None],
        'bool': [True, False],
        'int8': np.arange(1, 3).astype('int8'),
        'int16': np.arange(1, 3).astype('int16'),
        'int32': np.arange(1, 3).astype('int32'),
        'int64': np.arange(1, 3).astype('int64'),
        'uint8': np.arange(1, 3).astype('uint8'),
        'uint16': np.arange(1, 3).astype('uint16'),
        'uint32': np.arange(1, 3).astype('uint32'),
        'uint64': np.arange(1, 3).astype('uint64'),
        'float16': np.arange(1, 3).astype('float16'),
        'float32': np.arange(1, 3).astype('float32'),
        'float64': np.arange(1, 3).astype('float64'),
        'float128': np.arange(1, 3).astype('float128'),
        'complex64': np.arange(1, 3).astype('complex64'),
        'complex128': np.arange(1, 3).astype('complex128'),
        'string': list('ab'),
        'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]),
        'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]),
        'date': pd.date_range('19700101', periods=2).values,
        'category': pd.Series(list("AB")).astype('category')})
    pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]]
    pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern')
    return pdf

types =  [
    BooleanType(),
    ByteType(),
    ShortType(),
    IntegerType(),
    LongType(),
    FloatType(),
    DoubleType(),
    DateType(),
    TimestampType(),
    StringType(),
    DecimalType(10, 0),
    ArrayType(IntegerType()),
    MapType(StringType(), IntegerType()),
    StructType([StructField("_1", IntegerType())]),
    BinaryType(),
]

df = spark.range(2).repartition(1)
results = []
count = 0
total = len(types) * len(columns)
values = []
spark.sparkContext.setLogLevel("FATAL")
for t in types:
    result = []
    for column, pandas_t in columns:
        v = create_dataframe()[column][0]
        values.append(v)
        try:
            row = df.select(pandas_udf(lambda _: create_dataframe()[column], t)(df.id)).first()
            ret_str = repr(row[0])
        except Exception:
            ret_str = "X"
        result.append(ret_str)
        progress = "SQL Type: [%s]\n  Pandas Value(Type): %s(%s)]\n  Result Python Value: [%s]" % (
            t.simpleString(), v, pandas_t, ret_str)
        count += 1
        print("%s/%s:\n  %s" % (count, total, progress))
    results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns)))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False)
print("\n".join(map(lambda line: "    # %s  # noqa" % line, strings.strip().split("\n"))))
```

Closes #29569 from HyukjinKwon/SPARK-32722.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 15:38:39 +09:00
HyukjinKwon c154629171 [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow
### What changes were proposed in this pull request?

This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").

Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html

### Why are the changes needed?

To have a single place for PySpark users, and better documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.

### How was this patch tested?

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

and

```bash
cd python/docs
make clean html
```

Closes #29548 from HyukjinKwon/SPARK-32183.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 15:09:06 +09:00
Terry Kim baaa756dee [SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()
### What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

### Why are the changes needed?

The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.

### Does this PR introduce _any_ user-facing change?

Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added new tests.

Closes #29543 from imback82/path_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-27 06:21:04 +00:00
unirt d3304268d3 [MINOR][PYTHON] Fix typo in a docsting of RDD.toDF
### What changes were proposed in this pull request?

Fixes typo in docsting of `toDF`

### Why are the changes needed?

The third argument of `toDF` is actually `sampleRatio`.
related discussion: https://github.com/apache/spark/pull/12746#discussion-diff-62704834

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This patch doesn't affect any logic, so existing tests should cover it.

Closes #29551 from unirt/minor_fix_docs.

Authored-by: unirt <lunirtc@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-26 10:34:49 -07:00
Nicholas Chammas f540031419 [SPARK-31000][PYTHON][SQL] Add ability to set table description via Catalog.createTable()
### What changes were proposed in this pull request?

This PR enhances `Catalog.createTable()` to allow users to set the table's description. This corresponds to the following SQL syntax:

```sql
CREATE TABLE ...
COMMENT 'this is a fancy table';
```

### Why are the changes needed?

This brings the Scala/Python catalog APIs a bit closer to what's already possible via SQL.

### Does this PR introduce any user-facing change?

Yes, it adds a new parameter to `Catalog.createTable()`.

### How was this patch tested?

Existing unit tests:

```sh
./python/run-tests \
  --python-executables python3.7 \
  --testnames 'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context'
```

```
$ ./build/sbt
testOnly org.apache.spark.sql.internal.CatalogSuite org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.hive.MetastoreDataSourcesSuite org.apache.spark.sql.hive.execution.HiveDDLSuite
```

Closes #27908 from nchammas/SPARK-31000-table-description.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 13:42:31 +09:00
Nicholas Chammas 41cf1d093f [SPARK-32686][PYTHON] Un-deprecate inferring DataFrame schema from list of dict
### What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/29491#discussion_r474451282 and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas.

### Why are the changes needed?

This change clarifies to users that this behavior is supported and is not going away in the near future.

### Does this PR introduce _any_ user-facing change?

Yes. There used to be a `UserWarning` for this, but now there isn't.

### How was this patch tested?

I tested this manually.

Before:

```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
/Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
DataFrame[a: bigint]

>>> spark.createDataFrame([{'a': 5}])
.../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
DataFrame[a: bigint]
```

After:

```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
DataFrame[a: bigint]

>>> spark.createDataFrame([{'a': 5}])
DataFrame[a: bigint]
```

Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-08-24 14:55:11 -07:00
alexander-daskalov 10edeafc69 [MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request?

In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_

### Why are the changes needed?

Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter.

### Does this PR introduce _any_ user-facing change?

This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead,

### How was this patch tested?

No tests, as no code changes were made.

Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description.

Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-14 22:10:41 +09:00
Weichen Xu fc62d72076 [MINOR] add test_createDataFrame_empty_partition in pyspark arrow tests
### What changes were proposed in this pull request?
add test_createDataFrame_empty_partition in pyspark arrow tests

### Why are the changes needed?
test edge cases.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
N/A

Closes #29398 from WeichenXu123/add_one_pyspark_arrow_test.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-10 18:43:41 +09:00
Fokko Driesprong 9fcf0ea718 [SPARK-32319][PYSPARK] Disallow the use of unused imports
Disallow the use of unused imports:

- Unnecessary increases the memory footprint of the application
- Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :)

```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused
python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused
python/pyspark/heapq3.py:868:5: F401 '_heapq.*' imported but unused
python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused
python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused
python/pyspark/rdd.py:21:1: F401 're' imported but unused
python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused
python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused
python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused
python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.*' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.*' imported but unused
python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.*' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.*' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.*' imported but unused
python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.*' imported but unused
python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused
python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused
python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.*' imported but unused
python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused
python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.*' imported but unused
python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.*' imported but unused
python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused
python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.*' imported but unused
python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.*' imported but unused
python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.*' imported but unused
python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.*' imported but unused
python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.*' imported but unused
python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused
python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.*' imported but unused
python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused
python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.*' imported but unused
python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.*' imported but unused
python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused
python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.*' imported but unused
python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.*' imported but unused
python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused
python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused
python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.*' imported but unused
python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.*' imported but unused
python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused
python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused
python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.*' imported but unused
python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.*' imported but unused
python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.*' imported but unused
python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.*' imported but unused
python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused
python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused
python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused
python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused
python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused
python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.*' imported but unused
python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.*' imported but unused
python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused
python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.*' imported but unused
python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.*' imported but unused
python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.*' imported but unused
python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused
python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused
python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.*' imported but unused
python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.*' imported but unused
python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.*' imported but unused
python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.*' imported but unused
python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.*' imported but unused
python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.*' imported but unused
python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.*' imported but unused
python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.*' imported but unused
python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused
python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused
python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.*' imported but unused
python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused
python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused
python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.*' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.*' imported but unused
python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.*' imported but unused
python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.*' imported but unused
python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.*' imported but unused
python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.*' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.*' imported but unused
python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused
python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.*' imported but unused
python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused
python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.*' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.*' imported but unused
python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused
python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused
```

After:
```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
fokkodriesprongFan spark %
```

### What changes were proposed in this pull request?

Removing unused imports from the Python files to keep everything nice and tidy.

### Why are the changes needed?

Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adding the rule to the existing Flake8 checks.

Closes #29121 from Fokko/SPARK-32319.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-08 08:51:57 -07:00
Liang Zhang 2cb48eabdc [SPARK-32549][PYSPARK] Add column name in _infer_schema error message
### What changes were proposed in this pull request?

The current error message from `_infer_type` in `_infer_schema` only includes the unsupported column type but not the column name. This PR adds the column name in the error message to make it easier for users to identify which column should they drop or convert.

### Why are the changes needed?

Improve user experience.

### Does this PR introduce _any_ user-facing change?

Yes. The error message from `_infer_schema` is changed.
Before:
"not supported type: foo"
After:
"Column bar contains not supported type: foo"

### How was this patch tested?

Updated the existing unit test.

Closes #29365 from liangz1/types-error-colname.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-07 11:50:46 +09:00
Max Gekk 7eb6f45688 [SPARK-32499][SQL] Use {} in conversions maps and structs to strings
### What changes were proposed in this pull request?
Change casting of map and struct values to strings by using the `{}` brackets instead of `[]`. The behavior is controlled by the SQL config `spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`, `CAST` wraps maps and structs by `[]` in casting to strings. Otherwise, if this is `false`, which is the default, maps and structs are wrapped by `{}`.

### Why are the changes needed?
- To distinguish structs/maps from arrays.
- To make `show`'s output consistent with Hive and conversions to Hive strings.
- To display dataframe content in the same form by `spark-sql` and `show`
- To be consistent with the `*.sql` tests

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By existing test suite `CastSuite`.

Closes #29308 from MaxGekk/show-struct-map.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-04 14:57:09 +00:00
Max Gekk b2180c0950 [SPARK-32471][SQL][DOCS][TESTS][PYTHON][SS] Describe JSON option allowNonNumericNumbers
### What changes were proposed in this pull request?
1. Describe the JSON option `allowNonNumericNumbers` which is used in read
2. Add new test cases for allowed JSON field values: NaN, +INF, +Infinity, Infinity, -INF and -Infinity

### Why are the changes needed?
To improve UX with Spark SQL and to provide users full info about the supported option.

### Does this PR introduce _any_ user-facing change?
Yes, in PySpark.

### How was this patch tested?
Added new test to `JsonParsingOptionsSuite`

Closes #29275 from MaxGekk/allowNonNumericNumbers-doc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-29 12:14:13 +09:00
HyukjinKwon 5491c08bf1 Revert "[SPARK-31525][SQL] Return an empty list for df.head() when df is empty"
This reverts commit 44a5258ac2.
2020-07-29 12:07:35 +09:00
Tianshi Zhu 44a5258ac2 [SPARK-31525][SQL] Return an empty list for df.head() when df is empty
### What changes were proposed in this pull request?

return an empty list instead of None when calling `df.head()`

### Why are the changes needed?

`df.head()` and `df.head(1)` are inconsistent when df is empty.

### Does this PR introduce _any_ user-facing change?

Yes. If a user relies on `df.head()` to return None, things like `if df.head() is None:` will be broken.

### How was this patch tested?

Closes #29214 from tianshizz/SPARK-31525.

Authored-by: Tianshi Zhu <zhutianshirea@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-28 12:32:19 +09:00