Commit graph

1113 commits

Author SHA1 Message Date
Linhong Liu 2f700773c2 [SPARK-36224][SQL] Use Void as the type name of NullType
### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct<a:void,b:int>
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-02 23:19:54 +08:00
Enrico Minack a65eb36bae [SPARK-36319][SQL][PYTHON] Make Observation return Map instead of Row
### What changes were proposed in this pull request?
The Observation API (Scala, Java, PySpark) now returns a `Map` / `Dict`. Before, it returned `Row` simply because the metrics are (internal to Observation) retrieved from the listener as rows. Since that is hidden from the user by the Observation API, there is no need to return `Row`.

While touching this code, this moves the unit tests from `DataFrameSuite,scala` to `DatasetSuite.scala` and from `JavaDataFrameSuite.java` to `JavaDatasetSuite.java`, which is a better place.

### Why are the changes needed?
This simplifies the API and accessing the metrics, especially in Java. There is no need for the concept `Row` when retrieving the observation result.

### Does this PR introduce _any_ user-facing change?
Yes, it changes the return type of `get` from `Row` to `Map` (Scala) / `Dict` (Python) and introduces `getAsJavaMap` (Java).

### How was this patch tested?
This is tested in `DatasetSuite.SPARK-34806: observation on datasets`, `JavaDatasetSuite.testObservation` and `test_dataframe.test_observe`.

Closes #33545 from EnricoMi/branch-observation-returns-map.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 10:40:28 +09:00
Samuel Moseley a07df1acc6 [SPARK-36161][PYTHON] Add type check on dropDuplicates pyspark function
### What changes were proposed in this pull request?
Improve the error message for wrong type when calling dropDuplicates in pyspark.

### Why are the changes needed?
The current error message is cryptic and can be unclear to less experienced users.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a type error for when a user gives the wrong type to dropDuplicates

### How was this patch tested?
There is currently no testing for error messages in pyspark dataframe functions

Closes #33364 from sammyjmoseley/sm/add-type-checking-for-drop-duplicates.

Lead-authored-by: Samuel Moseley <smoseley@palantir.com>
Co-authored-by: Sammy Moseley <moseley.sammy@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-29 19:11:48 +09:00
Enrico Minack f90eb6a5db [SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark
### What changes were proposed in this pull request?
With SPARK-34806 we can now easily add an equivalent for `Dataset.observe(Observation, Column, Column*)` to PySpark's `DataFrame` API.

### Why are the changes needed?
This further aligns the Python DataFrame API with Scala Dataset API.

### Does this PR introduce _any_ user-facing change?
Yes, it adds the `Observation` class and the `DataFrame.observe` method.

### How was this patch tested?
Adds test `test_observe` to `pyspark.sql.test.test_dataframe`.

Closes #33484 from EnricoMi/branch-observation-python.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 01:39:34 +08:00
Luran He ede1bc6b51
[SPARK-36211][PYTHON] Correct typing of udf return value
The following code should type-check:

```python3
import uuid

import pyspark.sql.functions as F

my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()
```

### What changes were proposed in this pull request?

The `udf` function should return a more specific type.

### Why are the changes needed?

Right now, `mypy` will throw spurious errors, such as for the code given above.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests?

Closes #33399 from luranhe/patch-1.

Lead-authored-by: Luran He <luranjhe@gmail.com>
Co-authored-by: Luran He <luran.he@compass.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-07-27 09:07:22 +02:00
Dominik Gehl ae1c20ee0d [SPARK-36225][PYTHON][DOCS] Use DataFrame in python docstrings
### What changes were proposed in this pull request?
Changing references to Dataset in python docstrings to DataFrame

### Why are the changes needed?
no Dataset class in pyspark

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Doc change only

Closes #33438 from dominikgehl/feature/SPARK-36225.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:58:10 +09:00
Dominik Gehl 701756ac95 [SPARK-36226][PYTHON][DOCS] Improve python docstring links to other classes
### What changes were proposed in this pull request?
additional links to other classes in python documentation

### Why are the changes needed?
python docstring syntax wasn't fully used everywhere

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Documentation change only

Closes #33440 from dominikgehl/feature/python-docstrings.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 19:17:51 +09:00
Dominik Gehl 382fe44b55 [SPARK-36258][PYTHON] Exposing functionExists in pyspark sql catalog
### What changes were proposed in this pull request?
Exposing functionExists in pyspark sql catalog

### Why are the changes needed?
method was available in scala but not pyspark

### Does this PR introduce _any_ user-facing change?
Additional method

### How was this patch tested?
Unit tests

Closes #33481 from dominikgehl/SPARK-36258.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 19:15:41 +09:00
Dominik Gehl 2c35604044 [SPARK-36243][SQL][PYTHON][DOCS] Fixing pyspark tableExists issue with temporary views
### What changes were proposed in this pull request?
Additional tests for pyspark tableExists with regard to views and temporary views

### Why are the changes needed?
scala documentation indicates that tableExists works for tables/view and also temporary views. This unit tests try to verify that claim. While views seem ok, temporary views don't seem to work.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tests

Closes #33461 from dominikgehl/bug/SPARK-36243.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 18:12:01 +09:00
Dominik Gehl 463fcb3723 [SPARK-36207][PYTHON] Expose databaseExists in pyspark.sql.catalog
### What changes were proposed in this pull request?
Expose databaseExists in pyspark.sql.catalog

### Why are the changes needed?
Was available in scala, but not in pyspark

### Does this PR introduce _any_ user-facing change?
New method databaseExists

### How was this patch tested?
Unit tests in codebase

Closes #33416 from dominikgehl/feature/SPARK-36207.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 22:10:06 +09:00
Dominik Gehl d7d961fabe [SPARK-36176][PYTHON] Expose tableExists in pyspark.sql.catalog
### What changes were proposed in this pull request?
exposing tableExists in pyspark.sql.catalog

### Why are the changes needed?
avoids pyspark users having to go through listTables

### Does this PR introduce _any_ user-facing change?
Yes, additional tableExists method available in pyspark

### How was this patch tested?
test added

Closes #33388 from dominikgehl/feature/SPARK-36176.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 09:28:52 +09:00
Dominik Gehl 2ef8ced27a [SPARK-36181][PYTHON] Update pyspark sql readwriter documentation
### What changes were proposed in this pull request?
Updating the pyspark sql readwriter documentation to the level of detail provided by the scala documentation

### Why are the changes needed?
documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33394 from dominikgehl/feature/SPARK-36181.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:50:42 +09:00
Dominik Gehl 2d8d7b4aae [SPARK-36160][PYTHON][DOCS] Clarifying documentation for pyspark sql/column
### What changes were proposed in this pull request?
Adapting documentation of `between`, `getField`, `dropFields` and `cast` to the corresponding scala doc

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only documentation change

Closes #33369 from dominikgehl/feature/SPARK-36160.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-16 21:32:53 +09:00
Jungtaek Lim f2bf8b051b [SPARK-34893][SS] Support session window natively
Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR).

The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped).

This PR leverages two different approaches on merging session windows:

1. merging session windows with Spark's aggregation logic (a variant of sort aggregation)
2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards

First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases.

This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge.

The state format is versioned, so that we can bring a new state format if we find a better one.

### Why are the changes needed?

For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks:

1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows
2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves
3. (flat)MapGroupsWithState is only available in Scala/Java.

With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations.

Quoting the query example from test suite:

```
    val inputData = MemoryStream[(String, Long)]

    // Split the lines into words, treat words as sessionId of events
    val events = inputData.toDF()
      .select($"_1".as("value"), $"_2".as("timestamp"))
      .withColumn("eventTime", $"timestamp".cast("timestamp"))
      .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime")
      .withWatermark("eventTime", "30 seconds")

    val sessionUpdates = events
      .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId)
      .agg(count("*").as("numEvents"))
      .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)",
        "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs",
        "numEvents")
```

which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes).

39542bb81f/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala (L66-L105)

(Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.)

### Does this PR introduce _any_ user-facing change?

Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window".

### How was this patch tested?

New test suites. Also tested with benchmark code.

Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5.

Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-07-16 20:38:16 +09:00
Dominik Gehl 2db7ed7964 [SPARK-36158][PYTHON][DOCS] Improving pyspark sql/functions months_between documentation
### What changes were proposed in this pull request?
Updating pyspark months_between documentation to the precision in the scala documentation

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33366 from dominikgehl/feature/SPARK-36158.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-15 19:20:38 +03:00
Hyukjin Kwon a71dd6af2f [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs
### What changes were proposed in this pull request?

This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade)

```
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined
python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?
python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"?
python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object"
python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist"
python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue"
python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue"
python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str"
python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str")
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py💯 error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined
python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]")
python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel"
Found 32 errors in 15 files (checked 315 source files)
1
```

### Why are the changes needed?

Python 3.6 is deprecated at SPARK-35938

### Does this PR introduce _any_ user-facing change?

No. Maybe static analysis, etc. by some type hints but they are really non-breaking..

### How was this patch tested?

I manually checked by GitHub Actions build in forked repository.

Closes #33356 from HyukjinKwon/SPARK-36146.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 08:01:54 -07:00
Dominik Gehl 802f632a28 [SPARK-36154][DOCS] Documenting week and quarter as valid formats in pyspark sql/functions trunc
### What changes were proposed in this pull request?
Added missing documentation of week and quarter as valid formats to pyspark sql/functions trunc

### Why are the changes needed?
Pyspark documentation and scala documentation didn't mentioned the same supported formats

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33359 from dominikgehl/feature/SPARK-36154.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-15 16:51:11 +03:00
Dominik Gehl b5ee6d008d [SPARK-36149][PYTHON] Clarify documentation for dayofweek and weekofyear
### What changes were proposed in this pull request?
Clearly state which weekday corresponds to which integer

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
only documentation change

Closes #33345 from dominikgehl/doc/pyspark-dayofweek.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-15 11:53:54 +03:00
itholic 2537fe8cba [SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame
### What changes were proposed in this pull request?

Currently, inferring nested structs is always using `MapType`.

This behavior causes an issue because it infers the schema with a value type of the first field of the struct as below:

```python
data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}]
df = spark.createDataFrame(data)
df.show(truncate=False)
+--------------------------------+
|inside_struct                   |
+--------------------------------+
|{name -> null, payment -> 100.5}|
+--------------------------------+
```

The "name" became `null`, but it should've been `"Lee"`.

In this case, we need to be able to infer the schema with a `StructType` instead of a `MapType`.

Therefore, this PR proposes adding an new configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` to handle which type is used for inferring nested structs.
- When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `false` (by default), inferring nested structs by `MapType`
- When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `true`, inferring nested structs by `StructType`

### Why are the changes needed?

Because always inferring the nested structs by `MapType` doesn't work properly for some cases.

### Does this PR introduce _any_ user-facing change?

New configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is added.

### How was this patch tested?

Added an unit test

Closes #33214 from itholic/SPARK-35929.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 15:14:18 +09:00
Wenchen Fan 0c9c8ff569 [SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing
### What changes were proposed in this pull request?

By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.

However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.

This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.

### Why are the changes needed?

AQE is default on now, we should make the perf better in the default case.

### Does this PR introduce _any_ user-facing change?

yes, a new config.

### How was this patch tested?

new tests

Closes #33172 from cloud-fan/aqe2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-02 16:07:31 +08:00
itholic 03e6de2abe [SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame
### What changes were proposed in this pull request?

This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.

### Why are the changes needed?

Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.

### Does this PR introduce _any_ user-facing change?

No, it's kinda internal refactoring stuff.

### How was this patch tested?

Added the related tests and manually check they're passed.

Closes #33054 from itholic/SPARK-35605.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-28 11:47:09 +09:00
Keerthan Vasist f2c0a049a6 [SPARK-35643][PYTHON] Fix ambiguous reference in functions.py column()
### What changes were proposed in this pull request?
In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended.

This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643.

### Why are the changes needed?
In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I don't believe this patch needs additional testing.

Closes #32771 from keerthanvasist/col.

Lead-authored-by: Keerthan Vasist <kvasist@amazon.com>
Co-authored-by: keerthanvasist <kvasist@amazon.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-05 12:40:39 +09:00
itholic 2658bc590f [SPARK-35081][DOCS] Add Data Source Option links to missing documents
### What changes were proposed in this pull request?

This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`.

- Before
<img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png">

- After
<img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png">

### Why are the changes needed?

To provide users available options in detail with the proper documentation link.

### Does this PR introduce _any_ user-facing change?

Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture.

### How was this patch tested?

Manually built the docs and checked one by one.

Closes #32762 from itholic/SPARK-35081.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-03 13:52:46 +09:00
itholic 48252bac95 [SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "JDBC To Other Databases" page
<img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png">

- Python
![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png)

- Scala
![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png)

- Java
![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32723 from itholic/SPARK-35583.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-02 14:21:16 +09:00
itholic fe09def323 [SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents
### What changes were proposed in this pull request?

This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents.

### Why are the changes needed?

If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below:
<img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png">

### Does this PR introduce _any_ user-facing change?

Yes, the `# noqa` is no more shown in the documents as below:
<img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png">

### How was this patch tested?

Manually build docs and check.

Closes #32728 from itholic/SPARK-35582.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 15:24:04 +09:00
itholic 73d4f67145 [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move CSV data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "CSV Files" page
<img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png">

- Python
<img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png">

- Scala
<img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png">

- Java
<img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32658 from itholic/SPARK-35433.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:58:49 +09:00
itholic 79a6b0cc8a [SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move text data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "Text Files" page
<img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png">

- Python
<img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png">

- Scala
<img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png">

- Java
<img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32660 from itholic/SPARK-35509.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 17:12:49 +09:00
Hyukjin Kwon 20750a3f9e [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception
### What changes were proposed in this pull request?

This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python.

While I am here, I fixed another minor issue at `DataFrams.schema` together:

```diff
- except AttributeError as e:
-     raise Exception(
-         "Unable to parse datatype from schema. %s" % e)
+ except Exception as e:
+     raise ValueError(
+         "Unable to parse datatype from schema. %s" % e) from e
```

Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases.

### Why are the changes needed?

For users to expect the proper exceptions.

### Does this PR introduce _any_ user-facing change?

Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit.

### How was this patch tested?

Existing unittests should cover,

Closes #31238

Closes #32650 from HyukjinKwon/SPARK-32194.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 11:54:40 +09:00
itholic d2bdd6595e [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move Parquet data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "Parquet Files" page
![Screen Shot 2021-05-21 at 1 35 08 PM](https://user-images.githubusercontent.com/44108233/119082866-e7375f00-ba39-11eb-9ade-a931a5957b34.png)

- Python
![Screen Shot 2021-05-21 at 1 38 27 PM](https://user-images.githubusercontent.com/44108233/119082879-eef70380-ba39-11eb-9e8e-ee50eed98dbe.png)

- Scala
![Screen Shot 2021-05-21 at 1 36 52 PM](https://user-images.githubusercontent.com/44108233/119082884-f1595d80-ba39-11eb-98d5-966657df65f7.png)

- Java
![Screen Shot 2021-05-21 at 1 37 19 PM](https://user-images.githubusercontent.com/44108233/119082888-f4544e00-ba39-11eb-8bf8-47ce78ec0b01.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32161 from itholic/SPARK-34491.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 18:05:49 +09:00
itholic 419ddcb2a4 [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move JSON data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "JSON Files" page
<img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png">

- Python
<img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png">

- Scala
<img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png">

- Java
<img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32204 from itholic/SPARK-35081.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 18:05:13 +09:00
itholic 0fe65b5365 [SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move ORC data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "ORC Files" page
![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png)

- Python
![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png)

- Scala
![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png)

- Java
![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32546 from itholic/SPARK-35395.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 18:03:57 +09:00
Hyukjin Kwon 7eaabf4df5 [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format
### What changes were proposed in this pull request?

This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5.

This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message.

**NOTE** that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message.

### Why are the changes needed?

To keep unofficial Python 3.5 support

### Does this PR introduce _any_ user-facing change?

Officially nope.

### How was this patch tested?

Ran the linters.

Closes #32598 from HyukjinKwon/SPARK-35408=followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 10:47:31 +09:00
Kousuke Saruta 9283bebbbd [SPARK-35418][SQL] Add sentences function to functions.{scala,py}
### What changes were proposed in this pull request?

This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`.

### Why are the changes needed?

This function can be only used from SQL for now.
It's good if we can use this function from Scala/Python code as well as SQL.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use this function from Scala and Python.

### How was this patch tested?

New test.

Closes #32566 from sarutak/sentences-function.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-05-19 20:07:28 +09:00
Gera Shegalov 9eb45ecb4f [SPARK-35408][PYTHON] Improve parameter validation in DataFrame.show
### What changes were proposed in this pull request?
Provide clearer error message tied to the user's Python code if incorrect parameters are passed to `DataFrame.show` rather than the message about a missing JVM method the user is not calling directly.

```
py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748
```

### Why are the changes needed?
For faster debugging through actionable error message.

### Does this PR introduce _any_ user-facing change?
No change for the correct parameters but different error messages for the parameters triggering an exception.

### How was this patch tested?
- unit test
- manually in PySpark REPL

Closes #32555 from gerashegalov/df_show_validation.

Authored-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-17 16:22:46 +09:00
Takuya UESHIN 17b59a9970 [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs
### What changes were proposed in this pull request?

This PR fixes the same issue as #32424.

```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)
```

**Before:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```

**After:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Added a unit test as well as manually.

Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-13 14:58:01 +09:00
Sean Owen a189be8754 [MINOR][DOCS] Avoid some python docs where first sentence has "e.g." or similar
### What changes were proposed in this pull request?

Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary.

### Why are the changes needed?

See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated.

### Does this PR introduce _any_ user-facing change?

Only changes docs.

### How was this patch tested?

Manual testing of docs.

Closes #32508 from srowen/TruncatedPythonDesc.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 10:38:59 +09:00
HyukjinKwon 8aaa9e890a [SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation
### What changes were proposed in this pull request?

This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.

### Why are the changes needed?

To correctly document.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing documentation.

### How was this patch tested?

I checked them via running linters.

Closes #32423 from HyukjinKwon/SPARK-35250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 08:44:18 +09:00
Yikun Jiang 44b7931936 [SPARK-35176][PYTHON] Standardize input validation error type
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas

### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.

[1] https://docs.python.org/3/library/exceptions.html#TypeError

Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.

### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.

### How was this patch tested?
Existing test case and UT

Closes #32368 from Yikun/SPARK-35176.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 15:34:24 +09:00
Alex Mooney faa928cefc [MINOR][PYTHON][DOCS] Fix docstring for pyspark.sql.DataFrameWriter.json lineSep param
### What changes were proposed in this pull request?

Add a new line to the `lineSep` parameter so that the doc renders correctly.

### Why are the changes needed?

> <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png">

The first line of the description is part of the signature and is **bolded**.

### Does this PR introduce _any_ user-facing change?

Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered.

### How was this patch tested?

I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious.

Closes #32153 from AlexMooney/patch-1.

Authored-by: Alex Mooney <alexmooney@fastmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-14 13:14:51 +09:00
Yikun Jiang b43f7e6a97 [SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.*
### What changes were proposed in this pull request?
Fix type hints mismatches in pyspark.sql.*

### Why are the changes needed?
There were some mismatches in pyspark.sql.*

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
dev/lint-python passed.

Closes #32122 from Yikun/SPARK-35019.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-13 11:21:13 +09:00
Kousuke Saruta 14c7bb877d [SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters
### What changes were proposed in this pull request?

This PR fixes an issue that `quoteIfNeeded` quotes a name only if it contains `.` or ``` ` ```.
This method should quote it if it contains non-word characters.

### Why are the changes needed?

It's a potential bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #31964 from sarutak/fix-quoteIfNeeded.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-03-29 09:31:24 +00:00
Danny Meijer ad211ccd9d
[SPARK-34630][PYTHON][SQL] Added typehint for pyspark.sql.Column.contains
### What changes were proposed in this pull request?

This PR implements the missing typehints as per SPARK-34630.

### Why are the changes needed?

To satisfy the aforementioned Jira ticket

### Does this PR introduce _any_ user-facing change?

No, just adding a missing typehint for Project Zen

### How was this patch tested?

No tests needed (just adding a typehint)

Closes #31823 from dannymeijer/feature/SPARK-34630.

Authored-by: Danny Meijer <danny.meijer@nike.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-03-24 15:21:19 +01:00
John Ayad ddfc75ec64 [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import
### What changes were proposed in this pull request?

Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`.

### Why are the changes needed?

This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438

### Does this PR introduce _any_ user-facing change?

Yes, it will now show the root cause of the exception when pandas or arrow is missing during import.

### How was this patch tested?

Manually tested.

```python
from pyspark.sql.functions import pandas_udf
spark.range(1).select(pandas_udf(lambda x: x))
```

Before:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version
    raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
```

After:

```
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version
    import pyarrow
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version
    raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
```

Closes #31902 from johnhany97/jayad/spark-34803.

Lead-authored-by: John Ayad <johnhany97@gmail.com>
Co-authored-by: John H. Ayad <johnhany97@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-22 23:29:28 +09:00
Kousuke Saruta 03dd33cc98 [SPARK-25769][SPARK-34636][SPARK-34626][SQL] sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly
### What changes were proposed in this pull request?

This PR fixes an issue that `sql` method in the following classes which take qualified names don't quote the qualified names properly.

* UnresolvedAttribute
* AttributeReference
* Alias

One instance caused by this issue is reported in SPARK-34626.
```
UnresolvedAttribute("a" :: "b" :: Nil).sql
`a.b` // expected: `a`.`b`
```
And other instances are like as follows.
```
UnresolvedAttribute("a`b"::"c.d"::Nil).sql
a`b.`c.d` // expected: `a``b`.`c.d`

AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
c.d.`a.b` // expected: `c.d`.`a.b`

Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql
`a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
```

### Why are the changes needed?

This is a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #31754 from sarutak/fix-qualified-names.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-03-12 02:58:46 +00:00
Peter Toth ab8a9a0ceb [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite
### What changes were proposed in this pull request?

pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122
This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that
```
new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType))))
```
and
```
new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType))))
```
are currently equal and the second instance is replaced to the short code of the first one during serialization.

### Why are the changes needed?
The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description:

```
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>>
>>> def udf1(data_type):
        def u1(e):
            return e[0]
        return udf(u1, data_type)
>>>
>>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2'])
>>>
>>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
>>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
>>>
>>> df.select("c3").show()
+---+
| c3|
+---+
|1.0|
+---+

>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
+---+

>>> df.select("c3", "c4").show()
+---+----+
| c3|  c4|
+---+----+
|1.0|null|
+---+----+
```
This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113

After this PR:
```
>>> df.select("c3", "c4").show()
+---+---+
| c3| c4|
+---+---+
|1.0|  1|
+---+---+
```

### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new UT + manual tests.

Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-03-07 19:12:42 -06:00
Takuya UESHIN 331d459ee7 [SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests
### What changes were proposed in this pull request?

Fixes a Python UDF `plus_one` used in `GroupedAggPandasUDFTests` to always return float (double) values.

### Why are the changes needed?

The Python UDF `plus_one` used in `GroupedAggPandasUDFTests` is always returning `v + 1` regardless of its type. The return type of the UDF is 'double', so if the input is int, the result will be `null`.

```py
>>> df = spark.range(10).toDF('id') \
...             .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
...             .withColumn("v", explode(col('vs'))) \
...             .drop('vs') \
...             .withColumn('w', lit(1.0))
>>> udf('double')
... def plus_one(v):
...   assert isinstance(v, (int, float))
...   return v + 1
...
>>> pandas_udf('double', PandasUDFType.GROUPED_AGG)
... def sum_udf(v):
...   return v.sum()
...
>>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).show()
+------------+----------+
|plus_one(id)|sum_udf(v)|
+------------+----------+
|        null|    2900.0|
+------------+----------+
```

This is meaningless and should be:

```py
>>> udf('double')
... def plus_one(v):
...   assert isinstance(v, (int, float))
...   return float(v + 1)
...
>>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).sort('plus_one(id)').show()
+------------+----------+
|plus_one(id)|sum_udf(v)|
+------------+----------+
|         1.0|     245.0|
|         2.0|     255.0|
|         3.0|     265.0|
|         4.0|     275.0|
|         5.0|     285.0|
|         6.0|     295.0|
|         7.0|     305.0|
|         8.0|     315.0|
|         9.0|     325.0|
|        10.0|     335.0|
+------------+----------+
```

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Fixed the test.

Closes #31730 from ueshin/issues/SPARK-34610/test_pandas_udf_grouped_agg.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-04 10:03:54 +09:00
Richard Penney 7d0743b493 [SPARK-33678][SQL] Product aggregation function
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.

This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.

This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.

### Does this PR introduce _any_ user-facing change?
No - only adds new function.

### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).

An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw

df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))

df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x  |factorial      |
+---+---------------+
|1  |1.0            |
|2  |2.0            |
|3  |6.0            |
|4  |24.0           |
|5  |120.0          |
|6  |720.0          |
|7  |5040.0         |
|8  |40320.0        |
|9  |362880.0       |
|10 |3628800.0      |
|11 |3.99168E7      |
|12 |4.790016E8     |
|13 |6.2270208E9    |
|14 |8.71782912E10  |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```

Closes #30745 from rwpenney/feature/agg-product.

Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:51:07 +09:00
Max Gekk 5957bc18a1 [SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs
### What changes were proposed in this pull request?
Move the datetime rebase SQL configs from the `legacy` namespace by:
1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`.
2. Add the legacy configs as alternatives
3. Deprecate the legacy rebase configs.

### Why are the changes needed?
The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs.

### Does this PR introduce _any_ user-facing change?
Should not. Users will see a warning if they still use one of the legacy configs.

### How was this patch tested?
1. Manually checking new configs:
```scala
scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead")
res0: String = EXCEPTION

scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.

scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead")
res2: String = LEGACY
```
2. By running a datetime rebasing test suite:
```
$ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite"
```

Closes #31576 from MaxGekk/rebase-confs-alternatives.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-17 14:04:47 +00:00
Eric Lemmon e3b6e4ad43 [SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs
### What changes were proposed in this pull request?
Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well.

### Why are the changes needed?
No docs were generated for `pyspark.sql.conf.RuntimeConfig`.

### Does this PR introduce _any_ user-facing change?
Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch.

### How was this patch tested?
First built the Python docs:
```bash
cd $SPARK_HOME/docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve
```
Then verified all pages and links:
1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page:
http://localhost:4000/api/python/reference/index.html
![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png)

2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page:
http://localhost:4000/api/python/reference/pyspark.sql.html#configuration
![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)**

3. RuntimeConfig page displayed:
http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html
![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png)

4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page:
http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html
![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png)

Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable.

Authored-by: Eric Lemmon <eric@lemmon.cc>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-13 09:32:55 -06:00
HyukjinKwon 92a83463c9 [SPARK-34408][PYTHON] Refactor spark.udf.register to share the same path to generate UDF instance
### What changes were proposed in this pull request?

This PR proposes to use `_create_udf` where we need to create `UserDefinedFunction` to maintain codes easier.

### Why are the changes needed?

For the better readability of codes and maintenance.

### Does this PR introduce _any_ user-facing change?

No, refactoring.

### How was this patch tested?

Ran the existing unittests. CI in this PR should test it out too.

Closes #31537 from HyukjinKwon/SPARK-34408.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-11 10:57:02 +09:00