### What changes were proposed in this pull request?
This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.
In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
- `sumDistinct` -> `sum_distinct`
- `bitwiseNOT` -> `bitwise_not`
- `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
- `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
- `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
- (Scala-specific) `callUDF` -> `call_udf`
### Why are the changes needed?
To keep the consistent naming in APIs.
### Does this PR introduce _any_ user-facing change?
Yes, it deprecates some APIs and add new renamed APIs as described above.
### How was this patch tested?
Unittests were added.
Closes#31408 from HyukjinKwon/SPARK-34306.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added typing for keyword-only single argument udf overload.
### Why are the changes needed?
The intended use case is:
```
udf(returnType="string")
def f(x): ...
```
### Does this PR introduce _any_ user-facing change?
Yes - a new typing for udf is considered valid.
### How was this patch tested?
Existing tests.
Closes#31282 from pgrz/patch-1.
Authored-by: pgrz <grzegorski.piotr@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side.
Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary.
### Why are the changes needed?
To make the usage similar with Scala side, and for consistency.
### Does this PR introduce _any_ user-facing change?
No, this is not released yet.
### How was this patch tested?
GitHub Actions and Jenkins build will test it out.
Closes#31062 from HyukjinKwon/SPARK-30681.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds the following functions (introduced in Scala API with SPARK-33061):
- `acosh`
- `asinh`
- `atanh`
to Python and R.
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
New functions.
### How was this patch tested?
New unit tests.
Closes#30501 from zero323/SPARK-33563.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds support for passing `Column`s as input to PySpark sorting functions.
### Why are the changes needed?
According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible.
### Does this PR introduce _any_ user-facing change?
PySpark users can now provide both `Column` and `str` as an argument for `asc*` and `desc*` functions.
### How was this patch tested?
New unit tests.
Closes#30227 from zero323/SPARK-33257.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243.
While I am migrating, I also fixed some Python type hints accordingly.
### Why are the changes needed?
For better documentation as text itself, and generated HTMLs
### Does this PR introduce _any_ user-facing change?
Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243.
### How was this patch tested?
Manually tested via running `./dev/lint-python`.
Closes#30181 from HyukjinKwon/SPARK-33250.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable.
### Why are the changes needed?
Convenience & ensuring the typing reflects the functionality
### Does this PR introduce _any_ user-facing change?
Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions.
### How was this patch tested?
Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this.
Closes#30209 from dhimmel/patch-1.
Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- [x] Expand dictionary definitions into standalone functions.
- [x] Fix annotations for ordering functions.
### Why are the changes needed?
To simplify further maintenance of docstrings.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#30143 from zero323/SPARK-32084.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801).
- Add `assert_true` and `raise_error` to SparkR NAMESPACE.
- Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004).
### Why are the changes needed?
As discussed in review for #29947.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Existing tests.
- Validation of annotations using MyPy
Closes#29978 from zero323/SPARK-32793-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.
### Why are the changes needed?
Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.
### Does this PR introduce _any_ user-facing change?
Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.
### How was this patch tested?
Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.
Closes#29947 from karenfeng/spark-32793.
Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API.
### Why are the changes needed?
To support the consistent APIs
### Does this PR introduce _any_ user-facing change?
Yes, it introduces a new PySpark function API.
### How was this patch tested?
Unittest was added.
Closes#29899 from HyukjinKwon/SPARK-33020.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
Yes. This PR adds type annotations directly to Spark source.
This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.
### How was this patch tested?
- [x] MyPy tests of the PySpark source
```
mypy --no-incremental --config python/mypy.ini python/pyspark
```
- [x] MyPy tests of Spark examples
```
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```
- [x] Existing Flake8 linter
- [x] Existing unit tests
Tested against:
- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`
Closes#29591 from zero323/SPARK-32681.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>