### What changes were proposed in this pull request?
This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.
### Why are the changes needed?
To correctly document.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the user-facing documentation.
### How was this patch tested?
I checked them via running linters.
Closes#32423 from HyukjinKwon/SPARK-35250.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas
### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.
[1] https://docs.python.org/3/library/exceptions.html#TypeError
Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.
### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.
### How was this patch tested?
Existing test case and UT
Closes#32368 from Yikun/SPARK-35176.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a new line to the `lineSep` parameter so that the doc renders correctly.
### Why are the changes needed?
> <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png">
The first line of the description is part of the signature and is **bolded**.
### Does this PR introduce _any_ user-facing change?
Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered.
### How was this patch tested?
I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious.
Closes#32153 from AlexMooney/patch-1.
Authored-by: Alex Mooney <alexmooney@fastmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix type hints mismatches in pyspark.sql.*
### Why are the changes needed?
There were some mismatches in pyspark.sql.*
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
dev/lint-python passed.
Closes#32122 from Yikun/SPARK-35019.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `quoteIfNeeded` quotes a name only if it contains `.` or ``` ` ```.
This method should quote it if it contains non-word characters.
### Why are the changes needed?
It's a potential bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#31964 from sarutak/fix-quoteIfNeeded.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR implements the missing typehints as per SPARK-34630.
### Why are the changes needed?
To satisfy the aforementioned Jira ticket
### Does this PR introduce _any_ user-facing change?
No, just adding a missing typehint for Project Zen
### How was this patch tested?
No tests needed (just adding a typehint)
Closes#31823 from dannymeijer/feature/SPARK-34630.
Authored-by: Danny Meijer <danny.meijer@nike.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
### What changes were proposed in this pull request?
Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`.
### Why are the changes needed?
This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438
### Does this PR introduce _any_ user-facing change?
Yes, it will now show the root cause of the exception when pandas or arrow is missing during import.
### How was this patch tested?
Manually tested.
```python
from pyspark.sql.functions import pandas_udf
spark.range(1).select(pandas_udf(lambda x: x))
```
Before:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version
raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
```
After:
```
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version
import pyarrow
ModuleNotFoundError: No module named 'pyarrow'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version
raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
```
Closes#31902 from johnhany97/jayad/spark-34803.
Lead-authored-by: John Ayad <johnhany97@gmail.com>
Co-authored-by: John H. Ayad <johnhany97@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `sql` method in the following classes which take qualified names don't quote the qualified names properly.
* UnresolvedAttribute
* AttributeReference
* Alias
One instance caused by this issue is reported in SPARK-34626.
```
UnresolvedAttribute("a" :: "b" :: Nil).sql
`a.b` // expected: `a`.`b`
```
And other instances are like as follows.
```
UnresolvedAttribute("a`b"::"c.d"::Nil).sql
a`b.`c.d` // expected: `a``b`.`c.d`
AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
c.d.`a.b` // expected: `c.d`.`a.b`
Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql
`a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
```
### Why are the changes needed?
This is a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#31754 from sarutak/fix-qualified-names.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122
This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that
```
new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType))))
```
and
```
new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType))))
```
are currently equal and the second instance is replaced to the short code of the first one during serialization.
### Why are the changes needed?
The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description:
```
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>>
>>> def udf1(data_type):
def u1(e):
return e[0]
return udf(u1, data_type)
>>>
>>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2'])
>>>
>>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
>>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
>>>
>>> df.select("c3").show()
+---+
| c3|
+---+
|1.0|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
| 1|
+---+
>>> df.select("c3", "c4").show()
+---+----+
| c3| c4|
+---+----+
|1.0|null|
+---+----+
```
This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113
After this PR:
```
>>> df.select("c3", "c4").show()
+---+---+
| c3| c4|
+---+---+
|1.0| 1|
+---+---+
```
### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.
### How was this patch tested?
Added new UT + manual tests.
Closes#31682 from peter-toth/SPARK-34545-fix-row-comparison.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fixes a Python UDF `plus_one` used in `GroupedAggPandasUDFTests` to always return float (double) values.
### Why are the changes needed?
The Python UDF `plus_one` used in `GroupedAggPandasUDFTests` is always returning `v + 1` regardless of its type. The return type of the UDF is 'double', so if the input is int, the result will be `null`.
```py
>>> df = spark.range(10).toDF('id') \
... .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
... .withColumn("v", explode(col('vs'))) \
... .drop('vs') \
... .withColumn('w', lit(1.0))
>>> udf('double')
... def plus_one(v):
... assert isinstance(v, (int, float))
... return v + 1
...
>>> pandas_udf('double', PandasUDFType.GROUPED_AGG)
... def sum_udf(v):
... return v.sum()
...
>>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).show()
+------------+----------+
|plus_one(id)|sum_udf(v)|
+------------+----------+
| null| 2900.0|
+------------+----------+
```
This is meaningless and should be:
```py
>>> udf('double')
... def plus_one(v):
... assert isinstance(v, (int, float))
... return float(v + 1)
...
>>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).sort('plus_one(id)').show()
+------------+----------+
|plus_one(id)|sum_udf(v)|
+------------+----------+
| 1.0| 245.0|
| 2.0| 255.0|
| 3.0| 265.0|
| 4.0| 275.0|
| 5.0| 285.0|
| 6.0| 295.0|
| 7.0| 305.0|
| 8.0| 315.0|
| 9.0| 325.0|
| 10.0| 335.0|
+------------+----------+
```
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
Fixed the test.
Closes#31730 from ueshin/issues/SPARK-34610/test_pandas_udf_grouped_agg.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.
This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.
This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.
### Does this PR introduce _any_ user-facing change?
No - only adds new function.
### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).
An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw
df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))
df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x |factorial |
+---+---------------+
|1 |1.0 |
|2 |2.0 |
|3 |6.0 |
|4 |24.0 |
|5 |120.0 |
|6 |720.0 |
|7 |5040.0 |
|8 |40320.0 |
|9 |362880.0 |
|10 |3628800.0 |
|11 |3.99168E7 |
|12 |4.790016E8 |
|13 |6.2270208E9 |
|14 |8.71782912E10 |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```
Closes#30745 from rwpenney/feature/agg-product.
Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Move the datetime rebase SQL configs from the `legacy` namespace by:
1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`.
2. Add the legacy configs as alternatives
3. Deprecate the legacy rebase configs.
### Why are the changes needed?
The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs.
### Does this PR introduce _any_ user-facing change?
Should not. Users will see a warning if they still use one of the legacy configs.
### How was this patch tested?
1. Manually checking new configs:
```scala
scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead")
res0: String = EXCEPTION
scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.
scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead")
res2: String = LEGACY
```
2. By running a datetime rebasing test suite:
```
$ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite"
```
Closes#31576 from MaxGekk/rebase-confs-alternatives.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to use `_create_udf` where we need to create `UserDefinedFunction` to maintain codes easier.
### Why are the changes needed?
For the better readability of codes and maintenance.
### Does this PR introduce _any_ user-facing change?
No, refactoring.
### How was this patch tested?
Ran the existing unittests. CI in this PR should test it out too.
Closes#31537 from HyukjinKwon/SPARK-34408.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Creating a Pandas dataframe via Apache Arrow currently can use twice as much memory as the final result, because during the conversion, both Pandas and Arrow retain a copy of the data. Arrow has a "self-destruct" mode now (Arrow >= 0.16) to avoid this, by freeing each column after conversion. This PR integrates support for this in toPandas, handling a couple of edge cases:
self_destruct has no effect unless the memory is allocated appropriately, which is handled in the Arrow serializer here. Essentially, the issue is that self_destruct frees memory column-wise, but Arrow record batches are oriented row-wise:
```
Record batch 0: allocation 0: column 0 chunk 0, column 1 chunk 0, ...
Record batch 1: allocation 1: column 0 chunk 1, column 1 chunk 1, ...
```
In this scenario, Arrow will drop references to all of column 0's chunks, but no memory will actually be freed, as the chunks were just slices of an underlying allocation. The PR copies each column into its own allocation so that memory is instead arranged as so:
```
Record batch 0: allocation 0 column 0 chunk 0, allocation 1 column 1 chunk 0, ...
Record batch 1: allocation 2 column 0 chunk 1, allocation 3 column 1 chunk 1, ...
```
The optimization is disabled by default, and can be enabled with the Spark SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" set to "true". We can't always apply this optimization because it's more likely to generate a dataframe with immutable buffers, which Pandas doesn't always handle well, and because it is slower overall (since it only converts one column at a time instead of in parallel).
### Why are the changes needed?
This lets us load larger datasets - in particular, with N bytes of memory, before we could never load a dataset bigger than N/2 bytes; now the overhead is more like N/1.25 or so.
### Does this PR introduce _any_ user-facing change?
Yes - it adds a new SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled"
### How was this patch tested?
See the [mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html) - it was tested with Python memory_profiler. Unit tests added to check memory within certain bounds and correctness with the option enabled.
Closes#29818 from lidavidm/spark-32953.
Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose new options for the Parquet datasource:
1. `datetimeRebaseMode`
2. `int96RebaseMode`
Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps.
The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely;
- `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar.
- `"CORRECTED"`, dates/timestamps are read AS IS from parquet files.
- `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.
### Why are the changes needed?
1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance:
```scala
val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1)
val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2)
df1.join(df2, ...)
```
Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads.
2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps:
```scala
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy")
spark.read.parquet(folder).distinct.rdd.collect()
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite"
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite"
```
Closes#31489 from MaxGekk/parquet-rebase-options.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The current implement of some DDL not unify the output and not pass the output properly to physical command.
Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`.
As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`.
Take `show tables` and `show table extended like 'tbl'` as example.
The output before this PR:
`show tables`
|database|tableName|isTemporary|
-- | -- | --
| default| tbl| false|
If catalog is v2 session catalog, the output before this PR:
|namespace|tableName|
-- | --
| default| tbl
`show table extended like 'tbl'`
|database|tableName|isTemporary| information|
-- | -- | -- | --
| default| tbl| false|Database: default...|
The output after this PR:
`show tables`
|namespace|tableName|isTemporary|
-- | -- | --
| default| tbl| false|
`show table extended like 'tbl'`
|namespace|tableName|isTemporary| information|
-- | -- | -- | --
| default| tbl| false|Database: default...|
### Why are the changes needed?
This PR have benefits as follows:
First, Unify schema for the output of SHOW TABLES.
Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe.
### Does this PR introduce _any_ user-facing change?
Yes.
The output schema of `SHOW TABLES` replace `database` by `namespace`.
### How was this patch tested?
Jenkins test.
Closes#31245 from beliefer/SPARK-34157.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This changeset is published into the public domain.
### What changes were proposed in this pull request?
Some typos and syntax issues in docstrings and the output of `dev/lint-python` have been fixed.
### Why are the changes needed?
In some places, the documentation did not refer to parameters or classes by the full and correct name, potentially causing uncertainty in the reader or rendering issues in Sphinx. Also, a typo in the standard output of `dev/lint-python` was fixed.
### Does this PR introduce _any_ user-facing change?
Slight improvements in documentation, and in standard output of `dev/lint-python`.
### How was this patch tested?
Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change.
Closes#31401 from DavidToneian/SPARK-34300.
Authored-by: David Toneian <david@toneian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.
In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
- `sumDistinct` -> `sum_distinct`
- `bitwiseNOT` -> `bitwise_not`
- `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
- `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
- `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
- (Scala-specific) `callUDF` -> `call_udf`
### Why are the changes needed?
To keep the consistent naming in APIs.
### Does this PR introduce _any_ user-facing change?
Yes, it deprecates some APIs and add new renamed APIs as described above.
### How was this patch tested?
Unittests were added.
Closes#31408 from HyukjinKwon/SPARK-34306.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds `NullType` support for Arrow executions.
### Why are the changes needed?
As Arrow supports null type, we can convert `NullType` between PySpark and pandas with Arrow enabled.
### Does this PR introduce _any_ user-facing change?
Yes, if a user has a DataFrame including `NullType`, it will be able to convert with Arrow enabled.
### How was this patch tested?
Added tests.
Closes#31285 from ueshin/issues/SPARK-33489/arrow_nulltype.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added typing for keyword-only single argument udf overload.
### Why are the changes needed?
The intended use case is:
```
udf(returnType="string")
def f(x): ...
```
### Does this PR introduce _any_ user-facing change?
Yes - a new typing for udf is considered valid.
### How was this patch tested?
Existing tests.
Closes#31282 from pgrz/patch-1.
Authored-by: pgrz <grzegorski.piotr@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR:
- Adds as small hierarchy of warnings to be used in PySpark applications. These extend built-in classes and top level `PySparkWarning`.
- Replaces `DeprecationWarnings` (intended for developers) with PySpark specific subclasses of `FutureWarning` (intended for end users).
### Why are the changes needed?
- To be more precise and add users additional control (in addition to standard module level filters) over PySpark warnings handling.
- Correct semantics (at the moment we use `DeprecationWarning` in user-facing API, but it is intended "for warnings about deprecated features when those warnings are intended for other Python developers").
### Does this PR introduce _any_ user-facing change?
Yes. Code can raise different type of warning than before.
### How was this patch tested?
Existing tests.
Closes#30985 from zero323/SPARK-33730.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to strip auto-generated cast. The main logic is:
1. Add tag if Cast is specified by user.
2. Wrap `PrettyAttribute` in usePrettyExpression.
### Why are the changes needed?
Make sql consistent with dsl. Here is an inconsistent example before this PR:
```
-- output field name: FLOOR(1)
spark.emptyDataFrame.select(floor(lit(1)))
-- output field name: FLOOR(CAST(1 AS DOUBLE))
spark.sql("select floor(1)")
```
Note that, we don't remove the `Cast` so the auto-generated `Cast` can still work. The only changed place is `usePrettyExpression`, we use `PrettyAttribute` replace `Cast` to give a better sql string.
### Does this PR introduce _any_ user-facing change?
Yes, the default field name may change.
### How was this patch tested?
Add test and pass exists test.
Closes#31034 from ulysses-you/SPARK-33989.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function.
### Why are the changes needed?
The original PRs are missing tests with column-based arguments instead of literals.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests and existing tests.
Closes#31159 from ueshin/issues/SPARK-32338/slice_tests.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side.
Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary.
### Why are the changes needed?
To make the usage similar with Scala side, and for consistency.
### Does this PR introduce _any_ user-facing change?
No, this is not released yet.
### How was this patch tested?
GitHub Actions and Jenkins build will test it out.
Closes#31062 from HyukjinKwon/SPARK-30681.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Follow up work for #30521, document the following behaviors in the API doc:
- Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table.
- Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created.
### Why are the changes needed?
We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Document only.
Closes#30885 from xuanyuanking/SPARK-33659.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to:
- Make doctests simpler to show the usage (since we're not running them now).
- Use the test utils to drop the tables if exists.
### Why are the changes needed?
Better docs and code readability.
### Does this PR introduce _any_ user-facing change?
No, dev-only. It includes some doc changes in unreleased branches.
### How was this patch tested?
Manually tested.
```bash
cd python
./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests"
```
Closes#30873 from HyukjinKwon/SPARK-33836.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to expose `DataStreamReader.table` (SPARK-32885) and `DataStreamWriter.toTable` (SPARK-32896) to PySpark, which are the only way to read and write with table in Structured Streaming.
### Why are the changes needed?
Please refer SPARK-32885 and SPARK-32896 for rationalizations of these public APIs. This PR only exposes them to PySpark.
### Does this PR introduce _any_ user-facing change?
Yes, PySpark users will be able to read and write with table in Structured Streaming query.
### How was this patch tested?
Manually tested.
> v1 table
>> create table A and ingest to the table A
```
spark.sql("""
create table table_pyspark_parquet (
value long,
`timestamp` timestamp
) USING parquet
""")
df = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query = df.writeStream.toTable('table_pyspark_parquet', checkpointLocation='/tmp/checkpoint5')
query.lastProgress
query.stop()
```
>> read table A and ingest to the table B which doesn't exist
```
df2 = spark.readStream.table('table_pyspark_parquet')
query2 = df2.writeStream.toTable('table_pyspark_parquet_nonexist', format='parquet', checkpointLocation='/tmp/checkpoint2')
query2.lastProgress
query2.stop()
```
>> select tables
```
spark.sql("DESCRIBE TABLE table_pyspark_parquet").show()
spark.sql("SELECT * FROM table_pyspark_parquet").show()
spark.sql("DESCRIBE TABLE table_pyspark_parquet_nonexist").show()
spark.sql("SELECT * FROM table_pyspark_parquet_nonexist").show()
```
> v2 table (leveraging Apache Iceberg as it provides V2 table and custom catalog as well)
>> create table A and ingest to the table A
```
spark.sql("""
create table iceberg_catalog.default.table_pyspark_v2table (
value long,
`timestamp` timestamp
) USING iceberg
""")
df = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query = df.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table', checkpointLocation='/tmp/checkpoint_v2table_1')
query.lastProgress
query.stop()
```
>> ingest to the non-exist table B
```
df2 = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query2 = df2.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist', checkpointLocation='/tmp/checkpoint_v2table_2')
query2.lastProgress
query2.stop()
```
>> ingest to the non-exist table C partitioned by `value % 10`
```
df3 = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
df3a = df3.selectExpr('value', 'timestamp', 'value % 10 AS partition').repartition('partition')
query3 = df3a.writeStream.partitionBy('partition').toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned', checkpointLocation='/tmp/checkpoint_v2table_3')
query3.lastProgress
query3.stop()
```
>> select tables
```
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table").show()
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist").show()
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show()
```
Closes#30835 from HeartSaVioR/SPARK-33836.
Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests.
- https://docs.python.org/3/library/unittest.html#deprecated-aliases
### Why are the changes needed?
For consistency and eventual removal of deprecated APIs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#30557 from BryanCutler/replace-deprecated-apis-in-tests.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different.
The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable.
On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default.
So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv.
### Why are the changes needed?
Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Add a new case similar to that described in SPARK-33566
Closes#30518 from LuciferYang/SPARK-33566.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds the following functions (introduced in Scala API with SPARK-33061):
- `acosh`
- `asinh`
- `atanh`
to Python and R.
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
New functions.
### How was this patch tested?
New unit tests.
Closes#30501 from zero323/SPARK-33563.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Two new options, _modifiiedBefore_ and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.` In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written.
### Why are the changes needed?
When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code.
### Does this PR introduce _any_ user-facing change?
This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option.
**Example Usages**
_Load all CSV files modified after date:_
`spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()`
_Load all CSV files modified before date:_
`spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()`
_Load all CSV files modified between two dates:_
`spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()
`
### How was this patch tested?
A handful of unit tests were added to support the positive, negative, and edge case code paths.
It's also live in a handful of our Databricks dev environments. (quoted from cchighman)
Closes#30411 from HeartSaVioR/SPARK-31962.
Lead-authored-by: CC Highman <christopher.highman@microsoft.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.
### Why are the changes needed?
MapType was previous unsupported with Arrow.
### Does this PR introduce _any_ user-facing change?
User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs.
### How was this patch tested?
Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.
Closes#30393 from BryanCutler/arrow-add-MapType-SPARK-24554.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now.
### Why are the changes needed?
The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs.
### Does this PR introduce _any_ user-facing change?
Yes, update user-facing doc.
### How was this patch tested?
Unit test.
Closes#30396 from viirya/minor-json-csv.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed.
[SPARK-33139] has two commit, include a follow up. Revert them both.
### Why are the changes needed?
Revert.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#30367 from leanken/leanken-revert-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Removes encoding of the JVM response in `pyspark.sql.column.Column.__repr__`.
### Why are the changes needed?
API consistency and improved readability of the expressions.
### Does this PR introduce _any_ user-facing change?
Before this change
col("abc")
col("wąż")
result in
Column<b'abc'>
Column<b'w\xc4\x85\xc5\xbc'>
After this change we'll get
Column<'abc'>
Column<'wąż'>
### How was this patch tested?
Existing tests and manual inspection.
Closes#30322 from zero323/SPARK-33415.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds support for passing `Column`s as input to PySpark sorting functions.
### Why are the changes needed?
According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible.
### Does this PR introduce _any_ user-facing change?
PySpark users can now provide both `Column` and `str` as an argument for `asc*` and `desc*` functions.
### How was this patch tested?
New unit tests.
Closes#30227 from zero323/SPARK-33257.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243.
While I am migrating, I also fixed some Python type hints accordingly.
### Why are the changes needed?
For better documentation as text itself, and generated HTMLs
### Does this PR introduce _any_ user-facing change?
Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243.
### How was this patch tested?
Manually tested via running `./dev/lint-python`.
Closes#30181 from HyukjinKwon/SPARK-33250.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs.
Closes#30201
### Why are the changes needed?
Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By linters.
Closes#30226 from MaxGekk/from_json-common-schema-parsing-2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.
### Why are the changes needed?
Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
E.g.,:
```py
spark.range(0, 100000, 1, 1).write.parquet(path)
spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
def f(x):
return 0
fUdf = udf(f, LongType())
spark.read.parquet(path).select(fUdf('id')).head()
```
This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests, and manually.
Closes#30177 from ueshin/issues/SPARK-33277/python_pandas_udf.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable.
### Why are the changes needed?
Convenience & ensuring the typing reflects the functionality
### Does this PR introduce _any_ user-facing change?
Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions.
### How was this patch tested?
Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this.
Closes#30209 from dhimmel/patch-1.
Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression.
### Why are the changes needed?
To unify output of the `schema_of_json()` and `schema_of_csv()`.
### Does this PR introduce _any_ user-facing change?
Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter.
Before:
```
> SELECT schema_of_csv('1,abc');
struct<_c0:int,_c1:string>
```
After:
```
> SELECT schema_of_csv('1,abc');
STRUCT<`_c0`: INT, `_c1`: STRING>
```
### How was this patch tested?
By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`.
Closes#30180 from MaxGekk/schema_of_csv-sql-schema.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>