### What changes were proposed in this pull request?
Remove old workarounds related to null ordering.
### Why are the changes needed?
In pandas-on-Spark, there are still some remaining places to call `Column._jc.(asc|desc)_nulls_(first|last)` as a workaround from Koalas to support Spark 2.3.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified a couple of tests and existing tests.
Closes#33597 from ueshin/issues/SPARK-36365/nulls_first_last.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/33570, which mistakenly changed the default value of the default index
### Why are the changes needed?
It was mistakenly changed. It was changed to check if the tests actually pass but I forgot to change it back.
### Does this PR introduce _any_ user-facing change?
No, it's not related yet. It fixes up the mistake of the default value mistakenly changed.
(Changed default value makes the test flaky because of the order affected by extra shuffle)
### How was this patch tested?
Manually tested.
Closes#33596 from HyukjinKwon/SPARK-36338-followup.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade PySpark GitHub Action job to use the latest docker image `20210730` having `sklearn` and `mlflow` additionally.
- 5ca94453d1
```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep mlflow
mlflow 1.19.0
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep sklearn
sklearn 0.0
```
### Why are the changes needed?
This will save the installation time.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action PySpark jobs.
Closes#33595 from dongjoon-hyun/SPARK-36345.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Changes in this PR:
- `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and `doExecuteColumnar` to support adaptive queries where the final query stage produces columnar data.
- `SessionState` now has a new set of injectable rules named `finalQueryStagePrepRules` that can be applied to the final query stage.
- `AdaptiveSparkPlanExec` can now safely be wrapped by either `RowToColumnarExec` or `ColumnarToRowExec`.
A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` transition that is inserted by previous rules and at execution time can call `finalPlanSupportsColumnar` to see if the final query stage is columnar. If the plan is columnar then the plugin can safely call `doExecuteColumnar`. The adaptive plan can be wrapped in either `RowToColumnarExec` or `ColumnarToRowExec` to force a particular output format. There are fast paths in both of these operators to avoid any redundant transitions.
### Why are the changes needed?
Without this change it is necessary to use reflection to get the final physical plan to determine whether it is columnar and to execute it is a columnar plan. `AdaptiveSparkPlanExec` only provides public methods for row-based execution.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I have manually tested this patch with the RAPIDS Accelerator for Apache Spark.
Closes#33140 from andygrove/support-columnar-adaptive.
Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Move some logic related to `F.nanvl` to `DataTypeOps`.
### Why are the changes needed?
There are several places to branch by `FloatType` or `DoubleType` to use `F.nanvl` but `DataTypeOps` should handle it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33582 from ueshin/issues/SPARK-36350/nan_to_null.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Upgrade Kubernetes Client Version to 5.6.0.
### Why are the changes needed?
The exponential backoff feature is extended with one more case:
[ Retry HTTP operation in case IOException too (exponential backoff)](https://github.com/fabric8io/kubernetes-client/pull/3293)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with existing unit and integration tests:
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
...
[INFO] BUILD SUCCESS
```
Closes#33593 from attilapiros/SPARK-36358.
Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to implement `distributed-sequence` index in Scala side.
### Why are the changes needed?
- Avoid unnecessary (de)serialization
- Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104)
### Does this PR introduce _any_ user-facing change?
No to end users since pandas API on Spark is not released yet.
```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(1).spark.print_schema()
```
Before:
```
root
|-- id: long (nullable = true)
```
After:
```
root
|-- id: long (nullable = false)
```
### How was this patch tested?
Manually tested, and existing tests should cover them.
Closes#33570 from HyukjinKwon/SPARK-36338.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a partial revert of https://github.com/apache/spark/pull/33567 that keeps the logic to skip mlflow related tests if that's not installed.
### Why are the changes needed?
It's consistent with other libraries, e.g) PyArrow.
It also fixes up the potential dev breakage (see also https://github.com/apache/spark/pull/33567#issuecomment-889841829)
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
This is a partial revert. CI should test it out too.
Closes#33589 from HyukjinKwon/SPARK-36254.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/33352 , to simplify the JDBC aggregate pushdown:
1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark.
2. because of 1, now we can remove the `dataType` property from the public `Sum` expression.
This PR also contains some small improvements:
1. Spark should deduplicate the aggregate expressions before pushing them down.
2. Improve the `toString` of public aggregate expressions to make them more SQL.
### Why are the changes needed?
code and API simplification
### Does this PR introduce _any_ user-facing change?
this API is not released yet.
### How was this patch tested?
existing tests
Closes#33579 from cloud-fan/dsv2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.
### Why are the changes needed?
To enable the MLflow test in pandas API on Spark.
### Does this PR introduce _any_ user-facing change?
No, it's test-only
### How was this patch tested?
Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.
Closes#33567 from itholic/SPARK-36254.
Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is follow-up for https://github.com/apache/spark/pull/33414 to support the more options for `mode` argument for all APIs that has `mode` argument, not only `DataFrame.to_csv`.
### Why are the changes needed?
To keep the usage consistency for the arguments that have same name.
### Does this PR introduce _any_ user-facing change?
More options is available for all APIs that has `mode` argument, same as `DataFrame.to_csv`
### How was this patch tested?
Manually test on local
Closes#33569 from itholic/SPARK-35085-followup.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this.
### Why are the changes needed?
Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at.
### Does this PR introduce _any_ user-facing change?
No, it's just test refactoring.
### How was this patch tested?
Using existing tests:
```
build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"
```
and
```
build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"
```
Closes#33564 from sunchao/SPARK-36136-partitions-suite.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Just to fix some typos in ShuffleBlockPusher class.
### Why are the changes needed?
Fix the typos, make code clear.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need test.
Closes#33575 from zhuqi-lucas/master.
Authored-by: zhuqi-lucas <821684824@qq.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Reuse `IndexOpsMixin.isnull()` where the null check is needed.
### Why are the changes needed?
There are some places where we can reuse `IndexOpsMixin.isnull()` instead of directly using Spark `Column`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33562 from ueshin/issues/SPARK-36333/reuse_isnull.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation.
### Why are the changes needed?
For further ARM support and leverage the bug fix for the newer version.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33578 from xuanyuanking/SPARK-36347.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The main change of this pr is try to use `computeIfAbsent` to simplify the process of put and get missing value in a map.
**Before**
```java
V oldValue = map.get(key);
if (oldValue == null) {
V newValue = functionToProduceMissValue
map.put(key, newValue);
}
```
**After**
```java
map.computeIfAbsent(key, functionToProduceMissValue);
```
### Why are the changes needed?
Simplify code use Java 8 API.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#33558 from LuciferYang/SPARK-36326.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Spark support Timestamp in Avro data source now. We should support TimestampNTZ in Avro data source too.
As per the Avro spec https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29, Spark can convert TimestampNTZ type from/to Avro's Local timestamp type.
### Why are the changes needed?
Docking TimestampNTZ type in Spark and Local timestamp in Avro.
### Does this PR introduce _any_ user-facing change?
'Yes'.
Spark will read/write the TimestampNTZ/Local timestamp in Avro.
### How was this patch tested?
New tests.
Closes#33413 from beliefer/SPARK-36175.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Improve the error message for wrong type when calling dropDuplicates in pyspark.
### Why are the changes needed?
The current error message is cryptic and can be unclear to less experienced users.
### Does this PR introduce _any_ user-facing change?
Yes, it adds a type error for when a user gives the wrong type to dropDuplicates
### How was this patch tested?
There is currently no testing for error messages in pyspark dataframe functions
Closes#33364 from sammyjmoseley/sm/add-type-checking-for-drop-duplicates.
Lead-authored-by: Samuel Moseley <smoseley@palantir.com>
Co-authored-by: Sammy Moseley <moseley.sammy@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to add a new config to allow K8s API server-side caching for pod listing.
### Why are the changes needed?
Apache Spark currently requests the most recent data which should be consistent. New configuration looses the restriction to reduce the server-side overhead by allowing K8S API server side caching.
https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter
- `resourceVersion`: unset
> Most Recent: Return data at the most recent resource version. The returned data must be consistent (i.e. served from etcd via a quorum read).
- `resourceVersion`: "0"
> Any: Return data at any resource version. The newest available resource version is preferred, but strong consistency is not required; data at any resource version may be served.
### Does this PR introduce _any_ user-facing change?
Yes, this is a new feature to reduce the K8s API server side overhead.
### How was this patch tested?
Pass the CIs.
Closes#33563 from dongjoon-hyun/SPARK-36334.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to support ANSI interval literals for `TimeWindow`.
### Why are the changes needed?
Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33551 from sarutak/window-interval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This is a follow up of https://github.com/apache/spark/pull/33447, where the unit test is disabled, due to failure after memory setting changed. I found the root cause is after https://github.com/apache/spark/pull/33447, in unit test, Spark memory page byte size is changed from `67108864` to `33554432` [1]. So the shuffled hash join build size is also changed accordingly due to [memory page byte size change](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L457). Previously the unit test is checking the exact value of build size, so it no longer works. Here we change the unit test to verify the relative value of build size, and it should work.
[1]: I printed out the memory page byte size explicitly in unit test - `org.apache.spark.SparkException: chengsu pageSizeBytes: 33554432!` in https://github.com/c21/spark/runs/3186680616?check_suite_focus=true .
### Why are the changes needed?
Make previously disabled unit test work.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Changed unit test itself.
Closes#33494 from c21/test.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In PR #32959, we found some weird datetime strings that can be parsed. ([details](https://github.com/apache/spark/pull/32959#discussion_r665015489))
This PR blocks the invalid datetime string.
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
Yes, below strings will have different results when cast to datetime.
```sql
select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL
select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL
```
### How was this patch tested?
some new test cases
Closes#33490 from linhongliu-db/SPARK-35780-block-invalid-format.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improve the rest of DataTypeOps tests by avoiding joins.
### Why are the changes needed?
bool, string, numeric DataTypeOps tests have been improved by avoiding joins.
We should improve the rest of the DataTypeOps tests in the same way.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33546 from xinrong-databricks/test_no_join.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Proposing adding new metrics to `customMetrics` under the `stateOperators` in `StreamingQueryProgress` event These metrics help have better visibility into the RocksDB based state store in streaming jobs. For full details of metrics, refer to https://issues.apache.org/jira/browse/SPARK-36236.
### Why are the changes needed?
Current metrics available for the RockDB state store, do not provide observability into many operations such as how much time is spent by the RocksDB in compaction and what is the cache hit ratio. These metrics help compare performance differences in state store operations between slow and fast microbatches .
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unittests
Closes#33455 from vkorukanti/rocksdb-metrics.
Authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Adjust `astype` of fractional Series with missing values to follow pandas.
Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas.
### Why are the changes needed?
`astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas.
We ought to follow pandas.
### Does this PR introduce _any_ user-facing change?
Yes.
From:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
0 1.0
1 2.0
2 NaN
dtype: float64
```
To:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
Traceback (most recent call last):
...
ValueError: Cannot convert fractions with missing values to integer
```
### How was this patch tested?
Unit tests.
Closes#33466 from xinrong-databricks/extension_astype.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold.
```
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---------------------------+
|conv(repeat(?, 64), 10, 16)|
+---------------------------+
| 0|
+---------------------------+
scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0
+---------------------------+
|conv(repeat(?, 65), 10, 16)|
+---------------------------+
| FFFFFFFFFFFFFFFF|
+---------------------------+
scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0
+----------------------------+
|conv(repeat(?, 65), 10, -16)|
+----------------------------+
| -1|
+----------------------------+
scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
+----------------------------+
|conv(repeat(?, 64), 10, -16)|
+----------------------------+
| 0|
+----------------------------+
```
2/ conv should return result equal to max unsigned long value in base toBase when there is overflow
```
scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615
+-------------------------------+
|conv(aaaaaaa0aaaaaaa0a, 16, 10)|
+-------------------------------+
| 12297828695278266890|
+-------------------------------+
```
### Why are the changes needed?
Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database
### Does this PR introduce _any_ user-facing change?
change in result of conv() function
### How was this patch tested?
add test
Closes#33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group exception messages in core/src/main/scala/org/apache/spark/rdd
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#33317 from dgd-contributor/SPARK-36095_GroupExceptionCoreRdd.
Lead-authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix `Series`/`Index.copy()` to drop extra columns.
### Why are the changes needed?
Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns.
We can drop those when `Series`/`Index.copy()`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33549 from ueshin/issues/SPARK-36320/index_ops_copy.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
remove the `$` in the `-Xms$256m` memory options
### Why are the changes needed?
fix typo
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
not needed
Closes#33557 from linhongliu-db/minor-fix.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently, when a map is parsed in a from_json function, only StringType key is supported. If you try to parse other type, it results on a cast exception.
For example:
```scala
Seq((s"""{"2021-05-05T20:05:08": "sampleValue"}"""))
.toDF("value")
.withColumn("value1", from_json(col("value"), MapType(TimestampType, StringType)))
.show
```
```
Exception in thread "main" java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
```
This PR proposes to improve the error message.
```
org.apache.spark.sql.AnalysisException: cannot resolve 'entries' due to data type mismatch: Input schema map<timestamp,string> can only contain StringType as a key type for a MapType.;
'Project [unresolvedalias(from_json(MapType(TimestampType,StringType,true), value#1, Some(America/Los_Angeles)), Some(org.apache.spark.sql.Column$$Lambda$1496/54693608710e5bf9c))]
+- LocalRelation [value#1]
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:197)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:182)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
...
```
In https://github.com/apache/spark/pull/32599 we decide to improve the error message instead of support this.
### Why are the changes needed?
Avoid confusion in the interpretation of the error
### Does this PR introduce _any_ user-facing change?
Yes, the error message returned in this case
### How was this patch tested?
Unit testing and manual testing
Closes#33525 from planga82/feature/spark35320_improve_error_message.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add back ParquetSchemaConverter.checkFieldNames()
### Why are the changes needed?
Fix code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#33552 from AngersZhuuuu/SPARK-36312-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Unify schema check code of FileFormat and check avro schema filed name when CREATE TABLE DDL too
### Why are the changes needed?
Refactor code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#33441 from AngersZhuuuu/SPARK-36202.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to migrate the following `ALTER TABLE ... ADD/REPLACE COLUMNS` commands to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).
### Does this PR introduce _any_ user-facing change?
After this PR, the above `ALTER TABLE ... ADD/REPLACE COLUMNS` commands will have a consistent resolution behavior.
### How was this patch tested?
Updated existing tests.
Closes#33200 from imback82/alter_add_cols.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is follow-up of https://github.com/apache/spark/pull/31522.
It adds docs for the new metrics of task/job commit time
### Why are the changes needed?
So that users can understand the metrics better and know that the new metrics are only for file table writes.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/127198210-2ab201d3-5fca-4065-ace6-0b930390380f.png)
Closes#33542 from gengliangwang/addDocForMetrics.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Last pr only support add inner field check for hive ddl, this pr add check for parquet data source write API.
### Why are the changes needed?
Failed earlier
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added Ut
Without this UI it failed as
```
[info] - SPARK-36312: ParquetWriteSupport should check inner field *** FAILED *** (8 seconds, 29 milliseconds)
[info] Expected exception org.apache.spark.sql.AnalysisException to be thrown, but org.apache.spark.SparkException was thrown (HiveDDLSuite.scala:3035)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563)
[info] at org.scalatest.Assertions.intercept(Assertions.scala:756)
[info] at org.scalatest.Assertions.intercept$(Assertions.scala:746)
[info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563)
[info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396(HiveDDLSuite.scala:3035)
[info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396$adapted(HiveDDLSuite.scala:3034)
[info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
[info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
[info] at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34)
[info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$395(HiveDDLSuite.scala:3034)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
[info] at org.apache.spark.sql.test.SQLTestUtilsBase.withView(SQLTestUtils.scala:316)
[info] at org.apache.spark.sql.test.SQLTestUtilsBase.withView$(SQLTestUtils.scala:314)
[info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.withView(HiveDDLSuite.scala:396)
[info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$394(HiveDDLSuite.scala:3032)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
[info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:431)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
[info] at org.scalatest.Suite.run(Suite.scala:1112)
[info] at org.scalatest.Suite.run$(Suite.scala:1094)
[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] at java.lang.Thread.run(Thread.java:748)
[info] Cause: org.apache.spark.SparkException: Job aborted.
[info] at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
[info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
[info] at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
[info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
[info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
[info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
[info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
[info] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
[info] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
[info] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
[info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
[info] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
[info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
[info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
[info] at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93)
[info] at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80)
[info] at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78)
[info] at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:115)
[info] at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
[info] at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
[info] at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
[info] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
[info] at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
[in
```
Closes#33531 from AngersZhuuuu/SPARK-36312.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
AQEShuffleReadExec already reports "number of skewed partitions" and "number of skewed partition splits".
It would be useful to also report "number of coalesced partitions" and for ShuffleExchange to report "number of partitions"
This way it's clear what happened on the map side and on the reduce side.
![Metrics](https://user-images.githubusercontent.com/4297661/126729820-cf01b3fa-7bc4-44a5-8098-91689766a68a.png)
### Why are the changes needed?
Improves usability
### Does this PR introduce _any_ user-facing change?
Yes, it now provides more information about `AQEShuffleReadExec` operator behavior in the metrics system.
### How was this patch tested?
Existing tests
Closes#32776 from ekoifman/PRISM-91635-customshufflereader-sql-metrics.
Authored-by: Eugene Koifman <eugene.koifman@workday.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes an issue in `ResolveAggregateFunctions` where non-aggregated nested fields in ORDER BY and HAVING are not resolved correctly. This is because nested fields are resolved as aliases that fail to be semantically equal to any grouping/aggregate expressions.
### Why are the changes needed?
To fix an analyzer issue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests.
Closes#33498 from allisonwang-db/spark-36275-resolve-agg-func.
Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
update java doc, JDBC data source doc, address follow up comments
### Why are the changes needed?
update doc and address follow up comments
### Does this PR introduce _any_ user-facing change?
Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc.
### How was this patch tested?
manually checked
Closes#33526 from huaxingao/aggPD_followup.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Update the tables at https://spark.apache.org/docs/latest/sql-ref-datatypes.html about mapping ANSI interval types to Java/Scala/SQL types.
2. Remove `CalendarIntervalType` from the table of mapping Catalyst types to SQL types.
<img width="1028" alt="Screenshot 2021-07-27 at 20 52 57" src="https://user-images.githubusercontent.com/1580697/127204790-7ccb9c64-daf2-427d-963e-b7367aaa3439.png">
<img width="1017" alt="Screenshot 2021-07-27 at 20 53 22" src="https://user-images.githubusercontent.com/1580697/127204806-a0a51950-3c2d-4198-8a22-0f6614bb1487.png">
### Why are the changes needed?
To inform users which types from language APIs should be used as ANSI interval types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checking by building the docs:
```
$ SKIP_RDOC=1 SKIP_API=1 SKIP_PYTHONDOC=1 bundle exec jekyll build
```
Closes#33543 from MaxGekk/doc-interval-type-lang-api.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well.
### Why are the changes needed?
We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually tested.
Closes#33548 from HeartSaVioR/SPARK-36314.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`.
### Why are the changes needed?
`IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error.
Also it returns a wrong value when the `Series`/`Index` is empty.
```py
>>> ps.Series([]).hasnans
None
```
whereas:
```py
>>> pd.Series([]).hasnans
False
```
`IndexOpsMixin.any()` is safe for both cases.
### Does this PR introduce _any_ user-facing change?
`IndexOpsMixin.hasnans` will return `False` when empty.
### How was this patch tested?
Added some tests.
Closes#33547 from ueshin/issues/SPARK-36310/hasnan.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
With SPARK-34806 we can now easily add an equivalent for `Dataset.observe(Observation, Column, Column*)` to PySpark's `DataFrame` API.
### Why are the changes needed?
This further aligns the Python DataFrame API with Scala Dataset API.
### Does this PR introduce _any_ user-facing change?
Yes, it adds the `Observation` class and the `DataFrame.observe` method.
### How was this patch tested?
Adds test `test_observe` to `pyspark.sql.test.test_dataframe`.
Closes#33484 from EnricoMi/branch-observation-python.
Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to update the page https://spark.apache.org/docs/latest/sql-ref-datatypes.html and add information about the year-month and day-time interval types introduced by SPARK-27790.
<img width="932" alt="Screenshot 2021-07-27 at 10 38 23" src="https://user-images.githubusercontent.com/1580697/127115289-e633ca3a-2c18-49a0-a7c0-22421ae5c363.png">
### Why are the changes needed?
To inform users about new ANSI interval types, and improve UX with Spark SQL.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Should be tested by a GitHub action.
Closes#33518 from MaxGekk/doc-interval-types.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833
In this PR, I propose the restore the previous behavior to support the null column in a table.
### Why are the changes needed?
For a complex query, it's possible to generate a column with null type. If this happens to the input query of
CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective,
it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing
this constraint is more friendly to users.
### Does this PR introduce _any_ user-facing change?
Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR
```sql
CREATE TABLE t (col_1 void, col_2 int)
```
### How was this patch tested?
newly added and existing test cases
Closes#33488 from linhongliu-db/SPARK-36241-support-void-column.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.
### Why are the changes needed?
This will save GHA resource because MiMa is irrelevant to Python.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GHA.
Closes#33532 from williamhyun/mima.
Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
The following code should type-check:
```python3
import uuid
import pyspark.sql.functions as F
my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()
```
### What changes were proposed in this pull request?
The `udf` function should return a more specific type.
### Why are the changes needed?
Right now, `mypy` will throw spurious errors, such as for the code given above.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests?
Closes#33399 from luranhe/patch-1.
Lead-authored-by: Luran He <luranjhe@gmail.com>
Co-authored-by: Luran He <luran.he@compass.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>