Commit graph

31099 commits

Author SHA1 Message Date
Takuya UESHIN 359dfd7ec2 [SPARK-36333][PYTHON] Reuse isnull where the null check is needed
### What changes were proposed in this pull request?

Reuse `IndexOpsMixin.isnull()` where the null check is needed.

### Why are the changes needed?

There are some places where we can reuse `IndexOpsMixin.isnull()` instead of directly using Spark `Column`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33562 from ueshin/issues/SPARK-36333/reuse_isnull.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 07ed82be0b)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-29 15:33:25 -07:00
Yuanjian Li e8462a584c [SPARK-36347][SS] Upgrade the RocksDB version to 6.20.3
### What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation.

### Why are the changes needed?
For further ARM support and leverage the bug fix for the newer version.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #33578 from xuanyuanking/SPARK-36347.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4cd5fa96d8)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-07-29 11:09:10 -07:00
Samuel Moseley e42324982d [SPARK-36161][PYTHON] Add type check on dropDuplicates pyspark function
### What changes were proposed in this pull request?
Improve the error message for wrong type when calling dropDuplicates in pyspark.

### Why are the changes needed?
The current error message is cryptic and can be unclear to less experienced users.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a type error for when a user gives the wrong type to dropDuplicates

### How was this patch tested?
There is currently no testing for error messages in pyspark dataframe functions

Closes #33364 from sammyjmoseley/sm/add-type-checking-for-drop-duplicates.

Lead-authored-by: Samuel Moseley <smoseley@palantir.com>
Co-authored-by: Sammy Moseley <moseley.sammy@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a07df1acc6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-29 19:11:57 +09:00
Kousuke Saruta d247a6cd1a [SPARK-36323][SQL] Support ANSI interval literals for TimeWindow
### What changes were proposed in this pull request?

This PR proposes to support ANSI interval literals for `TimeWindow`.

### Why are the changes needed?

Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33551 from sarutak/window-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit db18866742)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-29 08:52:10 +03:00
Cheng Su 6d188cbb08 [SPARK-36272][SQL][TEST] Change shuffled hash join metrics test to check relative value of build size
### What changes were proposed in this pull request?

This is a follow up of https://github.com/apache/spark/pull/33447, where the unit test is disabled, due to failure after memory setting changed. I found the root cause is after https://github.com/apache/spark/pull/33447, in unit test, Spark memory page byte size is changed from `67108864` to `33554432` [1]. So the shuffled hash join build size is also changed accordingly due to [memory page byte size change](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L457). Previously the unit test is checking the exact value of build size, so it no longer works. Here we change the unit test to verify the relative value of build size, and it should work.

[1]: I printed out the memory page byte size explicitly in unit test - `org.apache.spark.SparkException: chengsu pageSizeBytes: 33554432!` in https://github.com/c21/spark/runs/3186680616?check_suite_focus=true .

### Why are the changes needed?

Make previously disabled unit test work.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Changed unit test itself.

Closes #33494 from c21/test.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6a8dd3229a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-29 11:14:48 +09:00
Linhong Liu fa521c1506 [SPARK-36286][SQL] Block some invalid datetime string
### What changes were proposed in this pull request?
In PR #32959, we found some weird datetime strings that can be parsed. ([details](https://github.com/apache/spark/pull/32959#discussion_r665015489))
This PR blocks the invalid datetime string.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
Yes, below strings will have different results when cast to datetime.
```sql
select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL
select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL
```

### How was this patch tested?
some new test cases

Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ed0e351f05)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-29 09:17:01 +08:00
Xinrong Meng 999cf81653 [SPARK-36190][PYTHON] Improve the rest of DataTypeOps tests by avoiding joins
### What changes were proposed in this pull request?
Improve the rest of DataTypeOps tests by avoiding joins.

### Why are the changes needed?
bool, string, numeric DataTypeOps tests have been improved by avoiding joins.
We should improve the rest of the DataTypeOps tests in the same way.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33546 from xinrong-databricks/test_no_join.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 9c5cb99d6e)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-28 15:53:51 -07:00
Venki Korukanti c236101d4c [SPARK-36236][SS] Additional metrics for RocksDB based state store implementation
### What changes were proposed in this pull request?

Proposing adding new metrics to `customMetrics` under the `stateOperators` in `StreamingQueryProgress` event These metrics help have better visibility into the RocksDB based state store in streaming jobs. For full details of metrics, refer to https://issues.apache.org/jira/browse/SPARK-36236.

### Why are the changes needed?

Current metrics available for the RockDB state store, do not provide observability into many operations such as how much time is spent by the RocksDB in compaction and what is the cache hit ratio. These metrics help compare performance differences in state store operations between slow and fast microbatches .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unittests

Closes #33455 from vkorukanti/rocksdb-metrics.

Authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit eb4d1c0332)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-28 12:51:56 -07:00
Xinrong Meng bdf1570911 [SPARK-36143][PYTHON] Adjust astype of fractional Series with missing values to follow pandas
### What changes were proposed in this pull request?
Adjust `astype` of fractional Series with missing values to follow pandas.

Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas.

### Why are the changes needed?
`astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas.

We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes.

From:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
0    1.0
1    2.0
2    NaN
dtype: float64

```

To:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
Traceback (most recent call last):
...
ValueError: Cannot convert fractions with missing values to integer

```

### How was this patch tested?
Unit tests.

Closes #33466 from xinrong-databricks/extension_astype.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 01213095e2)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-28 11:26:58 -07:00
dgd-contributor c5b0cb2d94 [SPARK-36229][SQL] conv() inconsistently handles invalid strings with more than 64 invalid characters and return wrong value on overflow
### What changes were proposed in this pull request?
1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold.

```
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---------------------------+
|conv(repeat(?, 64), 10, 16)|
+---------------------------+
|                          0|
+---------------------------+

scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0
+---------------------------+
|conv(repeat(?, 65), 10, 16)|
+---------------------------+
|           FFFFFFFFFFFFFFFF|
+---------------------------+

scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0
+----------------------------+
|conv(repeat(?, 65), 10, -16)|
+----------------------------+
|                          -1|
+----------------------------+

scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
+----------------------------+
|conv(repeat(?, 64), 10, -16)|
+----------------------------+
|                           0|
+----------------------------+
```

2/ conv should return result equal to max unsigned long value in base toBase when there is overflow

```
scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615

+-------------------------------+
|conv(aaaaaaa0aaaaaaa0a, 16, 10)|
+-------------------------------+
|           12297828695278266890|
+-------------------------------+
```

### Why are the changes needed?
Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database
### Does this PR introduce _any_ user-facing change?
change in result of conv() function
### How was this patch tested?
add test

Closes #33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e1c50ff779)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-29 00:19:19 +08:00
Takuya UESHIN c9af94ecb4 [SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns
### What changes were proposed in this pull request?

Fix `Series`/`Index.copy()` to drop extra columns.

### Why are the changes needed?

Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns.
We can drop those when `Series`/`Index.copy()`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3c76a924ce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 18:40:03 +09:00
Linhong Liu 3dc74e7aa6 [SPARK-36270][BUILD][FOLLOWUP] Fix typo in the sbt memory setting
### What changes were proposed in this pull request?
remove the `$` in the `-Xms$256m` memory options

### Why are the changes needed?
fix typo

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not needed

Closes #33557 from linhongliu-db/minor-fix.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit aff1826331)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 17:18:07 +09:00
Angerszhuuuu b58170b192 [SPARK-36312][SQL][FOLLOWUP] Add back ParquetSchemaConverter.checkFieldNames
### What changes were proposed in this pull request?
Add back ParquetSchemaConverter.checkFieldNames()

### Why are the changes needed?
Fix code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #33552 from AngersZhuuuu/SPARK-36312-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f086c17b8e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 14:38:36 +08:00
Angerszhuuuu 2f4f7936fd [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
### What changes were proposed in this pull request?
Unify schema check code of FileFormat and check avro schema filed name when CREATE TABLE DDL too

### Why are the changes needed?
Refactor code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33441 from AngersZhuuuu/SPARK-36202.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 86f44578e5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 14:04:37 +08:00
Terry Kim cd6b303d0f [SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate the following `ALTER TABLE ... ADD/REPLACE COLUMNS` commands to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).

### Does this PR introduce _any_ user-facing change?

After this PR, the above `ALTER TABLE ... ADD/REPLACE COLUMNS` commands will have a consistent resolution behavior.

### How was this patch tested?

Updated existing tests.

Closes #33200 from imback82/alter_add_cols.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 809b88a162)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 14:00:44 +08:00
Gengliang Wang fa0c7f487b [SPARK-34399][DOCS][FOLLOWUP] Add docs for the new metrics of task/job commit time
### What changes were proposed in this pull request?

This is follow-up of https://github.com/apache/spark/pull/31522.
It adds docs for the new metrics of task/job commit time

### Why are the changes needed?

So that users can understand the metrics better and know that the new metrics are only for file table writes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/127198210-2ab201d3-5fca-4065-ace6-0b930390380f.png)

Closes #33542 from gengliangwang/addDocForMetrics.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c9a7ff3f36)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 13:54:48 +08:00
Angerszhuuuu 3c441135bb [SPARK-36312][SQL] ParquetWriterSupport.setSchema should check inner field
### What changes were proposed in this pull request?
Last pr only support add inner field check for hive ddl, this pr add check for parquet data source write API.

### Why are the changes needed?
Failed earlier

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Ut

Without this UI it failed as
```
[info] - SPARK-36312: ParquetWriteSupport should check inner field *** FAILED *** (8 seconds, 29 milliseconds)
[info]   Expected exception org.apache.spark.sql.AnalysisException to be thrown, but org.apache.spark.SparkException was thrown (HiveDDLSuite.scala:3035)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563)
[info]   at org.scalatest.Assertions.intercept(Assertions.scala:756)
[info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
[info]   at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396(HiveDDLSuite.scala:3035)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396$adapted(HiveDDLSuite.scala:3034)
[info]   at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
[info]   at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
[info]   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$395(HiveDDLSuite.scala:3034)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withView(SQLTestUtils.scala:316)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withView$(SQLTestUtils.scala:314)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.withView(HiveDDLSuite.scala:396)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$394(HiveDDLSuite.scala:3032)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
[info]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
[info]   at org.scalatest.Suite.run(Suite.scala:1112)
[info]   at org.scalatest.Suite.run$(Suite.scala:1094)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)
[info]   Cause: org.apache.spark.SparkException: Job aborted.
[info]   at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
[info]   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
[info]   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
[info]   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
[info]   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
[info]   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
[info]   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
[info]   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
[info]   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[info]   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
[info]   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
[info]   at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93)
[info]   at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80)
[info]   at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78)
[info]   at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:115)
[info]   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
[info]   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
[info]   at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
[info]   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
[info]   at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
[in
```

Closes #33531 from AngersZhuuuu/SPARK-36312.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 59e0c25376)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 13:52:40 +08:00
Eugene Koifman c59e54fe0e [SPARK-35639][SQL] Add metrics about coalesced partitions to AQEShuffleRead in AQE
### What changes were proposed in this pull request?

AQEShuffleReadExec already reports "number of skewed partitions" and "number of skewed partition splits".
It would be useful to also report "number of coalesced partitions" and for ShuffleExchange to report "number of partitions"
This way it's clear what happened on the map side and on the reduce side.

![Metrics](https://user-images.githubusercontent.com/4297661/126729820-cf01b3fa-7bc4-44a5-8098-91689766a68a.png)

### Why are the changes needed?

Improves usability

### Does this PR introduce _any_ user-facing change?

Yes, it now provides more information about `AQEShuffleReadExec` operator behavior in the metrics system.

### How was this patch tested?

Existing tests

Closes #32776 from ekoifman/PRISM-91635-customshufflereader-sql-metrics.

Authored-by: Eugene Koifman <eugene.koifman@workday.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 41a16ebf11)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 13:50:04 +08:00
allisonwang-db 993ffafc3e [SPARK-36275][SQL] ResolveAggregateFunctions should works with nested fields
### What changes were proposed in this pull request?
This PR fixes an issue in `ResolveAggregateFunctions` where non-aggregated nested fields in ORDER BY and HAVING are not resolved correctly. This is because nested fields are resolved as aliases that fail to be semantically equal to any grouping/aggregate expressions.

### Why are the changes needed?
To fix an analyzer issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests.

Closes #33498 from allisonwang-db/spark-36275-resolve-agg-func.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 23a6ffa5dc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 13:35:35 +08:00
allisonwang-db aea36aa977 [SPARK-36028][SQL][3.2] Allow Project to host outer references in scalar subqueries
This PR cherry picks https://github.com/apache/spark/pull/33235 to branch-3.2 to fix test failures introduced by https://github.com/apache/spark/pull/33284.

### What changes were proposed in this pull request?
This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule.

Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation.

### Why are the changes needed?
To allow more types of correlated scalar subqueries.

### Does this PR introduce _any_ user-facing change?
Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example:
```sql
SELECT (SELECT c1) FROM t;
```
Before this change:
```
org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported
outside of WHERE/HAVING clauses
```

After this change:
```
+------------------+
|scalarsubquery(c1)|
+------------------+
|0                 |
|1                 |
+------------------+
```

### How was this patch tested?
Added unit tests and SQL tests.

(cherry picked from commit ca348e50a4)
Signed-off-by: allisonwang-db <allison.wangdatabricks.com>

Closes #33527 from allisonwang-db/spark-36028-3.2.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 12:54:15 +08:00
Huaxin Gao 33ef52e2c0 [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
### What changes were proposed in this pull request?
update java doc, JDBC data source doc, address follow up comments

### Why are the changes needed?
update doc and address follow up comments

### Does this PR introduce _any_ user-facing change?
Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc.

### How was this patch tested?
manually checked

Closes #33526 from huaxingao/aggPD_followup.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c8dd97d456)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 12:52:58 +08:00
Max Gekk 56f1ee4b06 [SPARK-36318][SQL][DOCS] Update docs about mapping of ANSI interval types to Java/Scala/SQL types
### What changes were proposed in this pull request?
1. Update the tables at https://spark.apache.org/docs/latest/sql-ref-datatypes.html about mapping ANSI interval types to Java/Scala/SQL types.
2. Remove `CalendarIntervalType` from the table of mapping Catalyst types to SQL types.

<img width="1028" alt="Screenshot 2021-07-27 at 20 52 57" src="https://user-images.githubusercontent.com/1580697/127204790-7ccb9c64-daf2-427d-963e-b7367aaa3439.png">
<img width="1017" alt="Screenshot 2021-07-27 at 20 53 22" src="https://user-images.githubusercontent.com/1580697/127204806-a0a51950-3c2d-4198-8a22-0f6614bb1487.png">

### Why are the changes needed?
To inform users which types from language APIs should be used as ANSI interval types.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually checking by building the docs:
```
$ SKIP_RDOC=1 SKIP_API=1 SKIP_PYTHONDOC=1 bundle exec jekyll build
```

Closes #33543 from MaxGekk/doc-interval-type-lang-api.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit 1614d00417)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-07-28 13:42:53 +09:00
Jungtaek Lim 16c6009957 [SPARK-36314][SS] Update Sessionization examples to use native support of session window
### What changes were proposed in this pull request?

This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well.

### Why are the changes needed?

We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested.

Closes #33548 from HeartSaVioR/SPARK-36314.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 1fafa8e191)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-27 20:10:13 -07:00
Takuya UESHIN 0e9e737a84 [SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any()
### What changes were proposed in this pull request?

Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`.

### Why are the changes needed?

`IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error.
Also it returns a wrong value when the `Series`/`Index` is empty.

```py
>>> ps.Series([]).hasnans
None
```

whereas:

```py
>>> pd.Series([]).hasnans
False
```

`IndexOpsMixin.any()` is safe for both cases.

### Does this PR introduce _any_ user-facing change?

`IndexOpsMixin.hasnans` will return `False` when empty.

### How was this patch tested?

Added some tests.

Closes #33547 from ueshin/issues/SPARK-36310/hasnan.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bcc595c112)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 09:21:22 +09:00
Gengliang Wang ee3bd71c92 [SPARK-34249][DOCS] Add documentation for ANSI implicit cast rules
### What changes were proposed in this pull request?

Add documentation for the ANSI implicit cast rules which are introduced from https://github.com/apache/spark/pull/31349

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build and preview in local:
![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png)
![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png)

![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png)
![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png)
![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png)

Closes #33516 from gengliangwang/addDoc.

Lead-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Serge Rielau <serge@rielau.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df98d5b5f1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-27 20:49:05 +08:00
Max Gekk 3d86128eae [SPARK-34619][SQL][DOCS][3.2] Describe ANSI interval types at the Data types page of the SQL reference
### What changes were proposed in this pull request?
In the PR, I propose to update the page https://spark.apache.org/docs/latest/sql-ref-datatypes.html and add information about the year-month and day-time interval types introduced by SPARK-27790.

<img width="932" alt="Screenshot 2021-07-27 at 10 38 23" src="https://user-images.githubusercontent.com/1580697/127115289-e633ca3a-2c18-49a0-a7c0-22421ae5c363.png">

### Why are the changes needed?
To inform users about new ANSI interval types, and improve UX with Spark SQL.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Should be tested by a GitHub action.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Kousuke Saruta <sarutakoss.nttdata.com>
(cherry picked from commit f4837961a9)
Signed-off-by: Kousuke Saruta <sarutakoss.nttdata.com>

Closes #33539 from sarutak/backport-SPARK-34619-3.2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-27 15:17:03 +03:00
Liang-Chi Hsieh dcd37f9639 Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package"
This reverts commit 634f96dde4.

Closes #33533 from viirya/revert-SPARK-36136.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 22ac98dcbf)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 19:11:51 +09:00
Linhong Liu 91b9de3d80 [SPARK-36241][SQL] Support creating tables with null column
### What changes were proposed in this pull request?
Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833
In this PR, I propose the restore the previous behavior to support the null column in a table.

### Why are the changes needed?
For a complex query, it's possible to generate a column with null type. If this happens to the input query of
CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective,
it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing
this constraint is more friendly to users.

### Does this PR introduce _any_ user-facing change?
Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR
```sql
CREATE TABLE t (col_1 void, col_2 int)
```

### How was this patch tested?
newly added and existing test cases

Closes #33488 from linhongliu-db/SPARK-36241-support-void-column.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8e7e14dc0d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-27 17:32:16 +08:00
William Hyun dfa5c4dadc [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.

This will save GHA resource because MiMa is irrelevant to Python.

No.
Pass the GHA.

Closes #33532 from williamhyun/mima.

Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 674202e7b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 16:50:10 +09:00
Luran He 8a3b1cd811
[SPARK-36211][PYTHON] Correct typing of udf return value
The following code should type-check:

```python3
import uuid

import pyspark.sql.functions as F

my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()
```

### What changes were proposed in this pull request?

The `udf` function should return a more specific type.

### Why are the changes needed?

Right now, `mypy` will throw spurious errors, such as for the code given above.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests?

Closes #33399 from luranhe/patch-1.

Lead-authored-by: Luran He <luranjhe@gmail.com>
Co-authored-by: Luran He <luran.he@compass.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
(cherry picked from commit ede1bc6b51)
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-07-27 09:09:11 +02:00
Wenchen Fan 14328e043d [SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command
### What changes were proposed in this pull request?

We added the char/varchar support in 3.1, but the string length check is only applied to INSERT, not UPDATE/MERGE. This PR fixes it. This PR also adds the missing type coercion for UPDATE/MERGE.

### Why are the changes needed?

complete the char/varchar support and make UPDATE/MERGE easier to use by doing type coercion.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new UT. No built-in source support UPDATE/MERGE so end-to-end test is not applicable here.

Closes #33468 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 068f8d434a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-27 13:57:26 +08:00
Leona 6027137928 [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents
### What changes were proposed in this pull request?

Update api usage examples on PySpark pandas API documents.

### Why are the changes needed?

If users try to use PySpark pandas API from the document, they will see some API deprication warnings.
It is kind for users to update those documents to avoid confusion.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

```
make html
```

Closes #33519 from yoda-mon/update-pyspark-configurations.

Authored-by: Leona <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9a47483f74)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:31:00 +09:00
Takuya UESHIN f278f771e6 [SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Clean up `CategoricalAccessor` and `CategoricalIndex`.

- Clean up the classes
- Add deprecation warnings
- Clean up the docs

### Why are the changes needed?

To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33528 from ueshin/issues/SPARK-36267/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c40d9d46f1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:17:26 +09:00
Yikun Jiang 139536c3ed [SPARK-36142][PYTHON] Follow Pandas when pow between fractional series with Na and bool literal
### What changes were proposed in this pull request?

Set the result to 1 when the exp with 0(or False).

### Why are the changes needed?
Currently, exponentiation between fractional series and bools is not consistent with pandas' behavior.
```
 >>> pser = pd.Series([1, 2, np.nan], dtype=float)
 >>> psser = ps.from_pandas(pser)
 >>> pser ** False
 0 1.0
 1 1.0
 2 1.0
 dtype: float64
 >>> psser ** False
 0 1.0
 1 1.0
 2 NaN
 dtype: float64
```
We ought to adjust that.

See more in [SPARK-36142](https://issues.apache.org/jira/browse/SPARK-36142)

### Does this PR introduce _any_ user-facing change?
Yes, it introduces a user-facing change, resulting in a different result for pow between fractional Series with missing values and bool literal, the results follow pandas behavior.

### How was this patch tested?
- Add test_pow_with_float_nan ut
- Exsiting test in test_pow

Closes #33521 from Yikun/SPARK-36142.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d52c2de08b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:07:10 +09:00
Xinrong Meng 1641812e97 [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?
Add set_categories to CategoricalAccessor and CategoricalIndex.

### Why are the changes needed?
set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to use `set_categories`.

### How was this patch tested?
Unit tests.

Closes #33506 from xinrong-databricks/set_categories.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 55971b70fe)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-26 17:12:45 -07:00
Min Shen bbec381f5e [SPARK-36266][SHUFFLE] Rename classes in shuffle RPC used for block push operations
### What changes were proposed in this pull request?
This is a follow-up to #29855 according to the [comments](https://github.com/apache/spark/pull/29855/files#r505536514)
In this PR, the following changes are made:

1. A new `BlockPushingListener` interface is created specifically for block push. The existing `BlockFetchingListener` interface is left as is, since it might be used by external shuffle solutions. These 2 interfaces are unified under `BlockTransferListener` to enable code reuse.
2. `RetryingBlockFetcher`, `BlockFetchStarter`, and `RetryingBlockFetchListener` are renamed to `RetryingBlockTransferor`, `BlockTransferStarter`, and `RetryingBlockTransferListener` respectively. This makes their names more generic to be reused across both block fetch and push.
3. Comments in `OneForOneBlockPusher` are further clarified to better explain how we handle retries for block push.

### Why are the changes needed?
To make code cleaner without sacrificing backward compatibility.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests.

Closes #33340 from Victsm/SPARK-32915-followup.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit c4aa54ed4e)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-07-26 17:40:19 -05:00
Chao Sun ae7b32a9e8 [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
### What changes were proposed in this pull request?

Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this.

### Why are the changes needed?

Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at.

### Does this PR introduce _any_ user-facing change?

No, it's just test refactoring.

### How was this patch tested?

Using existing tests:
```
build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"
```
and
```
build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"
```

Closes #33350 from sunchao/SPARK-36136-partitions-suite.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 634f96dde4)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-26 13:04:06 -07:00
Hyukjin Kwon a77c9d6d17 [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
### What changes were proposed in this pull request?

This PR proposes to rename:

- Rename `*Reader`/`*reader` to `*Read`/`*read` for rules and execution plan (user-facing doc/config name remain untouched)
  - `*ShuffleReaderExec` ->`*ShuffleReadExec`
  - `isLocalReader` -> `isLocalRead`
  - ...
- Rename `CustomShuffle*` prefix to `AQEShuffle*`
- Rename `OptimizeLocalShuffleReader` rule to `OptimizeShuffleWithLocalRead`

### Why are the changes needed?

There are multiple problems in the current naming:

- `CustomShuffle*` -> `AQEShuffle*`
    it sounds like it is a pluggable API. However, this is actually only used by AQE.
- `OptimizeLocalShuffleReader` -> `OptimizeShuffleWithLocalRead`
    it is the name of a rule but it can be misread as a reader, which is counterintuative
- `*ReaderExec` -> `*ReadExec`
    Reader execution reads a bit odd. It should better be read execution (like `ScanExec`, `ProjectExec` and `FilterExec`). I can't find the reason to name it with something that performs an action. See also the generated plans:

    Before:

    ```
    ...
    * HashAggregate (12)
       +- CustomShuffleReader (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ...
    ```

    After:

    ```
    ...
    * HashAggregate (12)
       +- AQEShuffleRead (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ..
    ```

### Does this PR introduce _any_ user-facing change?

No, internal refactoring.

### How was this patch tested?

Existing unittests should cover the changes.

Closes #33429 from HyukjinKwon/SPARK-36217.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6e3d404cec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-26 22:42:16 +08:00
Venkata krishnan Sowrirajan 39d6e87bd9 [SPARK-32920][FOLLOW-UP] Fix shuffleMergeFinalized directly calling rdd.getNumPartitions as RDD is not serialized to executor
### What changes were proposed in this pull request?

`ShuffleMapTask` should not push blocks if a shuffle is already merge finalized. Currently block push is disabled for retry cases. Also fix `shuffleMergeFinalized` calling `rdd.getNumPartitions` as RDD is not serialized causing issues.

### Why are the changes needed?

No

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #33426 from venkata91/SPARK-32920-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit ba1a7ce5ec)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-07-26 09:17:55 -05:00
Angerszhuuuu 07c7a6f739 [SPARK-34402][SQL] Group exception about data format schema
### What changes were proposed in this pull request?
Group exception about data format schema of different format, orc/parquet

### Why are the changes needed?
group exception

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33296 from AngersZhuuuu/SPARK-34402.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a63802f2c6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-26 19:18:56 +08:00
Cheng Su f42cc10512 [SPARK-36269][SQL] Fix only set data columns to Hive column names config
### What changes were proposed in this pull request?

When reading Hive table, we set the Hive column id and column name configs (`hive.io.file.readcolumn.ids` and `hive.io.file.readcolumn.names`). We should set non-partition columns (data columns) for both configs, as Spark always [appends partition columns in its own Hive reader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L240). The column id config has only non-partition columns, but column name config has both partition and non-partition columns. We should keep them to be consistent with only non-partition columns. This does not cause issue for public OSS Hive file format for now. But for customized internal Hive file format, it causes the issue as we are expecting these two configs to be same.

### Why are the changes needed?

Fix the code logic to be more consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing Hive tests.

Closes #33489 from c21/hive-col.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e5616e32ee)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-26 18:48:26 +08:00
michaelzhang-db ec91818e14 [SPARK-36105][SQL] OptimizeLocalShuffleReader support reading data of multiple mappers in one task
### What changes were proposed in this pull request?
Added another partition spec to allow OptimizeLocalShuffleReader rule to read data from multiple mappers if the parallelism is less than the number of mappers.

### Why are the changes needed?
Optimization to the OptimizeLocalShuffleReader rule

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests

Closes #33310 from michaelzhang-db/supportDataFromMultipleMappers.

Authored-by: michaelzhang-db <michael.zhang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 094ae3708f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-26 17:57:15 +08:00
Huaxin Gao b1f522cf97 [SPARK-34952][SQL] DSv2 Aggregate push down APIs
### What changes were proposed in this pull request?
Add interfaces and APIs to push down Aggregates to V2 Data Source

### Why are the changes needed?
improve performance

### Does this PR introduce _any_ user-facing change?
SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source.

### How was this patch tested?
New tests were added to test aggregates push down in https://github.com/apache/spark/pull/32049.  The original PR is split into two PRs. This PR doesn't contain new tests.

Closes #33352 from huaxingao/aggPushDownInterface.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c561ee6865)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-26 16:01:43 +08:00
yi.wu 1e17a5bc19 [SPARK-32920][FOLLOW-UP][CORE] Shutdown shuffleMergeFinalizeScheduler when DAGScheduler stop
### What changes were proposed in this pull request?

Call `shuffleMergeFinalizeScheduler.shutdownNow()` in `DAGScheduler.stop()`.

### Why are the changes needed?

Avoid the thread leak.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existing tests.

Closes #33495 from Ngone51/SPARK-32920-followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 21450b3254)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-24 17:41:20 -07:00
Erik Krogen 23e48dbf77 [SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units
### What changes were proposed in this pull request?
`ExternalBlockHandler` exposes 4 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
    private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
    private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).

This causes `YarnShuffleServiceMetrics` to expose confusingly-named metrics like `openBlockRequestLatencyMillis_nanos_max` (the actual values are currently in nanos).

This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit. The logic to include the `nanos` suffix in the metric name within `YarnShuffleServiceMetrics` has also been removed, with the assumption that the metric name itself includes the units.

### Does this PR introduce _any_ user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by `ExternalBlockHandler` via `YarnShuffleServiceMetrics` such as `openBlockRequestLatencyMillis_nanos_max` and `openBlockRequestLatencyMillis_nanos_50thPercentile` have been changed to remove the `_nanos` suffix. This would be considered a breaking change, but these names were only exposed as part of #32388, which has not yet been released (slated for 3.2.0). New names are like `openBlockRequestLatencyMillis_max` and `openBlockRequestLatencyMillis_50thPercentile`
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics were also introduced by #32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.

### How was this patch tested?
Unit tests have been updated.

Closes #33116 from xkrogen/xkrogen-SPARK-35259-ess-fix-metric-unit-prefix.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 70a15868fc)
Signed-off-by: yi.wu <yi.wu@databricks.com>
2021-07-24 21:27:56 +08:00
Sean Owen c7d246ba4e [SPARK-35310][MLLIB] Update to breeze 1.2
Update to the latest breeze 1.2

Minor bug fixes

No.

Existing tests

Closes #33449 from srowen/SPARK-35310.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-24 08:20:25 -05:00
Chandni Singh 96944ac17d [SPARK-36255][SHUFFLE][CORE] Stop pushing and retrying on FileNotFound exceptions
### What changes were proposed in this pull request?
Once the shuffle is cleaned up by the `ContextCleaner`, the shuffle files are deleted by the executors. In this case, the push of the shuffle data by the executors can throw `FileNotFoundException`s because the shuffle files are deleted. When this exception is thrown from the `shuffle-block-push-thread`, it causes the executor to exit. Both the `shuffle-block-push` threads and the netty event-loops will encounter `FileNotFoundException`s in this case.  The fix here stops these threads from pushing more blocks when they encounter `FileNotFoundException`. When the exception is from the `shuffle-block-push-thread`, it will get handled and logged as warning instead of failing the executor.

### Why are the changes needed?
This fixes the bug which causes executor to exits when they are instructed to clean up shuffle data.
Below is the stacktrace of this exception:
```
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer

{file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, offset=10640, length=190}
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, offset=10640, length=190}

at org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException: ******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a unit to verify no more data is pushed when `FileNotFoundException` is encountered. Have also verified in our environment.

Closes #33477 from otterc/SPARK-36255.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 09e1c61272)
Signed-off-by: yi.wu <yi.wu@databricks.com>
2021-07-24 21:10:01 +08:00
Dominik Gehl 29c0b1e7b1 [SPARK-36225][PYTHON][DOCS] Use DataFrame in python docstrings
### What changes were proposed in this pull request?
Changing references to Dataset in python docstrings to DataFrame

### Why are the changes needed?
no Dataset class in pyspark

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Doc change only

Closes #33438 from dominikgehl/feature/SPARK-36225.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae1c20ee0d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:58:19 +09:00
Takuya UESHIN c1434b1928 [SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9
### What changes were proposed in this pull request?

Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.

### Why are the changes needed?

Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.

```
The python3 -m black command was not found. Skipping black checks for now.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The `black` check in `lint-python` should work.

Closes #33507 from ueshin/issues/SPARK-36279/lint-python.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 663cbdfbe5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:49:51 +09:00
Xinrong Meng 44cfce8548 [SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals
### What changes were proposed in this pull request?
Fix equality comparison of unordered Categoricals.

### Why are the changes needed?
Codes of a Categorical Series are used for Series equality comparison. However, that doesn't apply to unordered Categoricals, where the same value can have different codes in two same categories in a different order.

So we should map codes to value respectively and then compare the equality of value.

### Does this PR introduce _any_ user-facing change?
Yes.
From:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0     True
1     True
2     True
3    False
dtype: bool
```

To:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0    False
1    False
2    False
3     True
dtype: bool
```

### How was this patch tested?
Unit tests.

Closes #33497 from xinrong-databricks/cat_bug.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 85adc2ff60)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-23 18:31:18 -07:00