Commit graph

29075 commits

Author SHA1 Message Date
Dongjoon Hyun 9e93fdb146 [SPARK-34103][INFRA] Fix MiMaExcludes by moving SPARK-23429 from 2.4 to 3.0
### What changes were proposed in this pull request?

This PR aims to fix `MiMaExcludes` rule by moving SPARK-23429 from 2.4 to 3.0.

### Why are the changes needed?

SPARK-23429 was added at Apache Spark 3.0.0.
This should land on `master` and `branch-3.1` and `branch-3.0`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the MiMa rule.

Closes #31174 from dongjoon-hyun/SPARK-34103.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-14 09:29:31 +09:00
Kousuke Saruta b7da108cae [SPARK-33690][SQL][FOLLOWUP] Escape further meta-characters in showString
### What changes were proposed in this pull request?

This is a followup PR for SPARK-33690 (#30647) .
In addition to the original PR, this PR intends to escape the following meta-characters in `Dataset#showString`.

  * `\r` (carrige ret)
  * `\f` (form feed)
  * `\b` (backspace)
  * `\u000B` (vertical tab)
  * `\u0007` (bell)

### Why are the changes needed?

To avoid breaking the layout of `Dataset#showString`.
`\u0007` does not break the layout of `Dataset#showString` but it's noisy (beeps for each row) so it should be also escaped.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified the existing tests.
I also build the documents and check the generated html for `sql-migration-guide.md`.

Closes #31144 from sarutak/escape-metacharacters-in-getRows.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-13 18:13:01 -06:00
Kousuke Saruta 62d8466c74 [SPARK-34051][SQL] Support 32-bit unicode escape in string literals
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
This PR adds a feature which supports 32-bit unicode escape in string literals like PostgreSQL or some modern programming languages do (e.g, Python3, C++11 and Rust).
In addition to the feature which supports 16-bit unicode escape like `"\u0041"`, users can express unicode characters like `"\U00020BB7"` with this change.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Users can express unicode characters straightly without surrogate pair.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
Yes. Users an express all the unicode characters straightly.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Added new assertions to the existing test case.

Closes #31096 from sarutak/32-bit-unicode-escape.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-13 18:10:03 -06:00
yangjie01 8b1ba233f1 [SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion
### What changes were proposed in this pull request?
There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile.

### Why are the changes needed?
Remove redundant collection conversion

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass the Jenkins or GitHub  Action
- Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed

Closes #31125 from LuciferYang/SPARK-34068.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-13 18:07:02 -06:00
yangjie01 8c5fecda73 [SPARK-34070][CORE][SQL] Replaces find and emptiness check with exists
### What changes were proposed in this pull request?
This pr use `exists` to simplify `find + emptiness check`, it's semantically consistent, but looks simpler.

**Before**

```
seq.find(p).isDefined

or

seq.find(p).isEmpty
```

**After**

```
seq.exists(p)

or

!seq.exists(p)
```
### Why are the changes needed?
Code Simpilefications.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31130 from LuciferYang/SPARK-34070.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-13 10:42:24 -06:00
schintap bd5039fc35 [SPARK-33741][CORE] Add min threshold time speculation config
### What changes were proposed in this pull request?
Add min threshold time speculation config

### Why are the changes needed?
When we turn on speculation with default configs we have the last 10% of the tasks subject to speculation. There are a lot of stages where the stage runs for few seconds to minutes. Also in general we don't want to speculate tasks that run within a minimum threshold. By setting a minimum threshold for speculation config gives us better control for speculative tasks

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test

Closes #30710 from redsanket/SPARK-33741.

Lead-authored-by: schintap <schintap@verizonmedia.com>
Co-authored-by: Sanket Chintapalli <chintapalli.sanketreddy@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-01-13 08:57:56 -06:00
Chao Sun 62d82b5b27 [SPARK-34076][SQL] SQLContext.dropTempTable fails if cache is non-empty
### What changes were proposed in this pull request?

This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved.

### Why are the changes needed?

Currently, `CatalogImpl.dropTempView` is implemented as following:

```scala
override def dropTempView(viewName: String): Boolean = {
  sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef =>
    sparkSession.sharedState.cacheManager.uncacheQuery(
      sparkSession, viewDef, cascade = false)
    sessionCatalog.dropTempView(viewName)
  }
}
```

Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like:
```
Invalid call to qualifier on unresolved object, tree: 'key
```

This can be reproduced via:
```scala
sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10")
sql(s"CREATE TABLE $t AS SELECT * FROM src")
sql(s"CACHE TABLE $t")
dropTempTable(v)
```

### Does this PR introduce _any_ user-facing change?

The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix.

### How was this patch tested?

Added new unit tests.

Closes #31136 from sunchao/SPARK-34076.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-13 13:22:21 +00:00
LantaoJin f1b21ba505 [SPARK-34064][SQL] Cancel the running broadcast sub-jobs when SQL statement is cancelled
### What changes were proposed in this pull request?
#24595 introduced `private val runId: UUID = UUID.randomUUID` in `BroadcastExchangeExec` to cancel the broadcast execution in the Future when timeout happens. Since the runId is a random UUID instead of inheriting the job group id, when a SQL statement is cancelled, these broadcast sub-jobs are still executing. This PR uses the job group id of the outside thread as its `runId` to abort these broadcast sub-jobs when the SQL statement is cancelled.

### Why are the changes needed?
When broadcasting a table takes too long and the SQL statement is cancelled. However, the background Spark job is still running and it wastes resources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test.
Since broadcasting a table is too fast to cancel in UT, but it is very easy to verify manually:
1. Start a Spark thrift-server with less resource in YARN.
2. When the driver is running but no executors are launched, submit a SQL which will broadcast tables from beeline.
3. Cancel the SQL in beeline

Without the patch, broadcast sub-jobs won't be cancelled.
![Screen Shot 2021-01-11 at 12 03 13 PM](https://user-images.githubusercontent.com/1853780/104150975-ab024b00-5416-11eb-8bf9-b5167bdad80a.png)

With this patch, broadcast sub-jobs will be cancelled.
![Screen Shot 2021-01-11 at 11 43 40 AM](https://user-images.githubusercontent.com/1853780/104150994-be151b00-5416-11eb-80ff-313d423c8a2e.png)

Closes #31119 from LantaoJin/SPARK-34064.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-13 12:58:27 +00:00
ulysses-you f64297d290 [SPARK-32850][TEST][FOLLOWUP] Fix flaky test due to timeout
### What changes were proposed in this pull request?

Increase test timeout.

### Why are the changes needed?

It's more reasonable to use 60s instead of 6s since many code place use the 60s in `TestUtils.waitUntilExecutorsUp`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass exists test.

Closes #31166 from ulysses-you/SPARK-32850-FOLLOWUP.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-13 19:21:44 +09:00
Kent Yao 04f031acb3 [SPARK-34086][SQL] RaiseError generates too much code and may fails codegen in length check for char varchar
### What changes were proposed in this pull request?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression.

W/ this fix, for q41 in TPCDS with [Using TPCDS original definitions for char/varchar columns](https://github.com/apache/spark/pull/31012) applied, we can reduce the stage code-gen size from 22523 to 14369
```
14369  - 22523 = - 8154
```

### Why are the changes needed?

fix the perf regression(we need other improvements for q41 works), there will be a huge performance regression if codegen fails

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

modified uts

Closes #31150 from yaooqinn/SPARK-34086.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-13 09:52:36 +00:00
Max Gekk 861f8bb5fb [SPARK-34071][SQL][TESTS] Check stats of cached v1 tables after altering
### What changes were proposed in this pull request?
Port the test added by https://github.com/apache/spark/pull/31112 to:
1. v1 In-Memory catalog for `ALTER TABLE .. DROP PARTITION`
2. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. ADD PARTITION`
3. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. RENAME PARTITION`

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite"
```

Closes #31131 from MaxGekk/cache-stats-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-13 04:58:01 +00:00
Gabor Somogyi b0759dc600 [SPARK-34090][SS] Cache HadoopDelegationTokenManager.isServiceEnabled result used in KafkaTokenUtil.needTokenUpdate
### What changes were proposed in this pull request?
`HadoopDelegationTokenManager.isServiceEnabled` is quite a time consuming operation which is called in `KafkaTokenUtil.needTokenUpdate` often which slowed down Kafka processing heavily. SPARK-33635 changed the if condition in order to overcome this issue when no delegation token is used but in case of delegation token usage the problem still exists. In this PR I'm caching the `HadoopDelegationTokenManager.isServiceEnabled` result in the `KafkaDataConsumer` instances which solves the issue. There would be another solution, namely caching the result inside `HadoopDelegationTokenManager` but since it's an object function and several application is running inside a JVM, different `SparkConf` instances will arrive. Caching the result per `SparkConf` instance would be an overkill.

### Why are the changes needed?
Kafka stream processing is slow with delegation token.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
* Existing unit tests
* In Kafka to Kafka live query I've double checked that `HadoopDelegationTokenManager.isServiceEnabled` call executed only when new `KafkaDataConsumer` created (new delegation token arrives or task failure).

Closes #31154 from gaborgsomogyi/SPARK-34090.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-13 11:04:44 +09:00
Takuya UESHIN ad8e40e2ab [SPARK-32338][SQL][PYSPARK][FOLLOW-UP][TEST] Add more tests for slice function
### What changes were proposed in this pull request?

This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function.

### Why are the changes needed?

The original PRs are missing tests with column-based arguments instead of literals.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests and existing tests.

Closes #31159 from ueshin/issues/SPARK-32338/slice_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-13 09:56:38 +09:00
ulysses-you 65222b7051 [SPARK-34069][CORE] Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL
### What changes were proposed in this pull request?

Add shouldInterruptTaskThread check when kill barrier task.

### Why are the changes needed?

We should interrupt task thread if user set local property `SPARK_JOB_INTERRUPT_ON_CANCEL` to true.

### Does this PR introduce _any_ user-facing change?

Yes, task will be interrupted if user set `SPARK_JOB_INTERRUPT_ON_CANCEL` to true.

### How was this patch tested?

Add test.

Closes #31127 from ulysses-you/SPARK-34069.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-01-12 13:19:45 -06:00
yi.wu 0099715aae [SPARK-34091][SQL] Shuffle batch fetch should be able to disable after it's been enabled
### What changes were proposed in this pull request?

Fix the setting issue of shuffle batch fetch in `ShuffledRowRDD`.

### Why are the changes needed?

Currently, we can not disable the shuffle batch fetch mode once the batch fetch mode has been enabled. This PR fixes the issue to make `ShuffledRowRDD` respects the `spark.sql.adaptive.fetchShuffleBlocksInBatch` at runtime.

### Does this PR introduce _any_ user-facing change?

Yes. Before this PR, users can not disable batch fetch if they enabled first. After this PR, they can.

### How was this patch tested?

Added unit test.

Closes #31155 from Ngone51/fix-batchfetch-set-issue.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-12 15:45:15 +00:00
Max Gekk 6c047958f9 [SPARK-34084][SQL] Fix auto updating of table stats in ALTER TABLE .. ADD PARTITION
### What changes were proposed in this pull request?
Fix an issue in `ALTER TABLE .. ADD PARTITION` which happens when:
- A table doesn't have stats
- `spark.sql.statistics.size.autoUpdate.enabled` is `true`

In that case, `ALTER TABLE .. ADD PARTITION` does not update table stats automatically.

### Why are the changes needed?
The changes fix the issue demonstrated by the example:
```sql
spark-sql> create table tbl (col0 int, part int) partitioned by (part);
spark-sql> insert into tbl partition (part = 0) select 0;
spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
spark-sql> alter table tbl add partition (part = 1);
```
the `add partition` command should update table stats but it does not. There is no stats in the output of:
```
spark-sql> describe table extended tbl;
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, `ALTER TABLE .. ADD PARTITION` updates stats even when a table does have them before the command:
```sql
spark-sql> alter table tbl add partition (part = 1);
spark-sql> describe table extended tbl;
col0	int	NULL
part	int	NULL
# Partition Information
# col_name	data_type	comment
part	int	NULL

# Detailed Table Information
...
Statistics	2 bytes
```

### How was this patch tested?
By running new UT and existing test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31149 from MaxGekk/fix-stats-in-add-partition.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-12 14:34:17 +00:00
Jarek Potiuk a4b70758d3 [SPARK-34053][INFRA][FOLLOW-UP] Disables canceling push/schedule workflows
Changes the configuration of the cancel workflow action to skip schedule/push events from canceling. This has the effect that duplicates of all direct pushes (master merges or direct pushes to the spark repository are not cancelled)

### What changes were proposed in this pull request?

Update to CI cancel policy to skip direct pushes. Duplicates will only be cancelled for Pull Requests.

### Why are the changes needed?

[Apparenlty](https://github.com/apache/spark/pull/31104#issuecomment-758318463) the aggressive behavior of the cancel action which also cancels duplicate master builds is too agressive for spark community. This change spares merges to master and scheduled builds from duplicate checking (as a result all merges to master will be always build to completion).

The previous behavior of the action was that in case of subsequent merges to master, only the latest one was guaranteed to complete. Other changes could be cancelled before they complete to save job queue.

### Does this PR introduce _any_ user-facing change?

No, except if the master builds were somehow facing the users (but it's unlikely taking into account the ASF release policy).
There was a potential that some changes that could be detected by specific master merge failing could be detected later (in one of the subsequent builds) which could make investigation of the root cause of failure a bit more difficult, because it could have been introduced in one of the commits between two completed builds merges. But this is at most impacting the timeline of release close to release itself, not the release itself.

### How was this patch tested?

This configuration parameter has been tested earlier in Airflow. We used it initially, but since our master builds are heavy and we have extensive tests in the PRs and investigation for failed builds is not as difficult we found that limiting the strain on Github Action and cancelling master builds was more important for the health of the whole ASF community and we removed that configuration.

Tested in https://github.com/potiuk/spark/runs/1688506527?check_suite_focus=true#step:2:46 where the action found other master builds in progress but did not add them as candidates to cancel.

Closes #31153 from potiuk/skip-schedule-push-branches.

Authored-by: Jarek Potiuk <potiuk@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-12 23:10:20 +09:00
Kent Yao 99f84892a5 [SPARK-34003][SQL][FOLLOWUP] Avoid pushing modified Char/Varchar sort attributes into aggregate for existing ones
### What changes were proposed in this pull request?

In 0f8e5dd445, we partially fix the rule conflicts between `PaddingAndLengthCheckForCharVarchar` and `ResolveAggregateFunctions`, as error still exists in

sql like ```SELECT substr(v, 1, 2), sum(i) FROM t GROUP BY v ORDER BY substr(v, 1, 2)```

```sql
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'spark_catalog.default.t.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
[info]   Project [substr(v, 1, 2)#100, sum(i)#101L]
[info]   +- Sort [aggOrder#102 ASC NULLS FIRST], true
[info]      +- !Aggregate [v#106], [substr(v#106, 1, 2) AS substr(v, 1, 2)#100, sum(cast(i#98 as bigint)) AS sum(i)#101L, substr(v#103, 1, 2) AS aggOrder#102
[info]         +- SubqueryAlias spark_catalog.default.t
[info]            +- Project [if ((length(v#97) <= 3)) v#97 else if ((length(rtrim(v#97, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#97) as string),  exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#97, None), 3,  ) AS v#106, i#98]
[info]               +- Relation[v#97,i#98] parquet
[info]
[info]   Project [substr(v, 1, 2)#100, sum(i)#101L]
[info]   +- Sort [aggOrder#102 ASC NULLS FIRST], true
[info]      +- !Aggregate [v#106], [substr(v#106, 1, 2) AS substr(v, 1, 2)#100, sum(cast(i#98 as bigint)) AS sum(i)#101L, substr(v#103, 1, 2) AS aggOrder#102
[info]         +- SubqueryAlias spark_catalog.default.t
[info]            +- Project [if ((length(v#97) <= 3)) v#97 else if ((length(rtrim(v#97, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#97) as string),  exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#97, None), 3,  ) AS v#106, i#98]
[info]               +- Relation[v#97,i#98] parquet

```
We need to look recursively into children to find char/varchars.

In this PR,  we try to resolve the full attributes including the original `Aggregate` expressions and the candidates in `SortOrder` together, then use the new re-resolved `Aggregate` expressions to determine which candidate in the `SortOrder` shall be pushed. This can avoid mismatch for the same attributes w/o this change, as the expressions returned by `executeSameContext` will change when `PaddingAndLengthCheckForCharVarchar` takes effects. W/ this change, the expressions can be matched correctly.

For those unmatched, w need to look recursively into children to find char/varchars instead of the expression itself only.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

add new tests

Closes #31129 from yaooqinn/SPARK-34003-F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-12 08:20:39 +00:00
Gengliang Wang 02a17e92f1 [SPARK-28646][SQL][FOLLOWUP] Add legacy config for allowing parameterless count
### What changes were proposed in this pull request?

Add a legacy configuration `spark.sql.legacy.allowParameterlessCount` in case users need the parameterless count.
This is a follow-up for https://github.com/apache/spark/pull/30541.

### Why are the changes needed?

There can be some users depends on the legacy behavior. We need a legacy flag for it.

### Does this PR introduce _any_ user-facing change?

Yes, adding a legacy flag `spark.sql.legacy.allowParameterlessCount`.

### How was this patch tested?

Unit tests

Closes #31143 from gengliangwang/countLegacy.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-12 16:31:22 +09:00
Max Gekk f7cbeec487 [SPARK-34074][SQL] Update stats only when table size changes
### What changes were proposed in this pull request?
Do not alter table stats if they are the same as in the catalog (at least since the recent retrieve).

### Why are the changes needed?
The changes reduce the number of calls to Hive external catalog.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31135 from MaxGekk/optimize-updateTableStats.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-12 03:28:28 +00:00
Ruifeng Zheng 7ff9ff153e [SPARK-34045][ML] OneVsRestModel.transform should not call setter of submodels
### What changes were proposed in this pull request?
use a tmp model in OneVsRestModel.transform, to avoid calling directly setter of model

### Why are the changes needed?
params of model (submodels) should not be changed in transform

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added testsuite

Closes #31086 from zhengruifeng/ovr_transform_tmp_model.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-01-12 10:21:37 +08:00
“attilapiros” 6bd7a6200f [SPARK-33711][K8S] Avoid race condition between POD lifecycle manager and scheduler backend
### What changes were proposed in this pull request?

Missing POD detection is extended by timestamp (and time limit) based check to avoid wrongfully detection of missing POD detection.

The two new timestamps:
- `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only updated by the pod polling snapshot source
- `registrationTs` is introduced for the `ExecutorData` and it is initialized at the executor registration at the scheduler backend

Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is used to specify the accepted delta between the two.

### Why are the changes needed?

Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single POD changes. This could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager.

A key indicator of this error is seeing this log message:

> "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event."

So one of the problem is running the missing POD detection check even when a single POD is changed without having a full consistent snapshot about all the PODs (see `ExecutorPodsPollingSnapshotSource`).
The other problem could be the race between the executor POD lifecycle manager and the scheduler backend: so even in case of a having a full snapshot the registration at the scheduler backend could precede the snapshot polling (and processing of those polled snapshots).

### Does this PR introduce _any_ user-facing change?

Yes. When the POD is missing then the reason message explaining the executor's exit is extended with both timestamps (the polling time and the executor registration time) and even the new config is mentioned.

### How was this patch tested?

The existing unit tests are extended.

Closes #30675 from attilapiros/SPARK-33711.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-01-11 14:25:12 -08:00
Dongjoon Hyun 3556929c43 [SPARK-33970][SQL][TEST][FOLLOWUP] Use String comparision
### What changes were proposed in this pull request?

This is a follow-up to replace `version.toDouble > 2` with `version >= "2.0"`

### Why are the changes needed?

`toDouble` has some assumption and can cause `java.lang.NumberFormatException`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31134 from dongjoon-hyun/SPARK-33970-FOLLOWUP.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-11 13:40:03 -08:00
Liang-Chi Hsieh ad9fad72a9 [MINOR][SS] Add some description about auto reset and data loss note to SS doc
### What changes were proposed in this pull request?

This patch adds a few description to SS doc about offset reset and data loss.

### Why are the changes needed?

During recent SS test, the behavior of gradual reducing input rows are confusing me. Comparing with Flink, I do not see a similar behavior. After looking into the code and doing some tests, I feel it is better to add some more description there in SS doc.

### Does this PR introduce _any_ user-facing change?

No, doc only.

### How was this patch tested?

Doc only.

Closes #31089 from viirya/ss-minor-5.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-11 11:47:07 -08:00
Liang-Chi Hsieh 0bcbafb4b8 [SPARK-34002][SQL] Fix the usage of encoder in ScalaUDF
### What changes were proposed in this pull request?

This patch fixes few issues when using encoders to serialize input/output in `ScalaUDF`.

### Why are the changes needed?

This fixes a bug when using encoders in Scala UDF. First, the output data type should be corrected to the corresponding data type of the object serializer. Second, `catalystConverter` should not serialize `Option[_]` as the ordinary row because in `ScalaUDF` case it is serialized to a column, not the top-level row. Otherwise, there will be a redundant `value` struct wrapping the serialized `Option[_]` object.

### Does this PR introduce _any_ user-facing change?

Yes, fixing a bug of `ScalaUDF`.

### How was this patch tested?

Unit test.

Closes #31103 from viirya/SPARK-34002.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-11 11:31:35 -08:00
yikf 1495ad8c46 [SPARK-33991][CORE][WEBUI] Repair enumeration conversion error for AllJobsPage
### What changes were proposed in this pull request?
For `AllJobsPage `class, `AllJobsPage` gets the schedulingMode of enumerated type by loading the `spark.scheduler.mode `configuration from Sparkconf, but an enumeration conversion error occurs when I set the value of this configuration to lowercase.

The reason for this problem is that the value of the `SchedulingMode `enumeration class is uppercase, which occurs when I configure `spark.scheduler.mode` to be lowercase.

I saw that the `#org.apache.spark.scheduler.TaskSchedulerImpl` class convert the `spark. scheduler.mode` value to uppercase, so I think it should be converted in `AllJobsPage `as well.

### Why are the changes needed?
An enumerated conversion error occurred with Spark when I set the value of this configuration to lowercase.

### How was this patch tested?
Existing tests.

Closes #31015 from yikf/master.

Authored-by: yikf <13468507104@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-11 08:48:02 -06:00
angerszhu 5ef6907792 [SPARK-33084][CORE][SQL] Rename Unit test file and use fake ivy link
### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/29966#discussion_r554514344
Use wrong name about suite file, this pr to fix this problem.
And change to use some fake ivy link for this test

### Why are the changes needed?
Follow file name rule

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #31118 from AngersZhuuuu/SPARK-33084-FOLLOW-UP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-11 23:18:18 +09:00
hyukjinkwon ff493173ab [SPARK-34065][INFRA] Cancel the duplicated jobs only in PRs at GitHub Actions
### What changes were proposed in this pull request?

This is kind of a followup of https://github.com/apache/spark/pull/31104 but I decided to track it separately with a separate JIRA.

Currently the jobs are being canceled in main repo branches. If a commit is merged, for example, to master branch before the test finishes, it cancels the previous builds. This is a problem because we cannot, for example, detect logical conflict properly. We should only cancel the jobs in PRs:

![Screen Shot 2021-01-11 at 3 22 24 PM](https://user-images.githubusercontent.com/6477701/104152015-c7f04b80-5421-11eb-9e40-6b0a0e5b8442.png)

This PR proposes to don't do this in the main repo branch commits but only do it in PRs.

### Why are the changes needed?

- To keep the test coverage
- To run the test in the synced master branch instead of relying on the builds made in each PR with an outdated master branch
- To detect test failures from logical conflicts from merging two conflicting PRs at the same time.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I manually tested in
- https://github.com/HyukjinKwon/spark/pull/27
- https://github.com/HyukjinKwon/spark/pull/28

I added Yi Wu as a co-author since he helped verifying the current fix in the PR above.

I checked that it does not cancel in the main repo branch:

![Screen Shot 2021-01-11 at 3 58 52 PM](https://user-images.githubusercontent.com/6477701/104153656-3afbc100-5426-11eb-9309-85f6f4fd9ff3.png)

I checked it cancels in PRs:

![Screen Shot 2021-01-11 at 3 58 45 PM](https://user-images.githubusercontent.com/6477701/104153658-3d5e1b00-5426-11eb-89f7-786c3ae6849a.png)

Closes #31121 from HyukjinKwon/SPARK-34065.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-11 16:37:16 +09:00
yi.wu 4afca0f706 [SPARK-31952][SQL] Fix incorrect memory spill metric when doing Aggregate
### What changes were proposed in this pull request?

This PR takes over https://github.com/apache/spark/pull/28780.

1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter`

2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter`

### Why are the changes needed?

As mentioned in https://github.com/apache/spark/pull/28780:

> It happends when hash aggregate downgrades to sort based aggregate.
`UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right.

Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`.

Issues can be reproduced by the following step by checking the SQL metrics in UI:

```
bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1"
scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json")
```

Before:

<img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png">

After:

<img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png">

### Does this PR introduce _any_ user-facing change?

Yes, users can see the correct spill metrics after this PR.

### How was this patch tested?

Tested manually and added UTs.

Closes #31035 from Ngone51/SPARK-31952.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-11 07:15:28 +00:00
Max Gekk d97e99157e [SPARK-34060][SQL] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION
### What changes were proposed in this pull request?
Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan.

### Why are the changes needed?
This fixes the issue demonstrated by the example below:
```scala
scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)")
scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
scala> sql("CACHE TABLE tbl")
scala> sql("SELECT * FROM tbl").show(false)
+---+----+
|id |part|
+---+----+
|0  |0   |
|1  |1   |
+---+----+

scala> spark.catalog.isCached("tbl")
scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = false
```
`ALTER TABLE .. DROP PARTITION` must keep the table in the cache.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats:
```scala
scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = true
```

### How was this patch tested?
By running new UT in `AlterTableDropPartitionSuite`.

Closes #31112 from MaxGekk/fix-caching-hive-table-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-11 07:03:44 +00:00
Max Gekk 664ef184c1 [SPARK-34055][SQL][TESTS][FOLLOWUP] Check partition adding to cached Hive table
### What changes were proposed in this pull request?
Replace `USING parquet` by `$defaultUsing` which is `USING parquet` for v1 In-Memory catalog and `USING hive` for v1 Hive external catalog.

### Why are the changes needed?
The PR https://github.com/apache/spark/pull/31101 added UT test but it checks only v1 In-Memory catalog. This PR runs this test for Hive external catalog as well to improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31117 from MaxGekk/add-partition-refresh-cache-2-followup-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-11 07:02:49 +00:00
Yuming Wang f77eeb0451 [SPARK-33970][SQL][TEST] Add test default partition in metastoredirectsql
### What changes were proposed in this pull request?

This pr add test default partition in metastoredirectsql.

### Why are the changes needed?

Improve test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #31109 from wangyum/SPARK-33970.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-11 14:19:53 +09:00
Terry Kim 8391a4a687 [SPARK-34057][SQL] UnresolvedTableOrView should retain SQL text position for DDL commands
### What changes were proposed in this pull request?

Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect:
```
scala> sql("DROP TABLE unknown")
org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 0;
```
, whereas the `pos` should be `11`.

This PR proposes to fix this issue for commands using `UnresolvedTableOrView`:
```
DROP TABLE unknown
DESCRIBE TABLE unknown
ANALYZE TABLE unknown COMPUTE STATISTICS
ANALYZE TABLE unknown COMPUTE STATISTICS FOR COLUMNS col
ANALYZE TABLE unknown COMPUTE STATISTICS FOR ALL COLUMNS
SHOW CREATE TABLE unknown
REFRESH TABLE unknown
SHOW COLUMNS FROM unknown
SHOW COLUMNS FROM unknown IN db
ALTER TABLE unknown RENAME TO t
ALTER VIEW unknown RENAME TO v
```

### Why are the changes needed?

To fix a bug.

### Does this PR introduce _any_ user-facing change?

Yes, now the above example will print the following:
```
org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 11;
```

### How was this patch tested?

Add a new test.

Closes #31106 from imback82/unresolved_table_or_view_message.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-11 04:28:39 +00:00
Weichen Xu 11fac232c8 [MINOR] Improve flaky NaiveBayes test
### What changes were proposed in this pull request?
Improve flaky NaiveBayes test

Current test may sometimes fail under different BLAS library. Due to some absTol check. Error like
```
Expected 0.7 and 0.6485507246376814 to be within 0.05 using absolute tolerance...

```

* Change absTol to relTol: The `absTol 0.05` in some cases (such as compare 0.1 and 0.05) is a big difference
* Remove the `exp` when comparing params. The `exp` will amplify the relative error.

### Why are the changes needed?
Flaky test

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
N/A

Closes #31004 from WeichenXu123/improve_bayes_tests.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-01-11 11:58:57 +08:00
Kamil Breguła 3e5e08640e [SPARK-34053][INFRA] Cancel the previous build
Similar to: https://github.com/apache/spark/pull/31098 https://github.com/apache/calcite/pull/2318 (solution suggestted by vlsi - https://github.com/apache/pulsar/issues/9154#issuecomment-756984731)

I used the action, which was maintained by potiuk instead of the original author, for two reasons:
- the original action was abandoned and is not supported (Proof: https://github.com/n1hility/cancel-previous-runs/issues/7)
- this action works with forks.  The original action only worked when the contribution was run in the same repository and the action had a token with full accesses.

> If you use forks, you should create a separate "Cancelling" workflow_run triggered workflow. The workflow_run should be responsible for all canceling actions. The examples below show the possible ways the action can be utilized.

### What changes were proposed in this pull request?
This PR aims to reduce the GitHub Action usage by cancelling the previous build.

### Why are the changes needed?
In most case, the last commit is meaningful.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Due to the nature of the change, testing of this change is difficult.

> Note: This event will only trigger a workflow run if the workflow file is on the default branch.

https://docs.github.com/en/free-pro-teamlatest/actions/reference/events-that-trigger-workflows#workflow_run

However, you can see on my fork that this action is triggered.
https://github.com/mik-laj/spark/actions?query=workflow%3A%22Cancelling+Duplicates%22

I also asked the author of this action to review this change - potiuk (PMC of Apache Airflow) and I have a positive review.

Closes #31104 from mik-laj/patch-1.

Lead-authored-by: Kamil Breguła <kamil.bregula@polidea.com>
Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-10 16:19:44 -08:00
HyukjinKwon 830249284d [SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure execute it eagerly
### What changes were proposed in this pull request?

This PR is basically a followup of https://github.com/apache/spark/pull/14332.
Calling `map` alone might leave it not executed due to lazy evaluation, e.g.)

```
scala> val foo = Seq(1,2,3)
foo: Seq[Int] = List(1, 2, 3)

scala> foo.map(println)
1
2
3
res0: Seq[Unit] = List((), (), ())

scala> foo.view.map(println)
res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)

scala> foo.view.foreach(println)
1
2
3
```

We should better use `foreach` to make sure it's executed where the output is unused or `Unit`.

### Why are the changes needed?

To prevent the potential issues by not executing `map`.

### Does this PR introduce _any_ user-facing change?

No, the current codes look not causing any problem for now.

### How was this patch tested?

I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally.

Closes #31110 from HyukjinKwon/SPARK-34059.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-01-10 15:22:24 -08:00
Max Gekk 9a8d275226 [SPARK-34055][SQL][TESTS][FOLLOWUP] Increase the expected number of calls to Hive external catalog in partition adding
### What changes were proposed in this pull request?
Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`.

### Why are the changes needed?
There is a logical conflict between https://github.com/apache/spark/pull/31101 and https://github.com/apache/spark/pull/31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31111 from MaxGekk/add-partition-refresh-cache-2-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-10 18:29:02 +09:00
ulysses-you 48b9611ba3 [SPARK-32668][SQL] HiveGenericUDTF initialize UDTF should use StructObjectInspector method
### What changes were proposed in this pull request?

Use `initialize(StructObjectInspector argOIs)` instead `initialize(ObjectInspector[] args)` in `HiveGenericUDTF`.

### Why are the changes needed?

In our case, we implement a Hive `GenericUDTF` and override `initialize(StructObjectInspector argOIs)`. Then it's ok to execute with Hive, but failed with Spark SQL. Here is the Spark SQL error msg:
```
No handler for UDF/UDAF/UDTF 'com.xxxx.xxxUDTF': java.lang.IllegalStateException:
Should not be called directly Please make sure your function overrides
`public StructObjectInspector initialize(ObjectInspector[] args)`.
```

The reason is Spark `HiveGenericUDTF` call `initialize(ObjectInspector[] argOIs)` to init a UDTF, but it's a Deprecated method.
```
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
        ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];

        for(int i = 0; i < inputFields.size(); ++i) {
            udtfInputOIs[i] = ((StructField)inputFields.get(i)).getFieldObjectInspector();
        }

        return this.initialize(udtfInputOIs);
    }

    Deprecated
    public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
        throw new IllegalStateException("Should not be called directly");
    }
```

We should use `initialize(StructObjectInspector argOIs)` to do this so that we can be compatible both of the two method. Same as Hive.

### Does this PR introduce _any_ user-facing change?

Yes, fix UDTF initialize method.

### How was this patch tested?

manual test and passed `HiveUDFDynamicLoadSuite`

Closes #29490 from ulysses-you/SPARK-32668.

Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-01-10 13:19:04 +08:00
Max Gekk e0e06c18fd [SPARK-34055][SQL] Refresh cache in ALTER TABLE .. ADD PARTITION
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`.

### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0	0
spark-sql> show table extended like 'tbl' partition(part=0);
default	tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
```

The last query must return `0	1` since it has been added by `ALTER TABLE .. ADD PARTITION`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
0	1
```

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes #31101 from MaxGekk/add-partition-refresh-cache-2.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-10 14:06:17 +09:00
HyukjinKwon 105ba6e5f0 Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE"
This reverts commit d36cdd5541.
2021-01-10 13:52:48 +09:00
ulysses-you 48cd11c483 [SPARK-34030][SQL] Fold RepartitionByExpression num partition should at Optimizer
### What changes were proposed in this pull request?

Move `RepartitionByExpression` fold partition number code to a new rule at `Optimizer`.

### Why are the changes needed?

We meet some ploblem when backport SPARK-33806. It is because the UnresolvedFunction.foldable will throw a exception. It's ok with master branch, but it's better to do it at Optimizer. Some reason:

1. It's not always safe to call Expression.foldable before analysis.
2. fold num partition to 1 more like a optimize behavior.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31077 from ulysses-you/SPARK-34030.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-10 13:00:40 +09:00
Max Gekk 0af387480c [SPARK-34048][SQL][TESTS] Check the amount of calls to Hive external catalog
### What changes were proposed in this pull request?
Add new tests to unified test suites to check the total amount of calls via the Hive client.

### Why are the changes needed?
1. To improve test coverage
2. To make foundation for future optimizations

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites like:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31092 from MaxGekk/access-to-catalog-refreshTable.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-09 15:33:08 -08:00
Anton Okolnychyi 6b34745cb9 [SPARK-34049][SS] DataSource V2: Use Write abstraction in StreamExecution
### What changes were proposed in this pull request?

This PR makes `StreamExecution` use the `Write` abstraction introduced in SPARK-33779.

Note: we will need separate plans for streaming writes in order to support the required distribution and ordering in SS. This change only migrates to the `Write` abstraction.

### Why are the changes needed?

These changes prevent exceptions from data sources that implement only the `build` method in `WriteBuilder`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #31093 from aokolnychyi/spark-34049.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-08 20:37:35 -08:00
Chandni Singh d00f0695b7 [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
### What changes were proposed in this pull request?
This is the shuffle writer side change where executors can push data to remote shuffle services. This is needed for push-based shuffle - SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
Summary of changes:
- This adds support for executors to push shuffle blocks after map tasks complete writing shuffle data.
- This also introduces a timeout specifically for creating connection to remote shuffle services.

### Why are the changes needed?
- These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
- The main reason to create a separate connection creation timeout is because the existing `connectionTimeoutMs` is overloaded and is used for connection creation timeouts as well as connection idle timeout. The connection creation timeout should be much lower than the idle timeouts. The default for `connectionTimeoutMs` is 120s. This is quite high for just establishing the connections.  If a shuffle server node is bad then the connection creation will fail within few seconds. However, an overloaded shuffle server may take much longer to respond to a request and the channel can stay idle for a much longer time which is expected.  Another reason is that with push-based shuffle, an executor may be fetching shuffle data and pushing shuffle data (next stage) simultaneously. Both these tasks will share the same connections with the shuffle service. If there is a bad shuffle server node and the connection creation timeout is very high then both these tasks end up waiting a long time time eventually impacting the performance.

### Does this PR introduce _any_ user-facing change?
Yes. This PR introduces client-side configs for push-based shuffle. If push-based shuffle is turned-off then the users will not see any change.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Ye Zhou yezhoulinkedin.com

Closes #30312 from otterc/SPARK-32917.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Co-authored-by: Min Shen <mshen@linked.in.com>
Co-authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-01-08 12:21:56 -06:00
Kousuke Saruta 0781ed4f5b [MINOR][SQL][TESTS] Fix the incorrect unicode escape test in ParserUtilsSuite
### What changes were proposed in this pull request?

This PR fixes an incorrect unicode literal test in `ParserUtilsSuite`.
In that suite, string literals in queries have unicode escape characters like `\u7328` but the backslash should be escaped because
the queriy strings are given as Java strings.

### Why are the changes needed?

Correct the test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run `ParserUtilsSuite` and it passed.

Closes #31088 from sarutak/fix-incorrect-unicode-test.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-08 09:44:33 -06:00
Kousuke Saruta 023eba2ad7 [SPARK-33796][DOCS][FOLLOWUP] Tweak the width of left-menu of Spark SQL Guide
### What changes were proposed in this pull request?

This PR tweaks the width of left-menu of Spark SQL Guide.
When I view the Spark SQL Guide with browsers on macOS, the title `Spark SQL Guide` looks prettily.
But I often use Pop!_OS, an Ubuntu variant, and the title is overlapped with browsers on it.
![spark-sql-guide-layout-before](https://user-images.githubusercontent.com/4736016/104002743-d56cc200-51e4-11eb-9e3a-28abcd46e0bf.png)

After this change, the title is no longer overlapped.
![spark-sql-guide-layout-after](https://user-images.githubusercontent.com/4736016/104002847-f9c89e80-51e4-11eb-85c0-01d69cee46b7.png)

### Why are the changes needed?

For the pretty layout.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built the document with `cd docs && SKIP_API=1 jekyll build` and confirmed the layout.

Closes #31091 from sarutak/modify-layout-sparksql-guide.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-08 09:43:11 -06:00
Max Gekk 157b72ac9f [SPARK-33591][SQL] Recognize null in partition spec values
### What changes were proposed in this pull request?
1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values.
2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`.
3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec.

### Why are the changes needed?
Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example:
```sql
spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false
```
Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works as expected:
```sql
spark-sql> SELECT isnull(p1) FROM tbl5;
true
```

### How was this patch tested?
1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`.
2. Compiling by Scala 2.13:
```
$  ./dev/change-scala-version.sh 2.13
$ ./build/sbt -Pscala-2.13 compile
```

Closes #30538 from MaxGekk/partition-spec-value-null.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-08 14:14:27 +00:00
Gabor Somogyi 71d261ab8f [SPARK-34032][SS] Add truststore and keystore type config possibility for Kafka delegation token
### What changes were proposed in this pull request?
Kafka delegation token is obtained with `AdminClient` where security settings can be set. Keystore and trustrore type however can't be set. In this PR I've added these new configurations. This can be useful when the type is different. A good example is to make Spark FIPS compliant where the default JKS is not accepted.

### Why are the changes needed?
Missing configurations.

### Does this PR introduce _any_ user-facing change?
Yes, adding 2 additional config parameters.

### How was this patch tested?
Existing + modified unit tests + simple Kafka to Kafka app on cluster.

Closes #31070 from gaborgsomogyi/SPARK-34032.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2021-01-08 20:04:56 +09:00
Kent Yao 0f8e5dd445 [SPARK-34003][SQL] Fix Rule conflicts between PaddingAndLengthCheckForCharVarchar and ResolveAggregateFunctions
### What changes were proposed in this pull request?

ResolveAggregateFunctions is a hacky rule and it calls `executeSameContext` to generate a `resolved agg` to determine which unresolved sort attribute should be pushed into the agg. However, after we add the PaddingAndLengthCheckForCharVarchar rule which will rewrite the query output, thus, the `resolved agg` cannot match original attributes anymore.

It causes some dissociative sort attribute to be pushed in and fails the query

``` logtalk
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'testcat.t1.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
[info]   Project [v#14, sum(i)#11L]
[info]   +- Sort [aggOrder#12 ASC NULLS FIRST], true
[info]      +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12]
[info]         +- SubqueryAlias testcat.t1
[info]            +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string),  exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3,  ) AS v#14, i#7]
[info]               +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1
[info]
[info]   Project [v#14, sum(i)#11L]
[info]   +- Sort [aggOrder#12 ASC NULLS FIRST], true
[info]      +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12]
[info]         +- SubqueryAlias testcat.t1
[info]            +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string),  exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3,  ) AS v#14, i#7]
[info]               +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1
```

### Why are the changes needed?

bugfix
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new tests

Closes #31027 from yaooqinn/SPARK-34003.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-08 09:05:22 +00:00
Gengliang Wang b95a847ce1 [SPARK-34046][SQL][TESTS] Use join hint for constructing joins in JoinSuite and WholeStageCodegenSuite
### What changes were proposed in this pull request?

There are some existing test cases that constructing various joins by tuning the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc.

This can be tricky and not straight-forward. In the future development we might have to tweak the configurations again .
This PR is to construct specific joins by using join hint in test cases.
### Why are the changes needed?

Make test cases for join simpler and more robust.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #31087 from gengliangwang/joinhintInTest.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-08 07:52:39 +00:00