Commit graph

29139 commits

Author SHA1 Message Date
gengjiaan 687f465244 [SPARK-33890][SQL] Improve the implement of trim/trimleft/trimright
### What changes were proposed in this pull request?
The current implement of trim/trimleft/trimright have somewhat redundant.

### Why are the changes needed?
Improve the implement of trim/trimleft/trimright

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #30905 from beliefer/SPARK-33890.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 06:06:17 +00:00
angerszhu 49aa6ebef1 [SPARK-32684][SQL][TESTS] Add a test case to check if null value is same as Hive's '\\N' in script transformation
### What changes were proposed in this pull request?
In hive script transform serde mode, NULL format default is `\\N`
```
String nullString = tbl.getProperty(
    serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
nullSequence = new Text(nullString);
```

I make a mistake that in Spark's code we need to fix and keep same with hive too.  So add some test case to show this issue.

### Why are the changes needed?
add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30946 from AngersZhuuuu/SPARK-32684.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 05:28:01 +00:00
Holden Karau 448494ebcf [SPARK-33874][K8S] Handle long lived sidecars
### What changes were proposed in this pull request?

For liveness check when checkAllContainers is not set, we check the liveness status of the Spark container if we can find it.

### Why are the changes needed?

Some environments may deploy long lived logs collecting side cars which outlive the Spark application. Just because they remain alive does not mean the Spark executor should keep running.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Extended the existing pod status tests.

Closes #30892 from holdenk/SPARK-33874-handle-long-lived-sidecars.

Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 14:06:34 +09:00
Liang-Chi Hsieh 951afc3acc [SPARK-33932][SS] Clean up KafkaOffsetReader API document
### What changes were proposed in this pull request?

This patch cleans up KafkaOffsetReader API document.

### Why are the changes needed?

KafkaOffsetReader API documents are duplicated among KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if the doc is centralized.

This also adds missing API doc too.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Doc only.

Closes #30961 from viirya/SPARK-33932.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 10:20:54 +09:00
Max Gekk 2b6836cdc2 [SPARK-33936][SQL] Add the version when connector's methods and interfaces were updated
### What changes were proposed in this pull request?
Add the `since` tag to methods and interfaces added recently.

### Why are the changes needed?
1. To follow the existing convention for Spark API.
2. To inform devs when Spark API was changed.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
`dev/scalastyle`

Closes #30966 from MaxGekk/spark-23889-interfaces-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-29 12:26:25 -08:00
Yuming Wang c42502493a [SPARK-33847][SQL][FOLLOWUP] Remove the CaseWhen should consider deterministic
### What changes were proposed in this pull request?

This pr fix remove the `CaseWhen` if elseValue is empty and other outputs are null because of we should consider deterministic.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30960 from wangyum/SPARK-33847-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 14:35:01 +00:00
Max Gekk 16c594de79 [SPARK-33859][SQL][FOLLOWUP] Add version to SupportsPartitionManagement.renamePartition()
### What changes were proposed in this pull request?
Add the version 3.2.0 to new method `renamePartition()` in the `SupportsPartitionManagement` interface.

### Why are the changes needed?
To inform Spark devs when the method appears in the interface.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`./dev/scalastyle`

Closes #30964 from MaxGekk/alter-table-rename-partition-v2-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 14:30:37 +00:00
angerszhu aadda4b561 [SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde
### What changes were proposed in this pull request?
For same SQL
```
SELECT TRANSFORM(a, b, c, null)
ROW FORMAT DELIMITED
USING 'cat'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '&'
FROM (select 1 as a, 2 as b, 3  as c) t
```
In hive:
```
hive> SELECT TRANSFORM(a, b, c, null)
    > ROW FORMAT DELIMITED
    > USING 'cat'
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '&'
    > FROM (select 1 as a, 2 as b, 3  as c) t;
OK
123\N	NULL
Time taken: 14.519 seconds, Fetched: 1 row(s)
hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe
```

In Spark
```
Spark master: local[*], Application Id: local-1609225830376
spark-sql> SELECT TRANSFORM(a, b, c, null)
         > ROW FORMAT DELIMITED
         > USING 'cat'
         > ROW FORMAT DELIMITED
         > FIELDS TERMINATED BY '&'
         > FROM (select 1 as a, 2 as b, 3  as c) t;
1	2	3	null	NULL
Time taken: 4.297 seconds, Fetched 1 row(s)
spark-sql>
```
We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001`

In hive default value is '1' to char is '\u0001'
```
bucket_count -1
column.name.delimiter ,
columns
columns.comments
columns.types
file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat
```

### Why are the changes needed?
Keep same behavior with hive

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30958 from AngersZhuuuu/SPARK-33930.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 23:26:27 +09:00
Yuming Wang 872107f67f [SPARK-33848][SQL][FOLLOWUP] Introduce allowList for push into (if / case) branches
### What changes were proposed in this pull request?

Introduce allowList push into (if / case) branches to fix potential bug.

### Why are the changes needed?

 Fix potential bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test.

Closes #30955 from wangyum/SPARK-33848-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 13:34:43 +00:00
ulysses-you 3b1b209e90 [SPARK-33909][SQL] Check rand functions seed is legal at analyer side
### What changes were proposed in this pull request?

Move seed is legal check to `CheckAnalysis`.

### Why are the changes needed?

It's better to check seed expression is legal at analyzer side instead of execution, and user can get exception as soon as possible.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #30923 from ulysses-you/SPARK-33909.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 13:33:06 +00:00
Max Gekk e0d2ffec31 [SPARK-33859][SQL] Support V2 ALTER TABLE .. RENAME PARTITION
### What changes were proposed in this pull request?
1. Add `renamePartition()` to the `SupportsPartitionManagement`
2. Implement `renamePartition()` in `InMemoryPartitionTable`
3. Add v2 execution node `AlterTableRenamePartitionExec`
4. Resolve the logical node `AlterTableRenamePartition` to `AlterTableRenamePartitionExec` for v2 tables that support `SupportsPartitionManagement`
5. Move v1 tests to the base suite `org.apache.spark.sql.execution.command.AlterTableRenamePartitionSuiteBase` to run them for v2 table catalogs.

### Why are the changes needed?
To have feature parity with Datasource V1.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running the unified tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #30935 from MaxGekk/alter-table-rename-partition-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 13:29:48 +00:00
yangjie01 9d6dbe0fe5 [SPARK-33775][FOLLOWUP][TEST-MAVEN][BUILD] Suppress maven compilation warnings in Scala 2.13
### What changes were proposed in this pull request?
This pr is followup of SPARK-33775, the main change of this pr this sync suppression rules from `SparkBuild.scala` to `pom.xml` to let maven build have the same suppression ability for compilation warnings in Scala 2.13

### Why are the changes needed?
Suppress unimportant compilation warnings in Scala 2.13 with maven build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Local manual test:The suppressed compilation warnings are no longer printed to the console.

Closes #30951 from LuciferYang/SPARK-33775-FOLLOWUP.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 21:42:00 +09:00
Liang-Chi Hsieh f9fe742442 [SPARK-32968][SQL] Prune unnecessary columns from CsvToStructs
### What changes were proposed in this pull request?

This patch proposes to do column pruning for CsvToStructs expression if we only require some fields from it.

### Why are the changes needed?

`CsvToStructs` takes a schema parameter used to tell CSV Parser what fields are needed to parse. If `CsvToStructs` is followed by GetStructField. We can prune the schema to only parse certain field.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #30912 from viirya/SPARK-32968.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 21:37:17 +09:00
Dongjoon Hyun 2627825647 [SPARK-33931][INFRA] Recover GitHub Action build_and_test job
### What changes were proposed in this pull request?

This PR aims to recover GitHub Action `build_and_test` job.

### Why are the changes needed?

Currently, `build_and_test` job fails to start because of  the following in master/branch-3.1 at least.
```
r-lib/actions/setup-rv1 is not allowed to be used in apache/spark.
Actions in this workflow must be: created by GitHub, verified in the GitHub Marketplace,
within a repository owned by apache or match the following:
adoptopenjdk/*, apache/*, gradle/wrapper-validation-action.
```
- https://github.com/apache/spark/actions/runs/449826457

![Screen Shot 2020-12-28 at 10 06 11 PM](https://user-images.githubusercontent.com/9700541/103262174-f1f13a80-4958-11eb-8ceb-631527155775.png)

### Does this PR introduce _any_ user-facing change?

No. This is a test infra.

### How was this patch tested?

To check GitHub Action `build_and_test` job on this PR.

Closes #30959 from dongjoon-hyun/SPARK-33931.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 20:51:57 +09:00
yi.wu 1ef7ddd38a [SPARK-33928][SPARK-23365][TEST][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite - " Don't update target num executors when killing idle executors"
### What changes were proposed in this pull request?

Use the testing mode for the test to fix the flaky.

### Why are the changes needed?

The test is flaky:

```scala
[info] - SPARK-23365 Don't update target num executors when killing idle executors *** FAILED *** (126 milliseconds)
[info] 1 did not equal 2 (ExecutorAllocationManagerSuite.scala:1615)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info] at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$84(ExecutorAllocationManagerSuite.scala:1617)
...
```
The root cause should be the same as https://github.com/apache/spark/pull/29773 since the test run under non-testing mode.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked. Flaky is gone by running the test hundreds of times after this fix.

Closes #30956 from Ngone51/fix-flaky-SPARK-23365.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 07:35:45 +00:00
Yuming Wang f7bdea334a [SPARK-33884][SQL] Simplify CaseWhenclauses with (true and false) and (false and true)
### What changes were proposed in this pull request?

This pr simplify `CaseWhen`clauses with (true and false) and (false and true):

Expression | cond.nullable | After simplify
-- | -- | --
case when cond then true else false end | true | cond <=> true
case when cond then true else false end | false | cond
case when cond then false else true end | true | !(cond <=> true)
case when cond then false else true end | false | !cond

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30898 from wangyum/SPARK-33884.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 07:09:11 +00:00
Max Gekk 379afcd2ce [SPARK-33924][SQL][TESTS] Preserve partition metadata by INSERT INTO in v2 table catalog
### What changes were proposed in this pull request?
For `InMemoryPartitionTable` used in tests, set empty partition metadata only when a partition doesn't exists.

### Why are the changes needed?
This bug fix is needed to use `INSERT INTO .. PARTITION` in other tests.

### Does this PR introduce _any_ user-facing change?
No. It affects only the v2 table catalog used in tests.

### How was this patch tested?
Added new UT to `DataSourceV2SQLSuite`, and run the affected test suite by:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.connector.DataSourceV2SQLSuite"
```

Closes #30952 from MaxGekk/fix-insert-into-partition-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 06:49:26 +00:00
HyukjinKwon b33fa53385 [SPARK-33925][CORE] Remove unused SecurityManager in Utils.fetchFile
### What changes were proposed in this pull request?

This is kind of a followup of https://github.com/apache/spark/pull/24033.
The first and last usage of that argument `SecurityManager` was removed in https://github.com/apache/spark/pull/24033.
After that,  we don't need to pass `SecurityManager` anymore in `Utils.fetchFile` and related code paths.

This PR proposes to remove it out.

### Why are the changes needed?

For better readability of codes.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually complied. GitHub Actions and Jenkins build should test it out as well.

Closes #30945 from HyukjinKwon/SPARK-33925.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-28 16:58:42 -08:00
Wenchen Fan c2eac1de02 [SPARK-33845][SQL][FOLLOWUP] fix SimplifyConditionals
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30849, to fix a correctness issue caused by null value handling.

### Why are the changes needed?

Fix a correctness issue. `If(null, true, false)` should return false, not true.

### Does this PR introduce _any_ user-facing change?

Yes, but the bug only exist in the master branch.

### How was this patch tested?

updated tests.

Closes #30953 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-28 16:44:57 -08:00
Dongjoon Hyun 6497ccbbda [SPARK-33916][CORE] Fix fallback storage offset and improve compression codec test coverage
### What changes were proposed in this pull request?

This PR aims to fix offset bug and improve compression codec test coverage.

### Why are the changes needed?

When the user choose a non-default codec, it causes a failure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the extended test suite.

Closes #30934 from dongjoon-hyun/SPARK-33916.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-28 16:33:01 -08:00
Max Gekk 0617dfce7b [SPARK-33899][SQL] Fix assert failure in v1 SHOW TABLES/VIEWS on spark_catalog
### What changes were proposed in this pull request?
Remove `assert(ns.nonEmpty)` in `ResolveSessionCatalog` for:
- `SHOW TABLES`
- `SHOW TABLE EXTENDED`
- `SHOW VIEWS`

### Why are the changes needed?
Spark SQL shouldn't fail with internal assert failures even for invalid user inputs. For instance:
```sql
spark-sql> show tables in spark_catalog;
20/12/24 11:19:46 ERROR SparkSQLDriver: Failed in [show tables in spark_catalog]
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:366)
	at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, for the example above:
```sql
spark-sql> show tables in spark_catalog;
Error in query: multi-part identifier cannot be empty.
```

### How was this patch tested?
Added new UT to `v1/ShowTablesSuite`.

Closes #30915 from MaxGekk/remove-assert-ns-nonempty.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-28 09:07:21 +00:00
angerszhu fc508d1898 [SPARK-32685][SQL] When specify serde, default filed.delim is '\t'
### What changes were proposed in this pull request?
In hive script transform, when we use specified serde, the `filed.delim` is '\t'
![image](https://user-images.githubusercontent.com/46485123/103187960-7dd77800-4901-11eb-8241-f4636e66fbc8.png)
And change to other serde and explain query plan, `filed.delim` is same.

In spark current code, the result is as below:
![image](https://user-images.githubusercontent.com/46485123/103187999-95aefc00-4901-11eb-9850-5c385000b78c.png)

We should keep same as hive.

Notic:
the result's NULL value is   different is another issue  https://issues.apache.org/jira/browse/SPARK-32684

### Why are the changes needed?
Keep same with hive serde

### Does this PR introduce _any_ user-facing change?
In script transform, is not specified,  `field.delim` keep same with hive as `\t`

### How was this patch tested?
UT added

Closes #30942 from AngersZhuuuu/SPARK-32685.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-28 08:23:01 +00:00
yi.wu 00fa49aeaa [SPARK-33923][SQL][TESTS] Fix some tests with AQE enabled
### What changes were proposed in this pull request?

* Remove the explicit AQE disable confs
* Use `AdaptiveSparkPlanHelper` to check plans
* No longer extending `DisableAdaptiveExecutionSuite` for `BucketedReadSuite` but only disable AQE for two certain tests there.

### Why are the changes needed?

Some tests that are fixed in https://github.com/apache/spark/pull/30655 doesn't really require AQE off.  Instead, they could use `AdaptiveSparkPlanHelper` to pass when AQE on. It's better to run tests with AQE on since we've turned it on by default.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass all tests and the updated tests.

Closes #30941 from Ngone51/SPARK-33680-follow-up.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-28 00:03:45 -08:00
Liang-Chi Hsieh c75f779fd7 [SPARK-33827][SS] Unload inactive state store as soon as possible
### What changes were proposed in this pull request?

This patch proposes to unload inactive state store as soon as possible. The timing of unload inactive state stores, happens when we get to load active state store provider at executors. At the time, state store coordinator will return back the state store provider list including loaded stores that are already loaded by other executors in new batch. Each state store provider in the list will go to unload.

### Why are the changes needed?

Per the discussion at #30770, it makes sense to me we should unload inactive state store asap. Now we run a maintenance task periodically to unload inactive state stores. So there will be some delays between a state store becomes inactive and it is unloaded.

However, we can force Spark to always allocate a state store to same executor, by using task locality configuration. This can reduce the possibility to have inactive state store.

Normally, with locality configuration, we might not able to see inactive state store generally. There is still chance an executor can be failed and reallocated, but in this case, inactive state store is also lost too. So it is not an issue.

Making driver-executor bi-directional for unloading inactive state store looks non-trivial, and seems to me, it is not worth, after considering what we can do with locality.

This proposes a simpler but effective approach. We can check if loaded state store is already loaded at other executor during reporting active state store to the coordinator. If so, it means the loaded store is inactive now, and it is going to be unload by the next maintenance task. Then we unload that store immediately.

How do we make sure the loaded state store in previous batch is loaded at other executor in this batch before reporting in this executor? With task locality and preferred location, once an executor is ready to be scheduled, Spark should assign the state store provider previously loaded at the executor. So when this executor gets a new assignment other than previously loaded state store, it means the previously loaded one is already assigned to other executor.

There is still a delay between the state store is loaded at other executor, and unloading it when reporting active state store at this executor. But it should be minimized now. And there won't be multiple state store belonging to same operator are loaded at the same time at one single executor, because once the executor reports any active store, it will unload all inactive stores. This should not be an issue IMHO.

This is a minimal change to unload inactive state store asap without significant change.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30827 from viirya/SPARK-33827.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2020-12-28 16:52:56 +09:00
Max Gekk 4a61fc1a92 [SPARK-33914][SQL][DOCS] Describe the structure of unified DS v1 and v2 tests
### What changes were proposed in this pull request?
Add comments for the unified datasource tests, describe what kind of tests they contain, and put refs to other test suits.

### Why are the changes needed?
To improve code maintenance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`.

Closes #30929 from MaxGekk/doc-unified-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-28 07:03:29 +00:00
angerszhu 0a3f3d609d [SPARK-33908][CORE] Refactor SparkSubmitUtils.resolveMavenCoordinates() 's return parameter
### What changes were proposed in this pull request?
Per discuss in  https://github.com/apache/spark/pull/29966#discussion_r531917374
We'd better change `SparkSubmitUtils.resolveMavenCoordinates()` 's return value as `Seq[String]`

### Why are the changes needed?
refactor code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30922 from AngersZhuuuu/SPARK-33908.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 16:00:24 +09:00
Kent Yao 3fdbc48373 [SPARK-33901][SQL] Fix Char and Varchar display error after DDLs
### What changes were proposed in this pull request?

After CTAS / CREATE TABLE LIKE / CVAS/ alter table add columns, the target tables will display string instead of char/varchar

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new tests

Closes #30918 from yaooqinn/SPARK-33901.

Lead-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-28 06:48:27 +00:00
yangjie01 1be9e7e40b [SPAKR-33801][CORE][SQL] Fix compilation warnings about 'Unicode escapes in triple quoted strings are deprecated'
### What changes were proposed in this pull request?
There are total 15 compilation warnings about `Unicode escapes in triple quoted strings are deprecated` in Spark code now:
```
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2930: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2931: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2932: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2933: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2934: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2935: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2936: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2937: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala:82: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:32: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:79: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:97: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:101: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:76: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:83: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
```

This pr try to fix these warnnings.

### Why are the changes needed?
Cleanup compilation warnings about `Unicode escapes in triple quoted strings are deprecated`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30926 from LuciferYang/SPARK-33801.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 15:29:09 +09:00
Terry Kim fe33262c91 [SPARK-33918][SQL] UnresolvedView should retain SQL text position for DDL commands
### What changes were proposed in this pull request?

Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect:
```
scala> sql("DROP VIEW unknown")
org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 0;
```
, whereas the `pos` should be `10`.

This PR proposes to fix this issue for commands using `UnresolvedTable`:
```
DROP VIEW v
ALTER VIEW v SET TBLPROPERTIES ('k'='v')
ALTER VIEW v UNSET TBLPROPERTIES ('k')
ALTER VIEW v AS SELECT 1
```

### Why are the changes needed?

To fix a bug.

### Does this PR introduce _any_ user-facing change?

Yes, now the above example will print the following:
```
org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 10;
```

### How was this patch tested?

Add a new suite of tests.

Closes #30936 from imback82/position_view_fix.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-28 05:45:40 +00:00
yangjie01 e6f019836c [SPARK-33532][SQL] Add comments to a unreachable branch in SpecificParquetRecordReaderBase.initialize method
### What changes were proposed in this pull request?
This pr mainly adds a comment for the 'rowgroupoffsets! = null' branch in `SpecificParquetRecordReaderBase.init(InputSplit, TaskAttemptContext)` to indicate that spark read parquet process will not enter this branch after SPARK-13883 and SPARK-13989.  It is not deleted because PARQUET-131 wants to move `SpecificParquetRecordReaderBase` into the parquet-mr project.

### Why are the changes needed?
Add a useful comment.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30484 from LuciferYang/SPARK-33532.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 14:07:50 +09:00
Gabor Somogyi 678294ddc2 [SPARK-33824][PYTHON][DOCS][FOLLOW-UP] Clarify about PYSPARK_DRIVER_PYTHON and spark.yarn.appMasterEnv.PYSPARK_PYTHON
### What changes were proposed in this pull request?

This PR proposes to clarify:
- `PYSPARK_DRIVER_PYTHON` should not be set for cluster modes in YARN and Kubernates.
- `spark.yarn.appMasterEnv.PYSPARK_PYTHON` is not required in YARN. This is just another way to set `PYSPARK_PYTHON` that is specific for a Spark application.

### Why are the changes needed?

To clarify what's required and not.

### Does this PR introduce _any_ user-facing change?

Yes, this is a user-facing doc change.

### How was this patch tested?

Manually tested.

Note that this credits to gaborgsomogyi who actually tested and raised a doubt about this offline to me.
I also manually tested all again to double check.

Closes #30938 from HyukjinKwon/SPARK-33824-followup.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 09:52:42 +09:00
Max Gekk b2dfeae18b [SPARK-33911][SQL][DOCS] Update the SQL migration guide about changes in HiveClientImpl
### What changes were proposed in this pull request?
Update the SQL migration guide about the changes made by:
- https://github.com/apache/spark/pull/30778
- https://github.com/apache/spark/pull/30711
- https://github.com/apache/spark/pull/30866

### Why are the changes needed?
To inform users about the recent changes in the upcoming releases.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes #30925 from MaxGekk/sql-migr-guide-hiveclientimpl.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-27 17:57:42 +09:00
yangjie01 37ae0a6086 [SPARK-33560][TEST-MAVEN][BUILD] Add "unused-import" check to Maven compilation process
### What changes were proposed in this pull request?

Similar to SPARK-33441, this pr add `unused-import` check to Maven compilation process. After this pr `unused-import` will trigger Maven compilation error.

For Scala 2.13 profile, this pr also left TODO(SPARK-33499) similar to SPARK-33441 because `scala.language.higherKinds` no longer needs to be imported explicitly since Scala 2.13.1

### Why are the changes needed?
Let Maven build also check for unused imports as compilation error.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Local manual test:add an unused import intentionally to trigger maven compilation error.

Closes #30784 from LuciferYang/SPARK-33560.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-26 17:40:19 -06:00
kozakana 2553d53dc8 [SPARK-33897][SQL] Can't set option 'cross' in join method
### What changes were proposed in this pull request?

[The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti."
However, I get the following error when I set the cross option.

```
scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b")))
df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C")))
df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show()
java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross
  at scala.Predef$.require(Predef.scala:281)
  at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:1025)
  ... 53 elided
```

### Why are the changes needed?

The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException.

### Does this PR introduce _any_ user-facing change?

Accepting this PR fix will behave the same as the documentation.

### How was this patch tested?

There is already a test for [JoinTypes](1b9fd67904/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala), but I can't find a test for the join option itself.

Closes #30803 from kozakana/allow_cross_option.

Authored-by: kozakana <goki727@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-26 16:30:50 +09:00
angerszhu 10b6466e91 [SPARK-33084][CORE][SQL] Add jar support ivy path
### What changes were proposed in this pull request?
Support add jar with ivy path

### Why are the changes needed?
Since submit app can support ivy, add jar we can also support ivy now.

### Does this PR introduce _any_ user-facing change?
User can add jar with sql like
```
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false
```

core api
```
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true")
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false")
```

#### Doc Update snapshot
![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png)

### How was this patch tested?
Added UT

Closes #29966 from AngersZhuuuu/support-add-jar-ivy.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-12-25 09:07:48 +09:00
Takeshi Yamamuro 65a9ac2ff4 [SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec
### What changes were proposed in this pull request?

This pr intends to support code generation for `HashAggregateExec` with filters.

Quick benchmark results:
```
$ ./bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=1 -v

scala> spark.range(100000000).selectExpr("id % 3 as k1", "id % 5 as k2", "rand() as v1", "rand() as v2").write.saveAsTable("t")
scala> sql("SELECT k1, k2, AVG(v1) FILTER (WHERE v2 > 0.5) FROM t GROUP BY k1, k2").write.format("noop").mode("overwrite").save()

>> Before this PR
Elapsed time: 16.170697619s

>> After this PR
Elapsed time: 6.7825313s
```

The query above is compiled into code below;

```
...
/* 285 */   private void agg_doAggregate_avg_0(boolean agg_exprIsNull_2_0, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_0, double agg_expr_2_0) throws java.io.IOException {
/* 286 */     // evaluate aggregate function for avg
/* 287 */     boolean agg_isNull_10 = true;
/* 288 */     double agg_value_12 = -1.0;
/* 289 */     boolean agg_isNull_11 = agg_unsafeRowAggBuffer_0.isNullAt(0);
/* 290 */     double agg_value_13 = agg_isNull_11 ?
/* 291 */     -1.0 : (agg_unsafeRowAggBuffer_0.getDouble(0));
/* 292 */     if (!agg_isNull_11) {
/* 293 */       agg_agg_isNull_12_0 = true;
/* 294 */       double agg_value_14 = -1.0;
/* 295 */       do {
/* 296 */         if (!agg_exprIsNull_2_0) {
/* 297 */           agg_agg_isNull_12_0 = false;
/* 298 */           agg_value_14 = agg_expr_2_0;
/* 299 */           continue;
/* 300 */         }
/* 301 */
/* 302 */         if (!false) {
/* 303 */           agg_agg_isNull_12_0 = false;
/* 304 */           agg_value_14 = 0.0D;
/* 305 */           continue;
/* 306 */         }
/* 307 */
/* 308 */       } while (false);
/* 309 */
/* 310 */       agg_isNull_10 = false; // resultCode could change nullability.
/* 311 */
/* 312 */       agg_value_12 = agg_value_13 + agg_value_14;
/* 313 */
/* 314 */     }
/* 315 */     boolean agg_isNull_15 = false;
/* 316 */     long agg_value_17 = -1L;
/* 317 */     if (!false && agg_exprIsNull_2_0) {
/* 318 */       boolean agg_isNull_18 = agg_unsafeRowAggBuffer_0.isNullAt(1);
/* 319 */       long agg_value_20 = agg_isNull_18 ?
/* 320 */       -1L : (agg_unsafeRowAggBuffer_0.getLong(1));
/* 321 */       agg_isNull_15 = agg_isNull_18;
/* 322 */       agg_value_17 = agg_value_20;
/* 323 */     } else {
/* 324 */       boolean agg_isNull_19 = true;
/* 325 */       long agg_value_21 = -1L;
/* 326 */       boolean agg_isNull_20 = agg_unsafeRowAggBuffer_0.isNullAt(1);
/* 327 */       long agg_value_22 = agg_isNull_20 ?
/* 328 */       -1L : (agg_unsafeRowAggBuffer_0.getLong(1));
/* 329 */       if (!agg_isNull_20) {
/* 330 */         agg_isNull_19 = false; // resultCode could change nullability.
/* 331 */
/* 332 */         agg_value_21 = agg_value_22 + 1L;
/* 333 */
/* 334 */       }
/* 335 */       agg_isNull_15 = agg_isNull_19;
/* 336 */       agg_value_17 = agg_value_21;
/* 337 */     }
/* 338 */     // update unsafe row buffer
/* 339 */     if (!agg_isNull_10) {
/* 340 */       agg_unsafeRowAggBuffer_0.setDouble(0, agg_value_12);
/* 341 */     } else {
/* 342 */       agg_unsafeRowAggBuffer_0.setNullAt(0);
/* 343 */     }
/* 344 */
/* 345 */     if (!agg_isNull_15) {
/* 346 */       agg_unsafeRowAggBuffer_0.setLong(1, agg_value_17);
/* 347 */     } else {
/* 348 */       agg_unsafeRowAggBuffer_0.setNullAt(1);
/* 349 */     }
/* 350 */   }
...
```

### Why are the changes needed?

For high performance.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27019 from maropu/AggregateFilterCodegen.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-24 14:44:16 -08:00
ulysses-you 9c30116fb4 [SPARK-33857][SQL] Unify the default seed of random functions
### What changes were proposed in this pull request?

Unify the seed of random functions
1. Add a hold place expression `UnresolvedSeed ` as the defualt seed.
2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `.
3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule.

### Why are the changes needed?

`Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing.

It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass exists test and add test.

Closes #30864 from ulysses-you/SPARK-33857.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-24 14:30:34 -08:00
sychen 700f5ab65c [SPARK-33900][WEBUI] Show shuffle read size / records correctly when only remotebytesread is available
### What changes were proposed in this pull request?
Shuffle Read Size / Records can also be displayed in remoteBytesRead>0 localBytesRead=0.

current:
![image](https://user-images.githubusercontent.com/3898450/103079421-c4ca2280-460e-11eb-9e2f-49d35b5d324d.png)
fix:
![image](https://user-images.githubusercontent.com/3898450/103079439-cc89c700-460e-11eb-9a41-6b2882980d11.png)

### Why are the changes needed?
At present, the page only displays the data of Shuffle Read Size / Records when localBytesRead>0.
When there is only remote reading, metrics cannot be seen on the stage page.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test

Closes #30916 from cxzl25/SPARK-33900.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2020-12-25 00:54:26 +09:00
Kent Yao 29cca68e9e [SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE
### What changes were proposed in this pull request?

Display char/varchar in
  - DESC table
  - DESC column
  - SHOW CREATE TABLE

### Why are the changes needed?

show the correct definition for users

### Does this PR introduce _any_ user-facing change?

yes, char/varchar  column's will print char/varchar instead of string
### How was this patch tested?

new tests

Closes #30908 from yaooqinn/SPARK-33892.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:56:02 +00:00
Max Gekk 54a67842e6 [SPARK-33881][SQL][TESTS] Check null and empty string as partition values in DS v1 and v2 tests
### What changes were proposed in this pull request?
Add tests to check handling `null` and `''` (empty string) as partition values in commands `SHOW PARTITIONS`, `ALTER TABLE .. ADD PARTITION`, `ALTER TABLE .. DROP PARTITION`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.ShowPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
```

Closes #30893 from MaxGekk/partition-value-empty-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:54:53 +00:00
gengjiaan 3e9821edfd [SPARK-33443][SQL] LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
### What changes were proposed in this pull request?
The mainstream database support `[ IGNORE NULLS | RESPECT NULLS ]` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE`.
But the current implement of `LEAD`/`LAG` don't support this syntax.

**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LEAD.html#GUID-0A0481F1-E98F-4535-A739-FCCA8D1B5B77

**Presto**
https://prestodb.io/docs/current/functions/window.html

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LEAD.html

**DB2**
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm

**Teradata**
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w

**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/lead.html
https://docs.snowflake.com/en/sql-reference/functions/lag.html

### Why are the changes needed?
Support `[ IGNORE NULLS | RESPECT NULLS ]` for `LEAD`/`LAG` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
Jenkins test.

Closes #30387 from beliefer/SPARK-33443.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:13:48 +00:00
Yuming Wang 32d4a2b062 [SPARK-33861][SQL] Simplify conditional in predicate
### What changes were proposed in this pull request?

This pr simplify conditional in predicate, after this change we can push down the filter to datasource:

Expression | After simplify
-- | --
IF(cond, trueVal, false)                   | AND(cond, trueVal)
IF(cond, trueVal, true)                    | OR(NOT(cond), trueVal)
IF(cond, false, falseVal)                  | AND(NOT(cond), elseVal)
IF(cond, true, falseVal)                   | OR(cond, elseVal)
CASE WHEN cond THEN trueVal ELSE false END | AND(cond, trueVal)
CASE WHEN cond THEN trueVal END            | AND(cond, trueVal)
CASE WHEN cond THEN trueVal ELSE null END  | AND(cond, trueVal)
CASE WHEN cond THEN trueVal ELSE true END  | OR(NOT(cond), trueVal)
CASE WHEN cond THEN false ELSE elseVal END | AND(NOT(cond), elseVal)
CASE WHEN cond THEN false END              | false
CASE WHEN cond THEN true ELSE elseVal END  | OR(cond, elseVal)
CASE WHEN cond THEN true END               | cond

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30865 from wangyum/SPARK-33861.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:10:28 +00:00
Kent Yao d7dc42d5f6 [SPARK-33895][SQL] Char and Varchar fail in MetaOperation of ThriftServer
### What changes were proposed in this pull request?

```
Caused by: java.lang.IllegalArgumentException: Unrecognized type name: CHAR(10)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:187)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:203)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:195)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4(SparkGetColumnsOperation.scala:99)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4$adapted(SparkGetColumnsOperation.scala:98)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
```

meta operation is targeting raw table schema, we need to handle these types there.

### Why are the changes needed?

bugfix, see the above case
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new tests

locally

![image](https://user-images.githubusercontent.com/8326978/103069196-cdfcc480-45f9-11eb-9c6a-d4c42123c6e3.png)

Closes #30914 from yaooqinn/SPARK-33895.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 07:40:38 +00:00
Terry Kim f1d3797291 [SPARK-33886][SQL] UnresolvedTable should retain SQL text position for DDL commands
### What changes were proposed in this pull request?

Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect:
```
scala> sql("MSCK REPAIR TABLE unknown")
org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 0;
```
, whereas the `pos` should be 18.

This PR proposes to fix this issue for commands using `UnresolvedTable`:
```
MSCK REPAIR TABLE t
LOAD DATA LOCAL INPATH 'filepath' INTO TABLE t
TRUNCATE TABLE t
SHOW PARTITIONS t
ALTER TABLE t RECOVER PARTITIONS
ALTER TABLE t ADD PARTITION (p=1)
ALTER TABLE t PARTITION (p=1) RENAME TO PARTITION (p=2)
ALTER TABLE t DROP PARTITION (p=1)
ALTER TABLE t SET SERDEPROPERTIES ('a'='b')
COMMENT ON TABLE t IS 'hello'"
```

### Why are the changes needed?

To fix a bug.

### Does this PR introduce _any_ user-facing change?

Yes, now the above example will print the following:
```
org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 18;
```

### How was this patch tested?

Add a new suite of tests.

Closes #30900 from imback82/position_Fix.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 05:21:39 +00:00
Yuanjian Li 86c1cfc579 [SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API
### What changes were proposed in this pull request?
Follow up work for #30521, document the following behaviors in the API doc:

- Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table.
- Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created.

### Why are the changes needed?
We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Document only.

Closes #30885 from xuanyuanking/SPARK-33659.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-24 12:44:37 +09:00
offthewall123 61881bb698 [SPARK-33835][CORE] Refector AbstractCommandBuilder.buildJavaCommand: use firstNonEmpty
### What changes were proposed in this pull request?
refector AbstractCommandBuilder.buildJavaCommand: use firstNonEmpty

### Why are the changes needed?
For better code understanding, and firstNonEmpty can detect javaHome = "   ", an empty string.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
End to End.

Closes #30831 from offthewall123/refector_AbstractCommandBuilder.

Authored-by: offthewall123 <dingyu.xu@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-23 20:01:53 -06:00
Kent Yao 368a2c341d [SPARK-33877][SQL][FOLLOWUP] SQL reference documents for INSERT w/ a column list
### What changes were proposed in this pull request?

followup of a3dd8dacee via suggestion https://github.com/apache/spark/pull/30888#discussion_r547822642
### Why are the changes needed?

doc improvement
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing GA doc

Closes #30909 from yaooqinn/SPARK-33877-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-23 15:38:32 -08:00
Dongjoon Hyun d467d81726 [SPARK-33893][CORE] Exclude fallback block manager from executorList
### What changes were proposed in this pull request?

This PR aims to exclude fallback block manager from `executorList` function.

### Why are the changes needed?

When a fallback storage is used, the executors UI tab hangs because the executor list REST API result doesn't have `peakMemoryMetrics` of `ExecutorMetrics`. The root cause is that the block manager id used by fallback storage is included in the API result and it doesn't have `peakMemoryMetrics` because it's populated during HeartBeat reporting. We should hide it.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix on UI.

### How was this patch tested?

Manual. Run the following and visit Spark `executors` tab UI with browser.
```
bin/spark-shell -c spark.storage.decommission.fallbackStorage.path=file:///tmp/spark-storage/
```

Closes #30911 from dongjoon-hyun/SPARK-33893.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-23 15:31:56 -08:00
Takuya UESHIN 5c9b421c37 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends
### What changes were proposed in this pull request?

This is a retry of #30177.

This is not a complete fix, but it would take long time to complete (#30242).
As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases.

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.

### Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

```py
spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()
```

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests, and manually.

Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-23 14:48:01 -08:00
Chandni Singh 0677c39009 [SPARK-32916][SHUFFLE][TEST-MAVEN][TEST-HADOOP2.7] Ensure the number of chunks in meta file and index file are equal
### What changes were proposed in this pull request?
1. Fixes for bugs in `RemoteBlockPushResolver` where the number of chunks in meta file and index file are inconsistent due to exceptions while writing to either index file or meta file. This java class was introduced in https://github.com/apache/spark/pull/30062.
 - If the writing to index file fails, the position of meta file is not reset. This means that the number of chunks in meta file is inconsistent with index file.
- During the exception handling while writing to index/meta file, we just set the pointer to the start position. If the files are closed just after this then it doesn't get rid of any the extra bytes written to it.
2. Adds an IOException threshold. If the `RemoteBlockPushResolver` encounters IOExceptions greater than this threshold  while updating data/meta/index file of a shuffle partition, then it responds to the client with  exception- `IOExceptions exceeded the threshold` so that client can stop pushing data for this shuffle partition.
3. When the update to metadata fails, exception is not propagated back to the client. This results in the increased size of the current chunk. However, with (2) in place, the current chunk will still be of a manageable size.

### Why are the changes needed?
This fix is needed for the bugs mentioned above.
1. Moved writing to meta file after index file. This fixes the issue because if there is an exception writing to meta file, then the index file position is not updated. With this change, if there is an exception writing to index file, then none of the files are effectively updated and the same is true vice-versa.
2. Truncating the lengths of data/index/meta files when the partition is finalized.
3. When the IOExceptions have reached the threshold, it is most likely that future blocks will also face the issue. So, it is better to let the clients know so that they can stop pushing the blocks for that partition.
4. When just the meta update fails, client retries pushing the block which was successfully merged to data file. This can be avoided by letting the chunk grow slightly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests for all the bugs and threshold.

Closes #30433 from otterc/SPARK-32916-followup.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2020-12-23 12:42:18 -06:00