Commit graph

29711 commits

Author SHA1 Message Date
allisonwang-db 76a7fca4e1 [SPARK-34335][SQL] Support referencing subquery with column aliases by table alias
### What changes were proposed in this pull request?
This PR adds support for referencing subquery with column aliases by its table alias.

Before
```sql
-- AnalysisException: cannot resolve '`t.c1`' given input columns: [c1, c2];
SELECT t.c1, t.c2 FROM (SELECT 1 AS a, 1 AS b) t(c1, c2)
```

After:
```sql
-- [(1, 1)]
SELECT t.c1, t.c2 FROM (SELECT 1 AS a, 1 AS b) t(c1, c2)
```

### Why are the changes needed?
To allow users to reference subquery with column aliases by its table alias.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added parser tests and SQL query tests.

Closes #31444 from allisonwang-db/spark-34335.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-03 08:51:28 +00:00
Erik Krogen 791de00ddf [SPARK-34182][AVRO] Improve error messages when matching Catalyst-to-Avro schemas
### What changes were proposed in this pull request?
Improve the error messages for incompatibilities between Avro and Catalyst schemas. First, make `AvroSerializer` more similar to `AvroDeserializer` in printing out contextual information such as hierarchical field names. Standardize exception messages in both serializer and deserializer to always include such contextual information, and include a top-level exception which shows the full schemas which were being parsed when the incompatibility was found. Both now print out the hierarchical name for both the Avro and Catalyst fields, since they may be different due to case sensitivity and Avro union handling.

### Why are the changes needed?
The error messages in this type of failure scenario are very lacking in information on the write path (`AvroSerializer`). Below are two examples of messages that provide insufficient information to determine what went wrong (lacking in field names, context about the overall schema structure, etc.).
```
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type IntegerType to Avro type "float".

org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(bar,IntegerType,true)) to Avro type {"type":"record","name":"test","fields":[{"name":"NOTbar","type":["null","int"],"default":null}]}.
```
The error messages currently existing in `AvroDeserializer` are much better, but still not very internally consistent, and it would be better if they were consistent with the newly added exception messages in `AvroSerializer`.

### Does this PR introduce _any_ user-facing change?
Error messages when there are incompatibilities between Avro and Catalyst schemas will be greatly improved on when writing Avro data using the `avroSchema` option, a little bit improved when reading Avro data, and much more consistent between the two.

Below is an example of a new message. See `AvroSerdeSuite` for more examples.
```
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type STRUCT<`foo`: STRUCT<`bar`: INT>> to Avro type {"type":"record","name":"top","fields":[{"name":"foo","type":"int"}]}
	at org.apache.spark.sql.avro.AvroSerializer.liftedTree1$1(AvroSerializer.scala:83)
...
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst field 'foo' to Avro field 'foo' because schema is incompatible (sqlType = STRUCT<`bar`: INT>, avroType = "int")
	at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:230)
...
```

### How was this patch tested?
New unit test suite, `AvroSerdeSuite`, was added to test corner cases on `AvroSerializer` and `AvroDeserializer` and verify that the exception messages are as expected. Existing tests in `AvroSuite` also continue to pass, with modifications in places where assertions were made about the exceptions that would be thrown.

Closes #31333 from xkrogen/xkrogen-SPARK-34182-avro-serde-errormessages.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-03 08:49:36 +00:00
Prashant Sharma 89bf2afb33 [SPARK-34327][BUILD] Strip passwords from inlining into build information while releasing
### What changes were proposed in this pull request?

Strip passwords from getting inlined into build information, inadvertently.

` https://user:passdomain/foo -> https://domain/foo`

### Why are the changes needed?
This can be a serious security issue, esp. during a release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested by executing the following command on both Mac OSX and Ubuntu.

```
echo url=$(git config --get remote.origin.url |  sed 's|https://\(.*\)\(.*\)|https://\2|')
```

Closes #31436 from ScrapCodes/strip_pass.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-03 15:02:35 +09:00
Terry Kim a1d4bb3300 [SPARK-34313][SQL] Migrate ALTER TABLE SET/UNSET TBLPROPERTIES commands to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `ALTER TABLE ... SET/UNSET TBLPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).

### Does this PR introduce _any_ user-facing change?

After this PR, `ALTER TABLE SET/UNSET TBLPROPERTIES` will have a consistent resolution behavior.

### How was this patch tested?

Updated existing tests / added new tests.

Closes #31422 from imback82/v2_alter_table_set_unset_properties.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-03 05:44:58 +00:00
Ruifeng Zheng fc80a5b877 [SPARK-34307][SQL] TakeOrderedAndProjectExec avoid shuffle if input rdd has single partition
### What changes were proposed in this pull request?
when the child rdd has only one partition, skip the shuffle

### Why are the changes needed?
avoid unnecessary shuffle

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #31409 from zhengruifeng/limit_with_single_partition.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-03 04:49:08 +00:00
HyukjinKwon e927bf90e0 Revert "[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path"
This reverts commit 63866025d2.
2021-02-03 12:32:39 +09:00
Kousuke Saruta 603a7fd7b6 [SPARK-34308][SQL] Escape meta-characters in printSchema
### What changes were proposed in this pull request?

Similar to SPARK-33690, this PR improves the output layout of `printSchema` for the case column names contain meta characters.
Here is an example.

Before:
```
scala> val df1 = spark.sql("SELECT 'aaa\nbbb\tccc\rddd\feee\bfff\u000Bggg\u0007hhh'")
scala> df1.printSchema
root
 |-- aaa
ddd	ccc
   eefff
        ggghhh: string (nullable = false)
```

After:
```
scala> df1.printSchema
root
 |-- aaa\nbbb\tccc\rddd\feee\bfff\vggg\ahhh: string (nullable = false)
```

### Why are the changes needed?

To avoid breaking the layout of `Dataset#printSchema`

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #31412 from sarutak/escape-meta-printSchema.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-03 11:06:41 +09:00
offthewall123 60c71c6d2d [SPARK-34325][CORE] Remove unused shuffleBlockResolver variable inSortShuffleWriter
### What changes were proposed in this pull request?
Remove unused shuffleBlockResolver variable in SortShuffleWriter.

### Why are the changes needed?
For better code understanding.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
End to End.

Closes #31433 from offthewall123/remove_shuffleBlockResolver_in_SortShuffleWriter.

Authored-by: offthewall123 <dingyu.xu@intel.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-03 09:40:08 +09:00
Wenchen Fan 00120ea537 [SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decimal fields with a larger precision
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/31357

#31357 added a very strong restriction to the vectorized parquet reader, that the spark data type must exactly match the physical parquet type, when reading decimal fields. This restriction is actually not necessary, as we can safely read parquet decimals with a larger precision. This PR releases this restriction a little bit.

### Why are the changes needed?

To not fail queries unnecessarily.

### Does this PR introduce _any_ user-facing change?

Yes, now users can read parquet decimals with mismatched `DecimalType` as long as the scale is the same and precision is larger.

### How was this patch tested?

updated test.

Closes #31443 from cloud-fan/improve.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-03 09:26:36 +09:00
Jungtaek Lim (HeartSaVioR) 63866025d2 [SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path
### What changes were proposed in this pull request?

This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted.

### Why are the changes needed?

The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified UTs explain the missing points, which also do the test.

Closes #31435 from HeartSaVioR/SPARK-34326.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-02-03 07:35:22 +09:00
Liang-Chi Hsieh cadca8d352 [SPARK-34324][SQL] FileTable should not list TRUNCATE in capabilities by default
### What changes were proposed in this pull request?

This patch proposes to remove `TRUNCATE` from the default `capabilities` list from `FileTable`.

### Why are the changes needed?

The abstract class `FileTable` now lists `TRUNCATE` in its `capabilities`, but `FileTable` does not know if an implementation really supports truncation or not. Specifically, we can check existing `FileTable` implementations including `AvroTable`, `CSVTable`, `JsonTable`, etc. No one implementation really implements `SupportsTruncate` in its writer builder.

### Does this PR introduce _any_ user-facing change?

No, because seems to me `FileTable` is not of user-facing public API.

### How was this patch tested?

Existing unit tests.

Closes #31432 from viirya/SPARK-34324.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-02-02 11:41:05 -08:00
Kousuke Saruta d308794adb [SPARK-34263][SQL] Simplify the code for treating unicode/octal/escaped characters in string literals
### What changes were proposed in this pull request?

In the current master, the code for treating unicode/octal/escaped characters in string literals is a little bit complex so let's simplify it.

### Why are the changes needed?

To keep it easy to maintain.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`ParserUtilsSuite` passes.

Closes #31362 from sarutak/refactor-unicode-escapes.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-02-03 01:07:12 +09:00
Max Gekk 79515b82f1 [SPARK-34282][SQL][TESTS] Unify v1 and v2 TRUNCATE TABLE tests
### What changes were proposed in this pull request?
1. Move parser tests from `DDLParserSuite` to `TruncateTableParserSuite`.
2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.TruncateTableSuiteBase` and to `v1.TruncateTableSuite`.
3. Add a test for DSv2 `TRUNCATE TABLE` to `v2.TruncateTableSuite`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *TruncateTableSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes #31387 from MaxGekk/unify-truncate-table-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-02 14:32:35 +00:00
Gengliang Wang ff1b6ecc37 [SPARK-33591][SQL][FOLLOW-UP] Revise the version and doc of spark.sql.legacy.parseNullPartitionSpecAsStringLiteral
### What changes were proposed in this pull request?

Correct the version of SQL configuration `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` from 3.2.0 to 3.0.2.
Also, revise the documentation and test case.

### Why are the changes needed?

The release version in https://github.com/apache/spark/pull/31421 was wrong.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #31434 from gengliangwang/reviseVersion.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-02 13:51:20 +00:00
gengjiaan 5b2ad59f64 [SPARK-33599][SQL] Restore the assert-like in catalyst/analysis
### What changes were proposed in this pull request?
There exists some `Exception` for assert in fact. Such as:
`throw new IllegalStateException("[BUG] unexpected plan returned by `lookupV2Relation`: " + other)`

This kind `Exception` seems should not put in single dedicated files.

### Why are the changes needed?
Reduce the workload of auditing.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #31395 from beliefer/SPARK-33599-restore-assert.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-02 13:28:28 +00:00
Kousuke Saruta 66f3480f2b [SPARK-34318][SQL] Dataset.colRegex should work with column names and qualifiers which contain newlines
### What changes were proposed in this pull request?

This PR fixes an issue that `Dataset.colRegex` doesn't work with column names or qualifiers which contain newlines.
In the current master, if column names or qualifiers passed to `colRegex` contain newlines, it throws exception.
```
val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table")
val col1 = df.colRegex("`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided

val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "test
_table.`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided
```

### Why are the changes needed?

Column names and qualifiers can contain newlines but `colRegex` can't work with them, so it's a bug.

### Does this PR introduce _any_ user-facing change?

Yes. users can pass column names and qualifiers even though they contain newlines.

### How was this patch tested?

New test.

Closes #31426 from sarutak/fix-colRegex.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-02-02 21:47:11 +09:00
William Hyun 5acc5b8f1e [SPARK-34323][BUILD] Upgrade zstd-jni to 1.4.8-3
### What changes were proposed in this pull request?

This PR aims to upgrade zstd-jni to 1.4.8-3.

### Why are the changes needed?

This will bring the latest improvements and bug fixes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs with the existing tests.

Closes #31430 from williamhyun/zstd-148.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-02 00:39:05 -08:00
Max Gekk 6d3674bb62 [SPARK-34312][SQL] Support partition(s) truncation by Supports(Atomic)PartitionManagement
### What changes were proposed in this pull request?
1. Add new method `truncatePartition()` to the `SupportsPartitionManagement` interface.
2. Add new method `truncatePartitions()` to the `SupportsAtomicPartitionManagement` interface.
3. Default implementation of new methods in `InMemoryPartitionTable`/`InMemoryAtomicPartitionTable`.

### Why are the changes needed?
This is the first step in supporting of v2 `TRUNCATE TABLE .. PARTITION`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *SupportsPartitionManagementSuite"
$ build/sbt "test:testOnly *SupportsAtomicPartitionManagementSuite"
```

Closes #31420 from MaxGekk/dsv2-truncate-table-partitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-02 08:25:59 +00:00
Terry Kim f024d3051c [SPARK-34317][SQL] Introduce relationTypeMismatchHint to UnresolvedTable for a better error message
### What changes were proposed in this pull request?

This PR proposes to add `relationTypeMismatchHint` to `UnresolvedTable` so that if a relation is resolved to a view when a table is expected, a hint message can be included as a part of the analysis exception message. Note that the same feature is already introduced to `UnresolvedView` in #30636.

This mostly affects `ALTER TABLE` commands where the analysis exception message will now contain `Please use ALTER VIEW as instead`.

### Why are the changes needed?

To give a better error message. (The hint used to exist but got removed for commands that migrated to the new resolution framework)

### Does this PR introduce _any_ user-facing change?

Yes, now `ALTER TABLE` commands include a hint to use `ALTER VIEW` instead.
```
sql("ALTER TABLE v SET SERDE 'whatever'")
```
Before:
```
"v is a view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]' expects a table.
```
After this PR:
```
"v is a view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]' expects a table. Please use ALTER VIEW instead.
```

### How was this patch tested?

Updated existing test cases to include the hint.

Closes #31424 from imback82/better_error.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-02 08:24:44 +00:00
Linhong Liu bb9bf66bb6 [SPARK-34199][SQL] Block table.* inside function to follow ANSI standard and other SQL engines
### What changes were proposed in this pull request?
In spark, the `count(table.*)` may cause very weird result, for example:
```
select count(*) from (select 1 as a, null as b) t;
output: 1
select count(t.*) from (select 1 as a, null as b) t;
output: 0
```
 This is because spark expands `t.*` while converts `*` to count(1), this will confuse
users. After checking the ANSI standard, `count(*)` should always be `count(1)` while `count(t.*)`
is not allowed. What's more, this is also not allowed by common databases, e.g. MySQL, Oracle.

So, this PR proposes to block the ambiguous behavior and print a clear error message for users.

### Why are the changes needed?
to avoid ambiguous behavior and follow ANSI standard and other SQL engines

### Does this PR introduce _any_ user-facing change?
Yes, `count(table.*)` behavior will be blocked and output an error message.

### How was this patch tested?
newly added and existing tests

Closes #31286 from linhongliu-db/fix-table-star.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-02 07:49:50 +00:00
yi.wu e9362c2571 [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas
### What changes were proposed in this pull request?

Resolve duplicate attributes for `FlatMapCoGroupsInPandas`.

### Why are the changes needed?

When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example,

```scala
df = spark.createDataFrame([(1, 1)], ("column", "value"))
row = df.groupby("ColUmn").cogroup(
    df.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, "column long, value long")
row.join(row).show()
```
error:

```scala
...
Conflicting attributes: column#163321L,value#163322L
;;
’Join Inner
:- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L]
:  :- Project [ColUmn#163312L, column#163312L, value#163313L]
:  :  +- LogicalRDD [column#163312L, value#163313L], false
:  +- Project [COLUMN#163312L, column#163312L, value#163313L]
:     +- LogicalRDD [column#163312L, value#163313L], false
+- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L]
   :- Project [ColUmn#163312L, column#163312L, value#163313L]
   :  +- LogicalRDD [column#163312L, value#163313L], false
   +- Project [COLUMN#163312L, column#163312L, value#163313L]
      +- LogicalRDD [column#163312L, value#163313L], false
...
```

### Does this PR introduce _any_ user-facing change?

yes, the query like the above example won't fail.

### How was this patch tested?

Adde unit tests.

Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 16:25:32 +09:00
Gengliang Wang 521397f2f9 [SPARK-33591][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values
### What changes were proposed in this pull request?

This is a follow up for https://github.com/apache/spark/pull/30538.
It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior.
It also adds document for the behavior change.

### Why are the changes needed?

In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true.

### Does this PR introduce _any_ user-facing change?

Yes, adding a legacy configuration to restore the old behavior.

### How was this patch tested?

Unit test.

Closes #31421 from gengliangwang/legacyNullStringConstant.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 16:13:40 +09:00
Dongjoon Hyun f66e38c963 [SPARK-34316][K8S] Support spark.kubernetes.executor.disableConfigMap
### What changes were proposed in this pull request?

This PR aims to add a new configuration `spark.kubernetes.executor.disableConfigMap`.

### Why are the changes needed?

This can be use to disable config map creating for executor pods due to https://github.com/apache/spark/pull/27735 .

### Does this PR introduce _any_ user-facing change?

No. By default, this doesn't change AS-IS behavior.
This is a new feature to add an ability to disable SPARK-30985.

### How was this patch tested?

Pass the newly added UT.

Closes #31428 from dongjoon-hyun/SPARK-34316.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-01 22:26:07 -08:00
David Toneian d99d0d27be [SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of dev/lint-python
This changeset is published into the public domain.

### What changes were proposed in this pull request?

Some typos and syntax issues in docstrings and the output of `dev/lint-python` have been fixed.

### Why are the changes needed?
In some places, the documentation did not refer to parameters or classes by the full and correct name, potentially causing uncertainty in the reader or rendering issues in Sphinx. Also, a typo in the standard output of `dev/lint-python` was fixed.

### Does this PR introduce _any_ user-facing change?

Slight improvements in documentation, and in standard output of `dev/lint-python`.

### How was this patch tested?

Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change.

Closes #31401 from DavidToneian/SPARK-34300.

Authored-by: David Toneian <david@toneian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 09:30:50 +09:00
HyukjinKwon 30468a9015 [SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs
### What changes were proposed in this pull request?

This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.

In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
  - `sumDistinct` -> `sum_distinct`
  - `bitwiseNOT` -> `bitwise_not`
  - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
  - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
  - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
  - (Scala-specific) `callUDF` -> `call_udf`

### Why are the changes needed?

To keep the consistent naming in APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it deprecates some APIs and add new renamed APIs as described above.

### How was this patch tested?

Unittests were added.

Closes #31408 from HyukjinKwon/SPARK-34306.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 09:29:40 +09:00
yangjie01 9db566a882 [SPARK-34310][CORE][SQL] Replaces map and flatten with flatMap
### What changes were proposed in this pull request?
Replaces `collection.map(f1).flatten(f2)` with `collection.flatMap` if possible. it's semantically consistent, but looks simpler.

### Why are the changes needed?
Code Simpilefications.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31416 from LuciferYang/SPARK-34310.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-01 08:21:35 -06:00
Angerszhuuuu 74116b6b25 [SPARK-34239][SQL] Unify output of SHOW COLUMNS pass output attributes properly
### What changes were proposed in this pull request?
Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe.

This PR keep SHOW COLUMNS command's output attribute exprId unchanged.

### Why are the changes needed?
 Keep SHOW PARTITIONS command's output attribute exprid unchanged.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #31377 from AngersZhuuuu/SPARK-34239.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-01 14:16:03 +00:00
Max Gekk 2b76e6d15c [SPARK-34301][SQL] Use logical plan of alter table in CatalogImpl.recoverPartitions()
### What changes were proposed in this pull request?
Replace v1 exec node `AlterTableRecoverPartitionsCommand` by the logical node `AlterTableRecoverPartitions` in `CatalogImpl.recoverPartitions()`.

### Why are the changes needed?
1. Print user friendly error message for views:
```
my_temp_table is a temp view. 'recoverPartitions()' expects a table
```
Before the changes:
```
Table or view 'my_temp_table' not found in database 'default'
```

2. To not bind to v1 `ALTER TABLE .. RECOVER PARTITIONS`, and to support v2 tables potentially as well.

### Does this PR introduce _any_ user-facing change?
Yes, it can.

### How was this patch tested?
By running new test in `CatalogSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.internal.CatalogSuite"
```

Closes #31403 from MaxGekk/catalogimpl-recoverPartitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-01 14:09:40 +00:00
Max Gekk 0837c1aa3d [SPARK-34303][SQL] Migrate ALTER TABLE .. SET LOCATION to new resolution framework
### What changes were proposed in this pull request?
1. Remove old statement `AlterTableSetLocationStatement`
2. Introduce new command `AlterTableSetLocation` for  `ALTER TABLE .. SET LOCATION`.

### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: SPARK-29900.

### Does this PR introduce _any_ user-facing change?
It can change the error message for views.

### How was this patch tested?
By running `ALTER TABLE .. SET LOCATION` tests:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *DataSourceV2SQLSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes #31414 from MaxGekk/migrate-set-location-resolv-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-01 13:41:15 +00:00
Max Gekk 95302756f1 [SPARK-34266][SQL][DOCS] Update comments for SessionCatalog.refreshTable() and CatalogImpl.refreshTable()
### What changes were proposed in this pull request?
Describe `SessionCatalog.refreshTable()` and `CatalogImpl.refreshTable()`. what they do and when they are supposed to be used.

### Why are the changes needed?
To improve code maintenance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`

Closes #31364 from MaxGekk/doc-refreshTable.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-01 13:07:05 +00:00
HyukjinKwon 4e7e7ee6e5 [SPARK-33245][SQL][FOLLOW-UP] Remove bitwiseGet in Scala functions API
### What changes were proposed in this pull request?

This PR is a followup that removes `bitwiseGet` in functions API. This is mainly for SQL compliance, and arguably not very much commonly used.

### Why are the changes needed?

See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59

### Does this PR introduce _any_ user-facing change?

No, it's a change in unreleased branches.

### How was this patch tested?

Existing tests should cover.

Closes #31410 from HyukjinKwon/SPARK-33245.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-01 18:21:27 +09:00
Ruifeng Zheng e0251ac143 [SPARK-34256][ML] VectorSlicer refine numFeatures checking and toString method
### What changes were proposed in this pull request?
1, update checking of numFeatures;
2, update `toString` to take `names` into account;

### Why are the changes needed?
1, should use `inputAttr.size` instead of `inputAttr.numAttributes` to get numFeatures;
2, this checking is necessary only if `$(indices).nonEmpty`, otherwise, `$(indices).max` will throw exception java.lang.UnsupportedOperationException: empty.max
3, in toString, should add length of names;

### Does this PR introduce _any_ user-facing change?
Yes, toString now count `names`;

### How was this patch tested?
existing testsuites

Closes #31354 from zhengruifeng/vector_slicer_clean.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-02-01 10:09:14 +08:00
Terry Kim a8eb443bf8 [SPARK-34299][SQL] Clean up ResolveSessionCatalog's isTempView and isTempFunction
### What changes were proposed in this pull request?

`ResolveSessionCatalog`'s `isTempView` and `isTempFunction` are not being used anymore since the resolution of temp view/function has moved to `Analyzer`.

This PR proposes to remove `isTempView` and `isTempFunction` from `ResolveSessionCatalog`.

### Why are the changes needed?

To clean up unused variables.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests should cover as this PR just removes the unused variables.

Closes #31400 from imback82/cleanup_resolve_session_catalog.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-31 13:03:30 +09:00
Terry Kim bec88a66bd [SPARK-34269][SQL][TESTS][FOLLOWUP] Test a subquery with view in aggregate's grouping expression
### What changes were proposed in this pull request?

This PR is a follow-up to #31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node.

### Why are the changes needed?

To increase the test coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a new test.

Closes #31352 from imback82/grouping_expr.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-30 17:07:40 -08:00
Chao Sun 463d4ec350 [SPARK-34269][SQL][TESTS][FOLLOWUP] Add test cases for cache lookup and project removal
### What changes were proposed in this pull request?

This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`.

### Why are the changes needed?

Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example:
```sql
> CREATE TABLE t (key bigint, value string) USING parquet
> CACHE TABLE v1 AS SELECT * FROM t ORDER BY key
> SELECT * FROM t ORDER BY key
```

#31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

This PR just adds unit tests.

Closes #31182 from sunchao/SPARK-34108.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-01-30 12:31:57 -08:00
Liang-Chi Hsieh 50d14c98c3 [SPARK-34270][SS] Combine StateStoreMetrics should not override StateStoreCustomMetric
### What changes were proposed in this pull request?

This patch proposes to sum up custom metric values instead of taking arbitrary one when combining `StateStoreMetrics`.

### Why are the changes needed?

For stateful join in structured streaming, we need to combine `StateStoreMetrics` from both left and right side. Currently we simply take arbitrary one from custom metrics with same name from left and right. By doing this we miss half of metric number.

### Does this PR introduce _any_ user-facing change?

Yes, this corrects metrics collected for stateful join.

### How was this patch tested?

Unit test.

Closes #31369 from viirya/SPARK-34270.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-29 20:50:39 -08:00
neko bd9eeebce0 [SPARK-34288][WEBUI] Add a tip info for the resources column in the executors page
### What changes were proposed in this pull request?
This is  an extension of PR #25613, I try to add a  tip info for `Resource` column to make it easier for users to know what the column actually means and avoid confusion.

### Why are the changes needed?
After upgrading from 2.3.2 to 3.0.1, the new `Resources` column in the executors page is always blank because it does not use GPU/FPGA,
and there is no tip info, so users are often confused when they do not know the exact meaning of this column.

### Does this PR introduce _any_ user-facing change?
add a tip info in the executors page.

### How was this patch tested?
 manual test  works well as below:
![fixed](https://user-images.githubusercontent.com/52202080/106248350-d032ee00-624b-11eb-9ae4-92319ed11110.png)

Closes #31392 from akiyamaneko/executors-resources-tips.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2021-01-30 10:23:52 +08:00
Yuming Wang f2b22d1487 [SPARK-34289][SQL] Parquet vectorized reader support column index
### What changes were proposed in this pull request?

This pr make parquet vectorized reader support [column index](https://issues.apache.org/jira/browse/PARQUET-1201).

### Why are the changes needed?

Improve filter performance. for example: `id = 1`, we only need to read `page-0` in `block 1`:

```
block 1:
                     null count  min                                       max
page-0                         0  0                                         99
page-1                         0  100                                       199
page-2                         0  200                                       299
page-3                         0  300                                       399
page-4                         0  400                                       449

block 2:
                     null count  min                                       max
page-0                         0  450                                       549
page-1                         0  550                                       649
page-2                         0  650                                       749
page-3                         0  750                                       849
page-4                         0  850                                       899
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and benchmark: https://github.com/apache/spark/pull/31393#issuecomment-769767724

Closes #31393 from wangyum/SPARK-34289.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-29 09:53:46 -08:00
“attilapiros” d3f049cbc2 [SPARK-34154][YARN][FOLLOWUP] Fix flaky LocalityPlacementStrategySuite test
### What changes were proposed in this pull request?

Fixing the flaky `handle large number of containers and tasks (SPARK-18750)` by avoiding to use `DNSToSwitchMapping` as in some situation DNS lookup could be extremely slow.

### Why are the changes needed?

After https://github.com/apache/spark/pull/31363 was merged the flaky `handle large number of containers and tasks (SPARK-18750)` test failed again in some other PRs but now we have the exact place where the test is stuck.

It is in the DNS lookup:

```
[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (30 seconds, 4 milliseconds)
[info]   Failed with an exception or a timeout at thread join:
[info]
[info]   java.lang.RuntimeException: Timeout at waiting for thread to stop (its stack trace is added to the exception)
[info]   	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
[info]   	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
[info]   	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
[info]   	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
[info]   	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
[info]   	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
[info]   	at java.net.InetAddress.getByName(InetAddress.java:1077)
[info]   	at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:568)
[info]   	at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:585)
[info]   	at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
[info]   	at org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:75)
[info]   	at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:66)
[info]   	at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$localityOfRequestedContainers$3(LocalityPreferredContainerPlacementStrategy.scala:142)
[info]   	at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy$$Lambda$658/1080992036.apply$mcVI$sp(Unknown Source)
[info]   	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
[info]   	at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:138)
[info]   	at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94)
[info]   	at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40)
[info]   	at java.lang.Thread.run(Thread.java:748) (LocalityPlacementStrategySuite.scala:61)
...
```

This could be because of the DNS servers used by those build machines are not configured to handle IPv6 queries and the client has to wait for the IPv6 query to timeout before falling back to IPv4.

This even make the tests more consistent. As when a single host was given to lookup via `resolve(hostName: String)` it gave a different answer from calling `resolve(hostNames: Seq[String])` with a `Seq` containing that single host.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #31397 from attilapiros/SPARK-34154-2nd.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-29 23:54:40 +09:00
Max Gekk 588ddcdf22 [SPARK-33163][SQL][TESTS][FOLLOWUP] Fix the test for the parquet metadata key 'org.apache.spark.legacyDateTime'
### What changes were proposed in this pull request?
1. Test both date and timestamp column types
2. Write the timestamp as the `TIMESTAMP_MICROS` logical type
3. Change the timestamp value to `'1000-01-01 01:02:03'` to check exception throwing.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite"
```

Closes #31396 from MaxGekk/parquet-test-metakey-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-29 22:25:01 +09:00
beliefer 0f7a4977c9 [SPARK-33601][SQL] Group exception messages in catalyst/parser
### What changes were proposed in this pull request?
This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #31293 from beliefer/SPARK-33601.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-29 08:57:58 +00:00
yangjie01 2e192b6f45 [SPARK-34284][CORE][TESTS] Fix deprecated API usage of Apache commons-io
### What changes were proposed in this pull request?
There are some deprecated API usage compilation warning related to Apache commons-io as follows:

 ```
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1109: [deprecation  org.apache.spark.deploy.SparkSubmitSuite.checkDownloadedFile.$org_scalatest_assert_macro_expr.$org_scalatest_assert_macro_left | origin=org.apache.commons.io.FileUtils.readFileToString | version=] method readFileToString in class FileUtils is deprecated
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1110: [deprecation  org.apache.spark.deploy.SparkSubmitSuite.checkDownloadedFile.$org_scalatest_assert_macro_expr.$org_scalatest_assert_macro_right | origin=org.apache.commons.io.FileUtils.readFileToString | version=] method readFileToString in class FileUtils is deprecated
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1152: [deprecation  org.apache.spark.deploy.SparkSubmitSuite | origin=org.apache.commons.io.FileUtils.write | version=] method write in class FileUtils is deprecated
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1167: [deprecation  org.apache.spark.deploy.SparkSubmitSuite | origin=org.apache.commons.io.FileUtils.write | version=] method write in class FileUtils is deprecated
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:201: [deprecation  org.apache.spark.deploy.history.HistoryServerSuite.<local HistoryServerSuite>.$anonfun.exp | origin=org.apache.commons.io.IOUtils.toString | version=] method toString in class IOUtils is deprecated
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:716: [deprecation  org.apache.spark.deploy.history.HistoryServerSuite.getContentAndCode.inString.$anonfun | origin=org.apache.commons.io.IOUtils.toString | version=] method toString in class IOUtils is deprecated
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:732: [deprecation  org.apache.spark.deploy.history.HistoryServerSuite.connectAndGetInputStream.errString.$anonfun | origin=org.apache.commons.io.IOUtils.toString | version=] method toString in class IOUtils is deprecated
[WARNING] [Warn] /spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:267: [deprecation  org.apache.spark.streaming.InputStreamsSuite.<local InputStreamsSuite>.$anonfun.$anonfun.write | origin=org.apache.commons.io.IOUtils.write | version=] method write in class IOUtils is deprecated
[WARNING] [Warn] /spark/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala:912: [deprecation  org.apache.spark.streaming.StreamingContextSuite.createCorruptedCheckpoint | origin=org.apache.commons.io.FileUtils.write | version=] method write in class FileUtils is deprecated
```

The main API change is to need to add a `java.nio.charset.Charset` parameter when the corresponding method is called, so the main change of is pr is add a `StandardCharsets.UTF_8` parameter to the these method.

### Why are the changes needed?
Fix deprecated API usage of Apache commons-io.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31389 from LuciferYang/SPARK-34284.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-29 17:50:14 +09:00
Chircu 520e5d2ab8 [SPARK-34144][SQL] Exception thrown when trying to write LocalDate and Instant values to a JDBC relation
### What changes were proposed in this pull request?

When writing rows to a table only the old date time API types are handled in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#makeSetter. If the new API is used (spark.sql.datetime.java8API.enabled=true) casting Instant and LocalDate to Timestamp and Date respectively fails. The proposed change is to handle Instant and LocalDate values and transform them to Timestamp and Date.

### Why are the changes needed?

In the current state writing Instant or LocalDate values to a table fails with something like:
Caused by: java.lang.ClassCastException: class java.time.LocalDate cannot be cast to class java.sql.Date (java.time.LocalDate is in module java.base of loader 'bootstrap'; java.sql.Date is in module java.sql of loader 'platform') at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11(JdbcUtils.scala:573) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11$adapted(JdbcUtils.scala:572) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:678) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:994) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:994) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) ... 3 more

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests

Closes #31264 from cristichircu/SPARK-34144.

Lead-authored-by: Chircu <chircu@arezzosky.com>
Co-authored-by: Cristi Chircu <cristian.chircu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-29 17:48:13 +09:00
Bo Zhang 3f350dbd78 [SPARK-33212][FOLLOW-UP][BUILD] Fix test "built-in Hadoop version should support shaded client" for hadoop-2.7
### What changes were proposed in this pull request?
We added test "built-in Hadoop version should support shaded client" in https://github.com/apache/spark/pull/31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2.

### Why are the changes needed?
The test fails in master branch when profile hadoop-2.7 is activated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Ran the test with hadoop-2.7 profile.

Closes #31391 from bozhang2820/fix-hadoop-2-version-test.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-29 15:47:02 +09:00
Wenchen Fan b891862fb6 [SPARK-34269][SQL] Simplify SQL view resolution
### What changes were proposed in this pull request?

The currently SQL (temp or permanent) view resolution is done in 2 steps:
1. In `SessionCatalog`, we get the view metadata, parse the view SQL string, and wrap it with `View`.
2. At the beginning of the optimizer, we run `EliminateView`, which drops the wrapper `View`, and apply some special logic to match the view schema.

Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra `Project` to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in `SessionCatalog`, so that we only have 1 step.

### Why are the changes needed?

Code simplification. It also fixes issues like https://github.com/apache/spark/pull/31352

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #31368 from cloud-fan/try.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-29 06:46:01 +00:00
Cheng Su d871b54a4e [SPARK-34237][SQL] Add more metrics (fallback, spill) to object hash aggregate
### What changes were proposed in this pull request?

This PR is to add two more metrics for `ObjectHashAggregateExec`, i.e. the spill size, and number of fallback to sort-based aggregation.

### Why are the changes needed?

As object hash aggregate fallback mechanism is special - it will fallback to sort-based aggregation based on number of keys seen so far [0]. This fallback logic sometimes is sub-optimal and leads to unnecessary sort, and performance degradation in run-time. The first step to help user/developer debug is to add more related metrics on UI, e.g. spill size, and number of fallback to sort-based aggregation. (spill size metrics was already added for hash aggregate [1])

[0]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L161

[1]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L68

### Does this PR introduce _any_ user-facing change?

Added two more metrics on Spark UI for operator `ObjectHashAggregateExec`. Screenshot is attached below.

### How was this patch tested?

* Added unit test in `SQLMetricsSuite.scala`.
* Tested on spark shell locally and verified the metrics shown up on UI.

<img width="399" alt="Screen Shot 2021-01-28 at 1 44 40 PM" src="https://user-images.githubusercontent.com/4629931/106204224-7a8a1300-6171-11eb-9814-c3432abadc29.png">

Closes #31340 from c21/object-hash.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-29 04:35:58 +00:00
ulysses-you 72b7f8abfb [SPARK-34261][SQL] Avoid side effect if create exists temporary function
### What changes were proposed in this pull request?

Add function exists check before load resource.

### Why are the changes needed?

We should not add jar into classpath if the create temporary function is already exists.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31358 from ulysses-you/SPARK-34261.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-01-29 10:39:02 +09:00
Yuming Wang a7683afdf4 [SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1
### What changes were proposed in this pull request?

This PR upgrade Parquet to 1.11.1.

Parquet 1.11.1 new features:

- [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes
- [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation
- [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr

More details:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md

### Why are the changes needed?
Support column indexes to improve query performance.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing test.

Closes #26804 from wangyum/SPARK-26346.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-01-29 08:07:49 +08:00
Dongjoon Hyun bc41c5a0e5 [SPARK-34273][CORE] Do not reregister BlockManager when SparkContext is stopped
### What changes were proposed in this pull request?

This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped.

### Why are the changes needed?

Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message.
However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #31373 from dongjoon-hyun/SPARK-34273.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-28 13:06:42 -08:00
Dongjoon Hyun 78244bafe8 [SPARK-34281][K8S] Promote spark.kubernetes.executor.podNamePrefix to the public conf
### What changes were proposed in this pull request?

This PR aims to remove `internal()` from `spark.kubernetes.executor.podNamePrefix` in order to make it the configuration public.

### Why are the changes needed?

In line with K8s GA, this will allow some users control the full executor pod names officially.
This is useful when we want a custom executor pod name pattern independently from the app name.

### Does this PR introduce _any_ user-facing change?

No, this has been there since Apache Spark 2.3.0.

### How was this patch tested?

N/A.

Closes #31386 from dongjoon-hyun/SPARK-34281.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-28 13:01:18 -08:00