Commit graph

11765 commits

Author SHA1 Message Date
Gengliang Wang 18143fb426 [SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry
### What changes were proposed in this pull request?

Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:

- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like

This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`

### Why are the changes needed?

Correct the SQL doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually

Closes #33834 from gengliangwang/allowSQLFunVersion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-25 22:32:20 +08:00
Kousuke Saruta b2ff01608f [SPARK-36398][SQL] Redact sensitive information in Spark Thrift Server log
### What changes were proposed in this pull request?

This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log.
For example, JDBC password can be exposed in the log.
```
21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613
```

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed the log.
```
21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909
```

Closes #33832 from sarutak/fix-SPARK-36398.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-25 21:30:43 +09:00
Max Gekk df0ec56723 [SPARK-36567][SQL] Support foldable special datetime strings by CAST
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.

### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```

### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.

### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```

Closes #33816 from MaxGekk/foldable-datetime-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-25 14:08:59 +08:00
Hyukjin Kwon 93cec49212 [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:02:53 +09:00
PengLei 3e32ea17db [SPARK-36336][SQL] Add new exception of base exception used in QueryExecutionErrors
### What changes were proposed in this pull request?
When we refactor the query execution errors to use error classes in QueryExecutionErrors, we need define some exception that mix SparkThrowable into a base Exception type.
according the example [SparkArithmeticException](f90eb6a5db/core/src/main/scala/org/apache/spark/SparkException.scala (L75))

Add SparkXXXException as follows:
- `SparkClassNotFoundException`
- `SparkConcurrentModificationException`
- `SparkDateTimeException`
- `SparkFileAlreadyExistsException`
- `SparkFileNotFoundException`
- `SparkNoSuchMethodException`
- `SparkIndexOutOfBoundsException`
- `SparkIOException`
- `SparkSecurityException`
- `SparkSQLException`
- `SparkSQLFeatureNotSupportedException`

Refactor some exceptions in QueryExecutionErrors to use error classes and new exception for testing new exception

Some added by [PR](https://github.com/apache/spark/pull/33538) as follows:

- `SparkUnsupportedOperationException`
- `SparkIllegalStateException`
- `SparkNumberFormatException`
- `SparkIllegalArgumentException`
- `SparkArrayIndexOutOfBoundsException`
- `SparkNoSuchElementException`

### Why are the changes needed?
[SPARK-36336](https://issues.apache.org/jira/browse/SPARK-36336)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed ut test

Closes #33573 from Peng-Lei/SPARK-36336.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 09:39:28 +09:00
Gengliang Wang 5b4c216478 [SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package.

### Why are the changes needed?

There are two packages for `org.apache.spark.sql.connector`
SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector
Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector

As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst.
### Does this PR introduce _any_ user-facing change?

No, the trait is not released yet.

### How was this patch tested?

Existing UT.

Closes #33826 from gengliangwang/moveLocalScan.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:23:50 -07:00
Angerszhuuuu e83f8a872a [SPARK-35991][SQL][FOLLOWUP] Add back protected modifier of sparkConf to TPCBase
### What changes were proposed in this pull request?
Add back protected modifier of sparkConf to TPCBase according to https://github.com/apache/spark/pull/33736/files#r694054229

### Why are the changes needed?
Add back protected modifier of sparkConf to TPCBase

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33813 from AngersZhuuuu/SPARK-35991-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-24 11:32:30 +09:00
Huaxin Gao cd2342691d [SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.

### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33815 from huaxingao/agg_package.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-23 15:31:13 -07:00
Kent Yao f918c123a0 [SPARK-36552][SQL] Fix different behavior for writing char/varchar to hive and datasource table
### What changes were proposed in this pull request?

For the hive table, the actual write path and the schema handling are inconsistent when `spark.sql.legacy.charVarcharAsString` is true.

This causes problems like SPARK-36552 described.

In this PR we respect `spark.sql.legacy.charVarcharAsString` when generates hive table schema from spark data types.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

yes, when `spark.sql.legacy.charVarcharAsString` is true, hive table with char/varchar will respect string behavior.

### How was this patch tested?

newly added test

Closes #33798 from yaooqinn/SPARK-36552.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:38:39 +09:00
yangjie01 1ccb06ca8c Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:

- https://github.com/apache/spark/pull/31517
- https://github.com/apache/spark/pull/33772

### Why are the changes needed?

1. No really performance improvement in Spark
2. Added an additional dependency

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33784 from LuciferYang/revert-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:36:15 +09:00
Yesheng Ma 5c0762b5d2 [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
### What changes were proposed in this pull request?
Change all exceptions in NoSuchItemException.scala to case classes.

### Why are the changes needed?
Exceptions in NoSuchItemException.scala are not case classes. This is causing issues because in Analyzer's executeAndCheck method always calls the `copy` method on the exception. However, since these exceptions are not case classes, the `copy` method was always delegated to `AnalysisException::copy`, which is not the specialized version.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #33673 from yeshengm/SPARK-36448.

Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-20 20:16:30 +08:00
gengjiaan 462aa7cd3c [SPARK-36428][TESTS][FOLLOWUP] Revert mistake change to DateExpressionsSuite
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/33775 commits the debug code mistakely.
This PR revert the test path.

### Why are the changes needed?
Revoke debug code.

### Does this PR introduce _any_ user-facing change?
 'No'.
Just adjust test.

### How was this patch tested?
Revert non-ansi test path.

Closes #33787 from beliefer/SPARK-36428-followup2.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 21:33:21 +08:00
Yuming Wang 2310b99e14 [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
### What changes were proposed in this pull request?

Remove `OptimizeSubqueries` from batch of `PartitionPruning` to make DPP support more cases. For example:
```sql
SELECT date_id, product_id FROM fact_sk f
JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s
ON f.store_id = s.new_store_id
```

Before this PR:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274]
      +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
         +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
```

After this PR:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   :        +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263]
   :           +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
   :              +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
   :                 +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
   :                    +- *(1) ColumnarToRow
   :                       +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
   +- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
```
This is because `OptimizeSubqueries` will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   :        +- Subquery subquery#4009, [id=#284]
   :           +- *(2) HashAggregate(keys=[new_store_id#3997#4008], functions=[])
   :              +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280]
   :                 +- *(1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[])
   :                    +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
   :                       +- *(1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
   :                          +- *(1) ColumnarToRow
   :                             +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305]
      +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
         +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
```

### Why are the changes needed?

Improve DPP to support more cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and benchmark test:
SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
TPC-DS q58 | 40 | 20
TPC-DS q83 | 18 | 14

Closes #33664 from wangyum/SPARK-36444.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-08-19 16:44:50 +08:00
Shixiong Zhu ea4919801a [SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries
### What changes were proposed in this pull request?

RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version.

A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format.

In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed.

We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint.

### Why are the changes needed?

Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change.

### Does this PR introduce _any_ user-facing change?

No. RocksDB state store is a new feature in Spark 3.2, which has not yet released.

### How was this patch tested?

The new unit tests.

Closes #33749 from zsxwing/SPARK-36519.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-19 00:23:40 -07:00
Max Gekk 1235bd29f0 [SPARK-36536][SQL] Use CAST for datetime in CSV/JSON by default
### What changes were proposed in this pull request?
In the PR, I propose to split the `dateFormat` and `timestampFormat` options in CSV/JSON datasources to:
- In write (`dateFormatInWrite`/`timestampFormatInWrite`). CSV/JSON datasource will use it in formatting of dates/timestamps. If an user doesn't initialise it, it will be set to a default value.
- In read (`dateFormatInRead`/`timestampFormatInRead`). The datasources will use it while parsing of input dates/timestamps strings. If an user doesn't set it, we will keep it as uninitialized (None), and use CAST to parse the input dates/timestamps strings.

### Why are the changes needed?
This should improve user experience with Spark SQL, and make the default parsing behavior more flexible.

### Does this PR introduce _any_ user-facing change?
Potentially, it can.

### How was this patch tested?
By existing test suites, and by new tests that are added to `JsonSuite` and to `CSVSuite`:
```
$ build/sbt "sql/test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "sql/test:testOnly *JsonFunctionsSuite"
$ build/sbt "sql/test:testOnly *CSVv1Suite"
$ build/sbt "sql/test:testOnly *JsonV2Suite"
```

Closes #33769 from MaxGekk/split-datetime-ds-options.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-19 09:30:50 +03:00
Kousuke Saruta 013f2b7439 [SPARK-36512][UI][TESTS] Fix UISeleniumSuite in sql/hive-thriftserver
### What changes were proposed in this pull request?

This PR fixes an issue that `UISeleniumSuite` in `sql/hive-thriftserver` doesn't work even though the ignored test is enabled due to the following two reasons.

(1) The suite waits for thriftserver starting up by reading a startup message from stdin but the expected message is never read.
(2) The URL and CSS selector for test are wrong.

To resolve (1), this PR adopt the way that `HiveThriftServer2Suite` does.
3f8ec0dae4/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (L1222-L1248)

This PR also enables the ignored test. Let's disable it again in the following PR if it's still flaky.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33741 from sarutak/fix-thrift-uiseleniumsuite.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 13:27:58 +08:00
Angerszhuuuu 559fe96a40 [SPARK-35991][SQL] Add PlanStability suite for TPCH
### What changes were proposed in this pull request?
Add PlanStability suite for TPCH

### Why are the changes needed?
Add PlanStability suite for TPCH

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33736 from AngersZhuuuu/SPARK-35991.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:40:34 +08:00
Kousuke Saruta c458edb77e [SPARK-36371][SQL] Support raw string literal
### What changes were proposed in this pull request?

This PR proposes to support raw string literal which escape no character using `\`.
The raw string literal is the form of `r"..."` or `r'...'` like the syntax BigQuery and Python supports.

Actually, there is no standard way to represent raw string literals.

In PostgreSQL, any special character isn't escaped by \ unless a string literal starts with E prefix.
https://www.postgresql.org/docs/13/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

In MySQL, special characters in a string literal are not escaped if NO_BACKSLASH_ESCAPES is enabled.
https://dev.mysql.com/doc/refman/8.0/en/string-literals.html

In MsSQLServer, any special character isn't escaped by \ but STRING_ESCAPE function can escape such characters.
https://docs.microsoft.com/en-us/sql/t-sql/functions/string-escape-transact-sql?view=sql-server-ver15

But BigQuery supports `r"..."` and `r'...'` forms so this PR proposes this syntax for the purpose.
https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#literals

### Why are the changes needed?

In the current master, sometimes it's too confusable to represent JSON and regex in a string literal if they contain backslash.
For example, in JSON, `\` needs to be escaped like as follows.
```
{"a": "\\"}
```
But, if the JSON above is represented in a string literal, further two `\` are needed because string literal also requires `\` to be escaped.
```
SELECT from_json('{"a": "\\\\"}', 'a string')
{"a":"\"}
```
With the raw string literal, we can represent such JSON like as follows.
```
SELECT from_json(r'{"a": "\\"}', 'a string')
{"a":"\"}
```

### Does this PR introduce _any_ user-facing change?

No. This PR just extends the existing syntax.

### How was this patch tested?

Added new test.
I also confirmed that the modified document is successfully built with `SKIP_API=1 bundle exec jekyll build`.
![raw_string_literal](https://user-images.githubusercontent.com/4736016/129223184-3bb4e206-f40f-42d2-a128-0cd8fc83b9c9.png)

Closes #33599 from sarutak/raw-string-literal.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:37:32 +08:00
tooptoop4 2fc9c0bfb5 [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker
### What changes were proposed in this pull request?

This log should at least be WARN not INFO (in org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala )
"Expected $numSubmittedFiles files, but only saw $numFiles."

### Why are the changes needed?

INFO logs don't indicate possible issue but WARN logs should

### Does this PR introduce any user-facing change?

Yes, Log level changed.

### How was this patch tested?

manual, trivial change

Closes #33332 from tooptoop4/warn.

Authored-by: tooptoop4 <33283496+tooptoop4@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-19 10:31:10 +09:00
yangjie01 1859d9bc85 [SPARK-36407][CORE][SQL] Convert int to long to avoid potential integer multiplications overflow risk
### What changes were proposed in this pull request?
The main change of this pr is converting an int literal to a long literal to avoid potential integer multiplications overflow risk.

For example:

**Before**

```java
void f(int i) {
    long val = 65536 * i;
  }
```

**After**

```java
void f(int i) {
    long val = 65536L * i;
  }
```

### Why are the changes needed?
Avoid potential integer multiplications overflow risk

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33629 from LuciferYang/cast-to-long.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-18 11:30:37 -05:00
gengjiaan 707eefa3c7 [SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp
### What changes were proposed in this pull request?
The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter.
This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code.

### Why are the changes needed?
Simplify `make_timestamp`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
New tests.

Closes #33775 from beliefer/SPARK-36428-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-18 22:57:06 +08:00
Kousuke Saruta b914ff7d54 [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex
### What changes were proposed in this pull request?

This PR fixes an issue that ThriftServer doesn't recognize `spark.sql.redaction.string.regex`.
The problem is that sensitive information included in queries can be exposed.
![thrift-password1](https://user-images.githubusercontent.com/4736016/129440772-46379cc5-987b-41ac-adce-aaf2139f6955.png)
![thrift-password2](https://user-images.githubusercontent.com/4736016/129440775-fd328c0f-d128-4a20-82b0-46c331b9fd64.png)

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed UI.

![thrift-hide-password1](https://user-images.githubusercontent.com/4736016/129440863-cabea247-d51f-41a4-80ac-6c64141e1fb7.png)
![thrift-hide-password2](https://user-images.githubusercontent.com/4736016/129440874-96cd0f0c-720b-4010-968a-cffbc85d2be5.png)

Closes #33743 from sarutak/thrift-redact.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-18 13:31:22 +09:00
Gengliang Wang 8bfb4f1e72 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"
### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-17 20:23:49 +08:00
Max Gekk 82a31508af [SPARK-36524][SQL] Common class for ANSI interval types
### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType`

### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance:
```scala
    case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
    case _: AnsiIntervalType => false
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #33753 from MaxGekk/ansi-interval-type-trait.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-17 12:27:56 +03:00
Gengliang Wang 26d6b952dc [SPARK-36521][SQL] Disallow comparison between Interval and String
### What changes were proposed in this pull request?

Disallow comparison between Interval and String in the default type coercion rules.

### Why are the changes needed?

If a binary comparison contains interval type and string type, we can't decide which
interval type the string should be promoted as. There are many possible interval
types, such as year interval, month interval, day interval, hour interval, etc.

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33750 from gengliangwang/disallowCom.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 22:41:14 +03:00
Max Gekk f620996142 [SPARK-36418][SQL] Use CAST in parsing of dates/timestamps with default pattern
### What changes were proposed in this pull request?
In the PR, I propose to use the `CAST` logic when the pattern is not specified in `DateFormatter` or `TimestampFormatter`. In particular, invoke the `DateTimeUtils.stringToTimestampAnsi()` or `stringToDateAnsi()` in the case.

### Why are the changes needed?
1. This can improve user experience with Spark SQL by making the default date/timestamp parsers more flexible and tolerant to their inputs.
2. We make the default case consistent to the behavior of the `CAST` expression which makes implementation more consistent.

### Does this PR introduce _any_ user-facing change?
The changes shouldn't introduce behavior change in regular cases but it can influence on corner cases. New implementation is able to parse more dates/timestamps by default. For instance, old (current) date parses can recognize dates only in the format **yyyy-MM-dd** but new one can handle:
   * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d *`
   * `[+-]yyyy*-[m]m-[d]dT*`

Similarly for timestamps. The old (current) timestamp formatter is able to parse timestamps only in the format **yyyy-MM-dd HH:mm:ss** by default, but new implementation can handle:
   * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *ImageFileFormatSuite"
$ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite"
```

Closes #33709 from MaxGekk/datetime-cast-default-pattern.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-16 23:29:33 +08:00
Wenchen Fan f4b31c6068 [SPARK-36498][SQL] Reorder inner fields of the input query in byName V2 write
### What changes were proposed in this pull request?

Today, when we write data to a v2 table with byName mode, we only reorder the top-level columns, not inner struct fields. This doesn't make sense as Spark should treat inner struct fields as the first-class citizen (e.g. nested column pruning, filter pushdown with nested columns).

This PR improves `TableOutputResolver` to reorder inner fields as well.

### Why are the changes needed?

better user-experience

### Does this PR introduce _any_ user-facing change?

yes, more queries are allowed to write to v2 tables.

### How was this patch tested?

new test

Closes #33728 from cloud-fan/reorder.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-16 15:08:08 +08:00
Kousuke Saruta 9b9db5a8a0 [SPARK-36491][SQL] Make from_json/to_json to handle timestamp_ntz type properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_json` and `to_json` cannot handle `timestamp_ntz` type properly.
In the current master, `from_json`/`to_json` can handle `timestamp` type like as follows.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```
```
SELECT to_json(map("a", TIMESTAMP"2021-11-23 11:22:33"));
{"a":"2021-11-23T11:22:33.000+09:00"}
```
But they cannot handle `timestamp_ntz` type properly.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ");
21/08/12 16:16:00 ERROR SparkSQLDriver: Failed in [SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
        at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
        at org.apache.spark.sql.catalyst.json.JacksonParser.makeConverter(JacksonParser.scala:339)
        at org.apache.spark.sql.catalyst.json.JacksonParser.$anonfun$makeConverter$17(JacksonParser.scala:313)
```
```
SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"));
21/08/12 16:14:07 ERROR SparkSQLDriver: Failed in [SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"))]
java.lang.RuntimeException: Failed to convert value 1637666553000000 (class of class java.lang.Long) with the type of TimestampNTZType to JSON.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.failToConvertValueToJsonError(QueryExecutionErrors.scala:294)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25(JacksonGenerator.scala:201)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25$adapted(JacksonGenerator.scala:199)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeMapData(JacksonGenerator.scala:253)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$write$3(JacksonGenerator.scala:293)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeObject(JacksonGenerator.scala:206)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:292)
```
### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33742 from sarutak/json-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 10:03:22 +03:00
Liang-Chi Hsieh 8b8d91cf64 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:00 +09:00
Pablo Langa a9ab41ad56 [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
### What changes were proposed in this pull request?

This PR is related with https://github.com/apache/spark/pull/33525.
The purpose is to align error messages between the function from_json and the Json reader for unsupported key types in MapType.
Current behavior:
```
scala> spark.read.schema(StructType(Seq(StructField("col", MapType(IntegerType, StringType))))).json(Seq("""{"1": "test"}""").toDS()).show
+----+
| col|
+----+
|null|
+----+

```
```
scala> Seq("""{"1": "test"}""").toDF("col").write.json("/tmp/jsontests1234")

scala> spark.read.schema(StructType(Seq(StructField("col", MapType(IntegerType, StringType))))).json("/tmp/jsontests1234").show
+----+
| col|
+----+
|null|
+----+
```
With this change, an AnalysisException with the message `"Input schema $schema can only contain StringType as a key type for a MapType."` wil be thrown

### Why are the changes needed?

It's more consistent to align the behavior

### Does this PR introduce _any_ user-facing change?

Yes, now an Exception will be thrown

### How was this patch tested?

Unit testing, manual testing

Closes #33672 from planga82/feature/spark35320_improve_error_message_reader.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-15 10:31:57 +09:00
Huaxin Gao 3f8ec0dae4 [SPARK-34952][SQL][FOLLOWUP] Normalize pushed down aggregate col name and group by col name
### What changes were proposed in this pull request?
Normalize pushed down aggregate col names and group by col names ...

### Why are the changes needed?
to handle case sensitive col names

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modify existing test

Closes #33739 from huaxingao/normalize.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-13 22:31:21 -07:00
Gengliang Wang ecdea91602 [SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal
### What changes were proposed in this pull request?

If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc.
The related binary operation for Interval contains
- Add
- Subtract
- Comparisions

Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them.

### Why are the changes needed?

Avoid ambiguously implicit casting string literals to interval types.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion is not released yet.

### How was this patch tested?

New tests.

Closes #33737 from gengliangwang/disallowStringAndInterval.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-14 10:45:05 +08:00
Huaxin Gao 959cf50a1b [SPARK-36503][SQL] Add RowToColumnConverter for BinaryType
### What changes were proposed in this pull request?
Add RowToColumnConverter for BinaryType

### Why are the changes needed?
currently, we have RowToColumnConverter for all data types except BinaryType
```
  private def getConverterForType(dataType: DataType, nullable: Boolean): TypeConverter = {
    val core = dataType match {
      case BooleanType => BooleanConverter
      case ByteType => ByteConverter
      case ShortType => ShortConverter
      case IntegerType | DateType => IntConverter
      case FloatType => FloatConverter
      case LongType | TimestampType => LongConverter
      case DoubleType => DoubleConverter
      case StringType => StringConverter
      case CalendarIntervalType => CalendarConverter
      case at: ArrayType => ArrayConverter(getConverterForType(at.elementType, at.containsNull))
      case st: StructType => new StructConverter(st.fields.map(
        (f) => getConverterForType(f.dataType, f.nullable)))
      case dt: DecimalType => new DecimalConverter(dt)
      case mt: MapType => MapConverter(getConverterForType(mt.keyType, nullable = false),
        getConverterForType(mt.valueType, mt.valueContainsNull))
      case unknown => throw QueryExecutionErrors.unsupportedDataTypeError(unknown.toString)
    }
```
so add one for BinaryType

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
modify existing test

Closes #33733 from huaxingao/binary_converter.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-13 10:08:45 -07:00
yangjie01 1da1e33a49 [SPARK-36495][SQL] Use type match to simplify methods in CatalystTypeConverter
### What changes were proposed in this pull request?

`CatalystTypeConverter.toCatalyst` method use `isInstanceOf  + asInstanceOf` for type conversion, the main change of this pr is use  type match to simplify this process.

`CatalystTypeConverters.createToCatalystConverter` method has a similar pattern.

### Why are the changes needed?
Code simplification

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Add a new case to `ScalaReflectionSuite` to add the coverage of the `case None` branch of `CatalystTypeConverter#toCatalyst` method

Closes #33722 from LuciferYang/SPARK-36495.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 19:24:22 +09:00
Kousuke Saruta 7fd34548b1 [SPARK-36490][SQL] Make from_csv/to_csv to handle timestamp_ntz type properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_csv` and `to_csv` cannot handle `timestamp_ntz` type properly.
In the current master, to_csv/from_csv can handle timestamp type like as follows.
```
SELECT to_csv(struct(TIMESTAMP"2021-11-23 11:22:33"));
2021-11-23T11:22:33.000+09:00
```
```
SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```

But they cannot handle timestamp_ntz type properly.
```
SELECT to_csv(struct(TIMESTAMP_NTZ"2021-11-23 11:22:33"));
-- 2021-11-23T11:22:33.000 is expected.
1637666553000000
```
```
SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP_NTZ");
21/08/12 16:12:49 ERROR SparkSQLDriver: Failed in [SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
        at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
        at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:234)
        at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33719 from sarutak/csv-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-13 12:08:53 +03:00
Gengliang Wang eb6be7f1ee [SPARK-36499][SQL][TESTS] Test Interval multiply / divide null
### What changes were proposed in this pull request?

Test the following valid operations:
```
year-month interval * null
null * year-month interval
year-month interval / null
```
and invalid operations:
```
null / interval
int / interval
```

### Why are the changes needed?

Improve test coverage

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #33729 from gengliangwang/addTest.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-13 11:05:57 +03:00
gengjiaan 7d82336734 [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
### What changes were proposed in this pull request?
With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode.

```
org.apache.spark.sql.AnalysisException
cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7
```

We should update the function `make_timestamp` to allow integer type 'seconds' parameter.

### Why are the changes needed?
Make `make_timestamp` could accepts integer as 'seconds' parameter.

### Does this PR introduce _any_ user-facing change?
'Yes'.
`make_timestamp` could accepts integer as 'seconds' parameter.

### How was this patch tested?
New tests.

Closes #33665 from beliefer/SPARK-36428.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 13:13:02 +08:00
Maryann Xue 29b1e394c6 [SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs
### What changes were proposed in this pull request?
This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances.

The major changes include:
1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined.
2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference.
3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined.
4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse.

A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.

### Why are the changes needed?
This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UTs.
Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs.

Closes #33671 from maryannxue/spark-36447.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 11:37:02 +08:00
Dongjoon Hyun e8e5785f02 [SPARK-36502][SQL] Remove jaxb-api from sql/catalyst module
### What changes were proposed in this pull request?

This PR aims to remove `jaxb-api` usage from `sql/catalyst` module.

### Why are the changes needed?

We only use `DatatypeConverter.parseHexBinary` and `DatatypeConverter.printHexBinary` twice.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33732 from dongjoon-hyun/SPARK-36502.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 12:31:09 +09:00
Gengliang Wang d4466d55ca [SPARK-36497][SQL] Support Interval add/subtract NULL
### What changes were proposed in this pull request?

Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type.
This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well.
To change it, I propose to cast the null as the same data type of the other element in the add/subtract:
```
null + interval => cast(null as interval) + interval
null - interval => cast(null as interval) - interval
interval + null=> interval + cast(null as interval)
interval - null => interval - cast(null as interval)
```

### Why are the changes needed?

Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval`

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33727 from gengliangwang/intervalTypeCoercion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 11:10:32 +08:00
Gengliang Wang 48e333af54 [SPARK-36445][SQL][FOLLOWUP] ANSI type coercion: revisit promoting string literals in datetime expressions
### What changes were proposed in this pull request?

1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted
```
string - timestamp => cast(string as timestamp) - timestamp
```
This PR is to promote the following string literals:
```
string - date => cast(string as date) - date
date - string => date - cast(date as string)
timestamp - string => timestamp
```
It is very straightforward to cast the string literal as the data type of the other side in the subtraction.

2. Merge the string promotion logic from the rule `StringLiteralCoercion`:
```
date_sub(date, string) => date_sub(date, cast(string as int))
date_add(date, string) => date_add(date, cast(string as int))
```

### Why are the changes needed?

1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL
2. Certerize all the string literal promotion in the ANSI type coercion rule

### Does this PR introduce _any_ user-facing change?

No, the new ANSI type coercion rules are not released yet.

### How was this patch tested?

Existing UT

Closes #33724 from gengliangwang/datetimeTypeCoercion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-13 01:02:34 +08:00
Wenchen Fan 124d011ee7 [SPARK-35881][SQL][FOLLOWUP] Remove the AQE post stage creation extension
### What changes were proposed in this pull request?

This is a followup of #33140

It turns out that we may be able to complete the AQE and columnar execution integration without the AQE post stage creation extension. The rule `ApplyColumnarRulesAndInsertTransitions` can add to-columnar transition if the shuffle/broadcast supports columnar.

### Why are the changes needed?

remove APIs that are not needed.

### Does this PR introduce _any_ user-facing change?

No, the APIs are not released yet.

### How was this patch tested?

existing and manual tests

Closes #33701 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-12 21:35:28 +08:00
IonutBoicuAms 2b665751d9 [SPARK-36489][SQL] Aggregate functions over no grouping keys, on tables with a single bucket, return multiple rows
### What changes were proposed in this pull request?

This PR fixes a bug in `DisableUnnecessaryBucketedScan`.
When running any aggregate function, without any grouping keys, on a table with a single bucket, multiple rows are returned.
This happens because the aggregate function satisfies the `AllTuples` distribution, no `Exchange` will be planned, and the bucketed scan will be disabled.

### Why are the changes needed?

Bug fixing. Aggregates over no grouping keys should return a single row.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new test in `DisableUnnecessaryBucketedScanSuite`.

Closes #33711 from IonutBoicuAms/fix-bug-disableunnecessarybucketedscan.

Authored-by: IonutBoicuAms <ionut.boicu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-12 15:22:38 +08:00
Wenchen Fan 00a4364f38 [SPARK-36479][SQL][TEST] Improve datetime test coverage in SQL files
### What changes were proposed in this pull request?

This PR adds more datetime tests in `date.sql` and `timestamp.sql`, especially for string promotion.

### Why are the changes needed?

improve test coverage

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33707 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-12 12:51:50 +08:00
Jungtaek Lim fac4e5eb3e [SPARK-36480][SS] SessionWindowStateStoreSaveExec should not filter input rows against watermark
### What changes were proposed in this pull request?

This PR proposes to remove the filter applying to input rows against watermark in SessionWindowStateStoreSaveExec, since SessionWindowStateStoreSaveExec is expected to store all inputs into state store, and apply eviction later.

### Why are the changes needed?

The code is logically not right, though I can't reproduce the actual problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests. I can't come up with broken case failing on previous code, but we can review the logic instead.

Closes #33708 from HeartSaVioR/SPARK-36480.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-11 20:10:59 -07:00
Cheng Su 79515e4b6c [SPARK-32986][SQL] Add bucketed scan info in query plan of data source v1
### What changes were proposed in this pull request?

As a followup from discussion in https://github.com/apache/spark/pull/29804#discussion_r493229395 , currently the query plan for data source v1 scan operator - `FileSourceScanExec` has no information to indicate whether the table is read as bucketed table or not. And if table not read as bucketed table, what's the reason behind it. Add this info into `FileSourceScanExec` physical query plan output, can help users and developers understand query plan more easily without spending a lot of time debugging why table is not read as bucketed table.

### Why are the changes needed?

Help users and developers debug query plan for bucketed table.

### Does this PR introduce _any_ user-facing change?

The added `Bucketed` information in physical query plan when reading bucketed table.
Note for reading non-bucketed table, the query plan stays same and nothing is changed.

Example:

```
Seq((1, 2), (2, 3)).toDF("i", "j").write.bucketBy(8, "i").saveAsTable("t1")
Seq(2, 3).toDF("i").write.bucketBy(8, "i").saveAsTable("t2")
val df1 = spark.table("t1")
val df2 = spark.table("t2")
df1.join(df2, df1("i") === df2("i"))
```

```
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [i#20], [i#24], Inner
   :- Sort [i#20 ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(i#20)
   :     +- FileScan parquet default.t1[i#20,j#21] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#20)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8
   +- Sort [i#24 ASC NULLS FIRST], false, 0
      +- Filter isnotnull(i#24)
         +- FileScan parquet default.t2[i#24] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#24)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8
```

### How was this patch tested?

Added unit test in `ExplainSuite.scala`.

Closes #33698 from c21/scan-v1.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-11 19:41:36 -07:00
yangjie01 f7c85b855b [SPARK-36456][CORE][SQL][SS] Clean up compilation warnings related to method closeQuietly in class IOUtils is deprecated
### What changes were proposed in this pull request?
There are some compilation warnings related to `method closeQuietly in class IOUtils is deprecated`,  `Apache commons-io` suggests that we should use the `try-with-resources` statement or handle suppressed exceptions manually.

The main change of this pr is replace  `o.a.commons.io.IOUtils.closeQuietly` with `o.a.s.network.util.JavaUtils.closeQuietly` directly because all original logic is suppressing `IOException`.

### Why are the changes needed?
Clean up compilation warnings related to `method closeQuietly in class IOUtils is deprecated`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

Closes #33682 from LuciferYang/closeQuietly.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-10 23:17:57 -07:00
Gengliang Wang 3029e62a82 [SPARK-36445][SQL] ANSI type coercion rule for date time operations
### What changes were proposed in this pull request?

Implement a new rule for the date-time operations in the ANSI type coercion system:
1. Date will be converted to Timestamp when it is in the subtraction with Timestmap.
2. Promote string literals in date_add/date_sub/time_add

### Why are the changes needed?

Currently the type coercion rule `DateTimeOperations` doesn't match the design of the ANSI type coercion system:
1. For date_add/date_sub, if the input is timestamp type, Spark should not convert it into date type since date type is narrower than the timestamp type.
2. For date_add/date_sub/time_add, string value can be implicit cast to date/timestamp only when it is literal.

Thus, we need to have a new rule for the date-time operations in the ANSI type coercion system.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion rules are not releaesd.

### How was this patch tested?

New UT

Closes #33666 from gengliangwang/datetimeOp.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-11 11:55:45 +08:00
Jungtaek Lim ed60aaa9f1 [SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window
### What changes were proposed in this pull request?

This PR proposes to prohibit update mode in streaming aggregation with session window.

UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.

This PR also cleans up test code via deduplicating.

### Why are the changes needed?

The semantic of "update" mode for session window based streaming aggregation is quite unclear.

For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.

This doesn't hold true for session window based streaming aggregation, as session range is changing.

If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.

If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.

### Does this PR introduce _any_ user-facing change?

No, as we haven't released this feature.

### How was this patch tested?

Updated tests.

Closes #33689 from HeartSaVioR/SPARK-36463.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-11 10:45:52 +09:00
gengjiaan 186815be1c [SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported
### What changes were proposed in this pull request?
Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`.
```
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>>
-- !query output
{"t":null}
```

```
-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz
```

We should make `from_json` throws exception too.
This PR fix the discussion below
https://github.com/apache/spark/pull/33640#discussion_r682862523

### Why are the changes needed?
Make the behavior of `from_json` more reasonable.

### Does this PR introduce _any_ user-facing change?
'Yes'.
from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ.

### How was this patch tested?
Tests updated.

Closes #33684 from beliefer/SPARK-36429-new.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 22:52:20 +08:00
Angerszhuuuu 89d8a4eacf [SPARK-36431][SQL] Support TypeCoercion of ANSI intervals with different fields
### What changes were proposed in this pull request?
 Support TypeCoercion of ANSI intervals with different fields

### Why are the changes needed?
 Support TypeCoercion of ANSI intervals with different fields

### Does this PR introduce _any_ user-facing change?
After this pr user can
 - use comparison function with  different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH`
 - support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)`
 - support different field of ansi interval type in `coalesce` etc..

### How was this patch tested?
Added UT

Closes #33661 from AngersZhuuuu/SPARK-SPARK-36431.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-10 14:22:31 +03:00
Cheng Pan 7f56b73cad [SPARK-36466][SQL] Table in unloaded catalog referenced by view should load correctly
### What changes were proposed in this pull request?

Retain `spark.sql.catalog.*` confs when resolving view.

### Why are the changes needed?

Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet.

### Does this PR introduce _any_ user-facing change?

Yes, bug fix.

### How was this patch tested?

Add UT.

Closes #33692 from pan3793/SPARK-36466.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 17:31:21 +08:00
Terry Kim e1a5d94117 [SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns
### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to handle it.

### Why are the changes needed?

To check the duplicate columns during analysis.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the following:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33676 from imback82/replace_cols_duplicates.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 13:20:29 +08:00
Huaxin Gao b04330cd38 [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
### What changes were proposed in this pull request?
not push down partition filter to `ORCScan` for DSv2

### Why are the changes needed?
Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1
```
== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>
```
Also, we don't push down partition filter for parquet in DSv2.
https://github.com/apache/spark/pull/30652

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suites

Closes #33680 from huaxingao/orc_filter.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-09 10:47:03 -07:00
Mick Jermsurawong 33c6d1168c [SPARK-20384][SQL] Support value class in nested schema for Dataset
### What changes were proposed in this pull request?

- This PR revisits https://github.com/apache/spark/pull/22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class.
- Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema.
- We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
- However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
- With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
- This PR revisits https://github.com/apache/spark/pull/27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]`

### Why are the changes needed?

- Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given `AnyVal` wrapper and its root-level class container
```
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)
```
The problematic part of generated code:
```
    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }
```
We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```

From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`,
1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`.

### Does this PR introduce _any_ user-facing change?

- Yes, this will allow support for the nested value class.

### How was this patch tested?

- Added unit tests to illustrate
  - raw schema
  - projection
  - round-trip encode/decode

Closes #33205 from mickjermsurawong-stripe/SPARK-20384-2.

Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-09 08:47:35 -05:00
yangjie01 900908b9be [SPARK-36410][CORE][SQL][STRUCTURED STREAMING][EXAMPLES] Replace anonymous classes with lambda expressions
### What changes were proposed in this pull request?
The main change of this pr is replace anonymous classes with lambda expressions in Java code

**Before**
```java
 new Thread(new Runnable() {
    Override
    public void run() {
      // run thread
    }
  });
```

**After**

```java
new Thread(() -> {
    // run thread
  });
```

### Why are the changes needed?
Code Simpilefications.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Manually test `JavaUserDefinedScalar` with command
   `bin/spark-submit run-example org.apache.spark.examples.sql.JavaUserDefinedScalar` passed

Closes #33635 from LuciferYang/lambda.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-09 19:28:31 +09:00
Wenchen Fan 9a539d5846 [SPARK-36430][SQL] Adaptively calculate the target size when coalescing shuffle partitions in AQE
### What changes were proposed in this pull request?

This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172

Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs.

After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable.

This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster.

### Why are the changes needed?

fix perf regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33655 from cloud-fan/minor.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 17:25:55 +08:00
Angerszhuuuu f3e079b09b [SPARK-36271][SQL] Unify V1 insert check field name before prepare writter
### What changes were proposed in this pull request?
Unify DataSource V1 insert schema check field name before prepare writer.
And in this PR we add check for avro V1 insert too.

### Why are the changes needed?
Unify code and add check for avro V1 insert too.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33566 from AngersZhuuuu/SPARK-36271.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 17:18:06 +08:00
ulysses-you bb6f65acca [SPARK-36424][SQL] Support eliminate limits in AQE Optimizer
### What changes were proposed in this pull request?

* override the maxRows method in `LogicalQueryStage`
* add rule `EliminateLimits` in `AQEOptimizer`

### Why are the changes needed?

In Ad-hoc scenario, we always add limit for the query if user have no special limit value, but not all limit is nesessary.

With the power of AQE, we can eliminate limits using running statistics.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

add test

Closes #33651 from ulysses-you/SPARK-36424.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:51:51 +08:00
Angerszhuuuu e051a540a1 [SPARK-36352][SQL] Spark should check result plan's output schema name
### What changes were proposed in this pull request?
Spark should check result plan's output schema name

### Why are the changes needed?
In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan
```
Project[a, B]
|--Scan[A, b, c]
```
the origin output schema is `a, B`, after SchemaPruning. it become
```
Project[A, b]
|--Scan[A, b]
```
It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed UT

Closes #33583 from AngersZhuuuu/SPARK-36352.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:47:56 +08:00
Wenchen Fan 8714eefe6f [SPARK-35881][SQL][FOLLOWUP] Add a boolean flag in AdaptiveSparkPlanExec to ask for columnar output
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/33140 to propose a simpler idea for integrating columnar execution into AQE.

Instead of making the `ColumnarToRowExec` and `RowToColumnarExec` dynamic to handle `AdaptiveSparkPlanExec`, it's simpler to let the consumer decide if it needs columnar output or not, and pass a boolean flag to `AdaptiveSparkPlanExec`.

For Spark vendors, they can set the flag differently in their custom columnar parquet writing command when the input plan is `AdaptiveSparkPlanExec`.

One argument is if we need to look at the final plan of AQE and consume the data differently (either row or columnar format). I can't think of a use case and I think we can always statically know if the AQE plan should output row or columnar data.

### Why are the changes needed?

code simplification.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manual test

Closes #33624 from cloud-fan/aqe.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:33:52 +08:00
Terry Kim b3d7ebb2df [SPARK-36450][SQL] Remove unused UnresolvedV2Relation
### What changes were proposed in this pull request?

Now that all the commands that use `UnresolvedV2Relation` have been migrated to use `UnresolvedTable` and `UnresolvedView` (e.g, #33200), `UnresolvedV2Relation` can be removed.

### Why are the changes needed?

To remove unused code.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Removing dead code and no code coverage existed before.

Closes #33677 from imback82/remove_unresolvedv2relation.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:27:45 +08:00
Yuming Wang 4624e59ac6 [SPARK-36359][SQL] Coalesce drop all expressions after the first non nullable expression
### What changes were proposed in this pull request?

`Coalesce` drop all expressions after the first non nullable expression. For example:
```scala
sql("create table t1(a string, b string) using parquet")
sql("select a, Coalesce(count(b), 0) from t1 group by a").explain(true)
```

Before this pr:
```
== Optimized Logical Plan ==
Aggregate [a#0], [a#0, coalesce(count(b#1), 0) AS coalesce(count(b), 0)#3L]
+- Relation default.t1[a#0,b#1] parquet
```
After this pr:
```
== Optimized Logical Plan ==
Aggregate [a#0], [a#0, count(b#1) AS coalesce(count(b), 0)#3L]
+- Relation default.t1[a#0,b#1] parquet
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33590 from wangyum/SPARK-36359.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-08-06 23:54:24 +08:00
Kousuke Saruta e17612d0bf Revert "[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported"
### What changes were proposed in this pull request?

This PR reverts the change in SPARK-36429 (#33654).
See [conversation](https://github.com/apache/spark/pull/33654#issuecomment-894160037).

### Why are the changes needed?

To recover CIs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33670 from sarutak/revert-SPARK-36429.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-06 20:56:24 +09:00
gaoyajun02 888f8f03c8 [SPARK-36339][SQL] References to grouping that not part of aggregation should be replaced
### What changes were proposed in this pull request?

Currently, references to grouping sets are reported as errors after aggregated expressions, e.g.
```
SELECT count(name) c, name
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);
```
Error in query: expression 'people.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;

### Why are the changes needed?

Fix the map anonymous function in the constructAggregateExprs function does not use underscores to avoid

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests.

Closes #33574 from gaoyajun02/SPARK-36339.

Lead-authored-by: gaoyajun02 <gaoyajun02@gmail.com>
Co-authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-06 16:34:37 +08:00
ulysses-you c97fb68885 [SPARK-35221][SQL] Add the check of supported join hints
### What changes were proposed in this pull request?

Print warning msg if join hint is not supported for the specified build side.

### Why are the changes needed?

Currently we support specify the join implementation with hint, but Spark did not promise it.

For example broadcast outer join and hash outer join we need to check if its build side was supported. And at least we should print some warning log instead of changing to other join implementation silently.

### Does this PR introduce _any_ user-facing change?

Yes, warning log might be printed.

### How was this patch tested?

Add new test.

Closes #32355 from ulysses-you/SPARK-35221.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-06 15:29:43 +08:00
Kousuke Saruta 63c7d1847d [SPARK-36429][SQL][FOLLOWUP] Update a golden file to comply with the change in SPARK-36429
### What changes were proposed in this pull request?

This PR updates a golden to comply with the change in SPARK-36429 (#33654).

### Why are the changes needed?

To recover GA failure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA itself.

Closes #33663 from sarutak/followup-SPARK-36429.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-06 15:20:54 +08:00
gengjiaan eb12727bc7 [SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported
### What changes were proposed in this pull request?
Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`.
```
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>>
-- !query output
{"t":null}
```

```
-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz
```

We should make `from_json` throws exception too.
This PR fix the discussion below
https://github.com/apache/spark/pull/33640#discussion_r682862523

### Why are the changes needed?
Make the behavior of `from_json` more reasonable.

### Does this PR introduce _any_ user-facing change?
'Yes'.
from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ.

### How was this patch tested?
Tests updated.

Closes #33654 from beliefer/SPARK-36429.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-06 12:53:04 +08:00
Kent Yao c7fa3c9090 [SPARK-36421][SQL][DOCS] Use ConfigEntry.key to fix docs and set command results
### What changes were proposed in this pull request?

This PR fixes the issue that `ConfigEntry` to be introduced to the doc field directly without calling `.key`, which causes malformed documents on the web site and in the result of `SET -v`

1. https://spark.apache.org/docs/3.1.2/configuration.html#static-sql-configuration - spark.sql.hive.metastore.jars

2. set -v
![image](https://user-images.githubusercontent.com/8326978/128292412-85100f95-24fd-4b40-a14f-d31a256dab7d.png)

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no, but contains doc fix
### How was this patch tested?

new tests

Closes #33647 from yaooqinn/SPARK-36421.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-06 11:01:47 +09:00
Kent Yao 0c94e47aec [SPARK-36414][SQL] Disable timeout for BroadcastQueryStageExec in AQE
### What changes were proposed in this pull request?

This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually.

 

![image](https://user-images.githubusercontent.com/8326978/128169612-4c96c8f6-6f8e-48ed-8eaf-450f87982c3b.png)

 

The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something

 

### Why are the changes needed?

AQE is default now, we can make it more stable with this PR

### Does this PR introduce _any_ user-facing change?

yes, broadcast timeout now is not used for AQE

### How was this patch tested?

modified test

Closes #33636 from yaooqinn/SPARK-36414.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-05 21:15:35 +08:00
Wenchen Fan 095f9ff75b [SPARK-36409][SQL][TESTS] Splitting test cases from datetime.sql
### What changes were proposed in this pull request?

Currently `datetime.sql` contains a lot of tests and will be run 3 times: default mode, ansi mode, ntz mode. It wastes the test time and also large test files are hard to read.

This PR proposes to split it into smaller ones:
1. `date.sql`, which contains date literals, functions and operations. It will be run twice with default and ansi mode.
2. `timestamp.sql`, which contains timestamp (no ltz or ntz suffix) literals, functions and operations. It will be run 4 times: default mode + ans off, defaul mode + ansi on, ntz mode + ansi off, ntz mode + ansi on.
3. `datetime_special.sql`, which create datetime values whose year is outside of [0, 9999]. This is a separated file as JDBC doesn't support them and need to ignore this test file. It will be run 4 times as well.
4. `timestamp_ltz.sql`, which contains timestamp_ltz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ltz are tested by `timestamp.sql`
5. `timestamp_ntz.sql`, which contains timestamp_ntz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ntz are tested by `timestamp.sql`

### Why are the changes needed?

reduce test run time.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33640 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-05 20:55:32 +08:00
Angerszhuuuu 02810eecbf [SPARK-36353][SQL] RemoveNoopOperators should keep output schema
### What changes were proposed in this pull request?
 RemoveNoopOperators should keep output schema

### Why are the changes needed?
Expand function

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33587 from AngersZhuuuu/SPARK-36355.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-05 20:43:35 +08:00
yangjie01 01cf6f4c6b [SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache
### What changes were proposed in this pull request?
There are 3 ways to use Guava cache in spark code:

1. `Loadingcache` is the main way to use Guava cache in spark code and the key usages are as follows:
  a. `LoadingCache` with `maximumsize` data eviction policy, such as `appCache` in `ApplicationCache`, `cache` in `Codegenerator`
  b. `LoadingCache` with `maximumWeight` data eviction policy, such as `shuffleIndexCache` in `ExternalShuffleBlockResolver`
  c. `LoadingCache` with 'expireAfterWrite' data eviction policy, such as `tableRelationCache` in `SessionCatalog`
2. `ManualCache` is another way to use Guava cache in spark code and the key usage is `cache` in `SharedInMemoryCache`, it use to caches partition file statuses in memory

3. The last use way is `hadoopJobMetadata` in `SparkEnv`, it uses Guava Cache to build a `soft-reference map`.

The goal of this pr is use `Caffeine` instead of `Guava Cache` because `Caffeine` is faster than `Guava Cache` from benchmarks, the main changes as follows:

1. Add `Caffeine` deps to maven `pom.xml`

2. Use `Caffeine` instead of Guava `LoadingCache`, `ManualCache` and soft-reference map in `SparkEnv`

3. Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine`

### Why are the changes needed?
`Caffeine` is faster than `Guava Cache` from benchmarks

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine`

Closes #31517 from LuciferYang/guava-cache-to-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Holden Karau <hkarau@netflix.com>
2021-08-04 12:01:44 -07:00
Cheng Su de62b5ae32 [SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2
### What changes were proposed in this pull request?

We added support of nested columns in ORC vectorized reader for data source v1. Data source v2 and v1 both use same underlying implementation for vectorized reader (OrcColumnVector), so we can support data source v2 as well.

### Why are the changes needed?

Improve query performance for ORC data source v2 when reading nested columns.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added test in `OrcQuerySuite.scala`.

Closes #33626 from c21/orc-v2.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-03 19:55:14 -07:00
PengLei 87d49cbcb1 [SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table
### What changes were proposed in this pull request?
Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive.

### Why are the changes needed?
At now the resolver is `_ == _` of `findNestedField`  called by `checkColumnNotExists`
Add `alter.conf.resolver` to it.
[SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381)
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add ut tests

Closes #33618 from Peng-Lei/sensitive-cloumn-name.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-04 10:04:13 +09:00
Yuming Wang 4a6afb4875 [SPARK-36280][SQL] Remove redundant aliases after RewritePredicateSubquery
### What changes were proposed in this pull request?

Remove redundant aliases after `RewritePredicateSubquery`. For example:
```scala
sql("CREATE TABLE t1 USING parquet AS SELECT id AS a, id AS b, id AS c FROM range(10)")
sql("CREATE TABLE t2 USING parquet AS SELECT id AS x, id AS y FROM range(8)")
sql(
  """
    |SELECT *
    |FROM  t1
    |WHERE  a IN (SELECT x
    |  FROM  (SELECT x AS x,
    |           Rank() OVER (partition BY x ORDER BY Sum(y) DESC) AS ranking
    |    FROM   t2
    |    GROUP  BY x) tmp1
    |  WHERE  ranking <= 5)
    |""".stripMargin).explain
```
Before this PR:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [a#10L], [x#7L], LeftSemi, BuildRight, false
   :- FileScan parquet default.t1[a#10L,b#11L,c#12L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#68]
      +- Project [x#7L]
         +- Filter (ranking#8 <= 5)
            +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST]
               +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0
                  +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#62]
                     +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)])
                        +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59]
                           +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)])
                              +- FileScan parquet default.t2[x#15L,y#16L]
```

After this PR:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [a#10L], [x#15L], LeftSemi, BuildRight, false
   :- FileScan parquet default.t1[a#10L,b#11L,c#12L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#67]
      +- Project [x#15L]
         +- Filter (ranking#8 <= 5)
            +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST]
               +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0
                  +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)])
                     +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59]
                        +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)])
                           +- FileScan parquet default.t2[x#15L,y#16L]
```

### Why are the changes needed?

Reduce shuffle to improve query performance. This change can benefit TPC-DS q70.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33509 from wangyum/SPARK-36280.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-03 13:56:59 -07:00
Max Gekk 67cbc93263 [SPARK-36349][SQL] Disallow ANSI intervals in file-based datasources
### What changes were proposed in this pull request?
In the PR, I propose to ban `YearMonthIntervalType` and `DayTimeIntervalType` at the analysis phase while creating a table using a built-in filed-based datasource or writing a dataset to such datasource. In particular, add the following case:
```scala
case _: DayTimeIntervalType | _: YearMonthIntervalType => false
```
to all methods that override either:
- V2 `FileTable.supportsDataType()`
- V1 `FileFormat.supportDataType()`

### Why are the changes needed?
To improve user experience with Spark SQL, and output a proper error message at the analysis phase.

### Does this PR introduce _any_ user-facing change?
Yes but ANSI interval types haven't released yet. So, for users this is new behavior.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite"
```

Closes #33580 from MaxGekk/interval-ban-in-ds.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-03 20:30:20 +03:00
Kousuke Saruta 92cdb17d1a [SPARK-35815][SQL][FOLLOWUP] Add test considering the case spark.sql.legacy.interval.enabled is true
### What changes were proposed in this pull request?

This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815.

### Why are the changes needed?

SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`.
We can't detect this issue because there is no test which considers the legacy interval type at that time.
In SPARK-36323 (#33551), this issue was resolved but it's better to add test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33606 from sarutak/test-watermark-with-legacy-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-03 13:48:41 +03:00
Wenchen Fan dd80457ffb [SPARK-36315][SQL] Only skip AQEShuffleReadRule in the final stage if it breaks the distribution requirement
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30494

This PR proposes a new way to optimize the final query stage in AQE. We first collect the effective user-specified repartition (semantic-wise, user-specified repartition is only effective if it's the root node or under a few simple nodes), and get the required distribution for the final plan. When we optimize the final query stage, we skip certain `AQEShuffleReadRule` if it breaks the required distribution.

### Why are the changes needed?

The current solution for optimizing the final query stage is pretty hacky and overkill. As an example, the newly added rule `OptimizeSkewInRebalancePartitions` can hardly apply as it's very common that the query plan has shuffles with origin `ENSURE_REQUIREMENTS`, which is not supported by `OptimizeSkewInRebalancePartitions`.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes #33541 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-03 18:28:52 +08:00
Wenchen Fan 7cb9c1c241 [SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN
### What changes were proposed in this pull request?

This a followup of the recent work such as https://github.com/apache/spark/pull/33200

For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands.

This PR also moves these AlterTable commands to a individual file and give them a base trait.

### Why are the changes needed?

name simplification

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing test

Closes #33609 from cloud-fan/dsv2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-03 10:43:00 +03:00
Yuming Wang c20af53580 [SPARK-36373][SQL] DecimalPrecision only add necessary cast
### What changes were proposed in this pull request?

This pr makes `DecimalPrecision` only add necessary cast similar to [`ImplicitTypeCasts`](96c2919988/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L675-L678)). For example:
```
EqualTo(AttributeReference("d1", DecimalType(5, 2))(), AttributeReference("d2", DecimalType(2, 1))())
```
It will add a useless cast to _d1_:
```
(cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
```

### Why are the changes needed?

1. Avoid adding unnecessary cast. Although it will be removed by  `SimplifyCasts` later.
2. I'm trying to add an extended rule similar to `PullOutGroupingExpressions`. The current behavior will introduce additional alias. For example: `cast(d1 as decimal(5,2)) as cast_d1`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33602 from wangyum/SPARK-36373.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-08-03 08:12:54 +08:00
Chao Sun 7a27f8a07f [SPARK-36137][SQL] HiveShim should fallback to getAllPartitionsOf even if directSQL is enabled in remote HMS
### What changes were proposed in this pull request?

Change `HiveShim.getPartitionsByFilter` to always fallback to use `getAllPartitionsMethod` even if `hive.metastore.try.direct.sql` is set to true in the remote HMS.

### Why are the changes needed?

At the moment `getPartitionsByFilter` in `HiveShim` only fallback to use `getAllPartitionsMethod` when `hive.metastore.try.direct.sql` is disabled in the remote HMS, and will fail the query otherwise. However, in certain cases the remote HMS will fallback to use ORM (which only support string type for partition columns) to query the underlying RDBMS **even if this config is set to true**. In this scenario, currently Spark will not be able to recover from the exception and will just fail the query.

For instance, we encountered this bug [HIVE-21497](https://issues.apache.org/jira/browse/HIVE-21497) in HMS running Hive 3.1.2, and Spark was not able to pushdown filter for date column.

### Does this PR introduce _any_ user-facing change?

Yes, now if Spark is querying partitions from a remote HMS which throws exception even if `hive.metastore.try.direct.sql` is set to true, Spark will fallback to list all partitions and do the pruning on client side, instead of failing the query.

### How was this patch tested?

Tested locally with a HMS instance running 3.1.2. It's pretty hard to add a unit test for this since we don't have a mock HMS.

Closes #33382 from sunchao/SPARK-36137-direct-sql.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-02 16:48:43 -07:00
Hyukjin Kwon 0bbcbc6508 [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode
### What changes were proposed in this pull request?

This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode.
Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on.
Now, we fail explicitly if `null` is passed when the input array contains `null`.

Note that this is consistent with non-array JSON input:

**Permissive mode:**

```scala
spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
res0: Array[org.apache.spark.sql.Row] = Array([str], [null])
```

**Failfast mode**:

```scala
spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
	at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### Why are the changes needed?

To make the permissive mode to proceed and parse without throwing an exception.

### Does this PR introduce _any_ user-facing change?

**Permissive mode:**

```scala
spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

NOTE that this behaviour is consistent when JSON object is malformed:

```scala
spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect()
```

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

Since we're parsing _one_ JSON array, related records all fail together.

**Failfast mode:**

```scala
spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
	at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### How was this patch tested?

Manually tested, and unit test was added.

Closes #33608 from HyukjinKwon/SPARK-36379.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-02 10:01:12 -07:00
Angerszhuuuu f3173956cb [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
### What changes were proposed in this pull request?
For added UT, without this patch will failed as below
```
[info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds)
[info]   java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.
[info]   at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
[info]   at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
[info]   at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
[info]   at scala.collection.immutable.List.foldLeft(List.scala:91)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
[info]   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
```

CollapseProject project replace alias should use origin column name
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33576 from AngersZhuuuu/SPARK-36086.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-03 00:08:13 +08:00
Linhong Liu 2f700773c2 [SPARK-36224][SQL] Use Void as the type name of NullType
### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct<a:void,b:int>
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-02 23:19:54 +08:00
Terry Kim 3b713e7f61 [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns
### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to handle it. For v1 command, the duplication is checked before invoking the catalog.

### Why are the changes needed?

To check the duplicate columns during analysis and be consistent with v1 command.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the fllowing:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33600 from imback82/alter_add_duplicate_columns.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-02 17:54:50 +08:00
Dongjoon Hyun 22c49226f7 [SPARK-36362][CORE][SQL][FOLLOWUP] Fix java linter errors
### What changes were proposed in this pull request?

This is a follow-up of #33594 to fix the Java linter error.

### Why are the changes needed?

To recover GitHub Action.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #33601 from dongjoon-hyun/SPARK-36362.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 21:40:03 +09:00
Sean Owen 72615bc551 [SPARK-36362][CORE][SQL][TESTS] Omnibus Java code static analyzer warning fixes
### What changes were proposed in this pull request?

Fix up some minor Java issues:

- Some int*int multiplications that widen to long maybe could overflow
- Unnecessarily non-static inner classes
- Some tests "catch (AssertionError)" and do nothing
- Manual array iteration vs very slightly faster/simpler foreach
- Incorrect generic types that just happen to not cause a runtime error
- Missed opportunities for try-close
- Mutable enums
- .. and a few other minor things

### Why are the changes needed?

Some are minor but clear fixes; some may have a marginal perf impact or avoid a bug later. Also: maybe avoid future PRs to address these one by one.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #33594 from srowen/SPARK-36362.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-31 22:35:57 -07:00
Enrico Minack a65eb36bae [SPARK-36319][SQL][PYTHON] Make Observation return Map instead of Row
### What changes were proposed in this pull request?
The Observation API (Scala, Java, PySpark) now returns a `Map` / `Dict`. Before, it returned `Row` simply because the metrics are (internal to Observation) retrieved from the listener as rows. Since that is hidden from the user by the Observation API, there is no need to return `Row`.

While touching this code, this moves the unit tests from `DataFrameSuite,scala` to `DatasetSuite.scala` and from `JavaDataFrameSuite.java` to `JavaDatasetSuite.java`, which is a better place.

### Why are the changes needed?
This simplifies the API and accessing the metrics, especially in Java. There is no need for the concept `Row` when retrieving the observation result.

### Does this PR introduce _any_ user-facing change?
Yes, it changes the return type of `get` from `Row` to `Map` (Scala) / `Dict` (Python) and introduces `getAsJavaMap` (Java).

### How was this patch tested?
This is tested in `DatasetSuite.SPARK-34806: observation on datasets`, `JavaDatasetSuite.testObservation` and `test_dataframe.test_observe`.

Closes #33545 from EnricoMi/branch-observation-returns-map.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 10:40:28 +09:00
Andy Grove 0f538402fb [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec
### What changes were proposed in this pull request?

Changes in this PR:

- `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and `doExecuteColumnar` to support adaptive queries where the final query stage produces columnar data.
- `SessionState` now has a new set of injectable rules named `finalQueryStagePrepRules` that can be applied to the final query stage.
- `AdaptiveSparkPlanExec` can now safely be wrapped by either `RowToColumnarExec` or `ColumnarToRowExec`.

A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` transition that is inserted by previous rules and at execution time can call `finalPlanSupportsColumnar` to see if the final query stage is columnar. If the plan is columnar then the plugin can safely call `doExecuteColumnar`. The adaptive plan can be wrapped in either `RowToColumnarExec` or `ColumnarToRowExec` to force a particular output format. There are fast paths in both of these operators to avoid any redundant transitions.

### Why are the changes needed?

Without this change it is necessary to use reflection to get the final physical plan to determine whether it is columnar and to execute it is a columnar plan. `AdaptiveSparkPlanExec` only provides public methods for row-based execution.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I have manually tested this patch with the RAPIDS Accelerator for Apache Spark.

Closes #33140 from andygrove/support-columnar-adaptive.

Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-07-30 13:21:50 -05:00
Hyukjin Kwon c6140d4d0a [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side
### What changes were proposed in this pull request?

This PR proposes to implement `distributed-sequence` index in Scala side.

### Why are the changes needed?

- Avoid unnecessary (de)serialization
- Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104)

### Does this PR introduce _any_ user-facing change?

No to end users since pandas API on Spark is not released yet.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(1).spark.print_schema()
```

Before:

```
root
 |-- id: long (nullable = true)
```

After:

```
root
 |-- id: long (nullable = false)
```

### How was this patch tested?

Manually tested, and existing tests should cover them.

Closes #33570 from HyukjinKwon/SPARK-36338.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 22:29:23 +09:00
Wenchen Fan 387a251a68 [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/33352 , to simplify the JDBC aggregate pushdown:
1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark.
2. because of 1, now we can remove the `dataType` property from the public `Sum` expression.

This PR also contains some small improvements:
1. Spark should deduplicate the aggregate expressions before pushing them down.
2. Improve the `toString` of public aggregate expressions to make them more SQL.

### Why are the changes needed?

code and API simplification

### Does this PR introduce _any_ user-facing change?

this API is not released yet.

### How was this patch tested?

existing tests

Closes #33579 from cloud-fan/dsv2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-30 00:26:32 -07:00
Chao Sun 0ece865ea4 [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
### What changes were proposed in this pull request?

Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this.

### Why are the changes needed?

Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at.

### Does this PR introduce _any_ user-facing change?

No, it's just test refactoring.

### How was this patch tested?

Using existing tests:
```
build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"
```
and
```
build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"
```

Closes #33564 from sunchao/SPARK-36136-partitions-suite.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-29 17:18:23 -07:00
Yuanjian Li 4cd5fa96d8 [SPARK-36347][SS] Upgrade the RocksDB version to 6.20.3
### What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation.

### Why are the changes needed?
For further ARM support and leverage the bug fix for the newer version.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #33578 from xuanyuanking/SPARK-36347.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-07-29 11:08:58 -07:00
Kousuke Saruta db18866742 [SPARK-36323][SQL] Support ANSI interval literals for TimeWindow
### What changes were proposed in this pull request?

This PR proposes to support ANSI interval literals for `TimeWindow`.

### Why are the changes needed?

Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33551 from sarutak/window-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-29 08:51:51 +03:00
Cheng Su 6a8dd3229a [SPARK-36272][SQL][TEST] Change shuffled hash join metrics test to check relative value of build size
### What changes were proposed in this pull request?

This is a follow up of https://github.com/apache/spark/pull/33447, where the unit test is disabled, due to failure after memory setting changed. I found the root cause is after https://github.com/apache/spark/pull/33447, in unit test, Spark memory page byte size is changed from `67108864` to `33554432` [1]. So the shuffled hash join build size is also changed accordingly due to [memory page byte size change](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L457). Previously the unit test is checking the exact value of build size, so it no longer works. Here we change the unit test to verify the relative value of build size, and it should work.

[1]: I printed out the memory page byte size explicitly in unit test - `org.apache.spark.SparkException: chengsu pageSizeBytes: 33554432!` in https://github.com/c21/spark/runs/3186680616?check_suite_focus=true .

### Why are the changes needed?

Make previously disabled unit test work.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Changed unit test itself.

Closes #33494 from c21/test.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-29 11:14:34 +09:00
Linhong Liu ed0e351f05 [SPARK-36286][SQL] Block some invalid datetime string
### What changes were proposed in this pull request?
In PR #32959, we found some weird datetime strings that can be parsed. ([details](https://github.com/apache/spark/pull/32959#discussion_r665015489))
This PR blocks the invalid datetime string.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
Yes, below strings will have different results when cast to datetime.
```sql
select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL
select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL
```

### How was this patch tested?
some new test cases

Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-29 09:16:46 +08:00
Venki Korukanti eb4d1c0332 [SPARK-36236][SS] Additional metrics for RocksDB based state store implementation
### What changes were proposed in this pull request?

Proposing adding new metrics to `customMetrics` under the `stateOperators` in `StreamingQueryProgress` event These metrics help have better visibility into the RocksDB based state store in streaming jobs. For full details of metrics, refer to https://issues.apache.org/jira/browse/SPARK-36236.

### Why are the changes needed?

Current metrics available for the RockDB state store, do not provide observability into many operations such as how much time is spent by the RocksDB in compaction and what is the cache hit ratio. These metrics help compare performance differences in state store operations between slow and fast microbatches .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unittests

Closes #33455 from vkorukanti/rocksdb-metrics.

Authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-28 12:51:42 -07:00
dgd-contributor e1c50ff779 [SPARK-36229][SQL] conv() inconsistently handles invalid strings with more than 64 invalid characters and return wrong value on overflow
### What changes were proposed in this pull request?
1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold.

```
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---------------------------+
|conv(repeat(?, 64), 10, 16)|
+---------------------------+
|                          0|
+---------------------------+

scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0
+---------------------------+
|conv(repeat(?, 65), 10, 16)|
+---------------------------+
|           FFFFFFFFFFFFFFFF|
+---------------------------+

scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0
+----------------------------+
|conv(repeat(?, 65), 10, -16)|
+----------------------------+
|                          -1|
+----------------------------+

scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
+----------------------------+
|conv(repeat(?, 64), 10, -16)|
+----------------------------+
|                           0|
+----------------------------+
```

2/ conv should return result equal to max unsigned long value in base toBase when there is overflow

```
scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615

+-------------------------------+
|conv(aaaaaaa0aaaaaaa0a, 16, 10)|
+-------------------------------+
|           12297828695278266890|
+-------------------------------+
```

### Why are the changes needed?
Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database
### Does this PR introduce _any_ user-facing change?
change in result of conv() function
### How was this patch tested?
add test

Closes #33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-29 00:19:04 +08:00
Pablo Langa f12793de20 [SPARK-35320][SQL] Improve error message for unsupported key types in MapType in from_json expression
### What changes were proposed in this pull request?

Currently, when a map is parsed in a from_json function, only StringType key is supported. If you try to parse other type, it results on a cast exception.
For example:
```scala
Seq((s"""{"2021-05-05T20:05:08": "sampleValue"}"""))
  .toDF("value")
  .withColumn("value1", from_json(col("value"),  MapType(TimestampType, StringType)))
  .show
```
```
Exception in thread "main" java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
	at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
	at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
```
This PR proposes to improve the error message.
```
org.apache.spark.sql.AnalysisException: cannot resolve 'entries' due to data type mismatch: Input schema map<timestamp,string> can only contain StringType as a key type for a MapType.;
'Project [unresolvedalias(from_json(MapType(TimestampType,StringType,true), value#1, Some(America/Los_Angeles)), Some(org.apache.spark.sql.Column$$Lambda$1496/54693608710e5bf9c))]
+- LocalRelation [value#1]
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:197)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:182)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
...
```
In https://github.com/apache/spark/pull/32599 we decide to improve the error message instead of support this.

### Why are the changes needed?

Avoid confusion in the interpretation of the error

### Does this PR introduce _any_ user-facing change?

Yes, the error message returned in this case

### How was this patch tested?

Unit testing and manual testing

Closes #33525 from planga82/feature/spark35320_improve_error_message.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 14:46:06 +08:00