Commit graph

4187 commits

Author SHA1 Message Date
Maxim Gekk e933539cdd [SPARK-29864][SPARK-29920][SQL] Strict parsing of day-time strings to intervals
### What changes were proposed in this pull request?
In the PR, I propose new implementation of `fromDayTimeString` which strictly parses strings in day-time formats to intervals. New implementation accepts only strings that match to a pattern defined by the `from` and `to`. Here is the mapping of user's bounds and patterns:
- `[+|-]D+ H[H]:m[m]:s[s][.SSSSSSSSS]` for **DAY TO SECOND**
- `[+|-]D+ H[H]:m[m]` for **DAY TO MINUTE**
- `[+|-]D+ H[H]` for **DAY TO HOUR**
- `[+|-]H[H]:m[m]s[s][.SSSSSSSSS]` for **HOUR TO SECOND**
- `[+|-]H[H]:m[m]` for **HOUR TO MINUTE**
- `[+|-]m[m]:s[s][.SSSSSSSSS]` for **MINUTE TO SECOND**

Closes #26327
Closes #26358

### Why are the changes needed?
- Improve user experience with Spark SQL, and respect to the bound specified by users.
- Behave the same as other broadly used DBMS - Oracle and MySQL.

### Does this PR introduce any user-facing change?
Yes, before:
```sql
spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE;
interval 1 weeks 3 days 11 hours 12 minutes
```
After:
```sql
spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE;
Error in query:
requirement failed: Interval string must match day-time format of '^(?<sign>[+|-])?(?<hour>\d{1,2}):(?<minute>\d{1,2})$': 10 11:12:13.123(line 1, pos 16)

== SQL ==
SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE
----------------^^^
```

### How was this patch tested?
- Added tests to `IntervalUtilsSuite`
- By `ExpressionParserSuite`
- Updated `literals.sql`

Closes #26473 from MaxGekk/strict-from-daytime-string.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-12 01:08:53 +08:00
Terry Kim beae14d5ed [SPARK-30104][SQL] Fix catalog resolution for 'global_temp'
### What changes were proposed in this pull request?

`global_temp` is used as a database name to access global temp views. The current catalog lookup logic considers only the first element of multi-part name when it resolves a catalog. This results in using the session catalog even `global_temp` is used as a table name under v2 catalog. This PR addresses this by making sure multi-part name has two elements before using the session catalog.

### Why are the changes needed?

Currently, 'global_temp' can be used as a table name in certain commands (CREATE) but not in others (DESCRIBE):
```
// Assume "spark.sql.globalTempDatabase" is set to "global_temp".
sql(s"CREATE TABLE testcat.t (id bigint, data string) USING foo")
sql(s"CREATE TABLE testcat.global_temp (id bigint, data string) USING foo")
sql("USE testcat")

sql(s"DESCRIBE TABLE t").show
+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|             id|   bigint|       |
|           data|   string|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+

sql(s"DESCRIBE TABLE global_temp").show
org.apache.spark.sql.AnalysisException: Table not found: global_temp;;
  'DescribeTable 'UnresolvedV2Relation [global_temp], org.apache.spark.sql.connector.InMemoryTableSessionCatalog2f1af64f, `global_temp`, false
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122)
```

### Does this PR introduce any user-facing change?

Yes, `sql(s"DESCRIBE TABLE global_temp").show` in the above example now displays:
```
+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|             id|   bigint|       |
|           data|   string|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+
```
instead of throwing an exception.

### How was this patch tested?

Added new tests.

Closes #26741 from imback82/global_temp.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-11 16:56:42 +08:00
Sean Owen 3cc55f6a0a [SPARK-29392][CORE][SQL][FOLLOWUP] More removal of 'foo Symbol syntax for Scala 2.13
### What changes were proposed in this pull request?

Another continuation of https://github.com/apache/spark/pull/26748

### Why are the changes needed?

To cleanly cross compile with Scala 2.13.

### Does this PR introduce any user-facing change?

None.

### How was this patch tested?

Existing tests

Closes #26842 from srowen/SPARK-29392.4.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-10 19:41:24 -08:00
Kent Yao 8f0eb7dc86 [SPARK-29587][SQL] Support SQL Standard type real as float(4) numeric as decimal
### What changes were proposed in this pull request?
The types decimal and numeric are equivalent. Both types are part of the SQL standard.

the real type is  4 bytes, variable-precision, inexact, 6 decimal digits precision, same as our float, part of the SQL standard.

### Why are the changes needed?

improve sql standard support
other dbs
https://www.postgresql.org/docs/9.3/datatype-numeric.html
https://prestodb.io/docs/current/language/types.html#floating-point
http://www.sqlservertutorial.net/sql-server-basics/sql-server-data-types/
MySQL treats REAL as a synonym for DOUBLE PRECISION (a nonstandard variation), unless the REAL_AS_FLOAT SQL mode is enabled.
In MySQL, NUMERIC is implemented as DECIMAL, so the following remarks about DECIMAL apply equally to NUMERIC.

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #26537 from yaooqinn/SPARK-29587.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-11 02:22:08 +08:00
Kent Yao 24c4ce1e64 [SPARK-28351][SQL][FOLLOWUP] Remove 'DELETE FROM' from unsupportedHiveNativeCommands
### What changes were proposed in this pull request?

Minor change, rm `DELETE FROM` from unsupported hive native operation, because it is supported in parser.

### Why are the changes needed?
clear ambiguous ambiguous

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

no

Closes #26836 from yaooqinn/SPARK-28351.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-10 09:54:50 -08:00
Yuanjian Li d9b3069412 [SPARK-30125][SQL] Remove PostgreSQL dialect
### What changes were proposed in this pull request?
Reprocess all PostgreSQL dialect related PRs, listing in order:

- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part]

### Why are the changes needed?
As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to maintain.
2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.

### Does this PR introduce any user-facing change?
Yes, the config `spark.sql.dialect` will be removed.

### How was this patch tested?
Existing UT.

Closes #26763 from xuanyuanking/SPARK-30125.

Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-11 01:22:34 +08:00
fuwhu c2f29d5ea5 [SPARK-30138][SQL] Separate configuration key of max iterations for analyzer and optimizer
### What changes were proposed in this pull request?
separate the configuration keys "spark.sql.optimizer.maxIterations" and "spark.sql.analyzer.maxIterations".

### Why are the changes needed?
Currently, both Analyzer and Optimizer use conf "spark.sql.optimizer.maxIterations" to set the max iterations to run, which is a little confusing.
It is clearer to add a new conf "spark.sql.analyzer.maxIterations" for analyzer max iterations.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
Existing unit tests.

Closes #26766 from fuwhu/SPARK-30138.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-12-09 19:43:32 +09:00
Aman Omer dcea7a4c9a [SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and bool_or()
### What changes were proposed in this pull request?
This PR introduces a method `expressionWithAlias` in class `FunctionRegistry` which is used to register function's constructor. Currently, `expressionWithAlias` is used to register `BoolAnd` & `BoolOr`.

### Why are the changes needed?
Error message is wrong when alias name is used for `BoolAnd` & `BoolOr`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tested manually.

For query,
`select every('true');`

Output before this PR,

> Error in query: cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7;

After this PR,

> Error in query: cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7;

Closes #26712 from amanomer/29883.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-09 13:23:16 +08:00
Pablo Langa bca9de6684 [SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution
### What changes were proposed in this pull request?

Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands.

We don’t have this methods in the catalog to implement an V2 command
* catalog.listFunctions

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing
`SHOW FUNCTIONS LIKE namespace.function`

### Does this PR introduce any user-facing change?

Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-12-08 20:15:09 -08:00
Kent Yao e88d74052b [SPARK-30147][SQL] Trim the string when cast string type to booleans
### What changes were proposed in this pull request?

Now, we trim the string when casting string value to those `canCast` types values, e.g. int, double, decimal, interval, date, timestamps, except for boolean.
This behavior makes type cast and coercion inconsistency in Spark.
Not fitting ANSI SQL standard either.
```
If TD is boolean, then
Case:
a) If SD is character string, then SV is replaced by
    TRIM ( BOTH ' ' FROM VE )
    Case:
    i) If the rules for literal in Subclause 5.3, “literal”, can be applied to SV to determine a valid
value of the data type TD, then let TV be that value.
   ii) Otherwise, an exception condition is raised: data exception — invalid character value for cast.
b) If SD is boolean, then TV is SV
```
In this pull request, we trim all the whitespaces from both ends of the string before converting it to a bool value. This behavior is as same as others, but a bit different from sql standard, which trim only spaces.

### Why are the changes needed?

Type cast/coercion consistency

### Does this PR introduce any user-facing change?

yes, string with whitespaces in both ends will be trimmed before converted to booleans.

e.g. `select cast('\t true' as boolean)` results `true` now, before this pr it's `null`
### How was this patch tested?

add unit tests

Closes #26776 from yaooqinn/SPARK-30147.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-12-07 15:03:51 +09:00
Sean Owen a30ec19a73 [SPARK-30155][SQL] Rename parse() to parseString() to avoid conflict in Scala 2.13
### What changes were proposed in this pull request?

Rename internal method LegacyTypeStringParser.parse() to parseString().

### Why are the changes needed?

In Scala 2.13, the parse() definition clashes with supertype declarations.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #26784 from srowen/SPARK-30155.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 16:16:28 -08:00
wuyi 58be82ad4b [SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax
### What changes were proposed in this pull request?

In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0.

And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`.

### Why are the changes needed?

1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent.

2. User may gets confused in some cases. For example:

```
CREATE TABLE t1 (c1 INT) USING PARQUET;
CREATE TABLE t2 (c1 INT);
```

In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.

On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`:

```
CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE;
CREATE TABLE t4 AS SELECT 1 AS VALUE;
```
And these two cases together can be really confusing.

3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior.

### Does this PR introduce any user-facing change?

Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider.

### How was this patch tested?

Added tests in `DDLParserSuite` and `HiveDDlSuite`.

Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default.

Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-07 02:15:25 +08:00
Sean Owen c8ed71b3cd [SPARK-30011][SQL] Inline 2.12 "AsIfIntegral" classes, not present in 2.13
### What changes were proposed in this pull request?

Classes like DoubleAsIfIntegral are not found in Scala 2.13, but used in the current build. This change 'inlines' the 2.12 implementation and makes it work with both 2.12 and 2.13.

### Why are the changes needed?

To cross-compile with 2.13.

### Does this PR introduce any user-facing change?

It should not as it copies in 2.12's existing behavior.

### How was this patch tested?

Existing tests.

Closes #26769 from srowen/SPARK-30011.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 08:15:38 -08:00
gengjiaan 187f3c1773 [SPARK-28083][SQL] Support LIKE ... ESCAPE syntax
## What changes were proposed in this pull request?

The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL.
For example:

```
select 'abcSpark_13sd' LIKE '%Spark\\_%';             //true
select 'abcSpark_13sd' LIKE '%Spark/_%';              //false
select 'abcSpark_13sd' LIKE '%Spark"_%';              //false
select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/';   //true
select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"';   //true
select 'abcSpark%13sd' LIKE '%Spark\\%%';             //true
select 'abcSpark%13sd' LIKE '%Spark/%%';              //false
select 'abcSpark%13sd' LIKE '%Spark"%%';              //false
select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/';   //true
select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"';   //true
select 'abcSpark\\13sd' LIKE '%Spark\\\\_%';          //true
select 'abcSpark/13sd' LIKE '%Spark//_%';             //false
select 'abcSpark"13sd' LIKE '%Spark""_%';             //false
select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/';  //true
select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"';  //true
```
But Spark SQL only supports 'LIKE predicate'.

Note: If the input string or pattern string is null, then the result is null too.

There are some mainstream database support the syntax.

**PostgreSQL:**
https://www.postgresql.org/docs/11/functions-matching.html

**Vertica:**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape

**MySQL:**
https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html

**Oracle:**
https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A

## How was this patch tested?

Exists UT and new UT.

This PR merged to my production environment and runs above sql:
```
spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%';
true
Time taken: 0.119 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%';
false
Time taken: 0.103 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%';
false
Time taken: 0.096 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/';
true
Time taken: 0.096 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"';
true
Time taken: 0.092 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%';
true
Time taken: 0.109 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%';
false
Time taken: 0.1 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%';
false
Time taken: 0.081 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/';
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"';
true
Time taken: 0.113 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%';
true
Time taken: 0.078 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%';
false
Time taken: 0.067 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%';
false
Time taken: 0.084 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/';
true
Time taken: 0.091 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"';
true
Time taken: 0.091 seconds, Fetched 1 row(s)
```
I create a table and its schema is:
```
spark-sql> desc formatted gja_test;
key     string  NULL
value   string  NULL
other   string  NULL

# Detailed Table Information
Database        test
Table   gja_test
Owner   test
Created Time    Wed Apr 10 11:06:15 CST 2019
Last Access     Thu Jan 01 08:00:00 CST 1970
Created By      Spark 2.4.1-SNAPSHOT
Type    MANAGED
Provider        hive
Table Properties        [transient_lastDdlTime=1563443838]
Statistics      26 bytes
Location        hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat     org.apache.hadoop.mapred.TextInputFormat
OutputFormat    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties      [field.delim=   , serialization.format= ]
Partition Provider      Catalog
Time taken: 0.642 seconds, Fetched 21 row(s)
```
Table `gja_test` exists three rows of data.
```
spark-sql> select * from gja_test;
a       A       ao
b       B       bo
"__     """__   "
Time taken: 0.665 seconds, Fetched 3 row(s)
```
At finally, I test this function:
```
spark-sql> select * from gja_test where key like value escape '"';
"__     """__   "
Time taken: 0.687 seconds, Fetched 1 row(s)
```

Closes #25001 from beliefer/ansi-sql-like.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-12-06 00:07:38 -08:00
Terry Kim b86d4bb931 [SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables
### What changes were proposed in this pull request?

This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation.

### Why are the changes needed?

Currently there are two issues:
1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice).
2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails:
```
scala> sql("create table t using csv as select 1 as i")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+---+
|  i|
+---+
|  1|
+---+

scala> sql("select * from spark_catalog.t").show
org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [spark_catalog, t]
```

### Does this PR introduce any user-facing change?

Yes. Now the catalog name is resolved correctly:
```
scala> sql("create table t using csv as select 1 as i")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+---+
|  i|
+---+
|  1|
+---+

scala> sql("select * from spark_catalog.t").show
+---+
|  i|
+---+
|  1|
+---+
```

### How was this patch tested?

Added new tests.

Closes #26684 from imback82/resolve_relation.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-06 15:45:13 +08:00
Sean Owen 7782b61a31 [SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings
TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748

I told you it'd be iterative!

Closes #26765 from srowen/SPARK-29392.3.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-05 13:48:29 -08:00
Kent Yao b9cae37750 [SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres
# What changes were proposed in this pull request?
Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy:
```scala
 /**
   * For [[Add]]:
   * 1. if both side are interval, stays the same;
   * 2. else if one side is interval, turns it to [[TimeAdd]];
   * 3. else if one side is date, turns it to [[DateAdd]] ;
   * 4. else stays the same.
   *
   * For [[Subtract]]:
   * 1. if both side are interval, stays the same;
   * 2. else if the right side is an interval, turns it to [[TimeSub]];
   * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]];
   * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]];
   * 5. else if the left side is date, turns it to [[DateSub]];
   * 6. else turns it to stays the same.
   *
   * For [[Multiply]]:
   * 1. If one side is interval, turns it to [[MultiplyInterval]];
   * 2. otherwise, stays the same.
   *
   * For [[Divide]]:
   * 1. If the left side is interval, turns it to [[DivideInterval]];
   * 2. otherwise, stays the same.
   */
```
Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule.
### Why are the changes needed?

Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.

### Does this PR introduce any user-facing change?

1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results.

### How was this patch tested?

add ut

Closes #26412 from yaooqinn/SPARK-29774.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 22:03:44 +08:00
Kent Yao 332e252a14 [SPARK-29425][SQL] The ownership of a database should be respected
### What changes were proposed in this pull request?

Keep the owner of a database when executing alter database commands

### Why are the changes needed?

Spark will inadvertently delete the owner of a database for executing databases ddls

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

add and modify uts

Closes #26080 from yaooqinn/SPARK-29425.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 16:14:27 +08:00
turbofei 0ab922c1eb [SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery
### What changes were proposed in this pull request?
There is an issue for InSubquery expression.
For example, there are two tables `ta` and `tb` created by the below statements.
```
 sql("create table ta(id Decimal(18,0)) using parquet")
 sql("create table tb(id Decimal(19,0)) using parquet")
```
This statement below would thrown dataType mismatch exception.

```
 sql("select * from ta where id in (select id from tb)").show()
```
However, this similar statement could execute successfully.

```
 sql("select * from ta where id in ((select id from tb))").show()
```
The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression.
Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt.
In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved.

### Why are the changes needed?
Some InSubquery would throw dataType mismatch exception.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Unit test.

Closes #26485 from turboFei/SPARK-29860-in-subquery.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 16:00:16 +08:00
Sean Owen ebd83a544e [SPARK-30009][CORE][SQL][FOLLOWUP] Remove OrderingUtil and Utils.nanSafeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly
### What changes were proposed in this pull request?

Follow up on https://github.com/apache/spark/pull/26654#discussion_r353826162
Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`.

### Why are the changes needed?

Simplification of the previous change, which existed to support Scala 2.13 migration.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #26761 from srowen/SPARK-30009.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 11:27:25 +08:00
Sean Owen 2ceed6f32c [SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings
### What changes were proposed in this pull request?

Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent.

### Why are the changes needed?

Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13.

The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings.

While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`.

### Does this PR introduce any user-facing change?

Should not change behavior.

### How was this patch tested?

Existing tests.

Closes #26748 from srowen/SPARK-29392.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-04 15:03:26 -08:00
Sean Owen 4193d2f4cc [SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13
### What changes were proposed in this pull request?

Move some classes extending Scala collections into parallel source trees, to support 2.13; other minor collection-related modifications.

Modify some classes extending Scala collections to work with 2.13 as well as 2.12. In many cases, this means introducing parallel source trees, as the type hierarchy changed in ways that one class can't support both.

### Why are the changes needed?

To support building for Scala 2.13 in the future.

### Does this PR introduce any user-facing change?

There should be no behavior change.

### How was this patch tested?

Existing tests. Note that the 2.13 changes are not tested by the PR builder, of course. They compile in 2.13 but can't even be tested locally. Later, once the project can be compiled for 2.13, thus tested, it's possible the 2.13 implementations will need updates.

Closes #26728 from srowen/SPARK-30012.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-03 08:59:43 -08:00
Kent Yao 65552a81d1 [SPARK-30083][SQL] visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking
### What changes were proposed in this pull request?

`UnaryPositive` only accepts numeric and interval as we defined, but what we do for this in  `AstBuider.visitArithmeticUnary` is just bypassing it.

This should not be omitted for the type checking requirement.

### Why are the changes needed?

bug fix, you can find a pre-discussion here https://github.com/apache/spark/pull/26578#discussion_r347350398

### Does this PR introduce any user-facing change?
yes,  +non-numeric-or-interval is now invalid.
```
-- !query 14
select +date '1900-01-01'
-- !query 14 schema
struct<DATE '1900-01-01':date>
-- !query 14 output
1900-01-01

-- !query 15
select +timestamp '1900-01-01'
-- !query 15 schema
struct<TIMESTAMP '1900-01-01 00:00:00':timestamp>
-- !query 15 output
1900-01-01 00:00:00

-- !query 16
select +map(1, 2)
-- !query 16 schema
struct<map(1, 2):map<int,int>>
-- !query 16 output
{1:2}

-- !query 17
select +array(1,2)
-- !query 17 schema
struct<array(1, 2):array<int>>
-- !query 17 output
[1,2]

-- !query 18
select -'1'
-- !query 18 schema
struct<(- CAST(1 AS DOUBLE)):double>
-- !query 18 output
-1.0

-- !query 19
select -X'1'
-- !query 19 schema
struct<>
-- !query 19 output
org.apache.spark.sql.AnalysisException
cannot resolve '(- X'01')' due to data type mismatch: argument 1 requires (numeric or interval) type, however, 'X'01'' is of binary type.; line 1 pos 7

-- !query 20
select +X'1'
-- !query 20 schema
struct<X'01':binary>
-- !query 20 output
```

### How was this patch tested?

add ut check

Closes #26716 from yaooqinn/SPARK-30083.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-03 23:42:21 +08:00
Kent Yao 39291cff95 [SPARK-30048][SQL] Enable aggregates with interval type values for RelationalGroupedDataset
### What changes were proposed in this pull request?

Now the min/max/sum/avg are support for intervals, we should also enable it in RelationalGroupedDataset

### Why are the changes needed?

API consistency improvement

### Does this PR introduce any user-facing change?

yes, Dataset support min/max/sum/avg(mean) on intervals
### How was this patch tested?

add ut

Closes #26681 from yaooqinn/SPARK-30048.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-03 18:40:14 +08:00
herman d7b268ab32 [SPARK-29348][SQL] Add observable Metrics for Streaming queries
### What changes were proposed in this pull request?
Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point.

A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach:
- Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map.
- (Micro-batch) Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. Please note that we currently do not support continuous execution streaming.

### Why are the changes needed?
This enabled observable metrics.

### Does this PR introduce any user-facing change?
Yes. It adds the `observe` method to `Dataset`.

### How was this patch tested?
- Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`.
- Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`.
- Added integration tests for streaming to the `StreamingQueryListenerSuite`.
- Added integration tests for batch to the `DataFrameCallbackSuite`.

Closes #26127 from hvanhovell/SPARK-29348.

Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-12-03 11:25:49 +01:00
Ali Afroozeh 68034a8056 [SPARK-30072][SQL] Create dedicated planner for subqueries
### What changes were proposed in this pull request?

This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent.

As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes:

```
promote_precision(1.7) * promote_precision(avg(a))
```

After the optimization, more specifically the constant folding rule, this expression becomes:

```
1.7 * promote_precision(avg(a))
```

Now if we run the analyzer on this optimized query again, we will get:

```
promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
```

Which will later optimized as:

```
1.7 * promote_precision(promote_precision(avg(a)))
```

As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan.

We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced.

### How was this patch tested?
Unit tests

Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-12-02 20:56:40 +01:00
Wenchen Fan e271664a01 [MINOR][SQL] Rename config name to spark.sql.analyzer.failAmbiguousSelfJoin.enabled
### What changes were proposed in this pull request?

add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`.

### Why are the changes needed?

to follow the existing naming style

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

not needed

Closes #26694 from cloud-fan/conf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-02 21:05:06 +08:00
Kent Yao 4e073f3c50 [SPARK-30047][SQL] Support interval types in UnsafeRow
### What changes were proposed in this pull request?

Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance.

### Why are the changes needed?

improve aggerates

### Does this PR introduce any user-facing change?
no

### How was this patch tested?

add ut and existing ones

Closes #26680 from yaooqinn/SPARK-30047.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-02 20:47:23 +08:00
LantaoJin 04a5b8f5f8 [SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE
### What changes were proposed in this pull request?
In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`.
Hive support `STORED AS` new file format syntax:
```sql
CREATE TABLE tbl(a int) STORED AS TEXTFILE;
CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
```
For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`.

### Why are the changes needed?
See https://github.com/apache/spark/pull/26097#issue-327424759

### Does this PR introduce any user-facing change?
Add a new syntax based on current CTL:
CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat];

### How was this patch tested?
Add UTs.

Closes #26466 from LantaoJin/SPARK-29839.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-02 16:11:58 +08:00
Yuanjian Li 169415ffac [SPARK-30025][CORE] Continuous shuffle block fetching should be disabled by default when the old fetch protocol is used
### What changes were proposed in this pull request?
Disable continuous shuffle block fetching when the old fetch protocol in use.

### Why are the changes needed?
The new feature of continuous shuffle block fetching depends on the latest version of the shuffle fetch protocol. We should keep this constraint in `BlockStoreShuffleReader.fetchContinuousBlocksInBatch`.

### Does this PR introduce any user-facing change?
Users will not get the exception related to continuous shuffle block fetching when old version of the external shuffle service is used.

### How was this patch tested?
Existing UT.

Closes #26663 from xuanyuanking/SPARK-30025.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-02 15:59:12 +08:00
HyukjinKwon 51e69feb49 [SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map
### What changes were proposed in this pull request?

This PR proposes to use foreach instead of misusing map as a small followup of #26476. This could cause some weird errors potentially and it's not a good practice anyway. See also SPARK-16694

### Why are the changes needed?
To avoid potential issues like SPARK-16694

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests should cover.

Closes #26729 from HyukjinKwon/SPARK-29851.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-02 13:40:00 +09:00
Yuanjian Li d1465a1b0d [SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey reducePostShufflePartitions enabled
### What changes were proposed in this pull request?
1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config.
2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`.

### Why are the changes needed?
Make the relation between these confs clearer.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UT.

Closes #26664 from xuanyuanking/SPARK-9853-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-02 12:37:06 +08:00
wuyi 87ebfaf003 [SPARK-29956][SQL] A literal number with an exponent should be parsed to Double
### What changes were proposed in this pull request?

For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use  `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior.

### Why are the changes needed?

According to ANSI standard of SQL, we see that the (part of) definition of `literal` :

```
<approximate numeric literal> ::=
    <mantissa> E <exponent>
```
which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal).

And when we test Presto, we found that Presto also conforms to this standard:

```
presto:default> select typeof(1E2);
 _col0
--------
 double
(1 row)
```

```
presto:default> select typeof(1.2);
    _col0
--------------
 decimal(2,1)
(1 row)
```

We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to *The difference between the two confuses most users* as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment).

Although, we also see that PostgreSQL has its own implementation:

```
postgres=# select pg_typeof(1E2);
 pg_typeof
-----------
 numeric
(1 row)

postgres=# select pg_typeof(1.2);
 pg_typeof
-----------
 numeric
(1 row)
```

We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience.

### Does this PR introduce any user-facing change?

Yes.

For `1E2`, before this PR:

```
scala> spark.sql("select 1E2")
res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)]
```

After this PR:

```
scala> spark.sql("select 1E2")
res0: org.apache.spark.sql.DataFrame = [100.0: double]
```

And for `1E-45`, before this PR:

```
org.apache.spark.sql.catalyst.parser.ParseException:
decimal can only support precision up to 38
== SQL ==
select 1E-45
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 47 elided
```

after this PR:

```
scala> spark.sql("select 1E-45");
res1: org.apache.spark.sql.DataFrame = [1.0E-45: double]
```

And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well.

### How was this patch tested?

updated `literals.sql.out` and `ansi/literals.sql.out`

Closes #26595 from Ngone51/SPARK-29956.

Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-02 11:34:56 +08:00
Yuming Wang 708ab57f37 [SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column
## What changes were proposed in this pull request?

[HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.

> HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
 However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.

**Spark SQL**:
```sql
// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>

// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18))  |
+----------------------------+--+
| 1.000000000000000000       |
+----------------------------+--+

// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000     |
+-------------------------+

// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+
```

**PostgreSQL**:
```sql
postgres=# select cast(1 as decimal(38, 18));
       numeric
----------------------
 1.000000000000000000
(1 row)
```
**Presto**:
```sql
presto> select cast(1 as decimal(38, 18));
        _col0
----------------------
 1.000000000000000000
(1 row)
```

## How was this patch tested?

unit tests and manual test:
```sql
spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000
```
Spark SQL Upgrading Guide:
![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png)

Closes #26697 from wangyum/SPARK-28461.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-02 09:02:39 +09:00
Dongjoon Hyun 9cd174a7c9 Revert "[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column"
This reverts commit 19af1fe3a2.
2019-11-27 11:07:08 -08:00
fuwhu 16da714ea5 [SPARK-29979][SQL][FOLLOW-UP] improve the output of DesribeTableExec
### What changes were proposed in this pull request?
refine the output of "DESC TABLE" command.

After this PR, the output of "DESC TABLE" command is like below :

```
id                        bigint
data                      string

# Partitioning
Part 0                    id

# Detailed Table Information
Name                      testca.table_name
Comment                   this is a test table
Location                  /tmp/testcat/table_name
Provider                  foo
Table Properties          [bar=baz]
```

### Why are the changes needed?
Currently, "DESC TABLE" will show reserved properties (eg. location, comment) in the "Table Property" section.
Since reserved properties are different from common properties, displaying reserved properties together with other table detailed information and displaying other properties in single field should be reasonable, and it is consistent with hive and DescribeTableCommand action.

### Does this PR introduce any user-facing change?
yes, the output of "DESC TABLE" command is refined as above.

### How was this patch tested?
Update existing unit tests.

Closes #26677 from fuwhu/SPARK-29979-FOLLOWUP-1.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-27 23:16:53 +08:00
Yuming Wang 19af1fe3a2 [SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column
## What changes were proposed in this pull request?

[HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.

> HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
 However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.

**Spark SQL**:
```sql
// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>

// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18))  |
+----------------------------+--+
| 1.000000000000000000       |
+----------------------------+--+

// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000     |
+-------------------------+

// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+
```

**PostgreSQL**:
```sql
postgres=# select cast(1 as decimal(38, 18));
       numeric
----------------------
 1.000000000000000000
(1 row)
```
**Presto**:
```sql
presto> select cast(1 as decimal(38, 18));
        _col0
----------------------
 1.000000000000000000
(1 row)
```

## How was this patch tested?

unit tests and manual test:
```sql
spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000
```
Spark SQL Upgrading Guide:
![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png)

Closes #25214 from wangyum/SPARK-28461.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-27 18:13:33 +09:00
wuyi a58d91b159 [SPARK-29768][SQL] Column pruning through nondeterministic expressions
### What changes were proposed in this pull request?

Support columnar pruning through non-deterministic expressions.

### Why are the changes needed?

In some cases, columns can still be pruned even though nondeterministic expressions appears.
e.g. for the plan  `Filter('a = 1, Project(Seq('a, rand() as 'r), LogicalRelation('a, 'b)))`, we shall still prune column b while non-deterministic expression appears.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a new test file: `ScanOperationSuite`.
Added test in `FileSourceStrategySuite` to verify the right prune behavior for both DS v1 and v2.

Closes #26629 from Ngone51/SPARK-29768.

Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-27 15:37:01 +08:00
Kent Yao 4fd585d2c5 [SPARK-30008][SQL] The dataType of collect_list/collect_set aggs should be ArrayType(_, false)
### What changes were proposed in this pull request?

```scala
// Do not allow null values. We follow the semantics of Hive's collect_list/collect_set here.
// See: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMkCollectionEvaluator
```
These two functions do not allow null values as they are defined, so their elements should not contain null.

### Why are the changes needed?

Casting collect_list(a) to ArrayType(_, false) fails before this fix.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

add ut

Closes #26651 from yaooqinn/SPARK-30008.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-26 20:40:21 -08:00
Kent Yao ed0c33fdd4 [SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string
### What changes were proposed in this pull request?

We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do.

This PR makes the whitespaces handling be consistent for nterval values.
Typed interval literal, multi-unit representation, and casting from strings are all supported.

```sql
postgres=# select interval E'1 \t day';
 interval
----------
 1 day
(1 row)

postgres=# select interval E'1\t' day;
 interval
----------
 1 day
(1 row)
```

### Why are the changes needed?

Whitespace handling should be consistent for interval value, and across different types in Spark.
PostgreSQL feature parity.

### Does this PR introduce any user-facing change?
Yes, the interval string of multi-units values which separated by whitespaces can be valid now.

### How was this patch tested?
add ut.

Closes #26662 from yaooqinn/SPARK-30026.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-27 01:20:38 +08:00
Sean Owen 29018025ba [SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13
### What changes were proposed in this pull request?

Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float.

Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways:

- (Split source tree)
- Inline the Scala 2.12 implementation
- Reflection

For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this.

### Why are the changes needed?

Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13.

TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators.

I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods.

### Does this PR introduce any user-facing change?

Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order.

### How was this patch tested?

Existing tests so far.

Closes #26654 from srowen/SPARK-30009.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-26 08:25:53 -08:00
Huaxin Gao 373c2c3f44 [SPARK-29862][SQL] CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add CreateViewStatement and make CREARE VIEW  go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC v // success and describe the view v from my_catalog
CREATE VIEW v AS SELECT 1 // report view not found as there is no view v in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running CREATE VIEW ...  Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog.

### How was this patch tested?
unit tests

Closes #26649 from huaxingao/spark-29862.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-26 14:10:46 +08:00
Kent Yao 8b0121bea8 [MINOR][DOC] Fix the CalendarIntervalType description
### What changes were proposed in this pull request?

fix the overdue and incorrect description about CalendarIntervalType

### Why are the changes needed?

api doc correctness

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

no

Closes #26659 from yaooqinn/intervaldoc.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-26 12:49:56 +08:00
fuwhu 29ebd9336c [SPARK-29979][SQL] Add basic/reserved property key constants in TableCatalog and SupportsNamespaces
### What changes were proposed in this pull request?
Add "comment" and "location" property key constants in TableCatalog and SupportNamespaces.
And update code of implementation classes to use these constants instead of hard code.

### Why are the changes needed?
Currently, some basic/reserved keys (eg. "location", "comment") of table and namespace properties are hard coded or defined in specific logical plan implementation class.
These keys can be centralized in TableCatalog and SupportsNamespaces interface and shared across different implementation classes.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
Existing unit test

Closes #26617 from fuwhu/SPARK-29979.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-26 01:24:43 +08:00
Kent Yao de21f28f8a [SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces
### What changes were proposed in this pull request?

A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean.

### Why are the changes needed?

improve the white spaces handling in UTF8String, also with some bugs fixed

### Does this PR introduce any user-facing change?

yes,
string with `control character` at either end can be convert to date/timestamp and interval now

### How was this patch tested?

add ut

Closes #26626 from yaooqinn/SPARK-29986.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-25 14:37:04 +08:00
Kent Yao 5cf475d288 [SPARK-30000][SQL] Trim the string when cast string type to decimals
### What changes were proposed in this pull request?

https://bugs.openjdk.java.net/browse/JDK-8170259
https://bugs.openjdk.java.net/browse/JDK-8170563

When we cast string type to decimal type, we rely on java.math. BigDecimal. It can't accept leading and training spaces, as you can see in the above links. This behavior is not consistent with other numeric types now. we need to fix it and keep consistency.

### Why are the changes needed?

make string to numeric types be consistent

### Does this PR introduce any user-facing change?

yes, string removed trailing or leading white spaces will be able to convert to decimal if the trimmed is valid

### How was this patch tested?

1. modify ut

#### Benchmark
```scala
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark

/**
 * Benchmark trim the string when casting string type to Boolean/Numeric types.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
 *   2. build/sbt "sql/test:runMain <this class>"
 *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
 *      Results will be written to "benchmarks/CastBenchmark-results.txt".
 * }}}
 */
object CastBenchmark extends SqlBasedBenchmark {

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val title = "Cast String to Integral"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 500L << 14
        val df = spark.range(N)
        val types = Seq("decimal")
        (1 to 5).by(2).foreach { i =>
          df.selectExpr(s"concat(id, '${" " * i}') as str")
            .write.mode("overwrite").parquet(dir + i.toString)
        }

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(true, false).foreach { trim =>
          types.foreach { t =>
            val str = if (trim) "trim(str)" else "str"
            val expr = s"cast($str as $t) as c_$t"
            (1 to 5).by(2).foreach { i =>
              benchmark.addCase(expr + s" - with $i spaces") { _ =>
                spark.read.parquet(dir + i.toString).selectExpr(expr).collect()
              }
            }
          }
        }
        benchmark.run()
      }
    }
  }
}

```

#### string trim vs not trim
```java
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
[info] Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as decimal) as c_decimal - with 1 spaces           3362           5486         NaN          2.4         410.4       1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 3 spaces           3251           5655         NaN          2.5         396.8       1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 5 spaces           3208           5725         NaN          2.6         391.7       1.0X
[info] cast(str as decimal) as c_decimal - with 1 spaces          13962          16233        1354          0.6        1704.3       0.2X
[info] cast(str as decimal) as c_decimal - with 3 spaces          14273          14444         179          0.6        1742.4       0.2X
[info] cast(str as decimal) as c_decimal - with 5 spaces          14318          14535         125          0.6        1747.8       0.2X
```
#### string trim vs this fix
```java
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
[info] Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as decimal) as c_decimal - with 1 spaces           3265           6299         NaN          2.5         398.6       1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 3 spaces           3183           6241         693          2.6         388.5       1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 5 spaces           3167           5923        1151          2.6         386.7       1.0X
[info] cast(str as decimal) as c_decimal - with 1 spaces           3161           5838        1126          2.6         385.9       1.0X
[info] cast(str as decimal) as c_decimal - with 3 spaces           3046           3457         837          2.7         371.8       1.1X
[info] cast(str as decimal) as c_decimal - with 5 spaces           3053           4445         NaN          2.7         372.7       1.1X
[info]
```

Closes #26640 from yaooqinn/SPARK-30000.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-25 12:47:07 +08:00
Sean Owen 13896e4eae [SPARK-30013][SQL] For scala 2.13, omit parens in various BigDecimal value() methods
### What changes were proposed in this pull request?

Omit parens on calls like BigDecimal.longValue()

### Why are the changes needed?

For some reason, this won't compile in Scala 2.13. The calls are otherwise equivalent in 2.12.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #26653 from srowen/SPARK-30013.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-24 18:23:34 -08:00
Takeshi Yamamuro 3f3a18fff1 [SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation
### What changes were proposed in this pull request?

This pr proposes a new independent config so that `LogicalRelation` could use `rowCount` to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable `StarSchemaDetection.reorderStarJoins` because we need to turn off CBO to enable it but `StarSchemaDetection` internally references the `rowCount` that is used in LogicalRelation if CBO disabled.

### Why are the changes needed?

Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests in `DataFrameJoinSuite`.

Closes #21668 from maropu/PlanStatsConf.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-24 08:30:24 -08:00
Dongjoon Hyun 13338eaa95 [SPARK-29554][SQL][FOLLOWUP] Rename Version to SparkVersion
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/26209 .
This renames class `Version` to class `SparkVersion`.

### Why are the changes needed?

According to the review comment, this uses more specific class name.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #26647 from dongjoon-hyun/SPARK-29554.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-23 19:53:52 -08:00
HyukjinKwon fc7a37b147 [SPARK-30003][SQL] Do not throw stack overflow exception in non-root unknown hint resolution
### What changes were proposed in this pull request?
This is rather a followup of https://github.com/apache/spark/pull/25464 (see https://github.com/apache/spark/pull/25464/files#r349543286)

It will cause an infinite recursion via mapping children - we should return the hint rather than its parent plan in unknown hint resolution.

### Why are the changes needed?

Prevent Stack over flow during hint resolution.

### Does this PR introduce any user-facing change?

Yes, it avoids stack overflow exception It was caused by https://github.com/apache/spark/pull/25464 and this is only in the master.

No behaviour changes to end users as it happened only in the master.

### How was this patch tested?

Unittest was added.

Closes #26642 from HyukjinKwon/SPARK-30003.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-23 17:24:56 +09:00
Wenchen Fan 6e581cf164 [SPARK-29893][SQL][FOLLOWUP] code cleanup for local shuffle reader
### What changes were proposed in this pull request?

A few cleanups for https://github.com/apache/spark/pull/26516:
1. move the calculating of partition start indices from the RDD to the rule. We can reuse code from "shrink number of reducers" in the future if we split partitions by size.
2. only check extra shuffles when adding local readers to the probe side.
3. add comments.
4. simplify the config name: `optimizedLocalShuffleReader` -> `localShuffleReader`

### Why are the changes needed?

make code more maintainable.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26625 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-11-22 10:26:54 -08:00
Wenchen Fan 6b4b6a87cd [SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive
### What changes were proposed in this pull request?

Make `ResolveRelations` call `ResolveTables` at the beginning, and make `ResolveTables` call `ResolveTempViews`(newly added) at the beginning, to ensure the relation resolution priority.

### Why are the changes needed?

To resolve an `UnresolvedRelation`, the general process is:
1. try to resolve to (global) temp view first. If it's not a temp view, move on
2. if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog.
3. when looking up table from session catalog, return a v1 relation if the table provider is v1.

Currently, this process is done by 2 rules: `ResolveTables` and `ResolveRelations`. To avoid rule conflicts, we add a lot of checks:
1. `ResolveTables` only resolves `UnresolvedRelation` if it's not a temp view and the resolved table is not v1.
2. `ResolveRelations` only resolves `UnresolvedRelation` if the table name has less than 2 parts.

This requires to run `ResolveTables` before `ResolveRelations`, otherwise we may resolve a v2 table to a v1 relation.

To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple:
1. first run `ResolveTempViews`, see if we can resolve relation to temp view
2. then run `ResolveTables`, see if we can resolve relation to v2 tables.
3. finally run `ResolveRelations`, see if we can resolve relation to v1 tables.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26214 from cloud-fan/resolve.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-11-21 09:47:42 -08:00
Ximo Guanter 54c5087a3a [SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory
### What changes were proposed in this pull request?
When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.

However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.

This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.

### Why are the changes needed?
Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:

it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)
### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tests passed

Closes #26591 from edrevo/temp.

Authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-22 00:19:25 +08:00
Takeshi Yamamuro cdcd43cbf2 [SPARK-29977][SQL] Remove newMutableProjection/newOrdering/newNaturalAscendingOrdering from SparkPlan
### What changes were proposed in this pull request?

This is to refactor `SparkPlan` code; it mainly removed `newMutableProjection`/`newOrdering`/`newNaturalAscendingOrdering` from `SparkPlan`.
The other modifications are listed below;
 - Move `BaseOrdering` from `o.a.s.sqlcatalyst.expressions.codegen.GenerateOrdering.scala` to `o.a.s.sqlcatalyst.expressions.ordering.scala`
 - `RowOrdering` extends `CodeGeneratorWithInterpretedFallback ` for `BaseOrdering`
 - Remove the unused variables (`subexpressionEliminationEnabled` and `codeGenFallBack`) from `SparkPlan`

### Why are the changes needed?

For better code/test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing.

Closes #26615 from maropu/RefactorOrdering.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-21 23:51:12 +08:00
Kent Yao 7a70670345 [SPARK-29961][SQL] Implement builtin function - typeof
### What changes were proposed in this pull request?
Add typeof function for Spark to get the underlying type of value.
```sql
-- !query 0
select typeof(1)
-- !query 0 schema
struct<typeof(1):string>
-- !query 0 output
int

-- !query 1
select typeof(1.2)
-- !query 1 schema
struct<typeof(1.2):string>
-- !query 1 output
decimal(2,1)

-- !query 2
select typeof(array(1, 2))
-- !query 2 schema
struct<typeof(array(1, 2)):string>
-- !query 2 output
array<int>

-- !query 3
select typeof(a) from (values (1), (2), (3.1)) t(a)
-- !query 3 schema
struct<typeof(a):string>
-- !query 3 output
decimal(11,1)
decimal(11,1)
decimal(11,1)

```

##### presto

```sql
presto> select typeof(array[1]);
     _col0
----------------
 array(integer)
(1 row)
```
##### PostgreSQL

```sql
postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a);
 pg_typeof
-----------
 numeric
 numeric
 numeric
(3 rows)
```
##### impala
https://issues.apache.org/jira/browse/IMPALA-1597

### Why are the changes needed?
a function which is better we have to help us debug, test, develop ...

### Does this PR introduce any user-facing change?

add a new function
### How was this patch tested?

add ut and example

Closes #26599 from yaooqinn/SPARK-29961.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 10:28:32 +09:00
LantaoJin 06e203b856 [SPARK-29911][SQL] Uncache cached tables when session closed
### What changes were proposed in this pull request?
The local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it.  But now cache data is cross-session. Its lifetime is the lifetime of the Spark application. That's will cause the memory leak if cache a local temporary view in memory when the session closed.
In this PR, we uncache the cached data of local temporary view when session closed. This PR doesn't impact the cached data of global temp view and persisted view.

How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop table v1.

The application will hold the memory forever. In a long running thrift server scenario. It's worse.
```shell
0: jdbc:hive2://localhost:10000> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:10000> !close
!close
Closing: 0: jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000 (closed)> !connect 'jdbc:hive2://localhost:10000'
!connect 'jdbc:hive2://localhost:10000'
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
lajin
Enter password for jdbc:hive2://localhost:10000:
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:10000> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
```
<img width="1047" alt="Screen Shot 2019-11-15 at 2 03 49 PM" src="https://user-images.githubusercontent.com/1853780/68923527-7ca8c180-07b9-11ea-9cc7-74f276c46840.png">

### Why are the changes needed?
Resolve memory leak for thrift server

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manual test in UI storage tab
And add an UT

Closes #26543 from LantaoJin/SPARK-29911.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 18:19:30 -06:00
Sean Owen 1febd373ea [MINOR][TESTS] Replace JVM assert with JUnit Assert in tests
### What changes were proposed in this pull request?

Use JUnit assertions in tests uniformly, not JVM assert() statements.

### Why are the changes needed?

assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.

### Does this PR introduce any user-facing change?

No. The assertion logic should be identical.

### How was this patch tested?

Existing tests.

Closes #26581 from srowen/assertToJUnit.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 14:04:15 -06:00
Yuanjian Li 23b3c4fafd [SPARK-29951][SQL] Make the behavior of Postgre dialect independent of ansi mode config
### What changes were proposed in this pull request?
Fix the inconsistent behavior of build-in function SQL LEFT/RIGHT.

### Why are the changes needed?
As the comment in https://github.com/apache/spark/pull/26497#discussion_r345708065, Postgre dialect should not be affected by the ANSI mode config.
During reran the existing tests, only the LEFT/RIGHT build-in SQL function broke the assumption. We fix this by following https://www.postgresql.org/docs/12/sql-keywords-appendix.html: `LEFT/RIGHT reserved (can be function or type)`

### Does this PR introduce any user-facing change?
Yes, the Postgre dialect will not be affected by the ANSI mode config.

### How was this patch tested?
Existing UT.

Closes #26584 from xuanyuanking/SPARK-29951.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-21 00:56:48 +08:00
Takeshi Yamamuro 0032d85153 [SPARK-29968][SQL] Remove the Predicate code from SparkPlan
### What changes were proposed in this pull request?

This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`.
Modifications are listed below;
 - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala`
 - To resolve the name conflict,  rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate`
 - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate`

This comes from the cloud-fan suggestion: https://github.com/apache/spark/pull/26420#discussion_r348005497

### Why are the changes needed?

For better code/test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26604 from maropu/RefactorPredicate.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-20 21:13:51 +08:00
Nikita Konda 5a70af7a6c [SPARK-29029][SQL] Use AttributeMap in PhysicalOperation.collectProjectsAndFilters
### What changes were proposed in this pull request?

This PR fixes the issue of substituting aliases while collecting filters in  `PhysicalOperation.collectProjectsAndFilters`. When the `AttributeReference` in alias map differs from the `AttributeReference` in filter condition only in qualifier, it does not substitute alias and throws exception saying `key videoid#47L not found` in the following scenario.

```
[1] Project [userid#0]
+- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30))
   +- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0]
      +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10))
         +- [5] Relation[userid#0,videoid#1,avebitrate#2]
```

### Why are the changes needed?

We need to use `AttributeMap` where the key is `AttributeReference`'s `ExprId` instead of `Map[Attribute, Expression]` while collecting and substituting aliases in `PhysicalOperation.collectProjectsAndFilters`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New unit tests were added in `TestPhysicalOperation` which reproduces the bug

Closes #25761 from nikitagkonda/SPARK-29029-use-attributemap-for-aliasmap-in-physicaloperation.

Authored-by: Nikita Konda <nikita.konda@workday.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-19 20:01:42 -08:00
Wenchen Fan 9e58b10c8e [SPARK-29945][SQL] do not handle negative sign specially in the parser
### What changes were proposed in this pull request?

Remove the special handling of the negative sign in the parser (interval literal and type constructor)

### Why are the changes needed?

The negative sign is an operator (UnaryMinus). We don't need to handle it specially, which is kind of doing constant folding at parser side.

### Does this PR introduce any user-facing change?

The error message becomes a little different. Now it reports type mismatch for the `-` operator.

### How was this patch tested?

existing tests

Closes #26578 from cloud-fan/interval.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-11-20 11:08:04 +09:00
Maxim Gekk 40b8a08b8b [SPARK-29963][SQL][TESTS] Check formatting timestamps up to microsecond precision by JSON/CSV datasource
### What changes were proposed in this pull request?
In the PR, I propose to add tests from the commit 47cb1f359a for Spark 2.4 that check formatting of timestamp strings for various seconds fractions.

### Why are the changes needed?
To make sure that current behavior is the same as in Spark 2.4

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`.

Closes #26601 from MaxGekk/format-timestamp-micros-tests.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-20 10:34:25 +09:00
Kent Yao 79ed4ae2db [SPARK-29926][SQL] Fix weird interval string whose value is only a dangling decimal point
### What changes were proposed in this pull request?

Currently, we support to parse '1. second' to 1s or even '. second' to 0s.

```sql
-- !query 118
select interval '1. seconds'
-- !query 118 schema
struct<1 seconds:interval>
-- !query 118 output
1 seconds

-- !query 119
select interval '. seconds'
-- !query 119 schema
struct<0 seconds:interval>
-- !query 119 output
0 seconds
```

```sql
postgres=# select interval '1. second';
ERROR:  invalid input syntax for type interval: "1. second"
LINE 1: select interval '1. second';

postgres=# select interval '. second';
ERROR:  invalid input syntax for type interval: ". second"
LINE 1: select interval '. second';
```
We fix this by fixing the new interval parser's VALUE_FRACTIONAL_PART state

With further digging, we found that 1. is valid in python, r, scala, and presto and so on... so this PR
ONLY forbid the invalid interval value in the form of  '. seconds'.

### Why are the changes needed?

bug fix

### Does this PR introduce any user-facing change?

yes, now we treat '. second' .... as invalid intervals
### How was this patch tested?

add ut

Closes #26573 from yaooqinn/SPARK-29926.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 21:01:26 +08:00
Wenchen Fan 16134d6d0f [SPARK-29948][SQL] make the default alias consistent between date, timestamp and interval
### What changes were proposed in this pull request?

Update `Literal.sql` to make date, timestamp and interval consistent. They should all use the `TYPE 'value'` format.

### Why are the changes needed?

Make the default alias consistent. For example, without this patch we will see
```
scala> sql("select interval '1 day', date '2000-10-10'").show
+------+-----------------+
|1 days|DATE '2000-10-10'|
+------+-----------------+
|1 days|       2000-10-10|
+------+-----------------+
```

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26579 from cloud-fan/sql.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 15:37:35 +08:00
Terry Kim 3d45779b68 [SPARK-29728][SQL] Datasource V2: Support ALTER TABLE RENAME TO
### What changes were proposed in this pull request?

This PR adds `ALTER TABLE a.b.c RENAME TO x.y.x` support for V2 catalogs.

### Why are the changes needed?

The current implementation doesn't support this command V2 catalogs.

### Does this PR introduce any user-facing change?

Yes, now the renaming table works for v2 catalogs:
```
scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show
+---------+---------+
|namespace|tableName|
+---------+---------+
|  ns1.ns2|      old|
+---------+---------+

scala> spark.sql("ALTER TABLE testcat.ns1.ns2.old RENAME TO testcat.ns1.ns2.new").show

scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show
+---------+---------+
|namespace|tableName|
+---------+---------+
|  ns1.ns2|      new|
+---------+---------+
```
### How was this patch tested?

Added unit tests.

Closes #26539 from imback82/rename_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 12:03:29 +08:00
HyukjinKwon 882f54b0a3 [SPARK-29870][SQL][FOLLOW-UP] Keep CalendarInterval's toString
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26418. This PR removed `CalendarInterval`'s `toString` with an unfinished changes.

### Why are the changes needed?

1. Ideally we should make each PR isolated and separate targeting one issue without touching unrelated codes.

2. There are some other places where the string formats were exposed to users. For example:

    ```scala
    scala> sql("select interval 1 days as a").selectExpr("to_csv(struct(a))").show()
    ```
    ```
    +--------------------------+
    |to_csv(named_struct(a, a))|
    +--------------------------+
    |      "CalendarInterval...|
    +--------------------------+
    ```

3.  Such fixes:

    ```diff
     private def writeMapData(
        map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = {
      val keyArray = map.keyArray()
    + val keyString = mapType.keyType match {
    +   case CalendarIntervalType =>
    +    (i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i))
    +   case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString
    + }
    ```

    can cause performance regression due to type dispatch for each map.

### Does this PR introduce any user-facing change?

Yes, see 2. case above.

### How was this patch tested?

Manually tested.

Closes #26572 from HyukjinKwon/SPARK-29783.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-19 09:11:41 +09:00
HyukjinKwon 8469614c05 [SPARK-25694][SQL][FOLLOW-UP] Move 'spark.sql.defaultUrlStreamHandlerFactory.enabled' into StaticSQLConf.scala
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/26530 and proposes to move the configuration `spark.sql.defaultUrlStreamHandlerFactory.enabled` to `StaticSQLConf.scala` for consistency.

### Why are the changes needed?

To put the similar configurations together and for readability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested as described in https://github.com/apache/spark/pull/26530.

Closes #26570 from HyukjinKwon/SPARK-25694.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-19 09:08:20 +09:00
Kent Yao ae6b711b26 [SPARK-29941][SQL] Add ansi type aliases for char and decimal
### What changes were proposed in this pull request?

Checked with SQL Standard and PostgreSQL

> CHAR is equivalent to CHARACTER. DEC is equivalent to DECIMAL. INT is equivalent to INTEGER. VARCHAR is equivalent to CHARACTER VARYING. ...

```sql
postgres=# select dec '1.0';
numeric
---------
1.0
(1 row)

postgres=# select CHARACTER '. second';
  bpchar
----------
 . second
(1 row)

postgres=# select CHAR '. second';
  bpchar
----------
 . second
(1 row)
```

### Why are the changes needed?

For better ansi support
### Does this PR introduce any user-facing change?

yes, we add character as char and dec as decimal

### How was this patch tested?

add ut

Closes #26574 from yaooqinn/SPARK-29941.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 23:30:31 +08:00
fuwhu c32e228689 [SPARK-29859][SQL] ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands
### What changes were proposed in this pull request?
Add AlterNamespaceSetLocationStatement, AlterNamespaceSetLocation, AlterNamespaceSetLocationExec to make ALTER DATABASE (SET LOCATION) look up catalog like v2 commands.
And also refine the code of AlterNamespaceSetProperties, AlterNamespaceSetPropertiesExec, DescribeNamespace, DescribeNamespaceExec to use SupportsNamespaces instead of CatalogPlugin for catalog parameter.

### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.

### Does this PR introduce any user-facing change?
Yes, add "ALTER NAMESPACE ... SET LOCATION" whose function is same as "ALTER DATABASE ... SET LOCATION" and "ALTER SCHEMA ... SET LOCATION".

### How was this patch tested?
New unit tests

Closes #26562 from fuwhu/SPARK-29859.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 20:40:23 +08:00
Kent Yao 50f6d930da [SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval
### What changes were proposed in this pull request?

We now have two different implementation for multi-units interval strings to CalendarInterval type values.

One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call `IntervalUtils.fromUnitStrings`

The other is used in `Cast`, which eventually calls `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the other.

We should unify these two for better performance and simple logic. this pr uses the 2nd approach.

### Why are the changes needed?

We should unify these two for better performance and simple logic.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

we shall not fail on existing uts

Closes #26491 from yaooqinn/SPARK-29870.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 15:50:06 +08:00
Kent Yao 5cebe587c7 [SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type
### What changes were proposed in this pull request?

Add 3 interval output types which are named as `SQL_STANDARD`, `ISO_8601`, `MULTI_UNITS`. And we add a new conf `spark.sql.dialect.intervalOutputStyle` for this. The `MULTI_UNITS` style displays the interval values in the former behavior and it is the default. The newly added `SQL_STANDARD`, `ISO_8601` styles can be found in the following table.

Style | conf | Year-Month Interval | Day-Time Interval | Mixed Interval
-- | -- | -- | -- | --
Format With Time Unit Designators | MULTI_UNITS | 1 year 2 mons | 1 days 2 hours 3 minutes 4.123456 seconds | interval 1 days 2 hours 3 minutes 4.123456 seconds
SQL STANDARD  | SQL_STANDARD | 1-2 | 3 4:05:06 | -1-2 3 -4:05:06
ISO8601 Basic Format| ISO_8601| P1Y2M| P3DT4H5M6S|P-1Y-2M3D-4H-5M-6S

### Why are the changes needed?

for ANSI SQL support
### Does this PR introduce any user-facing change?

yes,interval out now has 3 output styles
### How was this patch tested?

add new unit tests

cc cloud-fan maropu MaxGekk HyukjinKwon thanks.

Closes #26418 from yaooqinn/SPARK-29783.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 15:42:22 +08:00
gschiavon 73912379d0 [SPARK-29020][SQL] Improving array_sort behaviour
### What changes were proposed in this pull request?
I've noticed that there are two functions to sort arrays sort_array and array_sort.

sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending

array_sort is from 2.4.0 and it only has the possibility of ordering in ascending.

Basically I just added the possibility of ordering either ascending or descending using array_sort.

I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order.
Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions.

### Why are the changes needed?
Basically to be able to sort the array in descending order using *array_sort* instead of using *sort_array* from 1.5.0

### Does this PR introduce any user-facing change?
Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array

### How was this patch tested?
Test's added

This is the link to the [jira](https://issues.apache.org/jira/browse/SPARK-29020)

Closes #25728 from Gschiavon/improving-array-sort.

Lead-authored-by: gschiavon <german.schiavon@lifullconnect.com>
Co-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: gschiavon <Gschiavon@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-18 16:07:05 +09:00
xy_xin d83cacfcf5 [SPARK-29907][SQL] Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte
### What changes were proposed in this pull request?

SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to `dmlStatementNoWith`.

### Why are the changes needed?
Wit this change, we can support syntax like "With t AS (SELECT) DELETE FROM xxx", and so as UPDATE/MERGE.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

New cases added.

Closes #26536 from xianyinxin/SPARK-29907.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 11:48:56 +08:00
Maxim Gekk 5eb8973f87 [SPARK-29930][SQL] Remove SQL configs declared to be removed in Spark 3.0
### What changes were proposed in this pull request?
In the PR, I propose to remove the following SQL configs:
1. `spark.sql.fromJsonForceNullableSchema`
2. `spark.sql.legacy.compareDateTimestampInTimestamp`
3. `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation`

that are declared to be removed in Spark 3.0

### Why are the changes needed?
To make code cleaner and improve maintainability.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
By `TypeCoercionSuite`, `JsonExpressionsSuite` and `DDLSuite`.

Closes #26559 from MaxGekk/remove-sql-configs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-17 10:14:04 -08:00
fuwhu 388a737b98 [SPARK-29858][SQL] ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands
### What changes were proposed in this pull request?
Add AlterNamespaceSetPropertiesStatement, AlterNamespaceSetProperties and AlterNamespaceSetPropertiesExec to make ALTER DATABASE (SET DBPROPERTIES) command look up catalog like v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.

### Does this PR introduce any user-facing change?
Yes, add "ALTER NAMESPACE ... SET (DBPROPERTIES | PROPERTIES) ..." whose function is same as "ALTER DATABASE ... SET DBPROPERTIES ..." and "ALTER SCHEMA ... SET DBPROPERTIES ...".

### How was this patch tested?
New unit test

Closes #26551 from fuwhu/SPARK-29858.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-16 19:50:02 -08:00
Maxim Gekk e88267cb5a [SPARK-29928][SQL][TESTS] Check parsing timestamps up to microsecond precision by JSON/CSV datasource
### What changes were proposed in this pull request?
In the PR, I propose to add tests from the commit 9c7e8be1dc for Spark 2.4 that check parsing of timestamp strings for various seconds fractions.

### Why are the changes needed?
To make sure that current behavior is the same as in Spark 2.4

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`.

Closes #26558 from MaxGekk/parse-timestamp-micros-tests.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-16 18:01:25 -08:00
Yuanjian Li 40ea4a11d7 [SPARK-29807][SQL] Rename "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled"
### What changes were proposed in this pull request?
Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled"

### Why are the changes needed?
The relation between "spark.sql.ansi.enabled" and "spark.sql.dialect" is confusing, since the "PostgreSQL" dialect should contain the features of "spark.sql.ansi.enabled".

To make things clearer, we can rename the "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled", thus the option "spark.sql.dialect.spark.ansi.enabled" is only for Spark dialect.

For the casting and arithmetic operations, runtime exceptions should be thrown if "spark.sql.dialect" is "spark" and "spark.sql.dialect.spark.ansi.enabled" is true or "spark.sql.dialect" is PostgresSQL.

### Does this PR introduce any user-facing change?
Yes, the config name changed.

### How was this patch tested?
Existing UT.

Closes #26444 from xuanyuanking/SPARK-29807.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-16 17:46:39 +08:00
Dongjoon Hyun f77c10de38 [SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+
### What changes were proposed in this pull request?

This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow.

Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](https://github.com/apache/arrow/pull/5078)).
> #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true".
> This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty.

### Why are the changes needed?

After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After https://github.com/apache/spark/pull/26133, JDK11 Jenkins job seem to fail.

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/

```scala
Previous exception in task:
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available&#010;
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)&#010;
io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)&#010;
io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)&#010;
io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)&#010;
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)&#010;
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with JDK11.

Closes #26552 from dongjoon-hyun/SPARK-ARROW-JDK11.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-15 23:58:15 -08:00
Takeshi Yamamuro 6d6b233791 [SPARK-29343][SQL][FOLLOW-UP] Remove floating-point Sum/Average/CentralMomentAgg from order-insensitive aggregates
### What changes were proposed in this pull request?

This pr is to remove floating-point `Sum/Average/CentralMomentAgg` from order-insensitive aggregates in `EliminateSorts`.

This pr comes from the gatorsmile suggestion: https://github.com/apache/spark/pull/26011#discussion_r344583899

### Why are the changes needed?

Bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests in `SubquerySuite`.

Closes #26534 from maropu/SPARK-29343-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-15 18:54:02 -08:00
fuwhu 16e7195299 [SPARK-29834][SQL] DESC DATABASE should look up catalog like v2 commands
### What changes were proposed in this pull request?
Add DescribeNamespaceStatement, DescribeNamespace and DescribeNamespaceExec
to make "DESC DATABASE" look up catalog like v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.

### Does this PR introduce any user-facing change?
Yes, add "DESC NAMESPACE" whose function is same as "DESC DATABASE" and "DESC SCHEMA".

### How was this patch tested?
New unit test

Closes #26513 from fuwhu/SPARK-29834.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-15 18:50:42 -08:00
Pablo Langa 848bdfa218 [SPARK-29829][SQL] SHOW TABLE EXTENDED should do multi-catalog resolution
### What changes were proposed in this pull request?

Add ShowTableStatement and make SHOW TABLE EXTENDED go through the same catalog/table resolution framework of v2 commands.

We don’t have this methods in the catalog to implement an V2 command

- catalog.getPartition
- catalog.getTempViewOrPermanentTableMetadata

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing

```sql
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW TABLE EXTENDED FROM LIKE 't' // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

Yes. When running SHOW TABLE EXTENDED Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26540 from planga82/feature/SPARK-29481_ShowTableExtended.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-15 14:25:33 -08:00
Kent Yao 0c68578fa9 [SPARK-29888][SQL] new interval string parser shall handle numeric with only fractional part
### What changes were proposed in this pull request?

Current string to interval cast logic does not support i.e. cast('.111 second' as interval) which will fail in SIGN state and return null, actually, it is 00:00:00.111.
```scala
-- !query 63
select interval '.111 seconds'
-- !query 63 schema
struct<0.111 seconds:interval>
-- !query 63 output
0.111 seconds

-- !query 64
select cast('.111 seconds' as interval)
-- !query 64 schema
struct<CAST(.111 seconds AS INTERVAL):interval>
-- !query 64 output
NULL
````
### Why are the changes needed?

bug fix.
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #26514 from yaooqinn/SPARK-29888.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-15 13:33:30 +08:00
Terry Kim e46e487b08 [SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly
### What changes were proposed in this pull request?

This PR addresses issues where conflicting attributes in `Expand` are not correctly handled.

### Why are the changes needed?

```Scala
val numsDF = Seq(1, 2, 3, 4, 5, 6).toDF("nums")
val cubeDF = numsDF.cube("nums").agg(max(lit(0)).as("agcol"))
cubeDF.join(cubeDF, "nums").show
```
fails with the following exception:
```
org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#35]
:  +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
:     +- Project [nums#3, nums#3 AS nums#37]
:        +- Project [value#1 AS nums#3]
:           +- LocalRelation [value#1]
+- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#58]
   +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
                                                                         ^^^^^^^
      +- Project [nums#3, nums#3 AS nums#37]
         +- Project [value#1 AS nums#3]
            +- LocalRelation [value#1]

Conflicting attributes: nums#38
```
As you can see from the above plan, `num#38`, the output of `Expand` on the right side of `Join`, should have been handled to produce new attribute. Since the conflict is not resolved in `Expand`, the failure is happening upstream at `Aggregate`. This PR addresses handling conflicting attributes in `Expand`.

### Does this PR introduce any user-facing change?

Yes, the previous example now shows the following output:
```
+----+-----+-----+
|nums|agcol|agcol|
+----+-----+-----+
|   1|    0|    0|
|   6|    0|    0|
|   4|    0|    0|
|   2|    0|    0|
|   5|    0|    0|
|   3|    0|    0|
+----+-----+-----+
```
### How was this patch tested?

Added new unit test.

Closes #26441 from imback82/spark-29682.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-14 14:47:14 +08:00
wuyi fe1f456b20 [SPARK-29837][SQL] PostgreSQL dialect: cast to boolean
### What changes were proposed in this pull request?

Make SparkSQL's `cast to boolean` behavior be consistent with PostgreSQL when
spark.sql.dialect is configured as PostgreSQL.

### Why are the changes needed?

SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when `spark.sql.dialect` is configured as PostgreSQL.

### Does this PR introduce any user-facing change?

Yes. If user switches to PostgreSQL dialect now, they will

* get an exception if they input a invalid string, e.g "erut", while they get `null` before;

* get an exception if they input `TimestampType`, `DateType`, `LongType`, `ShortType`, `ByteType`, `DecimalType`, `DoubleType`, `FloatType` values,  while they get `true` or `false` result before.

And here're evidences for those unsupported types from PostgreSQL:

timestamp:
```
postgres=# select cast(cast('2019-11-11' as timestamp) as boolean);
ERROR:  cannot cast type timestamp without time zone to boolean
```

date:
```
postgres=# select cast(cast('2019-11-11' as date) as boolean);
ERROR:  cannot cast type date to boolean
```

bigint:
```
postgres=# select cast(cast('20191111' as bigint) as boolean);
ERROR:  cannot cast type bigint to boolean
```

smallint:
```
postgres=# select cast(cast(2019 as smallint) as boolean);
ERROR:  cannot cast type smallint to boolean
```

bytea:
```
postgres=# select cast(cast('2019' as bytea) as boolean);
ERROR:  cannot cast type bytea to boolean
```

decimal:
```
postgres=# select cast(cast('2019' as decimal) as boolean);
ERROR:  cannot cast type numeric to boolean
```

float:
```
postgres=# select cast(cast('2019' as float) as boolean);
ERROR:  cannot cast type double precision to boolean
```

### How was this patch tested?

Added and tested manually.

Closes #26463 from Ngone51/dev-postgre-cast2bool.

Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-14 11:55:01 +08:00
Burak Yavuz 363af16c72 [SPARK-29568][SS] Stop existing running streams when a new stream is launched
### What changes were proposed in this pull request?

This PR adds a SQL Conf: `spark.sql.streaming.stopActiveRunOnRestart`. When this conf is `true` (by default it is), an already running stream will be stopped, if a new copy gets launched on the same checkpoint location.

### Why are the changes needed?

In multi-tenant environments where you have multiple SparkSessions, you can accidentally start multiple copies of the same stream (i.e. streams using the same checkpoint location). This will cause all new instantiations of the new stream to fail. However, sometimes you may want to turn off the old stream, as the old stream may have turned into a zombie (you no longer have access to the query handle or SparkSession).

It would be nice to have a SQL flag that allows the stopping of the old stream for such zombie cases.

### Does this PR introduce any user-facing change?

Yes. Now by default, if you launch a new copy of an already running stream on a multi-tenant cluster, the existing stream will be stopped.

### How was this patch tested?

Unit tests in StreamingQueryManagerSuite

Closes #26225 from brkyvz/stopStream.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-11-13 08:59:46 -08:00
Wenchen Fan 4dcbdcd265 [SPARK-29863][SQL] Rename EveryAgg/AnyAgg to BoolAnd/BoolOr
### What changes were proposed in this pull request?

rename EveryAgg/AnyAgg to BoolAnd/BoolOr

### Why are the changes needed?

Under ansi mode, `every`, `any` and `some` are reserved keywords and can't be used as function names. `EveryAgg`/`AnyAgg` has several aliases and I think it's better to not pick  reserved keywords  as the primary name.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26486 from cloud-fan/naming.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-13 21:42:42 +08:00
Wenchen Fan 942753a44b [SPARK-29753][SQL] refine the default catalog config
### What changes were proposed in this pull request?

rename the config to address the comment: https://github.com/apache/spark/pull/24594#discussion_r285431212

improve the config description, provide a default value to simplify the code.

### Why are the changes needed?

make the config more understandable.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26395 from cloud-fan/config.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-13 21:27:36 +08:00
xy_xin d7bdc6aa17 [SPARK-29835][SQL] Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE
### What changes were proposed in this pull request?

The current parse and analyze flow for DELETE is: 1, the SQL string will be firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be converted to `DeleteFromTable`. However, the SQL string can be parsed to `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be redundant.

It is the same for UPDATE.

This pr removes the unnecessary `DeleteFromStatement` and `UpdateTableStatement`.

### Why are the changes needed?

This makes the codes for DELETE and UPDATE cleaner, and keep align with MERGE INTO.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existed tests and new tests.

Closes #26464 from xianyinxin/SPARK-29835.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-13 20:53:12 +08:00
Terry Kim b5a2ed6a37 [SPARK-29851][SQL] V2 catalog: Change default behavior of dropping namespace to cascade
### What changes were proposed in this pull request?

Currently, `SupportsNamespaces.dropNamespace` drops a namespace only if it is empty. Thus, to implement a cascading drop, one needs to iterate all objects (tables, view, etc.) within the namespace (including its sub-namespaces recursively) and drop them one by one. This can have a negative impact on the performance when there are large number of objects.

Instead, this PR proposes to change the default behavior of dropping a namespace to cascading such that implementing cascading/non-cascading drop is simpler without performance penalties.

### Why are the changes needed?

The new behavior makes implementing cascading/non-cascading drop simple without performance penalties.

### Does this PR introduce any user-facing change?

Yes. The default behavior of `SupportsNamespaces.dropNamespace` is now cascading.

### How was this patch tested?

Added new unit tests.

Closes #26476 from imback82/drop_ns_cascade.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-13 17:06:27 +08:00
Kent Yao f926809a1f [SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions
### What changes were proposed in this pull request?

Add 3 interval functions justify_days, justify_hours, justif_interval to support justify interval values

### Why are the changes needed?

For feature parity with postgres

add three interval functions to justify interval values.

justify_days(interval) | interval | Adjust interval so 30-day time periods are represented as months | justify_days(interval '35 days') | 1 mon 5 days
-- | -- | -- | -- | --
justify_hours(interval) | interval | Adjust interval so 24-hour time periods are represented as days | justify_hours(interval '27 hours') | 1 day 03:00:00
justify_interval(interval) | interval | Adjust interval using justify_days and justify_hours, with additional sign adjustments | justify_interval(interval '1 mon -1 hour') | 29 days 23:00:00

### Does this PR introduce any user-facing change?

yes. new interval functions are added

### How was this patch tested?

add ut

Closes #26465 from yaooqinn/SPARK-29390.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-11-13 15:04:39 +09:00
HyukjinKwon 80fbc382a6 Revert "[SPARK-29462] The data type of "array()" should be array<null>"
This reverts commit 0dcd739534.
2019-11-13 13:12:20 +09:00
Wenchen Fan 030e5d987e [SPARK-29789][SQL] should not parse the bucket column name when creating v2 tables
### What changes were proposed in this pull request?

When creating v2 expressions, we have public java APIs, as well as interval scala APIs. All of these APIs take a string column name and parse it to `NamedReference`.

This is convenient for end-users, but not for interval development. For example, the query plan already contains the parsed partition/bucket column names, and it's tricky if we need to quote the names before creating v2 expressions.

This PR proposes to change the interval scala APIs to take `NamedReference` directly, with a new method to create `NamedReference` with the exact name parts. The public java APIs are not changed.

### Why are the changes needed?

fix a bug, and make it easier to create v2 expressions correctly in the future.

### Does this PR introduce any user-facing change?

yes, now v2 CREATE TABLE works as expected.

### How was this patch tested?

a new test

Closes #26425 from cloud-fan/extract.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-11-12 12:25:45 -08:00
Kent Yao d99398e9f5 [SPARK-29855][SQL] typed literals with negative sign with proper result or exception
### What changes were proposed in this pull request?

```sql
-- !query 83
select -integer '7'
-- !query 83 schema
struct<7:int>
-- !query 83 output
7

-- !query 86
select -date '1999-01-01'
-- !query 86 schema
struct<DATE '1999-01-01':date>
-- !query 86 output
1999-01-01

-- !query 87
select -timestamp '1999-01-01'
-- !query 87 schema
struct<TIMESTAMP('1999-01-01 00:00:00'):timestamp>
-- !query 87 output
1999-01-01 00:00:00
```
the integer should be -7 and the date and timestamp results are confusing which should throw exceptions

### Why are the changes needed?

bug fix
### Does this PR introduce any user-facing change?

NO
### How was this patch tested?

ADD UTs

Closes #26479 from yaooqinn/SPARK-29855.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-12 23:53:07 +09:00
Pablo Langa 37e387a22d [SPARK-29519][SQL] SHOW TBLPROPERTIES should do multi-catalog resolution
### What changes were proposed in this pull request?

Add ShowTablePropertiesStatement and make SHOW TBLPROPERTIES go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW TBLPROPERTIES t // report table not found as there is no table t in the session catalog

### Does this PR introduce any user-facing change?

yes. When running SHOW TBLPROPERTIES Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26176 from planga82/feature/SPARK-29519_SHOW_TBLPROPERTIES_datasourceV2.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-12 13:31:28 +08:00
Jungtaek Lim (HeartSaVioR) c941362cb9 [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows
### What changes were proposed in this pull request?

This patch fixes the edge case of streaming left/right outer join described below:

Suppose query is provided as

`select * from A join B on A.id = B.id AND (A.ts <= B.ts AND B.ts <= A.ts + interval 5 seconds)`

and there're two rows for L1 (from A) and R1 (from B) which ensures L1.id = R1.id and L1.ts = R1.ts.
(we can simply imagine it from self-join)

Then Spark processes L1 and R1 as below:

- row L1 and row R1 are joined at batch 1
- row R1 is evicted at batch 2 due to join and watermark condition, whereas row L1 is not evicted
- row L1 is evicted at batch 3 due to join and watermark condition

When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below:

```
Checking whether the current row matches a key in the right side state, and that key
has any value which satisfies the filter function when joined. If it doesn't,
we know we can join with null, since there was never (including this batch) a match
within the watermark period. If it does, there must have been a match at some point, so
we know we can't join with null.
```

But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched.

To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows.

This approach introduces new state format which is not compatible with old state format - queries with old state format will be still running but they will still have the issue and be required to discard checkpoint and rerun to take this patch in effect.

### Why are the changes needed?

This patch fixes a correctness issue.

### Does this PR introduce any user-facing change?

No for compatibility viewpoint, but we'll encourage end users to discard the old checkpoint and rerun the query if they run stream-stream outer join query with old checkpoint, which might be "yes" for the question.

### How was this patch tested?

Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs.

Closes #26108 from HeartSaVioR/SPARK-26154-shorten-alternative.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-11 15:47:17 -08:00
DB Tsai a6a2748585 [SPARK-29805][SQL] Enable nested schema pruning and nested pruning on expressions by default
### What changes were proposed in this pull request?
Enable nested schema pruning and nested pruning on expressions by default. We have been using those features in production in Apple for couple months with great success. For some jobs, we reduce the data reading by more than 8x and 21x faster in wall clock time.

### Why are the changes needed?
Better performance.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #26443 from dbtsai/enableNestedSchemaPrunning.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-11-11 19:11:05 +00:00
Kent Yao d06a9cc4bd [SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values
### What changes were proposed in this pull request?

With the latest string to literal optimization https://github.com/apache/spark/pull/26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to  `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces.

How to re-produce, which aim the revisions since  https://github.com/apache/spark/pull/26256 is merged

```sql
select cast(v as interval) from values ('+     1 second') t(v);
select cast(v as interval) from values ('-     1 second') t(v);
```

### Why are the changes needed?

bug fix
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

1. ut
2. new benchmark test

Closes #26449 from yaooqinn/SPARK-29605.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-11 21:53:33 +08:00
lajin 4de7131cff [SPARK-29421][SQL] Supporting Create Table Like Using Provider
### What changes were proposed in this pull request?
Hive support STORED AS new file format syntax:
```sql
CREATE TABLE tbl(a int) STORED AS TEXTFILE;
CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
```
We add a similar syntax for Spark. Here we separate to two features:

1. specify a different table provider in CREATE TABLE LIKE
2. Hive compatibility

In this PR, we address the first one:
- [ ] Using `USING provider` to specify a different table provider in CREATE TABLE LIKE.
- [  ] Using `STORED AS file_format` in CREATE TABLE LIKE to address Hive compatibility.

### Why are the changes needed?
Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the definition of table tb2. The most user case is to create tb1 with the same schema of tb2. But an inconvenient case here is this command also copies the FileFormat from tb2, it cannot change the input/output format and serde. Add the ability of changing file format is useful for some scenarios like upgrading a table from a low performance file format to a high performance one (parquet, orc).

### Does this PR introduce any user-facing change?
Add a new syntax based on current CTL:
```sql
CREATE TABLE tbl2 LIKE tbl [USING parquet];
```

### How was this patch tested?
Modify some exist UTs.

Closes #26097 from LantaoJin/SPARK-29421.

Authored-by: lajin <lajin@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-11 15:25:56 +08:00
Maxim Gekk 18440151b0 [SPARK-29393][SQL] Add make_interval function
### What changes were proposed in this pull request?
In the PR, I propose new expression `MakeInterval` and register it as the function `make_interval`. The function accepts the following parameters:
- `years` - the number of years in the interval, positive or negative. The parameter is multiplied by 12, and added to interval's `months`.
- `months` - the number of months in the interval, positive or negative.
- `weeks` - the number of months in the interval, positive or negative. The parameter is multiplied by 7, and added to interval's `days`.
- `hours`, `mins` - the number of hours and minutes. The parameters can be negative or positive. They are converted to microseconds and added to interval's `microseconds`.
- `seconds` - the number of seconds with the fractional part in microseconds precision. It is converted to microseconds, and added to total interval's `microseconds` as `hours` and `minutes`.

For example:
```sql
spark-sql> select make_interval(2019, 11, 1, 1, 12, 30, 01.001001);
2019 years 11 months 8 days 12 hours 30 minutes 1.001001 seconds
```

### Why are the changes needed?
- To improve user experience with Spark SQL, and allow users making `INTERVAL` columns from other columns containing `years`, `months` ... `seconds`. Currently, users can make an `INTERVAL` column from other columns only by constructing a `STRING` column and cast it to `INTERVAL`. Have a look at the `IntervalBenchmark` as an example.
- To maintain feature parity with PostgreSQL which provides such function:
```sql
# SELECT make_interval(2019, 11);
   make_interval
--------------------
 2019 years 11 mons
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By new tests for the `MakeInterval` expression to `IntervalExpressionsSuite`
- By tests in `interval.sql`

Closes #26446 from MaxGekk/make_interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-10 14:34:52 -08:00
Maxim Gekk d4de01f567 [SPARK-29408][SQL] Support - before interval in interval literals
### What changes were proposed in this pull request?
- `SqlBase.g4` is modified to support a negative sign `-` in the interval type constructor from a string and in interval literals
- Negate interval in `AstBuilder` if a sign presents.
- Interval related SQL statements are moved from `inputs/datetime.sql` to new file `inputs/interval.sql`

For example:
```sql
spark-sql> select -interval '-1 month 1 day -1 second';
1 months -1 days 1 seconds
spark-sql> select -interval -1 month 1 day -1 second;
1 months -1 days 1 seconds
```

### Why are the changes needed?
For feature parity with PostgreSQL which supports that:
```sql
# select -interval '-1 month 1 day -1 second';
        ?column?
-------------------------
 1 mon -1 days +00:00:01
(1 row)
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added tests to `ExpressionParserSuite`
- by `interval.sql`

Closes #26438 from MaxGekk/negative-interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-10 10:10:04 -08:00
Maxim Gekk 7ddcb5b46d [SPARK-29819][SQL] Introduce an enum for interval units
### What changes were proposed in this pull request?
In the PR, I propose an enumeration for interval units with the value `YEAR`, `MONTH`, `WEEK`, `DAY`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECOND`, `MICROSECOND` and `NANOSECOND`.

### Why are the changes needed?
- This should prevent typos in interval unit names
- Stronger type checking of unit parameters.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `ExpressionParserSuite` and `IntervalUtilsSuite`

Closes #26455 from MaxGekk/interval-unit-enum.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-10 08:41:55 -08:00
Huaxin Gao 57b954e825 [SPARK-29730][SQL] ALTER VIEW QUERY should look up catalog/table like v2 commands
Add AlterViewAsStatement and make ALTER VIEW ... QUERY go through the same catalog/table resolution framework of v2 commands.

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC v // success and describe the view v from my_catalog
ALTER VIEW v SELECT 1 // report view not found as there is no view v in the session catalog
```

Yes. When running ALTER VIEW ... QUERY, Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog.

unit tests

Closes #26453 from huaxingao/spark-29730.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-09 17:06:09 -08:00
Xiao Li 1e2d76e80a [HOT-FIX] Fix the SQLBase.g4
### What changes were proposed in this pull request?
Remove the duplicate code

See the build failure: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/986/

### Why are the changes needed?
Fix the compilation

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
The existing tests

Closes #26445 from gatorsmile/hotfixPraser.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-11-08 22:39:07 -08:00
xy_xin 7cfd589868 [SPARK-28893][SQL] Support MERGE INTO in the parser and add the corresponding logical plan
### What changes were proposed in this pull request?
This PR supports MERGE INTO in the parser and add the corresponding logical plan. The SQL syntax likes,
```
MERGE INTO [ds_catalog.][multi_part_namespaces.]target_table [AS target_alias]
USING [ds_catalog.][multi_part_namespaces.]source_table | subquery [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
```
where
```
<matched_action>  =
  DELETE  |
  UPDATE SET *  |
  UPDATE SET column1 = value1 [, column2 = value2 ...]

<not_matched_action>  =
  INSERT *  |
  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
```

### Why are the changes needed?
This is a start work for introduce `MERGE INTO` support for the builtin datasource, and the design work for the `MERGE INTO` support in DSV2.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New test cases.

Closes #26167 from xianyinxin/SPARK-28893.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-09 11:45:24 +08:00
Liang-Chi Hsieh 70987d8144 [SPARK-29680][SQL][FOLLOWUP] Replace qualifiedName with multipartIdentifier
### What changes were proposed in this pull request?

Replace qualifiedName with multipartIdentifier in parser rules of DDL commands.

### Why are the changes needed?

There are identifiers in some DDL rules we use `qualifiedName`. We should use `multipartIdentifier` because it can capture wrong identifiers such as `test-table`, `test-col`.

### Does this PR introduce any user-facing change?

Yes. Wrong identifiers such as test-table, will be captured now after this change.

### How was this patch tested?

Unit tests.

Closes #26419 from viirya/SPARK-29680-followup2.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-11-08 14:18:06 -08:00
Kent Yao e026412d9c [SPARK-29679][SQL] Make interval type comparable and orderable
### What changes were proposed in this pull request?

interval type support >, >=, <, <=, =, <=>, order by, min,max..

### Why are the changes needed?

Part of SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?

yes, we now support compare intervals

### How was this patch tested?

add ut

Closes #26337 from yaooqinn/SPARK-29679.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-08 22:45:11 +08:00
Kent Yao e7f7990bc3 [SPARK-29688][SQL] Support average for interval type values
### What changes were proposed in this pull request?

avg aggregate support interval type values

### Why are the changes needed?

Part of SPARK-27764 Feature Parity between PostgreSQL and Spark

### Does this PR introduce any user-facing change?

yes, we can do avg on intervals

### How was this patch tested?

add ut

Closes #26347 from yaooqinn/SPARK-29688.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-08 21:55:07 +08:00
davidvrba afc943ff8a [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If
### What changes were proposed in this pull request?
Spark org.apache.spark.sql.functions do not have `if` function so conditions are expressed using `when-otherwise` function. However `If` (which is available in SQL) has more efficient code gen. This pr rewrites `when-otherwise` conditions to `If` if it is possible (`when-otherwise` with single branch)

### Why are the changes needed?
It is an optimization enhancement. Here is a simple performance comparison (tested in local mode (with 4 cores)):
```
val df = spark.range(10000000000L).withColumn("x", rand)
val resultA = df.withColumn("r", when($"x" < 0.5, lit(1)).otherwise(lit(0))).agg(sum($"r"))
val resultB = df.withColumn("r", expr("if(x < 0.5, 1, 0)")).agg(sum($"r"))

resultA.collect() // takes 56s to finish
resultB.collect() // takes 30s to finish
```
### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New test is added.

Closes #26294 from davidvrba/spark-28477_rewriteCaseWhenToIf.

Authored-by: davidvrba <vrba.dave@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-08 21:25:48 +08:00
Kent Yao 0a03839366 [SPARK-29787][SQL] Move methods add/subtract/negate from CalendarInterval to IntervalUtils
### What changes were proposed in this pull request?

Move method add/subtract/negate from CalendarInterval to IntervalUtils

### Why are the changes needed?

https://github.com/apache/spark/pull/26410#discussion_r343125468 suggested here
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add uts and move some

Closes #26423 from yaooqinn/SPARK-29787.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-08 10:28:58 +08:00
Kent Yao 9562b26914 [SPARK-29757][SQL] Move calendar interval constants together
### What changes were proposed in this pull request?
```java
  public static final int YEARS_PER_DECADE = 10;
  public static final int YEARS_PER_CENTURY = 100;
  public static final int YEARS_PER_MILLENNIUM = 1000;

  public static final byte MONTHS_PER_QUARTER = 3;
  public static final int MONTHS_PER_YEAR = 12;

  public static final byte DAYS_PER_WEEK = 7;
  public static final long DAYS_PER_MONTH = 30L;

  public static final long HOURS_PER_DAY = 24L;

  public static final long MINUTES_PER_HOUR = 60L;

  public static final long SECONDS_PER_MINUTE = 60L;
  public static final long SECONDS_PER_HOUR = MINUTES_PER_HOUR * SECONDS_PER_MINUTE;
  public static final long SECONDS_PER_DAY = HOURS_PER_DAY * SECONDS_PER_HOUR;

  public static final long MILLIS_PER_SECOND = 1000L;
  public static final long MILLIS_PER_MINUTE = SECONDS_PER_MINUTE * MILLIS_PER_SECOND;
  public static final long MILLIS_PER_HOUR = MINUTES_PER_HOUR * MILLIS_PER_MINUTE;
  public static final long MILLIS_PER_DAY = HOURS_PER_DAY * MILLIS_PER_HOUR;

  public static final long MICROS_PER_MILLIS = 1000L;
  public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * MICROS_PER_MILLIS;
  public static final long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * MICROS_PER_SECOND;
  public static final long MICROS_PER_HOUR = MINUTES_PER_HOUR * MICROS_PER_MINUTE;
  public static final long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR;
  public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY;
  /* 365.25 days per year assumes leap year every four years */
  public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100;

  public static final long NANOS_PER_MICROS = 1000L;
  public static final long NANOS_PER_MILLIS = MICROS_PER_MILLIS * NANOS_PER_MICROS;
  public static final long NANOS_PER_SECOND = MILLIS_PER_SECOND * NANOS_PER_MILLIS;
```
The above parameters are defined in IntervalUtils, DateTimeUtils, and CalendarInterval, some of them are redundant, some of them are cross-referenced.

### Why are the changes needed?
To simplify code, enhance consistency and reduce risks

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

modified uts

Closes #26399 from yaooqinn/SPARK-29757.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-07 19:48:19 +08:00
Wenchen Fan 9b61f90987 [SPARK-29761][SQL] do not output leading 'interval' in CalendarInterval.toString
### What changes were proposed in this pull request?

remove the leading "interval" in `CalendarInterval.toString`.

### Why are the changes needed?

Although it's allowed to have "interval" prefix when casting string to int, it's not recommended.

This is also consistent with pgsql:
```
cloud0fan=# select interval '1' day;
 interval
----------
 1 day
(1 row)
```

### Does this PR introduce any user-facing change?

yes, when display a dataframe with interval type column, the result is different.

### How was this patch tested?

updated tests.

Closes #26401 from cloud-fan/interval.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-07 15:44:50 +08:00
Maxim Gekk 29dc59ac29 [SPARK-29605][SQL] Optimize string to interval casting
### What changes were proposed in this pull request?
In the PR, I propose new function `stringToInterval()` in `IntervalUtils` for converting `UTF8String` to `CalendarInterval`. The function is used in casting a `STRING` column to an `INTERVAL` column.

### Why are the changes needed?
The proposed implementation is ~10 times faster. For example, parsing 9 interval units on JDK 8:
Before:
```
9 units w/ interval                               14004          14125         116          0.1       14003.6       0.0X
9 units w/o interval                              13785          14056         290          0.1       13784.9       0.0X
```
After:
```
9 units w/ interval                                1343           1344           1          0.7        1343.0       0.3X
9 units w/o interval                               1345           1349           8          0.7        1344.6       0.3X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By new tests for `stringToInterval` in `IntervalUtilsSuite`
- By existing tests

Closes #26256 from MaxGekk/string-to-interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-07 12:39:52 +08:00
Kent Yao 3437862975 [SPARK-29387][SQL][FOLLOWUP] Fix issues of the multiply and divide for intervals
### What changes were proposed in this pull request?

Handle the inconsistence dividing zeros between literals and columns.
fix the null issue too.

### Why are the changes needed?
BUG FIX
### 1 Handle the inconsistence dividing zeros between literals and columns
```sql
-- !query 24
select
    k,
    v,
    cast(k as interval) / v,
    cast(k as interval) * v
from VALUES
     ('1 seconds', 1),
     ('2 seconds', 0),
     ('3 seconds', null),
     (null, null),
     (null, 0) t(k, v)
-- !query 24 schema
struct<k:string,v:int,divide_interval(CAST(k AS INTERVAL), CAST(v AS DOUBLE)):interval,multiply_interval(CAST(k AS INTERVAL), CAST(v AS DOUBLE)):interval>
-- !query 24 output
1 seconds   1   interval 1 seconds  interval 1 seconds
2 seconds   0   interval 0 microseconds interval 0 microseconds
3 seconds   NULL    NULL    NULL
NULL    0   NULL    NULL
NULL    NULL    NULL    NULL
```
```sql
-- !query 21
select interval '1 year 2 month' / 0
-- !query 21 schema
struct<divide_interval(interval 1 years 2 months, CAST(0 AS DOUBLE)):interval>
-- !query 21 output
NULL
```

in the first case, interval ’2 seconds ‘ / 0, it produces `interval 0 microseconds `
in the second case, it is `null`

### 2 null literal issues

```sql

  -- !query 20
select interval '1 year 2 month' / null
-- !query 20 schema
struct<>
-- !query 20 output
org.apache.spark.sql.AnalysisException
cannot resolve '(interval 1 years 2 months / NULL)' due to data type mismatch: differing types in '(interval 1 years 2 months / NULL)' (interval and null).; line 1 pos 7

-- !query 22
select interval '4 months 2 weeks 6 days' * null
-- !query 22 schema
struct<>
-- !query 22 output
org.apache.spark.sql.AnalysisException
cannot resolve '(interval 4 months 20 days * NULL)' due to data type mismatch: differing types in '(interval 4 months 20 days * NULL)' (interval and null).; line 1 pos 7

-- !query 23
select null * interval '4 months 2 weeks 6 days'
-- !query 23 schema
struct<>
-- !query 23 output
org.apache.spark.sql.AnalysisException
cannot resolve '(NULL * interval 4 months 20 days)' due to data type mismatch: differing types in '(NULL * interval 4 months 20 days)' (null and interval).; line 1 pos 7
```
 dividing or multiplying null literals, error occurs; where in column is fine as the first case
### Does this PR introduce any user-facing change?

NO, maybe yes, but it is just a follow-up

### How was this patch tested?

add uts

cc cloud-fan MaxGekk maropu

Closes #26410 from yaooqinn/SPARK-29387.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-07 12:19:03 +08:00
Aman Omer 0dcd739534 [SPARK-29462] The data type of "array()" should be array<null>
### What changes were proposed in this pull request?
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tested manually

Closes #26324 from amanomer/29462.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-06 18:39:46 +09:00
Liang-Chi Hsieh 6233958ab6 [SPARK-29680][SQL] Remove ALTER TABLE CHANGE COLUMN syntax
### What changes were proposed in this pull request?

This patch removes v1 ALTER TABLE CHANGE COLUMN syntax.

### Why are the changes needed?

Since in v2 we have ALTER TABLE CHANGE COLUMN and ALTER TABLE RENAME COLUMN, this old syntax is not necessary now and can be confusing.

The v2 ALTER TABLE CHANGE COLUMN should fallback to v1 AlterTableChangeColumnCommand (#26354).

### Does this PR introduce any user-facing change?

Yes, the old v1 ALTER TABLE CHANGE COLUMN syntax is removed.

### How was this patch tested?

Unit tests.

Closes #26338 from viirya/SPARK-29680.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-06 10:42:44 +08:00
Maxim Gekk 4c53ac1822 [SPARK-29387][SQL] Support * and / operators for intervals
### What changes were proposed in this pull request?
Added new expressions `MultiplyInterval` and `DivideInterval` to multiply/divide an interval by a numeric. Updated `TypeCoercion.DateTimeOperations` to turn the `Multiply`/`Divide` expressions of `CalendarIntervalType` and `NumericType` to `MultiplyInterval`/`DivideInterval`.

To support new operations, added new methods `multiply()` and `divide()` to `CalendarInterval`.

### Why are the changes needed?
- To maintain feature parity with PostgreSQL which supports multiplication and division of intervals by doubles:
```sql
# select interval '1 hour' / double precision '1.5';
 ?column?
----------
 00:40:00
```
- To conform the SQL standard which defines those operations: `numeric * interval`, `interval * numeric` and `interval / numeric`. See [4.5.3  Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow users to adjust interval columns. For example:
```sql
spark-sql> select (timestamp'now' - timestamp'yesterday') * 1.3;
interval 2 days 10 hours 39 minutes 38 seconds 568 milliseconds 900 microseconds
```

### Does this PR introduce any user-facing change?
Yes, previously the following query fails with the error:
```sql
spark-sql> select interval 1 hour 30 minutes * 1.5;
Error in query: cannot resolve '(interval 1 hours 30 minutes * 1.5BD)' due to data type mismatch: differing types in '(interval 1 hours 30 minutes * 1.5BD)' (interval and decimal(2,1)).; line 1 pos 7;
```
After:
```sql
spark-sql> select interval 1 hour 30 minutes * 1.5;
interval 2 hours 15 minutes
```

### How was this patch tested?
- Added tests for the `multiply()` and `divide()` methods to `CalendarIntervalSuite.java`
- New test suite `IntervalExpressionsSuite`
- by tests for `Multiply` -> `MultiplyInterval` and `Divide` -> `DivideInterval` in `TypeCoercionSuite`
- updated `datetime.sql`

Closes #26132 from MaxGekk/interval-mul-div.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-06 00:37:43 +08:00
Huaxin Gao 02eecfec99 [SPARK-29695][SQL] ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add AlterTableSerDePropertiesStatement and make ALTER TABLE ... SET SERDE/SERDEPROPERTIES go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t SET SERDE 'org.apache.class' // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... SET SERDE/SERDEPROPERTIES, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26374 from huaxingao/spark_29695.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-04 21:42:39 -08:00
Terry Kim bc65c54f6b [SPARK-29734][SQL] Datasource V2: Support SHOW CURRENT NAMESPACE
### What changes were proposed in this pull request?

This PR introduces a new SQL command: `SHOW CURRENT NAMESPACE`.

### Why are the changes needed?

Datasource V2 supports multiple catalogs/namespaces and having `SHOW CURRENT NAMESPACE` to retrieve the current catalog/namespace info would be useful.

### Does this PR introduce any user-facing change?

Yes, the user can perform the following:
```
scala> spark.sql("SHOW CURRENT NAMESPACE").show
+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|  default|
+-------------+---------+

scala> spark.sql("USE testcat.ns1.ns2").show
scala> spark.sql("SHOW CURRENT NAMESPACE").show
+-------+---------+
|catalog|namespace|
+-------+---------+
|testcat|  ns1.ns2|
+-------+---------+
```

### How was this patch tested?

Added unit tests.

Closes #26379 from imback82/show_current_catalog.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-04 18:05:10 -08:00
Kent Yao 44b8fbcc58 [SPARK-29663][SQL] Support sum with interval type values
### What changes were proposed in this pull request?

sum support interval values

### Why are the changes needed?

Part of SPARK-27764 Feature Parity between PostgreSQL and Spark

### Does this PR introduce any user-facing change?

yes, sum can evaluate intervals
### How was this patch tested?

add ut

Closes #26325 from yaooqinn/SPARK-29663.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-05 01:05:07 +08:00
Terry Kim d4ea211187 [SPARK-29678][SQL] ALTER TABLE (ADD PARTITION) should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add AlterTableAddPartitionStatement and make ALTER TABLE ... ADD PARTITION go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t ADD PARTITION (id=1) // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

Yes. When running ALTER TABLE ... ADD PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests

Closes #26369 from imback82/spark-29678.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-04 23:56:47 +08:00
Maxim Gekk 50538600ec [SPARK-29736][TESTS] Improve stability of tests for special datetime values
### What changes were proposed in this pull request?
- Retry the tests for special date-time values on failure. The tests can potentially fail when reference values were taken before midnight and test code resolves special values after midnight. The retry can guarantees that the tests run during the same day.
- Simplify getting of the current timestamp via `Instant.now()`. This should avoid any issues of converting current local datetime to an instance. For example, the same local time can be mapped to 2 instants when clocks are turned backward 1 hour on daylight saving date.
- Extract common code to SQLHelper
- Set the tested zoneId to the session time zone in `DateTimeUtilsSuite`.

### Why are the changes needed?
To make the tests more stable.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `Date`/`TimestampFormatterSuite` and `DateTimeUtilsSuite`.

Closes #26380 from MaxGekk/retry-on-fail.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-04 16:59:32 +08:00
Maxim Gekk fb60c2a170 [SPARK-29671][SQL] Simplify string representation of intervals
### What changes were proposed in this pull request?
In the PR, I propose to changed `CalendarInterval.toString`:
- to skip the `week` unit
- to convert `milliseconds` and `microseconds` as the fractional part of the `seconds` unit.

### Why are the changes needed?
To improve readability.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- By `CalendarIntervalSuite` and `IntervalUtilsSuite`
- `literals.sql`, `datetime.sql` and `interval.sql`

Closes #26367 from MaxGekk/interval-to-string-format.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-03 22:56:59 -08:00
wangguangxin.cn 83c39d15e1 [SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation
### What changes were proposed in this pull request?
This is somewhat a complement of https://github.com/apache/spark/pull/21853.
The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`.
This PR try to remove this kind of `Sort` operator in `SQL Optimizer`.

### Why are the changes needed?
For example,  `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`.
'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`.

Remove useless `Sort` operator can improve performance.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Adding new UT `RemoveSortInSubquerySuite.scala`

Closes #26011 from WangGuangxin/remove_sorts.

Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-04 14:52:19 +08:00
Kent Yao 5ba17d09ac [SPARK-29722][SQL] Non reversed keywords should be able to be used in high order functions
### What changes were proposed in this pull request?

Support non-reversed keywords to be used in high order functions.

### Why are the changes needed?

the keywords are non-reversed.

### Does this PR introduce any user-facing change?

yes, all non-reversed keywords can be used in high order function correctly

### How was this patch tested?

add uts

Closes #26366 from yaooqinn/SPARK-29722.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-04 14:52:14 +09:00
Maxim Gekk 80a89873b2 [SPARK-29733][TESTS] Fix wrong order of parameters passed to assertEquals
### What changes were proposed in this pull request?
The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter.

### Why are the changes needed?
Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example:
```java
assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L));
```
```
java.lang.AssertionError:
Expected :interval 5 months 5 days 101 hours
Actual   :interval 5 months 5 days 102 hours
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing tests.

Closes #26377 from MaxGekk/fix-order-in-assert-equals.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-03 11:21:28 -08:00
Wenchen Fan 31ae446e9c [SPARK-29623][SQL] do not allow multiple unit TO unit statements in interval literal syntax
### What changes were proposed in this pull request?

re-arrange the parser rules to make it clear that multiple unit TO unit statement like `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH` is not allowed.

### Why are the changes needed?

This is definitely an accident that we support such a weird syntax in the past. It's not supported by any other DBs and I can't think of any use case of it. Also no test covers this syntax in the current codebase.

### Does this PR introduce any user-facing change?

Yes, and a migration guide item is added.

### How was this patch tested?

new tests.

Closes #26285 from cloud-fan/syntax.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-02 21:35:56 +08:00
Huaxin Gao 14337f68e3 [SPARK-29643][SQL] ALTER TABLE/VIEW (DROP PARTITION) should look up catalog/table like v2 commands
###What changes were proposed in this pull request?
Add AlterTableDropPartitionStatement and make ALTER TABLE/VIEW ... DROP PARTITION go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t DROP PARTITION (id=1)  // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE/VIEW ... DROP PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26303 from huaxingao/spark-29643.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-01 18:29:04 +08:00
Liu,Linhong a4382f7fe1 [SPARK-29486][SQL] CalendarInterval should have 3 fields: months, days and microseconds
### What changes were proposed in this pull request?
Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it
to 3 fields: months, days and microseconds. This is because one logical day interval may
have different number of microseconds (daylight saving).

### Why are the changes needed?
One logical day interval may have different number of microseconds (daylight saving).
For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to
2019-11-3 12:00:00

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
unit test and new added test cases

Closes #26134 from LinhongLiu/calendarinterval.

Authored-by: Liu,Linhong <liulinhong@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-01 18:12:33 +08:00
Huaxin Gao ae7450d1c9 [SPARK-29676][SQL] ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add AlterTableRenamePartitionStatement and make ALTER TABLE ... RENAME TO PARTITION go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2) // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RENAME TO PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26350 from huaxingao/spark_29676.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-31 20:28:31 -07:00
Wenchen Fan faf220aad9 [SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown
Bring back https://github.com/apache/spark/pull/25955

### What changes were proposed in this pull request?

This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.

This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.

### Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.

Closes #26341 from cloud-fan/back.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-31 08:25:32 -07:00
Maxim Gekk 5e9a155eba [SPARK-29520][SS] Fix checks of negative intervals
### What changes were proposed in this pull request?
- Added `getDuration()` to calculate interval duration in specified time units assuming provided days per months
- Added `isNegative()` which return `true` is the interval duration is less than 0
- Fix checking negative intervals by using `isNegative()` in structured streaming classes
- Fix checking of `year-months` intervals

### Why are the changes needed?
This fixes incorrect checking of negative intervals. An interval is negative when its duration is negative but not if interval's months **or** microseconds is negative. Also this fixes checking of `year-month` interval support because the `month` field could be negative.

### Does this PR introduce any user-facing change?
Should not

### How was this patch tested?
- Added tests for the `getDuration()` and `isNegative()` methods to `IntervalUtilsSuite`
- By existing SS tests

Closes #26177 from MaxGekk/interval-is-positive.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-31 15:35:04 +08:00
Dongjoon Hyun 095f7b05fd Revert "[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown"
This reverts commit cfc80d0eb1.
2019-10-30 23:11:22 -07:00
Terry Kim 3a06c129f4 [SPARK-29592][SQL] ALTER TABLE (set partition location) should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Update `AlterTableSetLocationStatement` to store `partitionSpec` and make `ALTER TABLE a.b.c PARTITION(...) SET LOCATION 'loc'` fail if `partitionSpec` is set with unsupported message.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t PARTITION(...) SET LOCATION 'loc' // report set location with partition spec is not supported.
```
### Does this PR introduce any user-facing change?

yes. When running ALTER TABLE (set partition location), Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

New unit tests

Closes #26304 from imback82/alter_table_partition_loc.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-31 10:47:43 +08:00
Unknown 401a5f7715 [SPARK-29523][SQL] SHOW COLUMNS should do multi-catalog resolution
### What changes were proposed in this pull request?

Add ShowColumnsStatement and make SHOW COLUMNS go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW COLUMNS FROM t // report table not found as there is no table t in the session catalog

### Does this PR introduce any user-facing change?

yes. When running SHOW COLUMNS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26182 from planga82/feature/SPARK-29523_SHOW_COLUMNS_datasourceV2.

Authored-by: Unknown <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-31 10:13:12 +08:00
Maxim Gekk 3206a99870 [SPARK-29651][SQL] Fix parsing of interval seconds fraction
### What changes were proposed in this pull request?
In the PR, I propose to extract parsing of the seconds interval units to the private method `parseNanos` in `IntervalUtils` and modify the code to correctly parse the fractional part of the seconds unit of intervals in the cases:
- When the fractional part has less than 9 digits
- The seconds unit is negative

### Why are the changes needed?
The changes are needed to fix the issues:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 microseconds
```
The correct result must be `interval 10 seconds 123 milliseconds 456 microseconds`
```sql
spark-sql> select interval '-10.123456789 seconds';
interval -9 seconds -876 milliseconds -544 microseconds
```
but the whole interval should be negated, and the result must be `interval -10 seconds -123 milliseconds -456 microseconds`, taking into account the truncation to microseconds.

### Does this PR introduce any user-facing change?
Yes. After changes:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 milliseconds 456 microseconds
spark-sql> select interval '-10.123456789 seconds';
interval -10 seconds -123 milliseconds -456 microseconds
```

### How was this patch tested?
By existing and new tests in `ExpressionParserSuite`.

Closes #26313 from MaxGekk/fix-interval-nanos-parsing.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-31 09:20:46 +08:00
Ryan Blue cfc80d0eb1 [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown
### What changes were proposed in this pull request?

This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.

This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.

### Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.

Closes #25955 from rdblue/move-v2-pushdown.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-10-30 18:07:34 -07:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Kent Yao dc987f0c8b [SPARK-29653][SQL] Fix MICROS_PER_MONTH in IntervalUtils
### What changes were proposed in this pull request?

MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY

### Why are the changes needed?

fix bug

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #26321 from yaooqinn/SPARK-29653.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 08:09:22 -07:00
Kent Yao 8e667db5d8 [SPARK-29629][SQL] Support typed integer literal expression
### What changes were proposed in this pull request?

```
postgres=# select date '2001-09-28' + integer '7';
  ?column?
------------
 2001-10-05
(1 row)postgres=# select integer '7';
 int4
------
    7
(1 row)
```
Add support for typed integer literal expression from postgreSQL.

### Why are the changes needed?

SPARK-27764 Feature Parity between PostgreSQL and Spark

### Does this PR introduce any user-facing change?

support typed integer lit in SQL

### How was this patch tested?

add uts

Closes #26291 from yaooqinn/SPARK-29629.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 09:40:54 +09:00
ulysses 6958d7e629 [SPARK-28746][SQL] Add partitionby hint for sql queries
## What changes were proposed in this pull request?

Now, `RepartitionByExpression` is allowed at Dataset method `Dataset.repartition()`. But in spark sql,  we do not have an equivalent functionality.
In hive, we can use `distribute by`, so it's worth to add a hint to support such function.
Similar jira [SPARK-24940](https://issues.apache.org/jira/browse/SPARK-24940)

## Why are the changes needed?

Make repartition hints consistent with repartition api .

## Does this PR introduce any user-facing change?
This pr intends to support quries below;
```
// SQL cases
 - sql("SELECT /*+ REPARTITION(c) */ * FROM t")
 - sql("SELECT /*+ REPARTITION(1, c) */ * FROM t")
 - sql("SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t")
 - sql("SELECT /*+ REPARTITION_BY_RANGE(1, c) */ * FROM t")
```

## How was this patch tested?
UT

Closes #25464 from ulysses-you/SPARK-28746.

Lead-authored-by: ulysses <youxiduo@weidian.com>
Co-authored-by: ulysses <646303253@qq.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-30 08:25:34 +09:00
Maxim Gekk 44c1c03924 [SPARK-29607][SQL] Move static methods from CalendarInterval to IntervalUtils
### What changes were proposed in this pull request?
In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala.

### Why are the changes needed?
- For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see https://github.com/apache/spark/pull/26190
- Taking into account that `CalendarInterval` will be fully exposed to users in the future (see https://github.com/apache/spark/pull/25022), it would be nice to clean it up by moving service methods to an internal object.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite`
- By existing test suites

Closes #26261 from MaxGekk/refactoring-calendar-interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-30 01:15:18 +08:00
Huaxin Gao e92b75482f [SPARK-29612][SQL] ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add AlterTableRecoverPartitionsStatement and make ALTER TABLE ... RECOVER PARTITIONS go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t RECOVER PARTITIONS  // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RECOVER PARTITIONS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26269 from huaxingao/spark-29612.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-29 13:54:07 +08:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
Terry Kim 59db1f617a [SPARK-29609][SQL] DataSourceV2: Support DROP NAMESPACE
### What changes were proposed in this pull request?

This PR adds `DROP NAMESPACE` support for V2 catalogs.

### Why are the changes needed?

Currently, you cannot drop namespaces for v2 catalogs.

### Does this PR introduce any user-facing change?

The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
DROP NAMESPACE mycatalog.ns
SHOW NAMESPACES IN mycatalog # Will show no namespaces
```
to drop a namespace `ns` inside `mycatalog` V2 catalog.

### How was this patch tested?

Added unit tests.

Closes #26262 from imback82/drop_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-28 15:00:22 -07:00
Liang-Chi Hsieh 2be1fe6abc [SPARK-29521][SQL] LOAD DATA INTO TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add LoadDataStatement and make LOAD DATA INTO TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
LOAD DATA INPATH 'filepath'  INTO TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running LOAD DATA INTO TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26178 from viirya/SPARK-29521.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-29 00:41:20 +08:00
Takeshi Yamamuro e6c56937e2 [SPARK-29008][SQL][FOLLOW-UP] Checks correct param lengths of split methods
### What changes were proposed in this pull request?

This pr is to fix wrong code to check parameter lengths of split methods in `subexpressionEliminationForWholeStageCodegen`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26267 from maropu/SPARK-29008-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-27 21:45:56 -07:00
Maxim Gekk 74514b46e5 [SPARK-29614][SQL][TEST] Fix failures of DateTimeUtilsSuite and TimestampFormatterSuite
### What changes were proposed in this pull request?
The `DateTimeUtilsSuite` and `TimestampFormatterSuite` assume constant time difference between `timestamp'yesterday'`, `timestamp'today'` and `timestamp'tomorrow'` which is wrong on daylight switching day - day length can be 23 or 25 hours. In the PR, I propose to use Java 8 time API to calculate instances of `yesterday` and `tomorrow` timestamps.

### Why are the changes needed?
The changes fix test failures and make the tests tolerant to daylight time switching.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `TimestampFormatterSuite`.

Closes #26273 from MaxGekk/midnight-tolerant.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-27 13:48:43 -07:00
Kent Yao 9a46702791 [SPARK-29554][SQL] Add version SQL function
### What changes were proposed in this pull request?

```
hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 2.113 seconds, Fetched: 1 row(s)
```

### Why are the changes needed?

From hive behavior and I guess it is useful for debugging and developing etc.

### Does this PR introduce any user-facing change?

add a misc func

### How was this patch tested?

add ut

Closes #26209 from yaooqinn/SPARK-29554.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 23:02:11 -07:00
Dongjoon Hyun 5bdc58bf8a [SPARK-27653][SQL][FOLLOWUP] Fix since version of min_by/max_by
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/24557 to fix `since` version.

### Why are the changes needed?

This is found during 3.0.0-preview preparation.
The version will be exposed to our SQL document like the following. We had better fix this.
- https://spark.apache.org/docs/latest/api/sql/#array_min

### Does this PR introduce any user-facing change?

Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and  SQL doc, but this is new at 3.0.0.

### How was this patch tested?

Manual.
```
spark-sql> DESC FUNCTION EXTENDED min_by;
Function: min_by
Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy
Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`.
Extended Usage:
    Examples:
      > SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
       a

    Since: 3.0.0
```

Closes #26264 from dongjoon-hyun/SPARK-27653.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 21:17:17 -07:00
Liang-Chi Hsieh 68dca9a095 [SPARK-29527][SQL] SHOW CREATE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add ShowCreateTableStatement and make SHOW CREATE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW CREATE TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running SHOW CREATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26184 from viirya/SPARK-29527.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-25 23:09:08 +08:00
Kent Yao 0cf4f07c66 [SPARK-29545][SQL] Add support for bit_xor aggregate function
### What changes were proposed in this pull request?

bit_xor(expr) - Returns the bitwise XOR of all non-null input values, or null if none

### Why are the changes needed?

As we support `bit_and`, `bit_or` now, we'd better support the related aggregate function **bit_xor** ahead of postgreSQL, because many other popular databases support it.

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbreference/bit-xor-function.html

https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_bit-or

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/BIT_XOR.htm?TocPath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_____10

### Does this PR introduce any user-facing change?

add a new bit agg
### How was this patch tested?

UTs added

Closes #26205 from yaooqinn/SPARK-29545.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-25 22:19:19 +09:00
Terry Kim dec99d8ac5 [SPARK-29526][SQL] UNCACHE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add UncacheTableStatement and make UNCACHE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
UNCACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running UNCACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

New unit tests

Closes #26237 from imback82/uncache_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 14:51:23 -07:00
stczwd dcf5eaf1a6 [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating
# What changes were proposed in this pull request?
Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
run unit tests

Closes #26227 from stczwd/json-generator-doc.

Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 10:25:04 -07:00
Wenchen Fan cdea520ff8 [SPARK-29532][SQL] Simplify interval string parsing
### What changes were proposed in this pull request?

Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`.

### Why are the changes needed?

Simplify the code and fix inconsistent behaviors.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the Jenkins with the updated test cases.

Closes #26190 from cloud-fan/parser.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 09:15:59 -07:00
Sean Owen a35fb4fd50 [SPARK-29578][TESTS] Add "8634" as another skipped day for Kwajalein timzeone due to more recent timezone updates in later JDK 8
### What changes were proposed in this pull request?

Recent timezone definition changes in very new JDK 8 (and beyond) releases cause test failures. The below was observed on JDK 1.8.0_232. As before, the easy fix is to allow for these inconsequential variations in test results due to differing definition of timezones.

### Why are the changes needed?

Keeps test passing on the latest JDK releases.

### Does this PR introduce any user-facing change?

None

### How was this patch tested?

Existing tests

Closes #26236 from srowen/SPARK-29578.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-24 08:30:27 -05:00
angerszhu 67cf0433ee [SPARK-29145][SQL] Support sub-queries in join conditions
### What changes were proposed in this pull request?
Support SparkSQL use iN/EXISTS with subquery  in JOIN condition.

### Why are the changes needed?
Support SQL use iN/EXISTS with subquery  in JOIN condition.

### Does this PR introduce any user-facing change?

This PR is for enable user use subquery in `JOIN`'s ON condition. such as we have create three table
```
CREATE TABLE A(id String);
CREATE TABLE B(id String);
CREATE TABLE C(id String);
```
we can do query like :
```
SELECT A.id  from  A JOIN B ON A.id = B.id and A.id IN (select C.id from C)
```

### How was this patch tested?
ADDED UT

Closes #25854 from AngersZhuuuu/SPARK-29145.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-24 21:55:03 +09:00
Yuanjian Li 9e77d48315 [SPARK-21492][SQL][FOLLOW UP] Reimplement UnsafeExternalRowSorter in database style iterator
### What changes were proposed in this pull request?
Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base.

### Why are the changes needed?
During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #26229 from xuanyuanking/SPARK-21492-follow-up.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 15:43:13 +08:00
Liang-Chi Hsieh 177bf672e4 [SPARK-29522][SQL] CACHE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
CACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26179 from viirya/SPARK-29522.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 15:00:21 +08:00
Jungtaek Lim (HeartSaVioR) bfbf2821f3 [SPARK-29503][SQL] Remove conversion CreateNamedStruct to CreateNamedStructUnsafe
### What changes were proposed in this pull request?

There's a case where MapObjects has a lambda function which creates nested struct - unsafe data in safe data struct. In this case, MapObjects doesn't copy the row returned from lambda function (as outmost data type is safe data struct), which misses copying nested unsafe data.

The culprit is that `UnsafeProjection.toUnsafeExprs` converts `CreateNamedStruct` to `CreateNamedStructUnsafe` (this is the only place where `CreateNamedStructUnsafe` is used) which incurs safe and unsafe being mixed up temporarily, which may not be needed at all at least logically, as it will finally assembly these evaluations to `UnsafeRow`.

> Before the patch

```
/* 105 */   private ArrayData MapObjects_0(InternalRow i) {
/* 106 */     boolean isNull_1 = i.isNullAt(0);
/* 107 */     ArrayData value_1 = isNull_1 ?
/* 108 */     null : (i.getArray(0));
/* 109 */     ArrayData value_0 = null;
/* 110 */
/* 111 */     if (!isNull_1) {
/* 112 */
/* 113 */       int dataLength_0 = value_1.numElements();
/* 114 */
/* 115 */       ArrayData[] convertedArray_0 = null;
/* 116 */       convertedArray_0 = new ArrayData[dataLength_0];
/* 117 */
/* 118 */
/* 119 */       int loopIndex_0 = 0;
/* 120 */
/* 121 */       while (loopIndex_0 < dataLength_0) {
/* 122 */         value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0));
/* 123 */         isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0);
/* 124 */
/* 125 */         ArrayData arrayData_0 = ArrayData.allocateArrayData(
/* 126 */           -1, 1L, " createArray failed.");
/* 127 */
/* 128 */         mutableStateArray_0[0].reset();
/* 129 */
/* 130 */
/* 131 */         mutableStateArray_0[0].zeroOutNullBytes();
/* 132 */
/* 133 */
/* 134 */         if (isNull_MapObject_lambda_variable_1) {
/* 135 */           mutableStateArray_0[0].setNullAt(0);
/* 136 */         } else {
/* 137 */           mutableStateArray_0[0].write(0, value_MapObject_lambda_variable_1);
/* 138 */         }
/* 139 */         arrayData_0.update(0, (mutableStateArray_0[0].getRow()));
/* 140 */         if (false) {
/* 141 */           convertedArray_0[loopIndex_0] = null;
/* 142 */         } else {
/* 143 */           convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0;
/* 144 */         }
/* 145 */
/* 146 */         loopIndex_0 += 1;
/* 147 */       }
/* 148 */
/* 149 */       value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0);
/* 150 */     }
/* 151 */     globalIsNull_0 = isNull_1;
/* 152 */     return value_0;
/* 153 */   }
```

> After the patch

```
/* 104 */   private ArrayData MapObjects_0(InternalRow i) {
/* 105 */     boolean isNull_1 = i.isNullAt(0);
/* 106 */     ArrayData value_1 = isNull_1 ?
/* 107 */     null : (i.getArray(0));
/* 108 */     ArrayData value_0 = null;
/* 109 */
/* 110 */     if (!isNull_1) {
/* 111 */
/* 112 */       int dataLength_0 = value_1.numElements();
/* 113 */
/* 114 */       ArrayData[] convertedArray_0 = null;
/* 115 */       convertedArray_0 = new ArrayData[dataLength_0];
/* 116 */
/* 117 */
/* 118 */       int loopIndex_0 = 0;
/* 119 */
/* 120 */       while (loopIndex_0 < dataLength_0) {
/* 121 */         value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0));
/* 122 */         isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0);
/* 123 */
/* 124 */         ArrayData arrayData_0 = ArrayData.allocateArrayData(
/* 125 */           -1, 1L, " createArray failed.");
/* 126 */
/* 127 */         Object[] values_0 = new Object[1];
/* 128 */
/* 129 */
/* 130 */         if (isNull_MapObject_lambda_variable_1) {
/* 131 */           values_0[0] = null;
/* 132 */         } else {
/* 133 */           values_0[0] = value_MapObject_lambda_variable_1;
/* 134 */         }
/* 135 */
/* 136 */         final InternalRow value_3 = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(values_0);
/* 137 */         values_0 = null;
/* 138 */         arrayData_0.update(0, value_3);
/* 139 */         if (false) {
/* 140 */           convertedArray_0[loopIndex_0] = null;
/* 141 */         } else {
/* 142 */           convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0;
/* 143 */         }
/* 144 */
/* 145 */         loopIndex_0 += 1;
/* 146 */       }
/* 147 */
/* 148 */       value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0);
/* 149 */     }
/* 150 */     globalIsNull_0 = isNull_1;
/* 151 */     return value_0;
/* 152 */   }
```

### Why are the changes needed?

This patch fixes the bug described above.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

UT added which fails on master branch and passes on PR.

Closes #26173 from HeartSaVioR/SPARK-29503.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 00:41:48 +08:00
Terry Kim 53a5f17803 [SPARK-29513][SQL] REFRESH TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add RefreshTableStatement and make REFRESH TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
REFRESH TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running REFRESH TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

New unit tests

Closes #26183 from imback82/refresh_table.

Lead-authored-by: Terry Kim <yuminkim@gmail.com>
Co-authored-by: Terry Kim <terryk@terrys-mbp-2.lan>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-23 08:26:47 -07:00
turbofei 70dd9c0cab [SPARK-29542][SQL][DOC] Make the descriptions of spark.sql.files.* be clearly
### What changes were proposed in this pull request?
As described  in [SPARK-29542](https://issues.apache.org/jira/browse/SPARK-29542) , the descriptions of `spark.sql.files.*` are confused.
In this PR, I make their descriptions be clearly.

### Why are the changes needed?
It makes the descriptions of `spark.sql.files.*` be clearly.
### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #26200 from turboFei/SPARK-29542-partition-maxSize.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-23 20:31:06 +09:00
Terry Kim c128ac564d [SPARK-29511][SQL] DataSourceV2: Support CREATE NAMESPACE
### What changes were proposed in this pull request?

This PR adds `CREATE NAMESPACE` support for V2 catalogs.

### Why are the changes needed?

Currently, you cannot explicitly create namespaces for v2 catalogs.

### Does this PR introduce any user-facing change?

The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
```
to create a namespace `ns` inside `mycatalog` V2 catalog.

### How was this patch tested?

Added unit tests.

Closes #26166 from imback82/create_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-23 12:17:20 +08:00
Huaxin Gao 3bf5355e24 [SPARK-29539][SQL] SHOW PARTITIONS should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add ShowPartitionsStatement and make SHOW PARTITIONS go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users.

### Does this PR introduce any user-facing change?
Yes. When running SHOW PARTITIONS, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26198 from huaxingao/spark-29539.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-22 14:47:17 -07:00
Liang-Chi Hsieh b4844eea1f [SPARK-29517][SQL] TRUNCATE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
TRUNCATE TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26174 from viirya/SPARK-29517.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-22 19:17:28 +08:00
denglingang 467c3f610f [SPARK-29529][DOCS] Remove unnecessary orc version and hive version in doc
### What changes were proposed in this pull request?

This PR remove unnecessary orc version and hive version in doc.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A.

Closes #26146 from denglingang/SPARK-24576.

Lead-authored-by: denglingang <chitin1027@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 14:49:23 +09:00
angerszhu 484f93e255 [SPARK-29530][SQL] Make SQLConf in SQL parse process thread safe
### What changes were proposed in this pull request?
As I have comment in  [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977)
SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation.

In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf.
Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe

### Why are the changes needed?
Fix bug

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
NO

Closes #26187 from AngersZhuuuu/SPARK-29530.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-22 10:38:06 +08:00
Kent Yao 5b4d9170ed [SPARK-27879][SQL] Add support for bit_and and bit_or aggregates
### What changes were proposed in this pull request?

```
bit_and(expression) -- The bitwise AND of all non-null input values, or null if none
bit_or(expression) -- The bitwise OR of all non-null input values, or null if none
```
More details:
https://www.postgresql.org/docs/9.3/functions-aggregate.html

### Why are the changes needed?

Postgres, Mysql and many other popular db support them.

### Does this PR introduce any user-facing change?

add two bit agg

### How was this patch tested?

add ut

Closes #26155 from yaooqinn/SPARK-27879.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-21 14:32:31 +08:00
Terry Kim ab92e1715e [SPARK-29512][SQL] REPAIR TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

New unit tests

Closes #26168 from imback82/repair_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-18 22:43:58 -07:00
Wenchen Fan 2437878299 [SPARK-29502][SQL] typed interval expression should fail for invalid format
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/25241 .

The typed interval expression should fail for invalid format.

### Why are the changes needed?

Te be consistent with the typed timestamp/date expression

### Does this PR introduce any user-facing change?

Yes. But this feature is not released yet.

### How was this patch tested?

updated test

Closes #26151 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-10-18 16:12:03 -07:00
Rahul Mahadev 4cfce3e5d0 [SPARK-29494][SQL] Fix for ArrayOutofBoundsException while converting string to timestamp
### What changes were proposed in this pull request?
* Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':'
* Added a test to make sure this works.

### Why are the changes needed?
In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing  ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test in the `DateTimeTestUtils` suite to test if my fix works.

Closes #26143 from rahulsmahadev/SPARK-29494.

Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-18 16:45:25 -05:00
Maxim Gekk 77fe8a8e7c [SPARK-28420][SQL] Support the INTERVAL type in date_part()
### What changes were proposed in this pull request?
The `date_part()` function can accept the `source` parameter of the `INTERVAL` type (`CalendarIntervalType`). The following values of the `field` parameter are supported:
- `"MILLENNIUM"` (`"MILLENNIA"`, `"MIL"`, `"MILS"`) - number of millenniums in the given interval. It is `YEAR / 1000`.
- `"CENTURY"` (`"CENTURIES"`, `"C"`, `"CENT"`) - number of centuries in the interval calculated as `YEAR / 100`.
- `"DECADE"` (`"DECADES"`, `"DEC"`, `"DECS"`) - decades in the `YEAR` part of the interval calculated as `YEAR / 10`.
- `"YEAR"` (`"Y"`, `"YEARS"`, `"YR"`, `"YRS"`) - years in a values of `CalendarIntervalType`. It is `MONTHS / 12`.
- `"QUARTER"` (`"QTR"`) - a quarter of year calculated as `MONTHS / 3 + 1`
- `"MONTH"` (`"MON"`, `"MONS"`, `"MONTHS"`) - the months part of the interval calculated as `CalendarInterval.months % 12`
- `"DAY"` (`"D"`, `"DAYS"`) - total number of days in `CalendarInterval.microseconds`
- `"HOUR"` (`"H"`, `"HOURS"`, `"HR"`, `"HRS"`) - the hour part of the interval.
- `"MINUTE"` (`"M"`, `"MIN"`, `"MINS"`, `"MINUTES"`) - the minute part of the interval.
- `"SECOND"` (`"S"`, `"SEC"`, `"SECONDS"`, `"SECS"`) - the seconds part with fractional microsecond part.
- `"MILLISECONDS"` (`"MSEC"`, `"MSECS"`, `"MILLISECON"`, `"MSECONDS"`, `"MS"`) - the millisecond part of the interval with fractional microsecond part.
- `"MICROSECONDS"` (`"USEC"`, `"USECS"`, `"USECONDS"`, `"MICROSECON"`, `"US"`) - the total number of microseconds in the `second`, `millisecond` and `microsecond` parts of the given interval.
- `"EPOCH"` - the total number of seconds in the interval including the fractional part with microsecond precision. Here we assume 365.25 days per year (leap year every four years).

For example:
```sql
> SELECT date_part('days', interval 1 year 10 months 5 days);
 5
> SELECT date_part('seconds', interval 30 seconds 1 milliseconds 1 microseconds);
 30.001001
```

### Why are the changes needed?
To maintain feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added new test suite `IntervalExpressionsSuite`
- Add new test cases to `date_part.sql`

Closes #25981 from MaxGekk/extract-from-intervals.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-18 23:54:59 +08:00
Terry Kim 39af51dbc6 [SPARK-29014][SQL] DataSourceV2: Fix current/default catalog usage
### What changes were proposed in this pull request?
The handling of the catalog across plans should be as follows ([SPARK-29014](https://issues.apache.org/jira/browse/SPARK-29014)):
* The *current* catalog should be used when no catalog is specified
* The default catalog is the catalog *current* is initialized to
* If the *default* catalog is not set, then *current* catalog is the built-in Spark session catalog.

This PR addresses the issue where *current* catalog usage is not followed as describe above.

### Why are the changes needed?

It is a bug as described in the previous section.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Unit tests added.

Closes #26120 from imback82/cleanup_catalog.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-18 22:45:42 +08:00
Wenchen Fan 74351468de [SPARK-29482][SQL] ANALYZE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add `AnalyzeTableStatement` and `AnalyzeColumnStatement`, and make ANALYZE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ANALYZE TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running ANALYZE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

new tests

Closes #26129 from cloud-fan/analyze-table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-10-18 12:55:49 +02:00
stczwd 78b0cbe265 [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values
### Why are the changes needed?
As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON.
For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data.

### Does this PR introduce any user-facing change?
Example in spark-shell.
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"b":1}

scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"a":null,"b":1}

### How was this patch tested?
Add new test to JacksonGeneratorSuite

Closes #26098 from stczwd/json.

Lead-authored-by: stczwd <qcsd2011@163.com>
Co-authored-by: Jackey Lee <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-18 16:06:54 +08:00
Yuanjian Li 8616109061 [SPARK-9853][CORE][FOLLOW-UP] Regularize all the shuffle configurations related to adaptive execution
### What changes were proposed in this pull request?
1. Regularize all the shuffle configurations related to adaptive execution.
2. Add default value for `BlockStoreShuffleReader.shouldBatchFetch`.

### Why are the changes needed?
It's a follow-up PR for #26040.
Regularize the existing `spark.sql.adaptive.shuffle` namespace in SQLConf.

### Does this PR introduce any user-facing change?
Rename one released user config `spark.sql.adaptive.minNumPostShufflePartitions` to `spark.sql.adaptive.shuffle.minNumPostShufflePartitions`, other changed configs is not released yet.

### How was this patch tested?
Existing UT.

Closes #26147 from xuanyuanking/SPARK-9853.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-18 15:39:35 +08:00
Jiajia Li dc0bc7a6eb [MINOR][DOCS] Fix some typos
### What changes were proposed in this pull request?

This PR proposes a few typos:
1. Sparks => Spark's
2. parallize => parallelize
3. doesnt => doesn't

Closes #26140 from plusplusjiajia/fix-typos.

Authored-by: Jiajia Li <jiajia.li@intel.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-17 07:22:01 -07:00
Kent Yao 4b902d3b45 [SPARK-29491][SQL] Add bit_count function support
### What changes were proposed in this pull request?

BIT_COUNT(N) - Returns the number of bits that are set in the argument N as an unsigned 64-bit integer, or NULL if the argument is NULL

### Why are the changes needed?

Supported by MySQL,Microsoft SQL Server ,etc.

### Does this PR introduce any user-facing change?

add a built-in function
### How was this patch tested?

add uts

Closes #26139 from yaooqinn/SPARK-29491.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-17 20:22:38 +08:00
Yuanjian Li 239ee3f561 [SPARK-9853][CORE] Optimize shuffle fetch of continuous partition IDs
This PR takes over #19788. After we split the shuffle fetch protocol from `OpenBlock` in #24565, this optimization can be extended in the new shuffle protocol. Credit to yucai, closes #19788.

### What changes were proposed in this pull request?
This PR adds the support for continuous shuffle block fetching in batch:

- Shuffle client changes:
    - Add new feature tag `spark.shuffle.fetchContinuousBlocksInBatch`, implement the decision logic in `BlockStoreShuffleReader`.
    - Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator.
- Shuffle server changes:
    - Add support in `ExternalBlockHandler` for the external shuffle service side.
    - Make `ShuffleBlockResolver.getBlockData` accept getting block data by range.
- Protocol changes:
    - Add new block id type `ShuffleBlockBatchId` represent continuous shuffle block ids.
    - Extend `FetchShuffleBlocks` and `OneForOneBlockFetcher`.
    - After the new shuffle fetch protocol completed in #24565, the backward compatibility for external shuffle service can be controlled by `spark.shuffle.useOldFetchProtocol`.

### Why are the changes needed?
In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example:

The shuffle block is stored like below:
![image](https://user-images.githubusercontent.com/2989575/51654634-c37fbd80-1fd3-11e9-935e-5652863676c3.png)
The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala.

In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14.
Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file.
After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add new UT.
Integrate test with setting `spark.sql.adaptive.enabled=true`.

Closes #26040 from xuanyuanking/SPARK-9853.

Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: yucai <yyu1@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-17 14:47:56 +08:00
Kent Yao 6d4cc7b855 [SPARK-27880][SQL] Add bool_and for every and bool_or for any as function aliases
### What changes were proposed in this pull request?

bool_or(x) <=> any/some(x) <=> max(x)
bool_and(x) <=> every(x) <=> min(x)
Args:
  x: boolean
### Why are the changes needed?

PostgreSQL, Presto and Vertica, etc also support this feature:
### Does this PR introduce any user-facing change?

add new functions support

### How was this patch tested?

add ut

Closes #26126 from yaooqinn/SPARK-27880.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-16 22:43:47 +08:00
Maxim Gekk d11cbf2e36 [SPARK-29364][SQL] Return an interval from date subtract according to SQL standard
### What changes were proposed in this pull request?
Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example:
```sql
> select date'tomorrow' - date'yesterday';
interval 2 days
```

Closes #26034

### Why are the changes needed?
- To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3  Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')`

### Does this PR introduce any user-facing change?
Before the query below returns number of days:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
399
```
After it returns an interval:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
interval 1 years 1 months 4 days
```

### How was this patch tested?
- by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`.
- by existing tests in `date.sql`

Closes #26112 from MaxGekk/date-subtract.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-10-16 06:26:01 -07:00
Jose Torres 5a482e7209 [SPARK-29468][SQL] Change Literal.sql to be correct for floats
### What changes were proposed in this pull request?
Change Literal.sql to output CAST('fpValue' AS FLOAT) instead of CAST(fpValue AS FLOAT) as the SQL for a floating point literal.

### Why are the changes needed?
The old version doesn't work for very small floating point numbers; the value will fail to parse if it doesn't fit in a DECIMAL(38).

This doesn't apply to doubles because they have special literal syntax.

### Does this PR introduce any user-facing change?
Not really.

### How was this patch tested?
New unit tests.

Closes #26114 from jose-torres/fpliteral.

Authored-by: Jose Torres <joseph.torres@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-16 21:06:13 +08:00
Jeff Evans 95de93b24e [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

### What changes were proposed in this pull request?

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

### Why are the changes needed?

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

### Does this PR introduce any user-facing change?

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

### How was this patch tested?

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes #26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-15 15:44:51 -05:00
Gengliang Wang 322ec0ba9b [SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default
### What changes were proposed in this pull request?

When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

### Why are the changes needed?

Following the ANSI SQL standard is most reasonable among the 3 policies.

### Does this PR introduce any user-facing change?

Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.

### How was this patch tested?

Unit test

Closes #26107 from gengliangwang/ansiPolicyAsDefault.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-15 10:41:37 -07:00
jiake 9ac4b2dbc5 [SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution
## What changes were proposed in this pull request?
Implement a rule in the new adaptive execution framework introduced in [SPARK-23128](https://issues.apache.org/jira/browse/SPARK-23128). This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution.

## How was this patch tested?
Existing tests

Closes #25295 from JkSelf/localShuffleOptimization.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-15 21:51:15 +08:00
Wenchen Fan 8915966bf4 [SPARK-29473][SQL] move statement logical plans to a new file
### What changes were proposed in this pull request?

move the statement logical plans that were created for v2 commands to a new file `statements.scala`, under the same package of `v2Commands.scala`.

This PR also includes some minor cleanups:
1. remove `private[sql]` from `ParsedStatement` as it's in the private package.
2. remove unnecessary override of `output` and `children`.
3. add missing classdoc.

### Why are the changes needed?

Similar to https://github.com/apache/spark/pull/26111 , this is to better organize the logical plans of data source v2.

It's a bit weird to put the statements in the package `org.apache.spark.sql.catalyst.plans.logical.sql` as `sql` is not a good sub-package name in Spark SQL.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26125 from cloud-fan/statement.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-15 15:05:49 +02:00
Wenchen Fan 9407fba037 [SPARK-29412][SQL] refine the document of v2 session catalog config
### What changes were proposed in this pull request?

Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it.

### Why are the changes needed?

Make this config more understandable

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the Jenkins with the newly updated test cases.

Closes #26071 from cloud-fan/config.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-15 10:18:58 +08:00
herman 1f1443ebb2 [SPARK-29347][SQL] Add JSON serialization for external Rows
### What changes were proposed in this pull request?
This PR adds JSON serialization for Spark external Rows.

### Why are the changes needed?
This is to be used for observable metrics where the `StreamingQueryProgress` contains a map of observed metrics rows which needs to be serialized in some cases.

### Does this PR introduce any user-facing change?
Yes, a user can call `toJson` on rows returned when collecting a DataFrame to the driver.

### How was this patch tested?
Added a new test suite: `RowJsonSuite` that should test this.

Closes #26013 from hvanhovell/SPARK-29347.

Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-10-15 00:24:17 +02:00
Wenchen Fan bfa09cf049 [SPARK-29463][SQL] move v2 commands to a new file
### What changes were proposed in this pull request?

move the v2 command logical plans from `basicLogicalOperators.scala` to a new file `v2Commands.scala`

### Why are the changes needed?

As we keep adding v2 commands, the `basicLogicalOperators.scala` grows bigger and bigger. It's better to have a separated file for them.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

not needed

Closes #26111 from cloud-fan/command.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-14 14:09:21 -07:00
Maxim Gekk da576a737c [SPARK-29369][SQL] Support string intervals without the interval prefix
### What changes were proposed in this pull request?
In the PR, I propose to move interval parsing to `CalendarInterval.fromCaseInsensitiveString()` which throws an `IllegalArgumentException` for invalid strings, and reuse it from `CalendarInterval.fromString()`. The former one handles `IllegalArgumentException` only and returns `NULL` for invalid interval strings. This will allow to support interval strings without the `interval` prefix in casting strings to intervals and in interval type constructor because they use `fromString()` for parsing string intervals.

For example:
```sql
spark-sql> select cast('1 year 10 days' as interval);
interval 1 years 1 weeks 3 days
spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS';
interval 1 years 1 weeks 3 days
```

### Why are the changes needed?
To maintain feature parity with PostgreSQL which supports interval strings without prefix:
```sql
# select interval '2 months 1 microsecond';
        interval
------------------------
 2 mons 00:00:00.000001
```
and to improve Spark SQL UX.

### Does this PR introduce any user-facing change?
Yes, previously parsing of interval strings without `interval` gives `NULL`:
```sql
spark-sql> select interval '2 months 1 microsecond';
NULL
```
After:
```sql
spark-sql> select interval '2 months 1 microsecond';
interval 2 months 1 microseconds
```

### How was this patch tested?
- Added new tests to `CalendarIntervalSuite.java`
- A test for casting strings to intervals in `CastSuite`
- Test for interval type constructor from strings in `ExpressionParserSuite`

Closes #26079 from MaxGekk/interval-str-without-prefix.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-14 23:34:18 +08:00
Terry Kim ef6dce29b2 [SPARK-29279][SQL] Merge SHOW NAMESPACES and SHOW DATABASES code path
### What changes were proposed in this pull request?
Currently,  `SHOW NAMESPACES` and `SHOW DATABASES` are separate code paths. This PR merges two implementations.

### Why are the changes needed?
To remove code/behavior duplication

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added new unit tests.

Closes #26006 from imback82/combine_show.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-14 22:35:26 +08:00
Maxim Gekk f302c2ee62 [SPARK-29328][SQL][FOLLOWUP] Revert calculation of mean seconds per month
### What changes were proposed in this pull request?

Revert this commit 18b7ad2fc5.

### Why are the changes needed?
See https://github.com/apache/spark/pull/16304#discussion_r92753590

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
There is no test for that.

Closes #26101 from MaxGekk/revert-mean-seconds-per-month.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-12 09:38:08 -05:00
Gengliang Wang 6edabeb0ee [SPARK-28989][SQL][FOLLOWUP] Update ANSI mode related config names in comments
### What changes were proposed in this pull request?

Update ANSI mode related config names in comments as "spark.sql.ansi.enabled"

### Why are the changes needed?

The removed configuration `spark.sql.parser.ansi.enabled` and `spark.sql.failOnIntegralTypeOverflow` still exist in code comments.

### Does this PR introduce any user-facing change?

No
### How was this patch tested?

Grep the whole code to ensure the remove config names no longer exist.
```
git grep "parser.ansi.enabled"
git grep failOnIntegralTypeOverflow
git grep decimalOperationsNullOnOverflow
git grep ANSI_SQL_PARSER
git grep FAIL_ON_INTEGRAL_TYPE_OVERFLOW
git grep DECIMAL_OPERATIONS_NULL_ON_OVERFLOW
```

Closes #26067 from gengliangwang/spark-28989-followup.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-09 19:17:54 -07:00
Sean Owen cc7493fa21 [SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13)
### What changes were proposed in this pull request?

Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place.

### Why are the changes needed?

To compile with 2.13.

### Does this PR introduce any user-facing change?

None.

### How was this patch tested?

Existing tests.

Closes #26073 from srowen/SPARK-29416.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-09 17:00:48 -07:00
Sean Owen fa95a5c395 [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13
### What changes were proposed in this pull request?

Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object.

### Why are the changes needed?

It doesn't compile otherwise in Scala 2.13.
- https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30

### Does this PR introduce any user-facing change?

Should be no behavior change at all.

### How was this patch tested?

Existing tests.

Closes #26070 from srowen/SPARK-29411.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-09 10:24:13 -07:00
herman ba4d413fc9 [SPARK-29346][SQL] Add Aggregating Accumulator
### What changes were proposed in this pull request?
This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time.

The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output.

Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`.

### Why are the changes needed?
This is the mechanism we are going to use to implement observable metrics.

### Does this PR introduce any user-facing change?
No, not yet.

### How was this patch tested?
Added `AggregatingAccumulator` test suite.

Closes #26012 from hvanhovell/SPARK-29346.

Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-10-09 16:05:14 +02:00
Sean Owen 2d871ad0e7 [SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo")
### What changes were proposed in this pull request?

Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")`

### Why are the changes needed?

Avoids ~50 deprecation warnings when attempting to build with 2.13.

### Does this PR introduce any user-facing change?

None, should be no functional change at all.

### How was this patch tested?

Existing tests.

Closes #26061 from srowen/SPARK-29392.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-08 20:15:37 -07:00
gwang3 b3eba29493 [SPARK-29189][FOLLOW-UP][SQL] Beautify config name
### What changes were proposed in this pull request?
Beautify comment

### Why are the changes needed?
The config name now is pretty weird.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
No test.

Closes #26054 from wangshisan/SPARK-29189.

Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-08 15:44:42 -07:00
Guilherme de360e96d7 [SPARK-29336][SQL] Fix the implementation of QuantileSummaries.merge (guarantee that the relativeError will be respected)
### What changes were proposed in this pull request?

Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug.

### Why are the changes needed?

The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified.

### Does this PR introduce any user-facing change?

Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example:

```scala
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}
```

In the current build it returns:

```
16
12
10
11
17
```

I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please?

### How was this patch tested?

A new unit test was added to uncover the previous bug.

Closes #26029 from sitegui/SPARK-29336.

Authored-by: Guilherme <sitegui@sitegui.com.br>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-08 08:11:10 -05:00