### What changes were proposed in this pull request?
In the PR, I propose to changed `CalendarInterval.toString`:
- to skip the `week` unit
- to convert `milliseconds` and `microseconds` as the fractional part of the `seconds` unit.
### Why are the changes needed?
To improve readability.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- By `CalendarIntervalSuite` and `IntervalUtilsSuite`
- `literals.sql`, `datetime.sql` and `interval.sql`
Closes#26367 from MaxGekk/interval-to-string-format.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is somewhat a complement of https://github.com/apache/spark/pull/21853.
The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`.
This PR try to remove this kind of `Sort` operator in `SQL Optimizer`.
### Why are the changes needed?
For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`.
'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`.
Remove useless `Sort` operator can improve performance.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Adding new UT `RemoveSortInSubquerySuite.scala`
Closes#26011 from WangGuangxin/remove_sorts.
Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support non-reversed keywords to be used in high order functions.
### Why are the changes needed?
the keywords are non-reversed.
### Does this PR introduce any user-facing change?
yes, all non-reversed keywords can be used in high order function correctly
### How was this patch tested?
add uts
Closes#26366 from yaooqinn/SPARK-29722.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter.
### Why are the changes needed?
Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example:
```java
assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L));
```
```
java.lang.AssertionError:
Expected :interval 5 months 5 days 101 hours
Actual :interval 5 months 5 days 102 hours
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing tests.
Closes#26377 from MaxGekk/fix-order-in-assert-equals.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
re-arrange the parser rules to make it clear that multiple unit TO unit statement like `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH` is not allowed.
### Why are the changes needed?
This is definitely an accident that we support such a weird syntax in the past. It's not supported by any other DBs and I can't think of any use case of it. Also no test covers this syntax in the current codebase.
### Does this PR introduce any user-facing change?
Yes, and a migration guide item is added.
### How was this patch tested?
new tests.
Closes#26285 from cloud-fan/syntax.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
###What changes were proposed in this pull request?
Add AlterTableDropPartitionStatement and make ALTER TABLE/VIEW ... DROP PARTITION go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t DROP PARTITION (id=1) // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE/VIEW ... DROP PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26303 from huaxingao/spark-29643.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it
to 3 fields: months, days and microseconds. This is because one logical day interval may
have different number of microseconds (daylight saving).
### Why are the changes needed?
One logical day interval may have different number of microseconds (daylight saving).
For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to
2019-11-3 12:00:00
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
unit test and new added test cases
Closes#26134 from LinhongLiu/calendarinterval.
Authored-by: Liu,Linhong <liulinhong@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterTableRenamePartitionStatement and make ALTER TABLE ... RENAME TO PARTITION go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2) // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RENAME TO PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26350 from huaxingao/spark_29676.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
Bring back https://github.com/apache/spark/pull/25955
### What changes were proposed in this pull request?
This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.
To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.
This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.
### Why are the changes needed?
This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.
Closes#26341 from cloud-fan/back.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
- Added `getDuration()` to calculate interval duration in specified time units assuming provided days per months
- Added `isNegative()` which return `true` is the interval duration is less than 0
- Fix checking negative intervals by using `isNegative()` in structured streaming classes
- Fix checking of `year-months` intervals
### Why are the changes needed?
This fixes incorrect checking of negative intervals. An interval is negative when its duration is negative but not if interval's months **or** microseconds is negative. Also this fixes checking of `year-month` interval support because the `month` field could be negative.
### Does this PR introduce any user-facing change?
Should not
### How was this patch tested?
- Added tests for the `getDuration()` and `isNegative()` methods to `IntervalUtilsSuite`
- By existing SS tests
Closes#26177 from MaxGekk/interval-is-positive.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update `AlterTableSetLocationStatement` to store `partitionSpec` and make `ALTER TABLE a.b.c PARTITION(...) SET LOCATION 'loc'` fail if `partitionSpec` is set with unsupported message.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t PARTITION(...) SET LOCATION 'loc' // report set location with partition spec is not supported.
```
### Does this PR introduce any user-facing change?
yes. When running ALTER TABLE (set partition location), Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26304 from imback82/alter_table_partition_loc.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add ShowColumnsStatement and make SHOW COLUMNS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW COLUMNS FROM t // report table not found as there is no table t in the session catalog
### Does this PR introduce any user-facing change?
yes. When running SHOW COLUMNS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26182 from planga82/feature/SPARK-29523_SHOW_COLUMNS_datasourceV2.
Authored-by: Unknown <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to extract parsing of the seconds interval units to the private method `parseNanos` in `IntervalUtils` and modify the code to correctly parse the fractional part of the seconds unit of intervals in the cases:
- When the fractional part has less than 9 digits
- The seconds unit is negative
### Why are the changes needed?
The changes are needed to fix the issues:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 microseconds
```
The correct result must be `interval 10 seconds 123 milliseconds 456 microseconds`
```sql
spark-sql> select interval '-10.123456789 seconds';
interval -9 seconds -876 milliseconds -544 microseconds
```
but the whole interval should be negated, and the result must be `interval -10 seconds -123 milliseconds -456 microseconds`, taking into account the truncation to microseconds.
### Does this PR introduce any user-facing change?
Yes. After changes:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 milliseconds 456 microseconds
spark-sql> select interval '-10.123456789 seconds';
interval -10 seconds -123 milliseconds -456 microseconds
```
### How was this patch tested?
By existing and new tests in `ExpressionParserSuite`.
Closes#26313 from MaxGekk/fix-interval-nanos-parsing.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.
To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.
This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.
### Why are the changes needed?
This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.
Closes#25955 from rdblue/move-v2-pushdown.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Ryan Blue <blue@apache.org>
### What changes were proposed in this pull request?
MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY
### Why are the changes needed?
fix bug
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes#26321 from yaooqinn/SPARK-29653.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
```
postgres=# select date '2001-09-28' + integer '7';
?column?
------------
2001-10-05
(1 row)postgres=# select integer '7';
int4
------
7
(1 row)
```
Add support for typed integer literal expression from postgreSQL.
### Why are the changes needed?
SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?
support typed integer lit in SQL
### How was this patch tested?
add uts
Closes#26291 from yaooqinn/SPARK-29629.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Now, `RepartitionByExpression` is allowed at Dataset method `Dataset.repartition()`. But in spark sql, we do not have an equivalent functionality.
In hive, we can use `distribute by`, so it's worth to add a hint to support such function.
Similar jira [SPARK-24940](https://issues.apache.org/jira/browse/SPARK-24940)
## Why are the changes needed?
Make repartition hints consistent with repartition api .
## Does this PR introduce any user-facing change?
This pr intends to support quries below;
```
// SQL cases
- sql("SELECT /*+ REPARTITION(c) */ * FROM t")
- sql("SELECT /*+ REPARTITION(1, c) */ * FROM t")
- sql("SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t")
- sql("SELECT /*+ REPARTITION_BY_RANGE(1, c) */ * FROM t")
```
## How was this patch tested?
UT
Closes#25464 from ulysses-you/SPARK-28746.
Lead-authored-by: ulysses <youxiduo@weidian.com>
Co-authored-by: ulysses <646303253@qq.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala.
### Why are the changes needed?
- For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see https://github.com/apache/spark/pull/26190
- Taking into account that `CalendarInterval` will be fully exposed to users in the future (see https://github.com/apache/spark/pull/25022), it would be nice to clean it up by moving service methods to an internal object.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite`
- By existing test suites
Closes#26261 from MaxGekk/refactoring-calendar-interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterTableRecoverPartitionsStatement and make ALTER TABLE ... RECOVER PARTITIONS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t RECOVER PARTITIONS // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RECOVER PARTITIONS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26269 from huaxingao/spark-29612.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `DROP NAMESPACE` support for V2 catalogs.
### Why are the changes needed?
Currently, you cannot drop namespaces for v2 catalogs.
### Does this PR introduce any user-facing change?
The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
DROP NAMESPACE mycatalog.ns
SHOW NAMESPACES IN mycatalog # Will show no namespaces
```
to drop a namespace `ns` inside `mycatalog` V2 catalog.
### How was this patch tested?
Added unit tests.
Closes#26262 from imback82/drop_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add LoadDataStatement and make LOAD DATA INTO TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
LOAD DATA INPATH 'filepath' INTO TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running LOAD DATA INTO TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26178 from viirya/SPARK-29521.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr is to fix wrong code to check parameter lengths of split methods in `subexpressionEliminationForWholeStageCodegen`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26267 from maropu/SPARK-29008-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The `DateTimeUtilsSuite` and `TimestampFormatterSuite` assume constant time difference between `timestamp'yesterday'`, `timestamp'today'` and `timestamp'tomorrow'` which is wrong on daylight switching day - day length can be 23 or 25 hours. In the PR, I propose to use Java 8 time API to calculate instances of `yesterday` and `tomorrow` timestamps.
### Why are the changes needed?
The changes fix test failures and make the tests tolerant to daylight time switching.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `TimestampFormatterSuite`.
Closes#26273 from MaxGekk/midnight-tolerant.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
```
hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 2.113 seconds, Fetched: 1 row(s)
```
### Why are the changes needed?
From hive behavior and I guess it is useful for debugging and developing etc.
### Does this PR introduce any user-facing change?
add a misc func
### How was this patch tested?
add ut
Closes#26209 from yaooqinn/SPARK-29554.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/24557 to fix `since` version.
### Why are the changes needed?
This is found during 3.0.0-preview preparation.
The version will be exposed to our SQL document like the following. We had better fix this.
- https://spark.apache.org/docs/latest/api/sql/#array_min
### Does this PR introduce any user-facing change?
Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and SQL doc, but this is new at 3.0.0.
### How was this patch tested?
Manual.
```
spark-sql> DESC FUNCTION EXTENDED min_by;
Function: min_by
Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy
Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`.
Extended Usage:
Examples:
> SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
a
Since: 3.0.0
```
Closes#26264 from dongjoon-hyun/SPARK-27653.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add ShowCreateTableStatement and make SHOW CREATE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW CREATE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running SHOW CREATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26184 from viirya/SPARK-29527.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add UncacheTableStatement and make UNCACHE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
UNCACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running UNCACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26237 from imback82/uncache_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
# What changes were proposed in this pull request?
Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#26227 from stczwd/json-generator-doc.
Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`.
### Why are the changes needed?
Simplify the code and fix inconsistent behaviors.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Pass the Jenkins with the updated test cases.
Closes#26190 from cloud-fan/parser.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Recent timezone definition changes in very new JDK 8 (and beyond) releases cause test failures. The below was observed on JDK 1.8.0_232. As before, the easy fix is to allow for these inconsequential variations in test results due to differing definition of timezones.
### Why are the changes needed?
Keeps test passing on the latest JDK releases.
### Does this PR introduce any user-facing change?
None
### How was this patch tested?
Existing tests
Closes#26236 from srowen/SPARK-29578.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Support SparkSQL use iN/EXISTS with subquery in JOIN condition.
### Why are the changes needed?
Support SQL use iN/EXISTS with subquery in JOIN condition.
### Does this PR introduce any user-facing change?
This PR is for enable user use subquery in `JOIN`'s ON condition. such as we have create three table
```
CREATE TABLE A(id String);
CREATE TABLE B(id String);
CREATE TABLE C(id String);
```
we can do query like :
```
SELECT A.id from A JOIN B ON A.id = B.id and A.id IN (select C.id from C)
```
### How was this patch tested?
ADDED UT
Closes#25854 from AngersZhuuuu/SPARK-29145.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base.
### Why are the changes needed?
During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#26229 from xuanyuanking/SPARK-21492-follow-up.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
CACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26179 from viirya/SPARK-29522.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add RefreshTableStatement and make REFRESH TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
REFRESH TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running REFRESH TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26183 from imback82/refresh_table.
Lead-authored-by: Terry Kim <yuminkim@gmail.com>
Co-authored-by: Terry Kim <terryk@terrys-mbp-2.lan>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
As described in [SPARK-29542](https://issues.apache.org/jira/browse/SPARK-29542) , the descriptions of `spark.sql.files.*` are confused.
In this PR, I make their descriptions be clearly.
### Why are the changes needed?
It makes the descriptions of `spark.sql.files.*` be clearly.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#26200 from turboFei/SPARK-29542-partition-maxSize.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds `CREATE NAMESPACE` support for V2 catalogs.
### Why are the changes needed?
Currently, you cannot explicitly create namespaces for v2 catalogs.
### Does this PR introduce any user-facing change?
The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
```
to create a namespace `ns` inside `mycatalog` V2 catalog.
### How was this patch tested?
Added unit tests.
Closes#26166 from imback82/create_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add ShowPartitionsStatement and make SHOW PARTITIONS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users.
### Does this PR introduce any user-facing change?
Yes. When running SHOW PARTITIONS, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26198 from huaxingao/spark-29539.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
TRUNCATE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26174 from viirya/SPARK-29517.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR remove unnecessary orc version and hive version in doc.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A.
Closes#26146 from denglingang/SPARK-24576.
Lead-authored-by: denglingang <chitin1027@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As I have comment in [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977)
SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation.
In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf.
Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe
### Why are the changes needed?
Fix bug
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
NO
Closes#26187 from AngersZhuuuu/SPARK-29530.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```
bit_and(expression) -- The bitwise AND of all non-null input values, or null if none
bit_or(expression) -- The bitwise OR of all non-null input values, or null if none
```
More details:
https://www.postgresql.org/docs/9.3/functions-aggregate.html
### Why are the changes needed?
Postgres, Mysql and many other popular db support them.
### Does this PR introduce any user-facing change?
add two bit agg
### How was this patch tested?
add ut
Closes#26155 from yaooqinn/SPARK-27879.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26168 from imback82/repair_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/25241 .
The typed interval expression should fail for invalid format.
### Why are the changes needed?
Te be consistent with the typed timestamp/date expression
### Does this PR introduce any user-facing change?
Yes. But this feature is not released yet.
### How was this patch tested?
updated test
Closes#26151 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
* Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':'
* Added a test to make sure this works.
### Why are the changes needed?
In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a test in the `DateTimeTestUtils` suite to test if my fix works.
Closes#26143 from rahulsmahadev/SPARK-29494.
Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
The `date_part()` function can accept the `source` parameter of the `INTERVAL` type (`CalendarIntervalType`). The following values of the `field` parameter are supported:
- `"MILLENNIUM"` (`"MILLENNIA"`, `"MIL"`, `"MILS"`) - number of millenniums in the given interval. It is `YEAR / 1000`.
- `"CENTURY"` (`"CENTURIES"`, `"C"`, `"CENT"`) - number of centuries in the interval calculated as `YEAR / 100`.
- `"DECADE"` (`"DECADES"`, `"DEC"`, `"DECS"`) - decades in the `YEAR` part of the interval calculated as `YEAR / 10`.
- `"YEAR"` (`"Y"`, `"YEARS"`, `"YR"`, `"YRS"`) - years in a values of `CalendarIntervalType`. It is `MONTHS / 12`.
- `"QUARTER"` (`"QTR"`) - a quarter of year calculated as `MONTHS / 3 + 1`
- `"MONTH"` (`"MON"`, `"MONS"`, `"MONTHS"`) - the months part of the interval calculated as `CalendarInterval.months % 12`
- `"DAY"` (`"D"`, `"DAYS"`) - total number of days in `CalendarInterval.microseconds`
- `"HOUR"` (`"H"`, `"HOURS"`, `"HR"`, `"HRS"`) - the hour part of the interval.
- `"MINUTE"` (`"M"`, `"MIN"`, `"MINS"`, `"MINUTES"`) - the minute part of the interval.
- `"SECOND"` (`"S"`, `"SEC"`, `"SECONDS"`, `"SECS"`) - the seconds part with fractional microsecond part.
- `"MILLISECONDS"` (`"MSEC"`, `"MSECS"`, `"MILLISECON"`, `"MSECONDS"`, `"MS"`) - the millisecond part of the interval with fractional microsecond part.
- `"MICROSECONDS"` (`"USEC"`, `"USECS"`, `"USECONDS"`, `"MICROSECON"`, `"US"`) - the total number of microseconds in the `second`, `millisecond` and `microsecond` parts of the given interval.
- `"EPOCH"` - the total number of seconds in the interval including the fractional part with microsecond precision. Here we assume 365.25 days per year (leap year every four years).
For example:
```sql
> SELECT date_part('days', interval 1 year 10 months 5 days);
5
> SELECT date_part('seconds', interval 30 seconds 1 milliseconds 1 microseconds);
30.001001
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- Added new test suite `IntervalExpressionsSuite`
- Add new test cases to `date_part.sql`
Closes#25981 from MaxGekk/extract-from-intervals.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The handling of the catalog across plans should be as follows ([SPARK-29014](https://issues.apache.org/jira/browse/SPARK-29014)):
* The *current* catalog should be used when no catalog is specified
* The default catalog is the catalog *current* is initialized to
* If the *default* catalog is not set, then *current* catalog is the built-in Spark session catalog.
This PR addresses the issue where *current* catalog usage is not followed as describe above.
### Why are the changes needed?
It is a bug as described in the previous section.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit tests added.
Closes#26120 from imback82/cleanup_catalog.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add `AnalyzeTableStatement` and `AnalyzeColumnStatement`, and make ANALYZE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ANALYZE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running ANALYZE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
new tests
Closes#26129 from cloud-fan/analyze-table.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### Why are the changes needed?
As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON.
For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data.
### Does this PR introduce any user-facing change?
Example in spark-shell.
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"b":1}
scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"a":null,"b":1}
### How was this patch tested?
Add new test to JacksonGeneratorSuite
Closes#26098 from stczwd/json.
Lead-authored-by: stczwd <qcsd2011@163.com>
Co-authored-by: Jackey Lee <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Regularize all the shuffle configurations related to adaptive execution.
2. Add default value for `BlockStoreShuffleReader.shouldBatchFetch`.
### Why are the changes needed?
It's a follow-up PR for #26040.
Regularize the existing `spark.sql.adaptive.shuffle` namespace in SQLConf.
### Does this PR introduce any user-facing change?
Rename one released user config `spark.sql.adaptive.minNumPostShufflePartitions` to `spark.sql.adaptive.shuffle.minNumPostShufflePartitions`, other changed configs is not released yet.
### How was this patch tested?
Existing UT.
Closes#26147 from xuanyuanking/SPARK-9853.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes a few typos:
1. Sparks => Spark's
2. parallize => parallelize
3. doesnt => doesn't
Closes#26140 from plusplusjiajia/fix-typos.
Authored-by: Jiajia Li <jiajia.li@intel.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
BIT_COUNT(N) - Returns the number of bits that are set in the argument N as an unsigned 64-bit integer, or NULL if the argument is NULL
### Why are the changes needed?
Supported by MySQL,Microsoft SQL Server ,etc.
### Does this PR introduce any user-facing change?
add a built-in function
### How was this patch tested?
add uts
Closes#26139 from yaooqinn/SPARK-29491.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This PR takes over #19788. After we split the shuffle fetch protocol from `OpenBlock` in #24565, this optimization can be extended in the new shuffle protocol. Credit to yucai, closes#19788.
### What changes were proposed in this pull request?
This PR adds the support for continuous shuffle block fetching in batch:
- Shuffle client changes:
- Add new feature tag `spark.shuffle.fetchContinuousBlocksInBatch`, implement the decision logic in `BlockStoreShuffleReader`.
- Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator.
- Shuffle server changes:
- Add support in `ExternalBlockHandler` for the external shuffle service side.
- Make `ShuffleBlockResolver.getBlockData` accept getting block data by range.
- Protocol changes:
- Add new block id type `ShuffleBlockBatchId` represent continuous shuffle block ids.
- Extend `FetchShuffleBlocks` and `OneForOneBlockFetcher`.
- After the new shuffle fetch protocol completed in #24565, the backward compatibility for external shuffle service can be controlled by `spark.shuffle.useOldFetchProtocol`.
### Why are the changes needed?
In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example:
The shuffle block is stored like below:
![image](https://user-images.githubusercontent.com/2989575/51654634-c37fbd80-1fd3-11e9-935e-5652863676c3.png)
The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala.
In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14.
Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file.
After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add new UT.
Integrate test with setting `spark.sql.adaptive.enabled=true`.
Closes#26040 from xuanyuanking/SPARK-9853.
Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: yucai <yyu1@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
bool_or(x) <=> any/some(x) <=> max(x)
bool_and(x) <=> every(x) <=> min(x)
Args:
x: boolean
### Why are the changes needed?
PostgreSQL, Presto and Vertica, etc also support this feature:
### Does this PR introduce any user-facing change?
add new functions support
### How was this patch tested?
add ut
Closes#26126 from yaooqinn/SPARK-27880.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example:
```sql
> select date'tomorrow' - date'yesterday';
interval 2 days
```
Closes#26034
### Why are the changes needed?
- To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')`
### Does this PR introduce any user-facing change?
Before the query below returns number of days:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
399
```
After it returns an interval:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
interval 1 years 1 months 4 days
```
### How was this patch tested?
- by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`.
- by existing tests in `date.sql`
Closes#26112 from MaxGekk/date-subtract.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Change Literal.sql to output CAST('fpValue' AS FLOAT) instead of CAST(fpValue AS FLOAT) as the SQL for a floating point literal.
### Why are the changes needed?
The old version doesn't work for very small floating point numbers; the value will fail to parse if it doesn't fit in a DECIMAL(38).
This doesn't apply to doubles because they have special literal syntax.
### Does this PR introduce any user-facing change?
Not really.
### How was this patch tested?
New unit tests.
Closes#26114 from jose-torres/fpliteral.
Authored-by: Jose Torres <joseph.torres@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters
Moving univocity-parsers version to spark-parent pom dependencyManagement section
Adding new utility method to build multi-char delimiter string, which delegates to existing one
Adding tests for multiple character delimited CSV
### What changes were proposed in this pull request?
Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.
### Why are the changes needed?
It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing).
### Does this PR introduce any user-facing change?
Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.
### How was this patch tested?
The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.
Closes#26027 from jeff303/SPARK-24540.
Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.
Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.
### Why are the changes needed?
Following the ANSI SQL standard is most reasonable among the 3 policies.
### Does this PR introduce any user-facing change?
Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.
### How was this patch tested?
Unit test
Closes#26107 from gengliangwang/ansiPolicyAsDefault.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Implement a rule in the new adaptive execution framework introduced in [SPARK-23128](https://issues.apache.org/jira/browse/SPARK-23128). This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution.
## How was this patch tested?
Existing tests
Closes#25295 from JkSelf/localShuffleOptimization.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
move the statement logical plans that were created for v2 commands to a new file `statements.scala`, under the same package of `v2Commands.scala`.
This PR also includes some minor cleanups:
1. remove `private[sql]` from `ParsedStatement` as it's in the private package.
2. remove unnecessary override of `output` and `children`.
3. add missing classdoc.
### Why are the changes needed?
Similar to https://github.com/apache/spark/pull/26111 , this is to better organize the logical plans of data source v2.
It's a bit weird to put the statements in the package `org.apache.spark.sql.catalyst.plans.logical.sql` as `sql` is not a good sub-package name in Spark SQL.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26125 from cloud-fan/statement.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it.
### Why are the changes needed?
Make this config more understandable
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Pass the Jenkins with the newly updated test cases.
Closes#26071 from cloud-fan/config.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds JSON serialization for Spark external Rows.
### Why are the changes needed?
This is to be used for observable metrics where the `StreamingQueryProgress` contains a map of observed metrics rows which needs to be serialized in some cases.
### Does this PR introduce any user-facing change?
Yes, a user can call `toJson` on rows returned when collecting a DataFrame to the driver.
### How was this patch tested?
Added a new test suite: `RowJsonSuite` that should test this.
Closes#26013 from hvanhovell/SPARK-29347.
Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
move the v2 command logical plans from `basicLogicalOperators.scala` to a new file `v2Commands.scala`
### Why are the changes needed?
As we keep adding v2 commands, the `basicLogicalOperators.scala` grows bigger and bigger. It's better to have a separated file for them.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
not needed
Closes#26111 from cloud-fan/command.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to move interval parsing to `CalendarInterval.fromCaseInsensitiveString()` which throws an `IllegalArgumentException` for invalid strings, and reuse it from `CalendarInterval.fromString()`. The former one handles `IllegalArgumentException` only and returns `NULL` for invalid interval strings. This will allow to support interval strings without the `interval` prefix in casting strings to intervals and in interval type constructor because they use `fromString()` for parsing string intervals.
For example:
```sql
spark-sql> select cast('1 year 10 days' as interval);
interval 1 years 1 weeks 3 days
spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS';
interval 1 years 1 weeks 3 days
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL which supports interval strings without prefix:
```sql
# select interval '2 months 1 microsecond';
interval
------------------------
2 mons 00:00:00.000001
```
and to improve Spark SQL UX.
### Does this PR introduce any user-facing change?
Yes, previously parsing of interval strings without `interval` gives `NULL`:
```sql
spark-sql> select interval '2 months 1 microsecond';
NULL
```
After:
```sql
spark-sql> select interval '2 months 1 microsecond';
interval 2 months 1 microseconds
```
### How was this patch tested?
- Added new tests to `CalendarIntervalSuite.java`
- A test for casting strings to intervals in `CastSuite`
- Test for interval type constructor from strings in `ExpressionParserSuite`
Closes#26079 from MaxGekk/interval-str-without-prefix.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, `SHOW NAMESPACES` and `SHOW DATABASES` are separate code paths. This PR merges two implementations.
### Why are the changes needed?
To remove code/behavior duplication
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new unit tests.
Closes#26006 from imback82/combine_show.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Revert this commit 18b7ad2fc5.
### Why are the changes needed?
See https://github.com/apache/spark/pull/16304#discussion_r92753590
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
There is no test for that.
Closes#26101 from MaxGekk/revert-mean-seconds-per-month.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Update ANSI mode related config names in comments as "spark.sql.ansi.enabled"
### Why are the changes needed?
The removed configuration `spark.sql.parser.ansi.enabled` and `spark.sql.failOnIntegralTypeOverflow` still exist in code comments.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Grep the whole code to ensure the remove config names no longer exist.
```
git grep "parser.ansi.enabled"
git grep failOnIntegralTypeOverflow
git grep decimalOperationsNullOnOverflow
git grep ANSI_SQL_PARSER
git grep FAIL_ON_INTEGRAL_TYPE_OVERFLOW
git grep DECIMAL_OPERATIONS_NULL_ON_OVERFLOW
```
Closes#26067 from gengliangwang/spark-28989-followup.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place.
### Why are the changes needed?
To compile with 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26073 from srowen/SPARK-29416.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object.
### Why are the changes needed?
It doesn't compile otherwise in Scala 2.13.
- https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30
### Does this PR introduce any user-facing change?
Should be no behavior change at all.
### How was this patch tested?
Existing tests.
Closes#26070 from srowen/SPARK-29411.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time.
The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output.
Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`.
### Why are the changes needed?
This is the mechanism we are going to use to implement observable metrics.
### Does this PR introduce any user-facing change?
No, not yet.
### How was this patch tested?
Added `AggregatingAccumulator` test suite.
Closes#26012 from hvanhovell/SPARK-29346.
Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")`
### Why are the changes needed?
Avoids ~50 deprecation warnings when attempting to build with 2.13.
### Does this PR introduce any user-facing change?
None, should be no functional change at all.
### How was this patch tested?
Existing tests.
Closes#26061 from srowen/SPARK-29392.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Beautify comment
### Why are the changes needed?
The config name now is pretty weird.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test.
Closes#26054 from wangshisan/SPARK-29189.
Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug.
### Why are the changes needed?
The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified.
### Does this PR introduce any user-facing change?
Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example:
```scala
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray
for (n <- 0 until 5) {
var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
val max_error = error.max
print(max_error + "\n")
}
```
In the current build it returns:
```
16
12
10
11
17
```
I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please?
### How was this patch tested?
A new unit test was added to uncover the previous bug.
Closes#26029 from sitegui/SPARK-29336.
Authored-by: Guilherme <sitegui@sitegui.com.br>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`.
### Why are the changes needed?
There is the agreement in the PR https://github.com/apache/spark/pull/21598#issuecomment-399695523 to change behavior in Spark 3.0.
### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select size(NULL);
-1
```
After:
```sql
spark-sql> select size(NULL);
NULL
```
### How was this patch tested?
By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples.
Closes#26051 from MaxGekk/sizeof-null-returns-null.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later.
### Why are the changes needed?
In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.
### Does this PR introduce any user-facing change?
no, UPDATE is still not supported yet.
### How was this patch tested?
new tests.
Closes#26025 from cloud-fan/update.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.
### Why are the changes needed?
And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.
Table Size | Total File Number | Total Block Number | List File Duration(With Block Location) | List File Duration(Without Block Location)
-- | -- | -- | -- | --
22.6T | 30000 | 120000 | 16.841s | 1.730s
28.8 T | 42001 | 148964 | 10.099s | 2.858s
3.4 T | 20000 | 20000 | 5.833s | 4.881s
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Via ut.
Closes#25869 from wangshisan/SPARK-29189.
Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
### What changes were proposed in this pull request?
In the PR, I propose to pass the `Pattern.CASE_INSENSITIVE` flag while compiling interval patterns in `CalendarInterval`. This makes casting string values to intervals case insensitive and tolerant to case of the `interval`, `year(s)`, `month(s)`, `week(s)`, `day(s)`, `hour(s)`, `minute(s)`, `second(s)`, `millisecond(s)` and `microsecond(s)`.
### Why are the changes needed?
There are at least 2 reasons:
- To maintain feature parity with PostgreSQL which is not sensitive to case:
```sql
# select cast('10 Days' as INTERVAL);
interval
----------
10 days
(1 row)
```
- Spark is tolerant to case of interval literals. Case insensitivity in casting should be convenient for Spark users.
```sql
spark-sql> SELECT INTERVAL 1 YEAR 1 WEEK;
interval 1 years 1 weeks
```
### Does this PR introduce any user-facing change?
Yes, current implementation produces `NULL` for `interval`, `year`, ... `microsecond` that are not in lower case.
Before:
```sql
spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL);
NULL
```
After:
```sql
spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL);
interval 1 weeks 3 days
```
### How was this patch tested?
- by new tests in `CalendarIntervalSuite.java`
- new test in `CastSuite`
Closes#26010 from MaxGekk/interval-case-insensitive.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year.
### Why are the changes needed?
Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth.
### Does this PR introduce any user-facing change?
Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases.
Before:
```sql
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.4516129
```
After:
```sql
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.45996838
```
### How was this patch tested?
By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
Closes#25998 from MaxGekk/days-in-year.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp.
### Why are the changes needed?
- To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date:
```sql
maxim=# select timestamp'now' - date'epoch';
?column?
----------------------------
18175 days 21:07:33.412875
(1 row)
maxim=# select date'2020-01-01' - timestamp'now';
?column?
-------------------------
86 days 02:52:00.945296
(1 row)
```
- To conform to the SQL standard which defines datetime subtraction as an interval.
### Does this PR introduce any user-facing change?
Yes, currently the queries bellow fails with the error:
```sql
spark-sql> select timestamp'now' - date'2019-10-01';
Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7;
'Project [unresolvedalias((1570385107234000 - 18170), None)]
+- OneRowRelation
```
after the changes:
```sql
spark-sql> select timestamp'now' - date'2019-10-01';
interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds
```
### How was this patch tested?
- Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite`
- by 2 new test in `datetime.sql`
Closes#26036 from MaxGekk/date-timestamp-subtract.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`.
### Why are the changes needed?
To maintain feature parity with PostgreSQL which allows to get timestamp difference:
```sql
# select timestamp'today' - timestamp'yesterday';
?column?
----------
1 day
(1 row)
```
### Does this PR introduce any user-facing change?
Yes, previously users got the following error from timestamp subtraction:
```sql
spark-sql> select timestamp'today' - timestamp'yesterday';
Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7;
'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)]
+- OneRowRelation
```
after the changes they should get an interval:
```sql
spark-sql> select timestamp'today' - timestamp'yesterday';
interval 1 days
```
### How was this patch tested?
- Added tests for `TimestampDiff` to `DateExpressionsSuite`
- By new test in `TypeCoercionSuite`.
- Enabled tests in `postgreSQL/timestamp.sql`.
Closes#26022 from MaxGekk/timestamp-diff.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is
1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`.
2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`.
However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places.
This PR centralizes the catalog/table lookup logic by 3 rules.
1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command.
2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`.
3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore.
### Why are the changes needed?
Reduce duplicated code and make the catalog/table lookup logic consistent.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25747 from cloud-fan/lookup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. With ANSI store assignment policy, an exception is thrown on casting failure
2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`.
### Why are the changes needed?
As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field.
### Does this PR introduce any user-facing change?
With ANSI store assignment policy, an exception is thrown on casting failure
### How was this patch tested?
Unit test
Closes#25997 from gengliangwang/newCast.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use.
However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely.
### Why are the changes needed?
To compile for 2.13 and later to work with 2.13.
### Does this PR introduce any user-facing change?
Should not, no.
### How was this patch tested?
Existing tests.
Closes#25980 from srowen/SPARK-29296.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`.
### Why are the changes needed?
See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays.
### Does this PR introduce any user-facing change?
Previously filter lambdas had to look like
`filter(arr, el -> whatever)`
Now, lambdas can take an index argument as well
`filter(array, (el, idx) -> whatever)`
### How was this patch tested?
I added unit tests to `HigherOrderFunctionsSuite`.
Closes#25666 from henrydavidge/filter-idx.
Authored-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#)
It also exposes `currentCatalog` in `CatalogManager`.
Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog).
### Why are the changes needed?
There is currently no mechanism to change current catalog/namespace thru SQL commands.
### Does this PR introduce any user-facing change?
Yes, you can perform the following:
```scala
// Sets the current catalog to 'testcat'
spark.sql("USE CATALOG testcat")
// Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'.
spark.sql("USE ns1.ns2 IN testcat")
// Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace.
spark.sql("SHOW NAMESPACES")
```
### How was this patch tested?
Added new unit tests.
Closes#25771 from imback82/use_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added new expression `SecondWithFraction` which produces the `seconds` part of timestamps/dates with fractional part containing microseconds. This expression is used only in the `DatePart` expression. As the result, `date_part()` and `extract` return seconds and microseconds as the fractional part of the seconds part when `field` is `SECOND` (or synonyms).
### Why are the changes needed?
The `date_part()` and `extract` were added to maintain feature parity with PostgreSQL which has different behavior for the `SECOND` value of the `field` parameter. The fix is needed to behave in the same way. Here is PostgreSQL's output:
```sql
# SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
date_part
-----------
1.000001
(1 row)
```
### Does this PR introduce any user-facing change?
Yes, type of `date_part('SECOND', ...)` is changed from `INT` to `DECIMAL(8, 6)`.
Before:
```sql
spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001');
1
```
After:
```sql
spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001');
1.000001
```
### How was this patch tested?
- Added new tests to `DateExpressionSuite` for the `SecondWithFraction` expression
- Regenerated results of `date_part.sql`, `extract.sql` and `timestamp.sql`
- Updated results of `ExtractBenchmark`
Closes#25986 from MaxGekk/extract-seconds-from-timestamp.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to remove `scalatest` deprecation warnings with the following changes.
- `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar`
- `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser`
- `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers`
- `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks`
### Why are the changes needed?
According to the Jenkins logs, there are 118 warnings about this.
```
grep "is deprecated" ~/consoleText | grep scalatest | wc -l
118
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
After Jenkins passes, we need to check the Jenkins log.
Closes#25982 from dongjoon-hyun/SPARK-29307.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Scala 2.13 emits a deprecation warning for procedure-like declarations:
```
def foo() {
...
```
This is equivalent to the following, so should be changed to avoid a warning:
```
def foo(): Unit = {
...
```
### Why are the changes needed?
It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.
Unfortunately, that makes this quite a big change.
### Does this PR introduce any user-facing change?
No behavior change at all.
### How was this patch tested?
Existing tests.
Closes#25968 from srowen/SPARK-29291.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch.
This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users.
This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative.
## How was this patch tested?
Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query.
Closes#24890 from HeartSaVioR/SPARK-28074.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson.
### Why are the changes needed?
When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
"num-children" : 1,
"name" : null,
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```
Seems the alias name was in the Json before SPARK-19602.
It is fixed by this patch:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
"num-children" : 1,
"name" : {
"product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier",
"identifier" : "t1"
},
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```
### Does this PR introduce any user-facing change?
Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier.
### How was this patch tested?
Added unit test.
Closes#25959 from viirya/SPARK-29186.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Support the syntax of `ALTER (DATABASE|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax.
Ref:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLhttps://issues.apache.org/jira/browse/HIVE-8472
### Why are the changes needed?
Support more syntax.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#25883 from wangyum/SPARK-28476.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default.
Please take a look at current example of Like:
d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)
If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail.
```
The escape character is '\'. If an escape character precedes a special symbol or another
escape character, the following character is matched literally. It is invalid to escape
any other character.
```
For the query
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%';
```
SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike):
> LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%`
which are no longer having origin intention (see left).
Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';
```
> LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%`
Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`.
Same for RLIKE:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*';
```
> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`
which is OK, but
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
```
> RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.*`
which no longer haves origin intention.
Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.*';
```
> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`
### Why are the changes needed?
Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue.
### Does this PR introduce any user-facing change?
No, as it only corrects documentation.
### How was this patch tested?
Added debug log (like above) and ran queries from `spark-sql`.
Closes#25957 from HeartSaVioR/SPARK-29281.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different.
### Why are the changes needed?
This prevents mistakes in expression examples, and fixes existing mistakes in comments.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add new test to `SQLQuerySuite`.
Closes#25942 from MaxGekk/run-expr-examples.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.
### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#25849 from xuanyuanking/SPARK-29175.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect.
After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will:
1. perform integral division with the / operator if both sides are integral types;
2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.
### Why are the changes needed?
Unify the external database dialect with one configuration, instead of small flags.
### Does this PR introduce any user-facing change?
A new configuration `spark.sql.dialect` for choosing a database dialect.
### How was this patch tested?
Existing tests.
Closes#25697 from gengliangwang/dialect.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.
### Why are the changes needed?
Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.
### Does this PR introduce any user-facing change?
It changes the default save mode for V2 Tables in the DataFrameWriter APIs
### How was this patch tested?
Existing tests
Closes#25876 from brkyvz/removeSM.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to skip PlanExpression when doing subexpression elimination on executors.
### Why are the changes needed?
Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields.
The NPE looks like:
```
[info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression *** FAILED *** (175 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException
[info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548)
[info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36)
[info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425)
[info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102)
[info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added unit test.
Closes#25925 from viirya/SPARK-29239.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples.
### Why are the changes needed?
Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`.
Closes#25924 from MaxGekk/fix-func-examples.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`.
### Why are the changes needed?
It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`.
Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog.
Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Newly added and updated test cases.
Closes#25903 from cloud-fan/current.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Move the rule `RemoveAllHints` after the batch `Resolution`.
### Why are the changes needed?
User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a test case
Closes#25746 from gatorsmile/moveRemoveAllHints.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component
### Why are the changes needed?
It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios.
### Does this PR introduce any user-facing change?
Yes, more info provided to user.
### How was this patch tested?
No tests added, only doc change.
Closes#25704 from sheepstop/master.
Authored-by: sheepstop <yangting617@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
partition path should be qualified to store in catalog.
There are some scenes:
1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x'
should be qualified: file:/path/x
**Hive 2.0.0 does not support for location without schema here.**
```
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. {0} is not absolute or has no scheme information. Please specify a complete absolute uri with scheme information.
```
2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x'
should be qualified: file:/tablelocation/x
**Hive 2.0.0 does not support for relative location here.**
3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
should be qualified: file:/path/x
**the same with Hive 2.0.0**
4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
should be qualified: file:/tablelocation/x
**the same with Hive 2.0.0**
Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it.
Another change is for alter table location.
## How was this patch tested?
add / modify existing TestCases
Closes#17254 from windpiger/qualifiedPartitionPath.
Authored-by: windpiger <songjun@outlook.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:
```
UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?
```
### Why are the changes needed?
With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New test cases added.
Closes#25626 from xianyinxin/SPARK-28892.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rewrite
```
NOT isnull(x) -> isnotnull(x)
NOT isnotnull(x) -> isnull(x)
```
### Why are the changes needed?
Make LogicalPlan more readable and useful for query canonicalization. Make same condition equal when judge query canonicalization equal
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Newly added UTs.
Closes#25878 from AngersZhuuuu/SPARK-29162.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Supported special string values for `DATE` type. They are simply notational shorthands that will be converted to ordinary date values when read. The following string values are supported:
- `epoch [zoneId]` - `1970-01-01`
- `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`.
- `yesterday [zoneId]` - the current date -1
- `tomorrow [zoneId]` - the current date + 1
- `now` - the date of running the current query. It has the same notion as `today`.
For example:
```sql
spark-sql> SELECT date 'tomorrow' - date 'yesterday';
2
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html)
### Does this PR introduce any user-facing change?
Previously, the parser fails on the special values with the error:
```sql
spark-sql> select date 'today';
Error in query:
Cannot parse the DATE value: today(line 1, pos 7)
```
After the changes, the special values are converted to appropriate dates:
```sql
spark-sql> select date 'today';
2019-09-06
```
### How was this patch tested?
- Added tests to `DateFormatterSuite` to check parsing special values from regular strings.
- Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String`
- Uncommented tests in `date.sql`
Closes#25708 from MaxGekk/datetime-special-values.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Refactoring of the `DateTimeUtils.getEpoch()` function by avoiding decimal operations that are pretty expensive, and converting the final result to the decimal type at the end.
### Why are the changes needed?
The changes improve performance of the `getEpoch()` method at least up to **20 times**.
Before:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 256 277 33 39.0 25.6 1.0X
EPOCH of timestamp 23455 23550 131 0.4 2345.5 0.0X
```
After:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 255 294 34 39.2 25.5 1.0X
EPOCH of timestamp 1049 1054 9 9.5 104.9 0.2X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test from `DateExpressionSuite`.
Closes#25881 from MaxGekk/optimize-extract-epoch.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Changed the `DateTimeUtils.getMilliseconds()` by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type.
### Why are the changes needed?
This improves performance of `extract` and `date_part()` by more than **50 times**:
Before:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 397 428 45 25.2 39.7 1.0X
MILLISECONDS of timestamp 36723 36761 63 0.3 3672.3 0.0X
```
After:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 278 284 6 36.0 27.8 1.0X
MILLISECONDS of timestamp 592 606 13 16.9 59.2 0.5X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suite - `DateExpressionsSuite`
Closes#25871 from MaxGekk/optimize-epoch-millis.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the issue brought by [SPARK-21870](http://issues.apache.org/jira/browse/SPARK-21870): when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as `byte[]`, but Spark create the code for BinaryType as `[B` and generated code fails compilation.
Below is the generated code which failed compilation (Line 380):
```
/* 380 */ private void agg_doAggregate_count_0([B agg_expr_1_1, boolean agg_exprIsNull_1_1, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_1) throws java.io.IOException {
/* 381 */ // evaluate aggregate function for count
/* 382 */ boolean agg_isNull_26 = false;
/* 383 */ long agg_value_28 = -1L;
/* 384 */ if (!false && agg_exprIsNull_1_1) {
/* 385 */ long agg_value_31 = agg_unsafeRowAggBuffer_1.getLong(1);
/* 386 */ agg_isNull_26 = false;
/* 387 */ agg_value_28 = agg_value_31;
/* 388 */ } else {
/* 389 */ long agg_value_33 = agg_unsafeRowAggBuffer_1.getLong(1);
/* 390 */
/* 391 */ long agg_value_32 = -1L;
/* 392 */
/* 393 */ agg_value_32 = agg_value_33 + 1L;
/* 394 */ agg_isNull_26 = false;
/* 395 */ agg_value_28 = agg_value_32;
/* 396 */ }
/* 397 */ // update unsafe row buffer
/* 398 */ agg_unsafeRowAggBuffer_1.setLong(1, agg_value_28);
/* 399 */ }
```
There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky.
### Why are the changes needed?
Without the fix, generated code from HashAggregateExec may fail compilation.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new UT. Without the fix, newly added UT fails.
Closes#25830 from HeartSaVioR/SPARK-29140.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to change behavior of the `date_part()` function in handling `null` field, and make it the same as PostgreSQL has. If `field` parameter is `null`, the function should return `null` of the `double` type as PostgreSQL does:
```sql
# select date_part(null, date '2019-09-20');
date_part
-----------
(1 row)
# select pg_typeof(date_part(null, date '2019-09-20'));
pg_typeof
------------------
double precision
(1 row)
```
### Why are the changes needed?
The `date_part()` function was added to maintain feature parity with PostgreSQL but current behavior of the function is different in handling null as `field`.
### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select date_part(null, date'2019-09-20');
Error in query: null; line 1 pos 7
```
After:
```sql
spark-sql> select date_part(null, date'2019-09-20');
NULL
```
### How was this patch tested?
Add new tests to `DateFunctionsSuite for 2 cases:
- `field` = `null`, `source` = a date literal
- `field` = `null`, `source` = a date column
Closes#25865 from MaxGekk/date_part-null.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr is to propagate all the SQL configurations to executors in `SQLQueryTestSuite`. When the propagation enabled in the tests, a potential bug below becomes apparent;
```
CREATE TABLE num_data (id int, val decimal(38,10)) USING parquet;
....
select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4): QueryOutput(select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4),struct<>,java.lang.IllegalArgumentException
[info] requirement failed: MutableProjection cannot use UnsafeRow for output data types: decimal(38,0)) (SQLQueryTestSuite.scala:380)
```
The root culprit is that `InterpretedMutableProjection` has incorrect validation in the interpreter mode: `validExprs.forall { case (e, _) => UnsafeRow.isFixedLength(e.dataType) }`. This validation should be the same with the condition (`isMutable`) in `HashAggregate.supportsAggregate`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126
### Why are the changes needed?
Bug fixes.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added tests in `AggregationQuerySuite`
Closes#25831 from maropu/SPARK-29122.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:
* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.
Here are a few example uses of the new API:
```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```
## How was this patch tested?
Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.
Closes#25681 from rdblue/SPARK-28612-add-data-frame-writer-v2.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to change the log level of logging generated code in case of compile error being occurred in UT. This would help to investigate compilation issue of generated code easier, as currently we got exception message of line number but there's no generated code being logged actually (as in most cases of UT the threshold of log level is at least WARN).
### Why are the changes needed?
This would help investigating issue on compilation error for generated code in UT.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#25835 from HeartSaVioR/MINOR-always-log-generated-code-on-fail-to-compile-in-unit-testing.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This addresses about 15 miscellaneous warnings that appear in the current build.
### Why are the changes needed?
No functional changes, it just slightly reduces the amount of extra warning output.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests, run manually.
Closes#25852 from srowen/BuildWarnings.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently, there are new configurations for compatibility with ANSI SQL:
* `spark.sql.parser.ansi.enabled`
* `spark.sql.decimalOperations.nullOnOverflow`
* `spark.sql.failOnIntegralTypeOverflow`
This PR is to add new configuration `spark.sql.ansi.enabled` and remove the 3 options above. When the configuration is true, Spark tries to conform to the ANSI SQL specification. It will be disabled by default.
### Why are the changes needed?
Make it simple and straightforward.
### Does this PR introduce any user-facing change?
The new features for ANSI compatibility will be set via one configuration `spark.sql.ansi.enabled`.
### How was this patch tested?
Existing unit tests.
Closes#25693 from gengliangwang/ansiEnabled.
Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This PR upgrade Scala to **2.12.10**.
Release notes:
- Fix regression in large string interpolations with non-String typed splices
- Revert "Generate shallower ASTs in pattern translation"
- Fix regression in classpath when JARs have 'a.b' entries beside 'a/b'
- Faster compiler: 5–10% faster since 2.12.8
- Improved compatibility with JDK 11, 12, and 13
- Experimental support for build pipelining and outline type checking
More details:
https://github.com/scala/scala/releases/tag/v2.12.10https://github.com/scala/scala/releases/tag/v2.12.9
## How was this patch tested?
Existing tests
Closes#25404 from wangyum/SPARK-28683.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Simplify the return type for `lookupV2Relation` which makes the 3 callers more straightforward.
## How was this patch tested?
Existing unit tests.
Closes#25735 from jzhuge/lookupv2relation.
Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
#DataSet
fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx
This PR aims to fix the below
```
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4
```
This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645).
SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387)
### Why are the changes needed?
SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387
### Does this PR introduce any user-facing change?
No,
### How was this patch tested?
Added UT, and also tested the bug SPARK-24645
**SPARK-24645 regression**
![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png)
Closes#25820 from sandeep-katta/SPARK-29101.
Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Supported special string values for `TIMESTAMP` type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported:
- `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)`
- `today [zoneId]` - midnight today.
- `yesterday [zoneId]` -midnight yesterday
- `tomorrow [zoneId]` - midnight tomorrow
- `now` - current query start time.
For example:
```sql
spark-sql> SELECT timestamp 'tomorrow';
2019-09-07 00:00:00
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html)
### Does this PR introduce any user-facing change?
Previously, the parser fails on the special values with the error:
```sql
spark-sql> select timestamp 'today';
Error in query:
Cannot parse the TIMESTAMP value: today(line 1, pos 7)
```
After the changes, the special values are converted to appropriate dates:
```sql
spark-sql> select timestamp 'today';
2019-09-06 00:00:00
```
### How was this patch tested?
- Added tests to `TimestampFormatterSuite` to check parsing special values from regular strings.
- Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String`
- Uncommented tests in `timestamp.sql`
Closes#25716 from MaxGekk/timestamp-special-values.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. After https://github.com/apache/spark/pull/21599, if the option "spark.sql.failOnIntegralTypeOverflow" is enabled, all the Binary Arithmetic operator will used the exact version function.
However, only `Add`/`Substract`/`Multiply` has a corresponding exact function in java.lang.Math . When the option "spark.sql.failOnIntegralTypeOverflow" is enabled, a runtime exception "BinaryArithmetics must override either exactMathMethod or genCode" is thrown if the other Binary Arithmetic operators are used, such as "Divide", "Remainder".
The exact math method should be called only when there is a corresponding function in `java.lang.Math`
2. Revise the log output of casting to `Int`/`Short`
3. Enable `spark.sql.failOnIntegralTypeOverflow` for pgSQL tests in `SQLQueryTestSuite`.
### Why are the changes needed?
1. Fix the bugs of https://github.com/apache/spark/pull/21599
2. The test case of pgSQL intends to check the overflow of integer/long type. We should enable `spark.sql.failOnIntegralTypeOverflow`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#25804 from gengliangwang/enableIntegerOverflowInSQLTest.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
**What changes were proposed in this pull request?**
Issue 1 : modifications not required as these are different formats for the same info. In the case of a Spark DataFrame, null is correct.
Issue 2 mentioned in JIRA Spark SQL "desc formatted tablename" is not showing the header # col_name,data_type,comment , seems to be the header has been removed knowingly as part of SPARK-20954.
Issue 3:
Corrected the Last Access time, the value shall display 'UNKNOWN' as currently system wont support the last access time evaluation, since hive was setting Last access time as '0' in metastore even though spark CatalogTable last access time value set as -1. this will make the validation logic of LasAccessTime where spark sets 'UNKNOWN' value if last access time value set as -1 (means not evaluated).
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Locally and corrected a ut.
Attaching the test report below
![SPARK-28930](https://user-images.githubusercontent.com/12999161/64484908-83a1d980-d236-11e9-8062-9facf3003e5e.PNG)
Closes#25720 from sujith71955/master_describe_info.
Authored-by: s71955 <sujithchacko.2010@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr refines the code of DELETE, including, 1, make `whereClause` to be optional, in which case DELETE will delete all of the data of a table; 2, add more test cases; 3, some other refines.
This is a following-up of SPARK-28351.
### Why are the changes needed?
An optional where clause in DELETE respects the SQL standard.
### Does this PR introduce any user-facing change?
Yes. But since this is a non-released feature, this change does not have any end-user affects.
### How was this patch tested?
New case is added.
Closes#25652 from xianyinxin/SPARK-28950.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable.
### Why are the changes needed?
The changes improve performance of the `date_format()` function.
Before:
```
format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
format date wholestage off 7180 / 7181 1.4 718.0 1.0X
format date wholestage on 7051 / 7194 1.4 705.1 1.0X
```
After:
```
format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
format date wholestage off 4787 / 4839 2.1 478.7 1.0X
format date wholestage on 4736 / 4802 2.1 473.6 1.0X
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`.
Closes#25782 from MaxGekk/date_format-foldable.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When InSet generates Java switch-based code, if the input set is empty, we don't generate switch condition, but a simple expression that is default case of original switch.
### Why are the changes needed?
SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error:
```
[info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 milliseconds)
[info] Code generation of input[0, int, true] INSET () failed:
[info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1
[info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)
```
### Does this PR introduce any user-facing change?
Yes. Previously, when users have InSet against an empty set, generated code causes compilation error. This patch fixed it.
### How was this patch tested?
Unit test added.
Closes#25806 from viirya/SPARK-29100.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; 4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)
The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`,
the current master generates;
```
/* 098 */ private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 099 */ // do aggregate
/* 100 */ // common sub-expressions
/* 101 */ int agg_value_6 = -1;
/* 102 */
/* 103 */ agg_value_6 = agg_expr_0_0 + agg_expr_1_0;
/* 104 */
/* 105 */ int agg_value_5 = -1;
/* 106 */
/* 107 */ agg_value_5 = agg_value_6 + agg_expr_2_0;
/* 108 */ boolean agg_isNull_4 = false;
/* 109 */ long agg_value_4 = -1L;
/* 110 */ if (!false) {
/* 111 */ agg_value_4 = (long) agg_value_5;
/* 112 */ }
/* 113 */ int agg_value_10 = -1;
/* 114 */
/* 115 */ agg_value_10 = agg_expr_0_0 + agg_expr_1_0;
/* 116 */ // evaluate aggregate functions and update aggregation buffers
/* 117 */ agg_doAggregate_sum_0(agg_value_10);
/* 118 */ agg_doAggregate_avg_0(agg_value_4, agg_isNull_4);
/* 119 */
/* 120 */ }
```
On the other hand, this pr generates;
```
/* 121 */ private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 122 */ // do aggregate
/* 123 */ // common sub-expressions
/* 124 */ long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0);
/* 125 */ int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0);
/* 126 */ // evaluate aggregate functions and update aggregation buffers
/* 127 */ agg_doAggregate_sum_0(agg_subExprValue_1);
/* 128 */ agg_doAggregate_avg_0(agg_subExprValue_0);
/* 129 */
/* 130 */ }
```
I run some micro benchmarks for this pr;
```
(base) maropu~:$system_profiler SPHardwareDataType
Hardware:
Hardware Overview:
Processor Name: Intel Core i5
Processor Speed: 2 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 8 GB
(base) maropu~:$java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
(base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v
val numCols = 40
val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" }
spark.range(3000000).selectExpr(colExprs: _*).createOrReplaceTempView("t")
val aggExprs = (2 until numCols).map { i =>
(0 until i).map(d => s"_c$d")
.mkString("AVG(", " + ", ")")
}
// Drops the time of a first run then pick that of a second run
timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() }
// the master
maxCodeGen: 12957
Elapsed time: 36.309858661s
// this pr
maxCodeGen=4184
Elapsed time: 2.399490285s
```
### Why are the changes needed?
To avoid the too-long-function issue in JVMs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests in `WholeStageCodegenSuite`
Closes#25710 from maropu/SplitSubexpr.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following;
```
scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...
```
### Why are the changes needed?
For efficient developments
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested
Closes#25766 from maropu/PrintBytecodeStats.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It's more clear to only do table lookup in `ResolveTables` rule (for v2 tables) and `ResolveRelations` rule (for v1 tables). `ResolveInsertInto` should only resolve the `InsertIntoStatement` with resolved relations.
### Why are the changes needed?
to make the code simpler
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25774 from cloud-fan/simplify.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR add the `namespaces` keyword to `TableIdentifierParserSuite`.
### Why are the changes needed?
Improve the test.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25758 from highmoutain/3.0bugfix.
Authored-by: changchun.wang <251922566@qq.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)
### Why are the changes needed?
The bug brings NPE.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.
Closes#25790 from HeartSaVioR/SPARK-29046-v2.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one.
### Why are the changes needed?
To make comments consistent to implementations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running Scala Style checker.
Closes#25796 from MaxGekk/year-pattern-uuuu-followup.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
reorganize the packages of DS v2 interfaces/classes:
1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes
2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes
3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes
4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes
### Why are the changes needed?
Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25700 from cloud-fan/package.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)
### Why are the changes needed?
The bug brings NPE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.
Closes#25753 from HeartSaVioR/SPARK-29046.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement`
### Why are the changes needed?
`InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes:
1. parser creates `InsertIntoStatement`
2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands.
3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25763 from cloud-fan/remove.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- For trait without companion object constructor, currently the method to get constructor parameters `constructParams` in `ScalaReflection` will throw exception.
```
scala.ScalaReflectionException: <none> is not a term
at scala.reflect.api.Symbols$SymbolApi.asTerm(Symbols.scala:211)
at scala.reflect.api.Symbols$SymbolApi.asTerm$(Symbols.scala:211)
at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:106)
at org.apache.spark.sql.catalyst.ScalaReflection.getCompanionConstructor(ScalaReflection.scala:909)
at org.apache.spark.sql.catalyst.ScalaReflection.constructParams(ScalaReflection.scala:914)
at org.apache.spark.sql.catalyst.ScalaReflection.constructParams$(ScalaReflection.scala:912)
at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:47)
at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:890)
at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:886)
at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:47)
```
- Instead this PR would throw exception:
```
Unable to find constructor for type [XXX]. This could happen if [XXX] is an interface or a trait without companion object constructor
UnsupportedOperationException:
```
In the normal usage of ExpressionEncoder, this can happen if the type is interface extending `scala.Product`. Also, since this is a protected method, this could have been other arbitrary types without constructor.
### Why are the changes needed?
- The error message `<none> is not a term` isn't helpful for users to understand the problem.
### Does this PR introduce any user-facing change?
- The exception would be thrown instead of runtime exception from the `scala.ScalaReflectionException`.
### How was this patch tested?
- Added a unit test to illustrate the `type` where expression encoder will fail and trigger the proposed error message.
Closes#25736 from mickjermsurawong-stripe/SPARK-29026.
Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.
### Why are the changes needed?
To support `SHOW DATABASES` SQL commands for v2 tables.
### Does this PR introduce any user-facing change?
`spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set:
```
+---------------+
| namespace|
+---------------+
| ns1|
| ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+
```
### How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25601 from imback82/show_databases.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This is a ANSI SQL and feature id is `T312`
```
<binary overlay function> ::=
OVERLAY <left paren> <binary value expression> PLACING <binary value expression>
FROM <start position> [ FOR <string length> ] <right paren>
```
This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array.
ref: https://www.postgresql.org/docs/11/functions-binarystring.html
## How was this patch tested?
new UT.
There are some show of the PR on my production environment.
```
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
Spark_SQL
Time taken: 0.285 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
Spark CORE
Time taken: 0.202 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
Spark ANSI SQL
Time taken: 0.165 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
Structured SQL
Time taken: 0.141 s
```
Closes#25172 from beliefer/ansi-overlay-byte-array.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it.
In this way, SQL configs are effective also in these cases, while earlier they were ignored.
### Why are the changes needed?
Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be:
```
withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
// Subexpression elimination is used here, despite it should have been disabled
data.describe()
}
```
### Does this PR introduce any user-facing change?
When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy.
### How was this patch tested?
added UT
Closes#25643 from mgaido91/SPARK-28939.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
The `V2SessionCatalog` has 2 functionalities:
1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`.
2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog).
To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage.
This PR does 2 things:
1. refine the document of the config `spark.sql.catalog.session`.
2. add a public abstract class `CatalogExtension` for users to write implementations.
TODOs for followup PRs:
1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one.
2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names.
## How was this patch tested?
existing tests
Closes#25104 from cloud-fan/session-catalog.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When we set spark.sql.decimalOperations.allowPrecisionLoss to false.
For the sql below, the result will overflow and return null.
Case a:
`select case when 1=2 then 1 else 1.000000000000000000000001 end * 1`
Similar with the division operation.
This sql below will lost precision.
Case b:
`select case when 1=2 then 1 else 1.000000000000000000000001 end / 1`
Let us check the code of TypeCoercion.scala.
a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875).
For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a common type and cast both operands to common type.
So, for these cases menthioned, their left operand is Decimal(34, 24) and right operand is Literal.
Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24).
Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class.
Let's check the relative code.
a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)
When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20).
Then the multi operation in case a will overflow and division operation in case b will lost precision.
In this PR, we skip to handle the binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it.
### Why are the changes needed?
Data will corrupt without this change.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#25701 from turboFei/SPARK-29000.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`.
### Why are the changes needed?
Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
added UT
Closes#25642 from mgaido91/SPARK-28916.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ.
### Why are the changes needed?
It is a performance improvement for AQE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#25703 from maryannxue/aqe-demote-bhj.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`:
```
date_part('field', source)
```
and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).
The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are:
- `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
- `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
- `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
- isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
- `year`, `month`, `day`, `hour`, `minute`, `second`
- `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
- `quarter` - the quarter of the year (1 - 4)
- `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
- `dow` - the day of the week as Sunday (0) to Saturday (6)
- `isodow` - the day of the week as Monday (1) to Sunday (7)
- `doy` - the day of the year (1 - 365/366)
- `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
- `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
- `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.
Here are examples:
```sql
spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224
```
I changed implementation of `extract` to re-use `date_part()` internally.
## How was this patch tested?
Added `date_part.sql` and regenerated results of `extract.sql`.
Closes#25410 from MaxGekk/date_part.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Support generator in aggregate expressions.
In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e:
```
aggregate(with generator)
|--child_plan
```
===>
```
project
|--generator(resolved)
|--aggregate
|--child_plan
```
### Why are the changes needed?
We should support sql like:
```
select explode(array(min(a), max(a))) from t
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test added.
Closes#25512 from WeichenXu123/explode_bug.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename `UnresolvedTable` to `V1Table` because it is not unresolved.
### Why are the changes needed?
The class name is inaccurate. This should be fixed before it is in a release.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25683 from rdblue/SPARK-28979-rename-unresolved-table.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach:
1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or
2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise
3. As a bypassed condition (`true`).
### Why are the changes needed?
This is an important performance feature.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT
- Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature.
- Testing DPP with reused broadcast results.
- Testing the key iterators on different HashedRelation types.
- Testing the packing and unpacking of the broadcast keys in a LongType.
Closes#25600 from maryannxue/dpp.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Datasource table now supports partition tables long ago. This commit adds the ability to translate
the InsertIntoTable(HiveTableRelation) to datasource table insertion.
## How was this patch tested?
Existing tests with some modification
Closes#25306 from advancedxy/SPARK-28573.
Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
drop the table after the test `query builtin functions don't call the external catalog` executed
This is required for [SPARK-25464](https://github.com/apache/spark/pull/22466)
## How was this patch tested?
existing UT
Closes#25427 from sandeep-katta/cleanuptable.
Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R.
The changes below can be summarized as:
- Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental
- Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched
- I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering)
It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those.
### Why are the changes needed?
Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25558 from srowen/SPARK-28855.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:
* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.
Here are a few example uses of the new API:
```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```
## How was this patch tested?
Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.
Closes#25354 from rdblue/SPARK-28612-add-data-frame-writer-v2.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.
https://www.postgresql.org/docs/devel/datatype-boolean.htmlhttps://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
## How was this patch tested?
Added new tests to CastSuite.
Closes#25458 from younggyuchun/SPARK-27931.
Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Adds support for the V2SessionCatalog for ALTER TABLE statements.
Implementation changes are ~50 loc. The rest is just test refactoring.
### Why are the changes needed?
To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements.
### How was this patch tested?
By re-using existing tests in DataSourceV2SQLSuite.
Closes#25502 from brkyvz/alterV3.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them.
After merging, there are 3 classes:
1. `InMemoryTable`
2. `InMemoryTableCatalog`
3. `StagingInMemoryTableCatalog`
For better maintainability, these 3 classes are put in 3 different files.
### Why are the changes needed?
reduce duplicated code
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#25610 from cloud-fan/dsv2-test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
### What changes were proposed in this pull request?
Disallow conversions between `timestamp` type and `long` type in table insertion with ANSI store assignment policy.
### Why are the changes needed?
In the PR https://github.com/apache/spark/pull/25581, timestamp type is allowed to be converted to long type, since timestamp type is represented by long type internally, and both legacy mode and strict mode allows the conversion.
After reconsideration, I think we should disallow it. As per ANSI SQL section "4.4.2 Characteristics of numbers":
> A number is assignable only to sites of numeric type.
In PostgreSQL, the conversion between timestamp and long is also disallowed.
### Does this PR introduce any user-facing change?
Conversion between timestamp and long is disallowed in table insertion with ANSI store assignment policy.
### How was this patch tested?
Unit test
Closes#25615 from gengliangwang/disallowTimeStampToLong.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Revise the documentation of SQL option `spark.sql.storeAssignmentPolicy`.
### Why are the changes needed?
1. Need to point out the ANSI mode is mostly the same with PostgreSQL
2. Need to point out Legacy mode allows type coercion as long as it is valid casting
3. Better examples.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Uni test
Closes#25605 from gengliangwang/reviseDoc.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Introduce ANSI store assignment policy for table insertion.
With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL.
### Why are the changes needed?
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.
In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases.
Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql.
### Does this PR introduce any user-facing change?
A new optional mode for table insertion.
### How was this patch tested?
Unit test
Closes#25581 from gengliangwang/ANSImode.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Make `spark.sql.crossJoin.enabled` default value true
### Why are the changes needed?
For implicit cross join, we can set up a watchdog to cancel it if running for a long time.
When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user:
* it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast.
* if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing.
* the CROSS JOIN syntax doesn't work well if join reorder happens.
* some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error.
So that in order to address this in simpler way, we can turn off showing this cross-join error by default.
For reference, I list some cases raising mismatching error here:
Providing:
```
spark.range(2).createOrReplaceTempView("sm1") // can be broadcast
spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast
spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast
```
1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g.
```
select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id
```
2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g.
```
select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id
```
### Does this PR introduce any user-facing change?
### How was this patch tested?
Closes#25520 from WeichenXu123/SPARK-28621.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path.
However, I found it's awkward to work with these 2 configs:
1. for `CREATE TABLE USING format`, should this be read path or write path?
2. for `V2SessionCatalog.loadTable`, we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write.
We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately.
This PR proposes to merge these 2 configs into one.
## How was this patch tested?
existing tests
Closes#25465 from cloud-fan/merge-conf.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog.
### Why are the changes needed?
This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs.
### Does this PR introduce any user-facing change?
No behavior changes.
### How was this patch tested?
Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths.
Closes#25507 from brkyvz/insertSesh.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR aims at improving the way physical plans are explained in spark.
Currently, the explain output for physical plan may look very cluttered and each operator's
string representation can be very wide and wraps around in the display making it little
hard to follow. This especially happens when explaining a query 1) Operating on wide tables
2) Has complex expressions etc.
This PR attempts to split the output into two sections. In the header section, we display
the basic operator tree with a number associated with each operator. In this section, we strictly
control what we output for each operator. In the footer section, each operator is verbosely
displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be
correlated by the originating expression id from its parent plan.
To illustrate, here is a simple plan displayed in old vs new way.
Example query1 :
```
EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0
```
Old :
```
*(2) Project [key#2, max(val)#15]
+- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
+- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18])
+- Exchange hashpartitioning(key#2, 200)
+- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21])
+- *(1) Project [key#2, val#3]
+- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
+- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>
```
New :
```
Project (8)
+- Filter (7)
+- HashAggregate (6)
+- Exchange (5)
+- HashAggregate (4)
+- Project (3)
+- Filter (2)
+- Scan parquet default.explain_temp1 (1)
(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]
(2) Filter [codegen id : 1]
Input : [key#2, val#3]
Condition : (isnotnull(key#2) AND (key#2 > 0))
(3) Project [codegen id : 1]
Output : [key#2, val#3]
Input : [key#2, val#3]
(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]
(5) Exchange
Input: [key#2, max#11]
(6) HashAggregate [codegen id : 2]
Input: [key#2, max#11]
(7) Filter [codegen id : 2]
Input : [key#2, max(val)#5, max(val#3)#8]
Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0))
(8) Project [codegen id : 2]
Output : [key#2, max(val)#5]
Input : [key#2, max(val)#5, max(val#3)#8]
```
Example Query2 (subquery):
```
SELECT * FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3
```
Old:
```
*(1) Project [key#2, val#3]
+- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3))
: +- Subquery scalar-subquery#39
: +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)#45])
: +- Exchange SinglePartition
: +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47])
: +- *(1) Project [key#26]
: +- *(1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2))
: : +- Subquery scalar-subquery#38
: : +- *(2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)#43])
: : +- Exchange SinglePartition
: : +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49])
: : +- *(1) Project [key#28]
: : +- *(1) Filter (isnotnull(val#29) AND (val#29 > 0))
: : +- *(1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int>
: +- *(1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int>
+- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int>
```
New:
```
Project (3)
+- Filter (2)
+- Scan parquet default.explain_temp1 (1)
(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]
(2) Filter [codegen id : 1]
Input : [key#2, val#3]
Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3))
(3) Project [codegen id : 1]
Output : [key#2, val#3]
Input : [key#2, val#3]
===== Subqueries =====
Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23
HashAggregate (9)
+- Exchange (8)
+- HashAggregate (7)
+- Project (6)
+- Filter (5)
+- Scan parquet default.explain_temp2 (4)
(4) Scan parquet default.explain_temp2 [codegen id : 1]
Output: [key#26, val#27]
(5) Filter [codegen id : 1]
Input : [key#26, val#27]
Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2))
(6) Project [codegen id : 1]
Output : [key#26]
Input : [key#26, val#27]
(7) HashAggregate [codegen id : 1]
Input: [key#26]
(8) Exchange
Input: [max#35]
(9) HashAggregate [codegen id : 2]
Input: [max#35]
Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22
HashAggregate (15)
+- Exchange (14)
+- HashAggregate (13)
+- Project (12)
+- Filter (11)
+- Scan parquet default.explain_temp3 (10)
(10) Scan parquet default.explain_temp3 [codegen id : 1]
Output: [key#28, val#29]
(11) Filter [codegen id : 1]
Input : [key#28, val#29]
Condition : (isnotnull(val#29) AND (val#29 > 0))
(12) Project [codegen id : 1]
Output : [key#28]
Input : [key#28, val#29]
(13) HashAggregate [codegen id : 1]
Input: [key#28]
(14) Exchange
Input: [max#37]
(15) HashAggregate [codegen id : 2]
Input: [max#37]
```
Note:
I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow
would not be able to immediately incorporate the feedback. I will start to
work on them as soon as i can. Also, currently this PR provides a basic infrastructure
for explain enhancement. The details about individual operators will be implemented
in follow-up prs
## How was this patch tested?
Added a new test `explain.sql` that tests basic scenarios. Need to add more tests.
Closes#24759 from dilipbiswal/explain_feature.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Fix the physical plan (`DescribeTableExec`) to have the same output attributes as the corresponding logical plan.
2. Remove `output` in statements since they are unresolved plans.
### Why are the changes needed?
Correctness of how output attributes should work.
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Existing tests
Closes#25568 from imback82/describe_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow.
The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow.
To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow"
## How was this patch tested?
Unit test
Closes#25461 from gengliangwang/AnsiCastIntegral.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR removes the `canonicalize(attrs: AttributeSeq)` from `PlanExpression` and taking care of normalizing expressions in `QueryPlan`.
### Why are the changes needed?
`Expression` has already a `canonicalized` method and having the `canonicalize` method in `PlanExpression` is confusing.
### Does this PR introduce any user-facing change?
Removes the `canonicalize` plan from `PlanExpression`. Also renames the `normalizeExprId` to `normalizeExpressions` in query plan.
### How was this patch tested?
This PR is a refactoring and passes the existing tests
Closes#25534 from dbaliafroozeh/ImproveCanonicalizeAPI.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
## What changes were proposed in this pull request?
Implements the SHOW TABLES logical and physical plans for data source v2 tables.
## How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25247 from imback82/dsv2_show_tables.
Lead-authored-by: terryk <yuminkim@gmail.com>
Co-authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules:
legacy and strict.
1. With legacy policy, Spark allows casting any value to any data type. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive.
2. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed.
Eventually, the "legacy" mode will be removed, so it is disallowed in data source V2.
To ensure backward compatibility with existing queries, the default store assignment policy for data source V1 is "legacy".
## How was this patch tested?
Unit test
Closes#25453 from gengliangwang/tableInsertRule.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix minor typo in SQLConf.
`FILE_COMRESSION_FACTOR` -> `FILE_COMPRESSION_FACTOR`
### Why are the changes needed?
Make conf more understandable.
### Does this PR introduce any user-facing change?
No. (`spark.sql.sources.fileCompressionFactor` is unchanged.)
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#25538 from triplesheep/TYPO-FIX.
Authored-by: triplesheep <triplesheep0419@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds a simple cost model and a mechanism to compare the costs of the before and after plans of each re-optimization in Adaptive Query Execution. Now the workflow of AQE re-optimization is changed to: If the cost of the plan after re-optimization is lower than or equal to that of the plan before re-optimization and the plan has been changed after re-optimization (if equal), the current physical plan will be updated to the plan after re-optimization, otherwise it will remain unchanged until the next re-optimization.
### Why are the changes needed?
This new mechanism is to prevent regressions in Adaptive Query Execution caused by change of the plan introducing extra cost, in this PR specifically, change of SMJ to BHJ leading to extra `ShuffleExchangeExec`s.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#25456 from maryannxue/aqe-cost.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
The current namespace/catalog should be set to None at the beginning, so that we can read the new configs when reporting currennt namespace/catalog later.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Fix a bug in CatalogManager, to reflect the change of default catalog config when reporting current catalog.
### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
No. The current namespace/catalog stuff is still internal right now.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
a new test suite
Closes#25521 from cloud-fan/fix.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
Introduces the collectInPlanAndSubqueries and subqueriesAll methods in QueryPlan that consider all the plans in the query plan, including the ones in nested subqueries.
## How was this patch tested?
Unit test added
Closes#25433 from dbaliafroozeh/IntroduceCollectInPlanAndSubqueries.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
## What changes were proposed in this pull request?
This PR adds a V1 fallback interface for writing to V2 Tables using V1 Writer interfaces. The only supported SaveMode that will be called on the target table will be an Append. The target table must use V2 interfaces such as `SupportsOverwrite` or `SupportsTruncate` to support Overwrite operations. It is up to the target DataSource implementation if this operation can be atomic or not.
We do not support dynamicPartitionOverwrite, as we cannot call a `commit` method that actually cleans up the data in the partitions that were touched through this fallback.
## How was this patch tested?
Will add tests and example implementation after comments + feedback. This is a proposal at this point.
Closes#25348 from brkyvz/v1WriteFallback.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
The expression `IntegralDivide`, which corresponds to the `div` operator, support only integral type. Postgres, though, allows it to work also with decimals.
The PR adds the support to decimal operands for this operation in order to have feature parity with postgres.
## How was this patch tested?
added UTs
Closes#25136 from mgaido91/SPARK-28322.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
This is a pure refactor PR, which creates a new class `CatalogManager` to track the registered v2 catalogs, and provide the catalog up functionality.
`CatalogManager` also tracks the current catalog/namespace. We will implement corresponding commands in other PRs, like `USE CATALOG my_catalog`
## How was this patch tested?
existing tests
Closes#25368 from cloud-fan/refactor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Some newer JDKs use the tzdata2018i database, which changes how certain (obscure) historical dates and timezones are handled. As previously, we can pretty much safely ignore these in tests, as the value may vary by JDK.
### Why are the changes needed?
Test otherwise fails using, for example, JDK 1.8.0_222. https://bugs.openjdk.java.net/browse/JDK-8215982 has a full list of JDKs which has this.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests
Closes#25504 from srowen/SPARK-28775.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up of #24761 which added a higher-order function `ArrayForAll`.
The PR mistakenly removed the `prettyName` from `ArrayExists` and forgot to add it to `ArrayForAll`.
### Why are the changes needed?
This reverts the `prettyName` back to `ArrayExists` not to affect explained plans, and adds it to `ArrayForAll` to clarify the `prettyName` as the same as the expressions around.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25501 from ueshin/issues/SPARK-27905/pretty_names.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR makes analysis error messages more meaningful when the function does not support the modifier DISTINCT:
```sql
postgres=# select upper(distinct a) from (values('a'), ('b')) v(a);
ERROR: DISTINCT specified, but upper is not an aggregate function
LINE 1: select upper(distinct a) from (values('a'), ('b')) v(a);
spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a);
Error in query: upper does not support the modifier DISTINCT; line 1 pos 7
spark-sql>
```
After this pr:
```sql
spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a);
Error in query: DISTINCT specified, but upper is not an aggregate function; line 1 pos 7
spark-sql>
```
## How was this patch tested?
Unit test
Closes#25486 from wangyum/DISTINCT.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
**Before Fix**
When a non existent permanent function is dropped, generic NoSuchFunctionException was thrown.- which printed "This function is neither a registered temporary function nor a permanent function registered in the database" .
This creates a ambiguity when a temp function in the same name exist.
**After Fix**
NoSuchPermanentFunctionException will be thrown, which will print
"NoSuchPermanentFunctionException:Function not found in database "
## How was this patch tested?
Unit test was run and corrected the UT.
Closes#25394 from PavithraRamachandran/funcIssue.
Lead-authored-by: pavithra <pavi.rams@gmail.com>
Co-authored-by: pavithraramachandran <pavi.rams@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog.
## How was this patch tested?
Unit tests. A lot of tests broke under hive when things were not working properly under `ResolveTables`, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths.
Closes#25402 from brkyvz/saveAsV2.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
In the PR, I propose new expressions `Epoch`, `IsoYear`, `Milliseconds` and `Microseconds`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):
1. `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.
2. `isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
3. `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
4. `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
Here are examples:
```sql
spark-sql> SELECT EXTRACT(EPOCH FROM TIMESTAMP '2019-08-11 19:07:30.123456');
1565550450.123456
spark-sql> SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-01');
2005
spark-sql> SELECT EXTRACT(MILLISECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456');
30123.456
spark-sql> SELECT EXTRACT(MICROSECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456');
30123456
```
## How was this patch tested?
Added new tests to `DateExpressionsSuite`, and uncommented existing tests in `extract.sql` and `pgSQL/date.sql`.
Closes#25408 from MaxGekk/extract-ext3.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This pr adds DELETE support for V2 datasources. As a first step, this pr only support delete by source filters:
```scala
void delete(Filter[] filters);
```
which could not deal with complicated cases like subqueries.
Since it's uncomfortable to embed the implementation of DELETE in the current V2 APIs, a new mix-in of datasource is added, which is called `SupportsMaintenance`, similar to `SupportsRead` and `SupportsWrite`. A datasource which can be maintained means we can perform DELETE/UPDATE/MERGE/OPTIMIZE on the datasource, as long as the datasource implements the necessary mix-ins.
## How was this patch tested?
new test case.
Please review https://spark.apache.org/contributing.html before opening a pull request.
Closes#25115 from xianyinxin/SPARK-28351.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Here is the problem description from the JIRA.
```
When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results.
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('1')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('infinity')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('-infinity'), ('infinity')) v(x);
The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way.
Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html
```
In this PR, the casting code is enhanced to handle these `special` string literals in case insensitive manner.
## How was this patch tested?
Added tests in CastSuite and modified existing test suites.
Closes#25331 from dilipbiswal/double_infinity.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Changed type of `sec` argument in the `make_timestamp()` function from `DOUBLE` to `DECIMAL(8, 6)`. The scale is set to 6 to cover microsecond fractions, and the precision is 2 digits for seconds + 6 digits for microsecond fraction. New type prevents losing precision in some cases, for example:
Before:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001);
2019-08-12 00:00:58
```
After:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001);
2019-08-12 00:00:58.000001
```
Also switching to `DECIMAL` fixes rounding `sec` towards "nearest neighbor" unless both neighbors are equidistant, in which case round up. For example:
Before:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567);
2019-08-12 00:00:00.123456
```
After:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567);
2019-08-12 00:00:00.123457
```
## How was this patch tested?
This was tested by `DateExpressionsSuite` and `pgSQL/timestamp.sql`.
Closes#25421 from MaxGekk/make_timestamp-decimal.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
In the PR, I propose additional synonyms for the `field` argument of `extract` supported by PostgreSQL. The `extract.sql` is updated to check all supported values of the `field` argument. The list of synonyms was taken from https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/datetime.c .
## How was this patch tested?
By running `extract.sql` via:
```
$ build/sbt "sql/test-only *SQLQueryTestSuite -- -z extract.sql"
```
Closes#25438 from MaxGekk/extract-field-synonyms.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`.
This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`.
When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf.
## How was this patch tested?
Added tests.
Closes#25352 from viirya/SPARK-28422.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
#25242 proposed to disallow upcasting complex data types to string type, however, upcasting from null type to any types should still be safe.
## How was this patch tested?
Add corresponding case in `CastSuite`.
Closes#25425 from jiangxb1987/nullToString.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR replaces `CatalogUtils.maskCredentials` with `SQLConf.get.redactOptions` to match other redacts.
## How was this patch tested?
unit test and manual tests:
Before this PR:
```sql
spark-sql> DESC EXTENDED test_spark_28675;
id int NULL
# Detailed Table Information
Database default
Table test_spark_28675
Owner root
Created Time Fri Aug 09 08:23:17 GMT-07:00 2019
Last Access Wed Dec 31 17:00:00 GMT-07:00 1969
Created By Spark 3.0.0-SNAPSHOT
Type MANAGED
Provider org.apache.spark.sql.jdbc
Location file:/user/hive/warehouse/test_spark_28675
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675';
default test_spark_28675 false Database: default
Table: test_spark_28675
Owner: root
Created Time: Fri Aug 09 08:23:17 GMT-07:00 2019
Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969
Created By: Spark 3.0.0-SNAPSHOT
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Location: file:/user/hive/warehouse/test_spark_28675
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
Schema: root
|-- id: integer (nullable = true)
```
After this PR:
```sql
spark-sql> DESC EXTENDED test_spark_28675;
id int NULL
# Detailed Table Information
Database default
Table test_spark_28675
Owner root
Created Time Fri Aug 09 08:19:49 GMT-07:00 2019
Last Access Wed Dec 31 17:00:00 GMT-07:00 1969
Created By Spark 3.0.0-SNAPSHOT
Type MANAGED
Provider org.apache.spark.sql.jdbc
Location file:/user/hive/warehouse/test_spark_28675
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties [url=*********(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675';
default test_spark_28675 false Database: default
Table: test_spark_28675
Owner: root
Created Time: Fri Aug 09 08:19:49 GMT-07:00 2019
Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969
Created By: Spark 3.0.0-SNAPSHOT
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Location: file:/user/hive/warehouse/test_spark_28675
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [url=*********(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
Schema: root
|-- id: integer (nullable = true)
```
Closes#25395 from wangyum/SPARK-28675.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
In the PR, I propose new expressions `Millennium`, `Century` and `Decade`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):
1. `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
2. `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
3. `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
Here are examples:
```sql
spark-sql> SELECT EXTRACT(MILLENNIUM FROM DATE '1981-01-19');
2
spark-sql> SELECT EXTRACT(CENTURY FROM DATE '1981-01-19');
20
spark-sql> SELECT EXTRACT(DECADE FROM DATE '1981-01-19');
198
```
## How was this patch tested?
Added new tests to `DateExpressionsSuite` and uncommented existing tests in `pgSQL/date.sql`.
Closes#25388 from MaxGekk/extract-ext2.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Right now, batch DataFrame always changes the schema to nullable automatically (See this line: 325bc8e9c6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L399)). But streaming file source is missing this.
This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior.
## How was this patch tested?
The new unit test.
Closes#25382 from zsxwing/SPARK-28651.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
I propose new levels of truncations for the `date_trunc()` and `trunc()` functions:
1. `MICROSECOND` and `MILLISECOND` truncate values of the `TIMESTAMP` type to microsecond and millisecond precision.
2. `DECADE`, `CENTURY` and `MILLENNIUM` truncate dates/timestamps to lowest date of current decade/century/millennium.
Also the `WEEK` and `QUARTER` levels have been supported by the `trunc()` function.
The function is implemented similarly to `date_trunc` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC to maintain feature parity with it.
Here are examples of `TRUNC`:
```sql
spark-sql> SELECT TRUNC('2015-10-27', 'DECADE');
2010-01-01
spark-sql> set spark.sql.datetime.java8API.enabled=true;
spark.sql.datetime.java8API.enabled true
spark-sql> SELECT TRUNC('1999-10-27', 'millennium');
1001-01-01
```
Examples of `DATE_TRUNC`:
```sql
spark-sql> SELECT DATE_TRUNC('CENTURY', '2015-03-05T09:32:05.123456');
2001-01-01T00:00:00Z
```
## How was this patch tested?
Added new tests to `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`, and uncommented existing tests in `pgSQL/date.sql`.
Closes#25336 from MaxGekk/date_truct-ext.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Adds checks around:
- The existence of transforms in the table schema (even in nested fields)
- Duplications of transforms
- Case sensitivity checks around column names
in the V2 table creation code paths.
## How was this patch tested?
Unit tests.
Closes#25305 from brkyvz/v2CreateTable.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases.
More details:
https://github.com/apache/spark/pull/25158#discussion_r309764541
## How was this patch tested?
N/A
Closes#25376 from wangyum/SPARK-28395-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
In `Catalogs.load`, the `pluginClassName` in the following code
```
String pluginClassName = conf.getConfString("spark.sql.catalog." + name, null);
```
is always null for built-in catalogs, e.g there is a SQLConf entry `spark.sql.catalog.session`.
This is because of https://github.com/apache/spark/pull/18852: SQLConf.conf.getConfString(key, null) always returns null.
## How was this patch tested?
Apply code changes of https://github.com/apache/spark/pull/24768 and tried loading session catalog.
Closes#25094 from gengliangwang/fixCatalogLoad.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
The flag `spark.sql.decimalOperations.nullOnOverflow` is not honored by the `Cast` operator. This means that a casting which causes an overflow currently returns `null`.
The PR makes `Cast` respecting that flag, ie. when it is turned to false and a decimal overflow occurs, an exception id thrown.
## How was this patch tested?
Added UT
Closes#25253 from mgaido91/SPARK-28470.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.
This is caused by `FileSourceScanExec`:
1. In its `toString`, it needs to report the number of partitions it reads. This needs to query the hive metastore.
2. In its `outputOrdering`, it needs to get all the files. This needs to query the hive metastore.
This PR fixes by:
1. `toString` do not need to report the number of partitions it reads. We should report it via SQL metrics.
2. The `outputOrdering` is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.
## How was this patch tested?
existing tests
Closes#25328 from cloud-fan/ui.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Implements the `DESCRIBE TABLE` logical and physical plans for data source v2 tables.
## How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25040 from mccheah/describe-table-v2.
Authored-by: mcheah <mcheah@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Add's the higher order function `forall`, which tests an array to see if a predicate holds for every element.
The function is implemented in `org.apache.spark.sql.catalyst.expressions.ArrayForAll`.
The function is added to the function registry under the pretty name `forall`.
## How was this patch tested?
I've added appropriate unit tests for the new ArrayForAll expression in
`sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala`.
Also added tests for the function in `sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala`.
Not sure who is best to ask about this PR so:
HyukjinKwon rxin gatorsmile ueshin srowen hvanhovell gatorsmile
Closes#24761 from nvander1/feature/for_all.
Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com>
Co-authored-by: Nik <nikolasrvanderhoof@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
In the PR, I propose to use existing expressions `DayOfYear`, `WeekDay` and `DayOfWeek`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):
1. `dow` - the day of the week as Sunday (0) to Saturday (6)
2. `isodow` - the day of the week as Monday (1) to Sunday (7)
3. `doy` - the day of the year (1 - 365/366)
Here are examples:
```sql
spark-sql> SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40');
5
spark-sql> SELECT EXTRACT(ISODOW FROM TIMESTAMP '2001-02-18 20:38:40');
7
spark-sql> SELECT EXTRACT(DOY FROM TIMESTAMP '2001-02-16 20:38:40');
47
```
## How was this patch tested?
Updated `extract.sql`.
Closes#25367 from MaxGekk/extract-ext.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR targets to rename `PullOutPythonUDFInJoinCondition` to `ExtractPythonUDFFromJoinCondition` and move to 'Extract Python UDFs' together with other Python UDF related rules.
Currently `PullOutPythonUDFInJoinCondition` rule is alone outside of other 'Extract Python UDFs' rules together.
and the name `ExtractPythonUDFFromJoinCondition` is matched to existing Python UDF extraction rules.
## How was this patch tested?
Existing tests should cover.
Closes#25358 from HyukjinKwon/move-python-join-rule.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used.
This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible.
This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings.
## How was this patch tested?
Existing unit tests.
Closes#25335 from HeartSaVioR/SPARK-28601.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This is an alternative solution of https://github.com/apache/spark/pull/24442 . It fails the query if ambiguous self join is detected, instead of trying to disambiguate it. The problem is that, it's hard to come up with a reasonable rule to disambiguate, the rule proposed by #24442 is mostly a heuristic.
### background of the self-join problem:
This is a long-standing bug and I've seen many people complaining about it in JIRA/dev list.
A typical example:
```
val df1 = …
val df2 = df1.filter(...)
df1.join(df2, df1("a") > df2("a")) // returns empty result
```
The root cause is, `Dataset.apply` is so powerful that users think it returns a column reference which can point to the column of the Dataset at anywhere. This is not true in many cases. `Dataset.apply` returns an `AttributeReference` . Different Datasets may share the same `AttributeReference`. In the example above, `df2` adds a Filter operator above the logical plan of `df1`, and the Filter operator reserves the output `AttributeReference` of its child. This means, `df1("a")` is exactly the same as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false.
### The rule to detect ambiguous column reference caused by self join:
We can reuse the infra in #24442 :
1. each Dataset has a globally unique id.
2. the `AttributeReference` returned by `Dataset.apply` carries the ID and column position(e.g. 3rd column of the Dataset) via metadata.
3. the logical plan of a `Dataset` carries the ID via `TreeNodeTag`
When self-join happens, the analyzer asks the right side plan of join to re-generate output attributes with new exprIds. Based on it, a simple rule to detect ambiguous self join is:
1. find all column references (i.e. `AttributeReference`s with Dataset ID and col position) in the root node of a query plan.
2. for each column reference, traverse the query plan tree, find a sub-plan that carries Dataset ID and the ID is the same as the one in the column reference.
3. get the corresponding output attribute of the sub-plan by the col position in the column reference.
4. if the corresponding output attribute has a different exprID than the column reference, then it means this sub-plan is on the right side of a self-join and has regenerated its output attributes. This is an ambiguous self join because the column reference points to a table being self-joined.
## How was this patch tested?
existing tests and new test cases
Closes#25107 from cloud-fan/new-self-join.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This adds an interface for catalog plugins that exposes namespace operations:
* `listNamespaces`
* `namespaceExists`
* `loadNamespaceMetadata`
* `createNamespace`
* `alterNamespace`
* `dropNamespace`
## How was this patch tested?
API only. Existing tests for regressions.
Closes#24560 from rdblue/SPARK-27661-add-catalog-namespace-api.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
Explained why "Subquery" and "Join Reorder" optimization batches should be `FixedPoint(1)`, which was introduced in SPARK-28532 and SPARK-28530.
## How was this patch tested?
Existing UTs.
Closes#25320 from yeshengm/SPARK-28530-followup.
Lead-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
See discussion on the JIRA (and dev). At heart, we find that math.log and math.pow can actually return slightly different results across platforms because of hardware optimizations. For the actual SQL log and pow functions, I propose that we should use StrictMath instead to ensure the answers are already the same. (This should have the benefit of helping tests pass on aarch64.)
Further, the atanh function (which is not part of java.lang.Math) can be implemented in a slightly different and more accurate way.
## How was this patch tested?
Existing tests (which will need to be changed).
Some manual testing locally to understand the numeric issues.
Closes#25279 from srowen/SPARK-28519.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
When PythonUDF is used in group by, and it is also in aggregate expression, like
```
SELECT pyUDF(a + 1), COUNT(b) FROM testData GROUP BY pyUDF(a + 1)
```
It causes analysis exception in `CheckAnalysis`, like
```
org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function.
```
First, `CheckAnalysis` can't check semantic equality between PythonUDFs.
Second, even we make it possible, runtime exception will be thrown
```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF1#8615
...
Cause: java.lang.RuntimeException: Couldn't find pythonUDF1#8615 in [cast(pythonUDF0#8614 as int)#8617,count(b#8599)#8607L]
```
The cause is, `ExtractPythonUDFs` extracts both PythonUDFs in group by and aggregate expression. The PythonUDFs are two different aliases now in the logical aggregate. In runtime, we can't bind the resulting expression in aggregate to its grouping and aggregate attributes.
This patch proposes a rule `ExtractGroupingPythonUDFFromAggregate` to extract PythonUDFs in group by and evaluate them before aggregate. We replace the group by PythonUDF in aggregate expression with aliased result.
The query plan of query `SELECT pyUDF(a + 1), pyUDF(COUNT(b)) FROM testData GROUP BY pyUDF(a + 1)`, like
```
== Optimized Logical Plan ==
Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L]
+- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616]
+- Aggregate [cast(groupingPythonUDF#8614 as int)], [cast(groupingPythonUDF#8614 as int) AS CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, count(b#8599) AS agg#8613L]
+- Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599]
+- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615]
+- LocalRelation [a#8598, b#8599]
== Physical Plan ==
*(3) Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L]
+- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616]
+- *(2) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int)#8617], functions=[count(b#8599)], output=[CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, agg#8613L])
+- Exchange hashpartitioning(cast(groupingPythonUDF#8614 as int)#8617, 5), true
+- *(1) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int) AS cast(groupingPythonUDF#8614 as int)#8617], functions=[partial_count(b#8599)], output=[cast(groupingPythonUDF#8614 as int)#8617, count#8619L])
+- *(1) Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599]
+- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615]
+- LocalTableScan [a#8598, b#8599]
```
## How was this patch tested?
Added tests.
Closes#25215 from viirya/SPARK-28445.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
```sql
spark-sql> select cast(1);
19/07/26 00:54:17 ERROR SparkSQLDriver: Failed in [select cast(1)]
java.lang.UnsupportedOperationException: empty.init
at scala.collection.TraversableLike$class.init(TraversableLike.scala:451)
at scala.collection.mutable.ArrayOps$ofInt.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:234)
at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:135)
at scala.collection.mutable.ArrayOps$ofInt.init(ArrayOps.scala:234)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:565)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:558)
at scala.Option.getOrElse(Option.scala:121)
```
The reason is that we did not handle the case [`validParametersCount.length == 0`](2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L588)) because the [parameter types](2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L589)) can be `Expression`, `DataType` and `Option`. This PR makes it handle the case `validParametersCount.length == 0`.
## How was this patch tested?
unit tests
Closes#25261 from wangyum/SPARK-28521.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
When an overflow occurs performing an arithmetic operation, we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard.
## How was this patch tested?
added UT + existing UTs (improved)
Closes#21599 from mgaido91/SPARK-24598.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR moves `replaceFunctionName(usage: String, functionName: String)`
from `DescribeFunctionCommand` to `ExpressionInfo` in order to make `ExpressionInfo` returns actual name instead of placeholder. We can get `ExpressionInfo`s directly through `SessionCatalog.lookupFunctionInfo` API and get the real names.
## How was this patch tested?
unit tests
Closes#25314 from wangyum/SPARK-28581.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR aims to support ANSI SQL `Boolean-Predicate` syntax.
```sql
expression IS [NOT] TRUE
expression IS [NOT] FALSE
expression IS [NOT] UNKNOWN
```
There are some mainstream database support this syntax.
- **PostgreSQL:** https://www.postgresql.org/docs/9.1/functions-comparison.html
- **Hive:** https://issues.apache.org/jira/browse/HIVE-13583
- **Redshift:** https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
- **Vertica:** https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm
For example:
```sql
spark-sql> select null is true, null is not true;
false true
spark-sql> select false is true, false is not true;
false true
spark-sql> select true is true, true is not true;
true false
spark-sql> select null is false, null is not false;
false true
spark-sql> select false is false, false is not false;
true false
spark-sql> select true is false, true is not false;
false true
spark-sql> select null is unknown, null is not unknown;
true false
spark-sql> select false is unknown, false is not unknown;
false true
spark-sql> select true is unknown, true is not unknown;
false true
```
**Note**: A null input is treated as the logical value "unknown".
## How was this patch tested?
Pass the Jenkins with the newly added test cases.
Closes#25074 from beliefer/ansi-sql-boolean-test.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR makes the optimizer rule PullupCorrelatedPredicates idempotent.
## How was this patch tested?
A new test PullupCorrelatedPredicatesSuite
Closes#25268 from dilipbiswal/pr-25164.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
New function `make_timestamp()` takes 6 columns `year`, `month`, `day`, `hour`, `min`, `sec` + optionally `timezone`, and makes new column of the `TIMESTAMP` type. If values in the input columns are `null` or out of valid ranges, the function returns `null`. Valid ranges are:
- `year` - `[1, 9999]`
- `month` - `[1, 12]`
- `day` - `[1, 31]`
- `hour` - `[0, 23]`
- `min` - `[0, 59]`
- `sec` - `[0, 60]`. If the `sec` argument equals to 60, the seconds field is set to 0 and 1 minute is added to the final timestamp.
- `timezone` - an identifier of timezone. Actual database of timezones can be found there: https://www.iana.org/time-zones.
Also constructed timestamp must be valid otherwise `make_timestamp` returns `null`.
The function is implemented similarly to `make_timestamp` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html to maintain feature parity with it.
Here is an example:
```sql
select make_timestamp(2014, 12, 28, 6, 30, 45.887);
2014-12-28 06:30:45.887
select make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET');
2014-12-28 10:30:45.887
select make_timestamp(2019, 6, 30, 23, 59, 60)
2019-07-01 00:00:00
```
Returned value has Spark Catalyst type `TIMESTAMP` which is similar to Oracle's `TIMESTAMP WITH LOCAL TIME ZONE` (see https://docs.oracle.com/cd/B28359_01/server.111/b28298/ch4datetime.htm#i1006169) where data is stored in the session time zone, and the time zone offset is not stored as part of the column data. When users retrieve the data, Spark returns it in the session time zone specified by the SQL config `spark.sql.session.timeZone`.
## How was this patch tested?
Added new tests to `DateExpressionsSuite`, and uncommented a test for `make_timestamp` in `pgSQL/timestamp.sql`.
Closes#25220 from MaxGekk/make_timestamp.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
These are what I found during working on #22282.
- Remove unused value: `UnsafeArraySuite#defaultTz`
- Remove redundant new modifier to the case class, `KafkaSourceRDDPartition`
- Remove unused variables from `RDD.scala`
- Remove trailing space from `structured-streaming-kafka-integration.md`
- Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default.
- Remove leading empty line: `UnsafeRow`
- Remove trailing empty line: `KafkaTestUtils`
- Remove unthrown exception type: `UnsafeMapData`
- Replace unused declarations: `expressions`
- Remove duplicated default parameter: `AnalysisErrorSuite`
- `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable
Closes#25251 from dongjinleekr/cleanup/201907.
Authored-by: Lee Dongjin <dongjin@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR aims to add a SQL function alias `random` to the existing `rand` function.
Please note that this adds the alias to SQL layer only because this is for PostgreSQL feature parity.
- [PostgreSQL Random function](https://www.postgresql.org/docs/11/functions-math.html)
- [SPARK-23160 Port window.sql](https://github.com/apache/spark/pull/24881/files#diff-14489bae6b27814d4cde0456a7ae75c8R702)
- [SPARK-28406 Port union.sql](https://github.com/apache/spark/pull/25163/files#diff-23a3430e0e1ff88830cbb43701da1f2cR402)
## How was this patch tested?
Manual.
```sql
spark-sql> DESCRIBE FUNCTION random;
Function: random
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage: random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
```
Closes#25282 from dongjoon-hyun/SPARK-28086.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
In the PR, I propose to use `uuuu` for years instead of `yyyy` in date/timestamp patterns without the era pattern `G` (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html). **Parsing/formatting of positive years (current era) will be the same.** The difference is in formatting negative years belong to previous era - BC (Before Christ).
I replaced the `yyyy` pattern by `uuuu` everywhere except:
1. Test, Suite & Benchmark. Existing tests must work as is.
2. `SimpleDateFormat` because it doesn't support the `uuuu` pattern.
3. Comments and examples (except comments related to already replaced patterns).
Before the changes, the year of common era `100` and the year of BC era `-99`, showed similarly as `100`. After the changes negative years will be formatted with the `-` sign.
Before:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+----------+
| value|
+----------+
|0100-01-01|
+----------+
```
After:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+-----------+
| value|
+-----------+
|-0099-01-01|
+-----------+
```
## How was this patch tested?
By existing test suites, and added tests for negative years to `DateFormatterSuite` and `TimestampFormatterSuite`.
Closes#25230 from MaxGekk/year-pattern-uuuu.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR add support typed `interval` expression:
```sql
spark-sql> select interval 'interval 3 year 1 hour';
interval 3 years 1 hours
spark-sql>
```
Please note that this pr did not add a cast alias for `interval` type like [other types](2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L529-L541)) because neither PostgreSQL nor Hive supports this syntax.
## How was this patch tested?
unit tests
Closes#25241 from wangyum/SPARK-28424.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Since for AQP the cost for joins can change between multiple runs, there is no reason that we have an idempotence enforcement on this optimizer batch. We thus make it `FixedPoint(1)` instead of `Once`.
## How was this patch tested?
Existing UTs.
Closes#25266 from yeshengm/SPARK-28530.
Lead-authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In the Catalyst optimizer, the batch subquery actually calls the optimizer recursively. Therefore it makes no sense to enforce idempotence on it and we change this batch to `FixedPoint(1)`.
## How was this patch tested?
Existing UTs.
Closes#25267 from yeshengm/SPARK-28532.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In SPARK-15370, We checked the expression at the root of the correlated subquery, in order to fix count bug. If a `PythonUDF` in in the checking path, evaluating it causes the failure as we can't statically evaluate `PythonUDF`. The Python UDF test added at SPARK-28277 shows this issue.
If we can statically evaluate the expression, we intercept NULL values coming from the outer join and replace them with the value that the subquery's expression like before, if it is not, we replace them with the `PythonUDF` expression, with statically evaluated parameters.
After this, the last query in `udf-except.sql` which throws `java.lang.UnsupportedOperationException` can be run:
```
SELECT t1.k
FROM t1
WHERE t1.v <= (SELECT udf(max(udf(t2.v)))
FROM t2
WHERE udf(t2.k) = udf(t1.k))
MINUS
SELECT t1.k
FROM t1
WHERE udf(t1.v) >= (SELECT min(udf(t2.v))
FROM t2
WHERE t2.k = t1.k)
-- !query 2 schema
struct<k:string>
-- !query 2 output
two
```
Note that this issue is also for other non-foldable expressions, like rand. As like PythonUDF, we can't call `eval` on this kind of expressions in optimization. The evaluation needs to defer to query runtime.
## How was this patch tested?
Added tests.
Closes#25204 from viirya/SPARK-28441.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
In adaptive query processing (AQE), query plans are optimized on the fly during execution. However, a few `Once` rules can be problematic for such optimization since they can either generate wrong plan/unnecessary intermediate plan nodes.
This PR enforces idempotence for "Once" batches that are supposed to run once. This is a key enabler for AQE re-optimization and can improve robustness for existing optimizer rules.
Once batches that are currently not idempotent are marked in a blacklist. We will submit followup PRs to fix idempotence of these rules.
## How was this patch tested?
Existing UTs. Failing Once rules are temporarily blacklisted.
Closes#25249 from yeshengm/idempotence-checker.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Support multiple catalogs in the following InsertTable use cases:
- INSERT INTO [TABLE] catalog.db.tbl
- INSERT OVERWRITE TABLE catalog.db.tbl
Support matrix:
Overwrite|Partitioned Table|Partition Clause |Partition Overwrite Mode|Action
---------|-----------------|-----------------|------------------------|-----
false|*|*|*|AppendData
true|no|(empty)|*|OverwriteByExpression(true)
true|yes|p1,p2 or p1 or p2 or (empty)|STATIC|OverwriteByExpression(true)
true|yes|p2,p2 or p1 or p2 or (empty)|DYNAMIC|OverwritePartitionsDynamic
true|yes|p1=23,p2=3|*|OverwriteByExpression(p1=23 and p2=3)
true|yes|p1=23,p2 or p1=23|STATIC|OverwriteByExpression(p1=23)
true|yes|p1=23,p2 or p1=23|DYNAMIC|OverwritePartitionsDynamic
Notes:
- Assume the partitioned table has 2 partitions: p1 and p2.
- `STATIC` is the default Partition Overwrite Mode for data source tables.
- DSv2 tables currently do not support `IfPartitionNotExists`.
## How was this patch tested?
New tests.
All existing catalyst and sql/core tests.
Closes#24832 from jzhuge/SPARK-27845-pr.
Lead-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
In the current implementation. complex types like Array/Map/StructType are allowed to upcast as StringType.
This is not safe casting. We should disallow it.
## How was this patch tested?
Update the existing test case
Closes#25242 from gengliangwang/fixUpCastStringType.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/7355 add support casting between IntervalType and StringType for scala interface:
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions._
Cast(Literal("interval 3 month 1 hours"), CalendarIntervalType).eval()
res0: Any = interval 3 months 1 hours
```
But SQL interface does not support it:
```sql
scala> spark.sql("SELECT CAST('interval 3 month 1 hour' AS interval)").show
org.apache.spark.sql.catalyst.parser.ParseException:
DataType interval is not supported.(line 1, pos 41)
== SQL ==
SELECT CAST('interval 3 month 1 hour' AS interval)
-----------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPrimitiveDataType$1(AstBuilder.scala:1931)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1909)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:52)
...
```
This PR add supports accepting the `interval` keyword in the schema string. So that SQL interface can support this feature.
## How was this patch tested?
unit tests
Closes#25189 from wangyum/SPARK-28435.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
query plan was designed to be immutable, but sometimes we do allow it to carry mutable states, because of the complexity of the SQL system. One example is `TreeNodeTag`. It's a state of `TreeNode` and can be carried over during copy and transform. The adaptive execution framework relies on it to link the logical and physical plans.
This leads to a problem: when we get `QueryExecution#analyzed`, the plan can be changed unexpectedly because it's mutable. I hit a real issue in https://github.com/apache/spark/pull/25107 : I use `TreeNodeTag` to carry dataset id in logical plans. However, the analyzed plan ends up with many duplicated dataset id tags in different nodes. It turns out that, the optimizer transforms the logical plan and add the tag to more nodes.
For example, the logical plan is `SubqueryAlias(Filter(...))`, and I expect only the `SubqueryAlais` has the dataset id tag. However, the optimizer removes `SubqueryAlias` and carries over the dataset id tag to `Filter`. When I go back to the analyzed plan, both `SubqueryAlias` and `Filter` has the dataset id tag, which breaks my assumption.
Since now query plan is mutable, I think it's better to limit the life cycle of a query plan instance. We can clone the query plan between analyzer, optimizer and planner, so that the life cycle is limited in one stage.
## How was this patch tested?
new test
Closes#25111 from cloud-fan/clone.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This PR change `CalendarIntervalType`'s readable string representation from `calendarinterval` to `interval`.
## How was this patch tested?
Existing UT
Closes#25225 from wangyum/SPARK-28469.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Fix CSV datasource to throw `com.univocity.parsers.common.TextParsingException` with large size message, which will make log output consume large disk space.
This issue is troublesome when sometimes we need parse CSV with large size column.
This PR proposes to set CSV parser/writer settings by `setErrorContentLength(1000)` to limit the error message length.
## How was this patch tested?
Manually.
```
val s = "a" * 40 * 1000000
Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv")
spark.read .option("maxCharsPerColumn", 30000000) .csv("/tmp/bogdan/es4196.csv").count
```
**Before:**
The thrown message will include error content of about 30MB size (The column size exceed the max value 30MB, so the error content include the whole parsed content, so it is 30MB).
**After:**
The thrown message will include error content like "...aaa...aa" (the number of 'a' is 1024), i.e. limit the content size to be 1024.
Closes#25184 from WeichenXu123/limit_csv_exception_size.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
New function `make_date()` takes 3 columns `year`, `month` and `day`, and makes new column of the `DATE` type. If values in the input columns are `null` or out of valid ranges, the function returns `null`. Valid ranges are:
- `year` - `[1, 9999]`
- `month` - `[1, 12]`
- `day` - `[1, 31]`
Also constructed date must be valid otherwise `make_date` returns `null`.
The function is implemented similarly to `make_date` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html to maintain feature parity with it.
Here is an example:
```sql
select make_date(2013, 7, 15);
2013-07-15
```
## How was this patch tested?
Added new tests to `DateExpressionsSuite`.
Closes#25210 from MaxGekk/make_date-timestamp.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in multiple `Dataset`s. However, creating an `Encoder` for a complicated class is slow due to Scala Reflection. To eliminate the cost of Scala Reflection, right now I usually use the private API `ExpressionEncoder.copy` as follows:
```scala
object FooEncoder {
private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]()
implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy()
}
```
This PR proposes a new method `makeCopy` in `Encoder` so that the above codes can be rewritten using public APIs.
```scala
object FooEncoder {
private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]()
implicit def encoder: Encoder[Foo] = _encoder.makeCopy
}
```
The method name is consistent with `TreeNode.makeCopy`.
## How was this patch tested?
Jenkins
Closes#25209 from zsxwing/encoder-copy.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Implements the `REPLACE TABLE` and `REPLACE TABLE AS SELECT` logical plans. `REPLACE TABLE` is now a valid operation in spark-sql provided that the tables being modified are managed by V2 catalogs.
This also introduces an atomic mix-in that table catalogs can choose to implement. Table catalogs can now implement `TransactionalTableCatalog`. The semantics of this API are that table creation and replacement can be "staged" and then "committed".
On the execution of `REPLACE TABLE AS SELECT`, `REPLACE TABLE`, and `CREATE TABLE AS SELECT`, if the catalog implements transactional operations, the physical plan will use said functionality. Otherwise, these operations fall back on non-atomic variants. For `REPLACE TABLE` in particular, the usage of non-atomic operations can unfortunately lead to inconsistent state.
## How was this patch tested?
Unit tests - multiple additions to `DataSourceV2SQLSuite`.
Closes#24798 from mccheah/spark-27724.
Authored-by: mcheah <mcheah@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
When a `ScalaUDF` returns a value which overflows, currently it returns null regardless of the value of the config `spark.sql.decimalOperations.nullOnOverflow`.
The PR makes it respect the above-mentioned config and behave accordingly.
## How was this patch tested?
added UT
Closes#25144 from mgaido91/SPARK-28369.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This pr proposes to add a prefix '*' to non-nullable attribute names in PlanTestBase.comparePlans failures. In the current master, nullability mismatches might generate the same error message for left/right logical plans like this;
```
// This failure message was extracted from #24765
- constraints should be inferred from aliased literals *** FAILED ***
== FAIL: Plans do not match ===
!'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
:- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0))
: +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0]
+- Project [2 AS two#0] +- Project [2 AS two#0]
+- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145)
```
With this pr, this error message is changed to one below;
```
- constraints should be inferred from aliased literals *** FAILED ***
== FAIL: Plans do not match ===
!'Join Inner, (*two#0 = a#0) 'Join Inner, (*two#0 = *a#0)
:- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0))
: +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0]
+- Project [2 AS two#0] +- Project [2 AS two#0]
+- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145)
```
## How was this patch tested?
N/A
Closes#25213 from maropu/MarkForNullability.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The optimize rule `PushDownPredicate` has been combined into `PushDownPredicates`, update the comment that references the old rule.
## How was this patch tested?
N/A
Closes#25207 from jiangxb1987/comment.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example:
```scala
val df = spark.sparkContext.parallelize(Seq(("1",
Array.fill(M)({
val i = math.random
(i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString)
})))).toDF("col", "arr")
.selectExpr("col", "struct(col, arr) as st")
.selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col")
```
The explode causes `st` to be duplicated as many as the exploded elements.
Benchmarks it:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
[info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz
[info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] generate big nested struct array wholestage off 52668 53162 699 0.0 877803.4 1.0X
[info] generate big nested struct array wholestage on 47261 49093 1125 0.0 787690.2 1.1X
[info]
```
The query plan:
```
== Physical Plan ==
Project [col#508, st#512.col AS col1#515, arr_col#519]
+- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519]
+- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504]
+- Scan[obj#534]
```
This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of `Generate`, and substitutes them by alias attributes on the projection on top of `Generate`.
Benchmarks it after the change:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
[info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz
[info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] generate big nested struct array wholestage off 311 331 28 0.2 5188.6 1.0X
[info] generate big nested struct array wholestage on 297 312 15 0.2 4947.3 1.0X
[info]
```
The query plan:
```
== Physical Plan ==
Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603]
+- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603]
+- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in
put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4),
if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#588]
+- Scan[obj#586]
```
This behavior is controlled by a SQL config `spark.sql.optimizer.expression.nestedPruning.enabled`.
## How was this patch tested?
Added benchmark.
Closes#24637 from viirya/SPARK-27707.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The `DateTimeUtils.timestampAddInterval` method was rewritten by using Java 8 time API. To add months and microseconds, I used the `plusMonths()` and `plus()` methods of `ZonedDateTime`. Also the signature of `timestampAddInterval()` was changed to accept an `ZoneId` instance instead of `TimeZone`. Using `ZoneId` allows to avoid the conversion `TimeZone` -> `ZoneId` on every invoke of `timestampAddInterval()`.
## How was this patch tested?
By existing test suites `DateExpressionsSuite`, `TypeCoercionSuite` and `CollectionExpressionsSuite`.
Closes#25173 from MaxGekk/timestamp-add-interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
A `Filter` predicate using `PythonUDF` can't be push down into join condition, currently. A predicate like that should be able to push down to join condition. For `PythonUDF`s that can't be evaluated in join condition, `PullOutPythonUDFInJoinCondition` will pull them out later.
An example like:
```scala
val pythonTestUDF = TestPythonUDF(name = "udf")
val left = Seq((1, 2), (2, 3)).toDF("a", "b")
val right = Seq((1, 2), (3, 4)).toDF("c", "d")
val df = left.crossJoin(right).where(pythonTestUDF($"a") === pythonTestUDF($"c"))
```
Query plan before the PR:
```
== Physical Plan ==
*(3) Project [a#2121, b#2122, c#2132, d#2133]
+- *(3) Filter (pythonUDF0#2142 = pythonUDF1#2143)
+- BatchEvalPython [udf(a#2121), udf(c#2132)], [pythonUDF0#2142, pythonUDF1#2143]
+- BroadcastNestedLoopJoin BuildRight, Cross
:- *(1) Project [_1#2116 AS a#2121, _2#2117 AS b#2122]
: +- LocalTableScan [_1#2116, _2#2117]
+- BroadcastExchange IdentityBroadcastMode
+- *(2) Project [_1#2127 AS c#2132, _2#2128 AS d#2133]
+- LocalTableScan [_1#2127, _2#2128]
```
Query plan after the PR:
```
== Physical Plan ==
*(3) Project [a#2121, b#2122, c#2132, d#2133]
+- *(3) BroadcastHashJoin [pythonUDF0#2142], [pythonUDF0#2143], Cross, BuildRight
:- BatchEvalPython [udf(a#2121)], [pythonUDF0#2142]
: +- *(1) Project [_1#2116 AS a#2121, _2#2117 AS b#2122]
: +- LocalTableScan [_1#2116, _2#2117]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[2, string, true]))
+- BatchEvalPython [udf(c#2132)], [pythonUDF0#2143]
+- *(2) Project [_1#2127 AS c#2132, _2#2128 AS d#2133]
+- LocalTableScan [_1#2127, _2#2128]
```
After this PR, the join can use `BroadcastHashJoin`, instead of `BroadcastNestedLoopJoin`.
## How was this patch tested?
Added tests.
Closes#25106 from viirya/pythonudf-join-condition.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Existing random generators in tests produce wide ranges of values that can be out of supported ranges for:
- `DateType`, the valid range is `[0001-01-01, 9999-12-31]`
- `TimestampType` supports values in `[0001-01-01T00:00:00.000000Z, 9999-12-31T23:59:59.999999Z]`
- `CalendarIntervalType` should define intervals for the ranges above.
Dates and timestamps produced by random literal generators are usually out of valid ranges for those types. And tests just check invalid values or values caused by arithmetic overflow.
In the PR, I propose to restrict tested pseudo-random values by valid ranges of `DateType`, `TimestampType` and `CalendarIntervalType`. This should allow to check valid values in test, and avoid wasting time on a priori invalid inputs.
## How was this patch tested?
The changes were checked by `DateExpressionsSuite` and modified `DateTimeUtils.dateAddMonths`:
```Scala
def dateAddMonths(days: SQLDate, months: Int): SQLDate = {
localDateToDays(LocalDate.ofEpochDay(days).plusMonths(months))
}
```
Closes#25166 from MaxGekk/datetime-lit-random-gen.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Idempotence of the `NormalizeFloatingNumbers` rule was broken due to the implementation of `ExtractEquiJoinKeys`. There is no reason that we don't remove `EqualNullSafe` join keys from an equi-join's `otherPredicates`.
## How was this patch tested?
A new UT.
Closes#25126 from yeshengm/spark-28306.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In the PR, I propose to use the `plusMonths()` method of `LocalDate` to add months to a date. This method adds the specified amount to the months field of `LocalDate` in three steps:
1. Add the input months to the month-of-year field
2. Check if the resulting date would be invalid
3. Adjust the day-of-month to the last valid day if necessary
The difference between current behavior and propose one is in handling the last day of month in the original date. For example, adding 1 month to `2019-02-28` will produce `2019-03-28` comparing to the current implementation where the result is `2019-03-31`.
The proposed behavior is implemented in MySQL and PostgreSQL.
## How was this patch tested?
By existing test suites `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`.
Closes#25153 from MaxGekk/add-months.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Adding support to hyperbolic functions like asinh\acosh\atanh in spark SQL.
Feature parity: https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-HYP-TABLE
The followings are the diffence from PostgreSQL.
```
spark-sql> SELECT acosh(0); (PostgreSQL returns `ERROR: input is out of range`)
NaN
spark-sql> SELECT atanh(2); (PostgreSQL returns `ERROR: input is out of range`)
NaN
```
Teradata has similar behavior as PostgreSQL with out of range input float values - It outputs **Invalid Input: numeric value within range only.**
These newly added asinh/acosh/atanh handles special input(NaN, +-Infinity) in the same way as existing cos/sin/tan/acos/asin/atan in spark. For which input value range is not (-∞, ∞)):
out of range float values: Spark returns NaN and PostgreSQL shows input is out of range
NaN: Spark returns NaN, PostgreSQL also returns NaN
Infinity: Spark return NaN, PostgreSQL shows input is out of range
## How was this patch tested?
```
spark.sql("select asinh(xx)")
spark.sql("select acosh(xx)")
spark.sql("select atanh(xx)")
./build/sbt "testOnly org.apache.spark.sql.MathFunctionsSuite"
./build/sbt "testOnly org.apache.spark.sql.catalyst.expressions.MathExpressionsSuite"
```
Closes#25041 from Tonix517/SPARK-28133.
Authored-by: Tony Zhang <tony.zhang@uber.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR adds compatibility of handling a `WITH` clause within another `WITH` cause. Before this PR these queries retuned `1` while after this PR they return `2` as PostgreSQL does:
```
WITH
t AS (SELECT 1),
t2 AS (
WITH t AS (SELECT 2)
SELECT * FROM t
)
SELECT * FROM t2
```
```
WITH t AS (SELECT 1)
SELECT (
WITH t AS (SELECT 2)
SELECT * FROM t
)
```
As this is an incompatible change, the PR introduces the `spark.sql.legacy.cte.substitution.enabled` flag as an option to restore old behaviour.
## How was this patch tested?
Added new UTs.
Closes#25029 from peter-toth/SPARK-28228.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
There are some hardcoded configs, using config entry to replace them.
## How was this patch tested?
Existing UT
Closes#25059 from WangGuangxin/ConfigEntry.
Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Implement `ALTER TABLE` for v2 tables:
* Add `AlterTable` logical plan and `AlterTableExec` physical plan
* Convert `ALTER TABLE` parsed plans to `AlterTable` when a v2 catalog is responsible for an identifier
* Validate that columns to alter exist in analyzer checks
* Fix nested type handling in `CatalogV2Util`
## How was this patch tested?
* Add extensive tests in `DataSourceV2SQLSuite`
Closes#24937 from rdblue/SPARK-28139-add-v2-alter-table.
Lead-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>