### What changes were proposed in this pull request?
override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax.
### Why are the changes needed?
The current implementation is wrong. It gives you a SQL string that returns different result.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
new tests
Closes#28281 from cloud-fan/sql.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
```sql
spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6
```
Currently, there are 4 types of day-of-week range:
1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7)
2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6)
3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7)
4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6)
Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903https://issues.apache.org/jira/browse/SPARK-28623
In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything.
Also, add more documentation to the extracting function to make extract field more clear to understand.
### Why are the changes needed?
Consistency insurance
### Does this PR introduce any user-facing change?
yes, doc updated and extract `dow` is as same as `dayofweek`
### How was this patch tested?
1. modified ut
2. local SQL doc verification
#### before
![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png)
#### after
![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png)
Closes#28248 from yaooqinn/SPARK-31474.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing.
### Why are the changes needed?
launching tasks that do nothing is a waste.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
updated tests
Closes#28226 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is the follow-up PR of https://github.com/apache/spark/pull/28003
- add a migration guide
- add an end-to-end test case.
### Why are the changes needed?
The original PR made the major behavior change in the user-facing RESET command.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a new end-to-end test
Closes#28265 from gatorsmile/spark-31234followup.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
1. Generate rebasing arrays for micros up to 2037 in `RebaseDateTimeSuite.generateRebaseJson()`.
2. Exclude 4 time zones from the black list in `generateRebaseJson()`.
3. Re-generate JSON files with rebasing info - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`.
### Why are the changes needed?
1. `sun.util.calendar.ZoneInfo` resolves DST after 2037 year incorrectly. See aa318070b2/jdk/src/share/classes/sun/util/calendar/ZoneInfo.java (L55-L62) . By restricting the rebase arrays to 2037 year, we follow the behaviour of `ZoneInfo` which uses DST of 2037 for all years beyond 2037.
2. To enable optimization of micros rebasing via switch arrays for the time zones:
- Asia/Tehran
- Iran
- Africa/Casablanca
- Africa/El_Aaiun
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `RebaseDateTimeUtils`, `DateTimeUtilsSuite` and `DateFunctionsSuite`.
Closes#28253 from MaxGekk/fix-4-time-zones-rebasing.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to add an ExpressionInfo entry for EXTRACT for better documentations.
This PR comes from the comment in https://github.com/apache/spark/pull/21479#discussion_r409900080
### Why are the changes needed?
To make SQL documentations complete.
### Does this PR introduce any user-facing change?
Yes, this PR updates the `Spark SQL, Built-in Functions` page.
### How was this patch tested?
Run the example tests.
Closes#28251 from maropu/AddExtractExpr.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is to dump the codegen and compilation time for benchmark query tests.
### Why are the changes needed?
Measure the codegen and compilation time costs in TPC-DS queries
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual test in my local laptop:
```
23:13:12.845 WARN org.apache.spark.sql.TPCDSQuerySuite:
=== Metrics of Whole-stage Codegen ===
Total code generation time: 21.275102261 seconds
Total compilation time: 12.223771828 seconds
```
Closes#28252 from gatorsmile/testMastercode.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix a bug that occurs when comparing null types to decimal types in master/branch-3.0;
```
scala> Seq(BigDecimal(10)).toDF("v1").selectExpr("v1 = NULL").explain(true)
org.apache.spark.sql.AnalysisException: cannot resolve '(`v1` = NULL)' due to data type mismatch: differing types in '(`v1` = NULL)' (decimal(38,18) and null).; line 1 pos 0;
'Project [(v1#5 = null) AS (v1 = NULL)#7]
+- Project [value#2 AS v1#5]
+- LocalRelation [value#2]
...
```
The query above passed in v2.4.5.
### Why are the changes needed?
bugfix
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests.
Closes#28241 from maropu/SPARK-31468.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values
While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other.
While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual.
So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`.
While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.
To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.
### Why are the changes needed?
Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.
The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.
With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.
e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579
e.g. presto
```sql
presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)
presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)
```
### Does this PR introduce any user-facing change?
Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function
### How was this patch tested?
rm some used tests
Closes#28242 from yaooqinn/SPARK-31469.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
I checked all the config of Spark again. find some new commit not add version information.
**Test.scala**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.testing.skipValidateCores | 3.1.0 | SPARK-29154 | 474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 |
**SQL**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.legacy.integerGroupingId | 3.1.0 | SPARK-30279 | 71c73d58f6e88d2558ed2e696897767d93bac60f#diff-9a6b543db706f1a90f790783d6930a13 |
The two config only exists in branch master.
### Why are the changes needed?
Supplement version information.
### Does this PR introduce any user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#28233 from beliefer/sql-conf-version-legacy-integerGroupingId.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts.
### Why are the changes needed?
Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue.
Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8` to simulate SPARC platform. And tests only pass with this fix.
Closes#28195 from Ngone51/fix_uao.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).
### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#28223 from hvanhovell/SPARK-31450.
Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
SPARK-21870 (cb0cddf#diff-06dc5de6163687b7810aa76e7e152a76R146-R149) caused significant performance regression in cases where the source code size is fairly large as `HashAggregateExec` uses `Block.length` to decide on splitting the code. The change in `length` makes sense as the comment and extra new lines shouldn't be taken into account when deciding on splitting, but the regular expression based approach is very slow and adds a big relative overhead to cases where the execution is quick (small number of rows).
This PR:
- restores `Block.length` to its original form
- places comments in `HashAggragateExec` with `CodegenContext.registerComment` so as to appear only when comments are enabled (`spark.sql.codegen.comments=true`)
Before this PR:
```
deeply nested struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem) 1137 1143 8 0.1 11368.3 0.0X
```
After this PR:
```
deeply nested struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem) 167 180 7 0.6 1674.3 0.1X
```
### Why are the changes needed?
To fix performance regression.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#28083 from peter-toth/SPARK-30564-use-comment-placeholders.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed timestamps in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range [1582-10-05, 1582-10-15). Not existed timestamps from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianMicros()` because reverse rebasing from the hybrid timestamps to Proleptic Gregorian timestamps does not have such problem.
The shifting affects only the date part of timestamps while keeping the time part as is. For example:
```
1582-10-10 00:11:22.334455 -> 1582-10-15 00:11:22.334455
```
### Why are the changes needed?
Currently, not-existed timestamps are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 00:00:00 -> 1582-10-24 00:00:00. That contradicts to shifting of not existed dates in other cases, for example:
```
scala> sql("select timestamp'1990-9-31 12:12:12'").show
+----------------------------------+
|TIMESTAMP('1990-10-01 12:12:12.0')|
+----------------------------------+
| 1990-10-01 12:12:12|
+----------------------------------+
```
### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `TIMESTAMP` values to external timestamps based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 12:13:14 date to ORC files, it will be shifted to the next valid date 1582-10-15 12:13:14.
### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.
Closes#28227 from MaxGekk/fix-not-exist-timestamps.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed dates in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 1582-10-15). Not existed dates from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates to Proleptic Gregorian dates does not have such problem.
### Why are the changes needed?
Currently, not-existed dates are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 1582-10-24. That's contradict to shifting not existed dates in other cases, for example:
```
scala> sql("select date'1990-9-31'").show
+-----------------+
|DATE '1990-10-01'|
+-----------------+
| 1990-10-01|
+-----------------+
```
### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `DATE` values to external dates based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 date to ORC files, it will be shifted to the next valid date 1582-10-15.
### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.
Closes#28225 from MaxGekk/fix-not-exist-dates.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`.
Also new benchmark for collecting dates is added to `DateTimeBenchmark`.
### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27):
Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 559 603 38 8.9 111.8 1.0X
Collect dates 2306 3221 1558 2.2 461.1 0.2X
```
Before the changes:
```
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 1052 1130 73 4.8 210.3 1.0X
Collect dates 3251 4943 1624 1.5 650.2 0.3X
```
After:
```
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 416 419 3 12.0 83.2 1.0X
Collect dates 1928 2759 1180 2.6 385.6 0.2X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |
Closes#28212 from MaxGekk/optimize-toJavaDate.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`.
Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`.
### Why are the changes needed?
The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values:
- x4 on the master branch
- 30% against Spark 2.4.6-SNAPSHOT
Spark 2.4.6-SNAPSHOT:
```
To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 614 655 43 8.1 122.8 1.0X
```
Before the changes:
```
To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 1154 1206 46 4.3 230.9 1.0X
```
After:
```
To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 427 434 7 11.7 85.3 1.0X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |
Closes#28205 from MaxGekk/optimize-fromJavaDate.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Refine the code comments of days rebasing, to be consistent with the micros rebasing. i.e. one method is the actual implementation and the other variant is the optimized version.
### Why are the changes needed?
improve code comments
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#28199 from cloud-fan/comment.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Reuse the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` functions introduced by the PR #28119 in `DateTimeUtils`.`toJavaTimestamp()` and `fromJavaTimestamp()`. Actually, new implementation is derived from Spark 2.4 + rebasing via pre-calculated rebasing maps.
### Why are the changes needed?
The changes speed up conversions to/from java.sql.Timestamp, and as a consequence the PR improve performance of ORC datasource in loading/saving timestamps:
- Saving ~ **x2.8 faster** in master, and -11% against Spark 2.4.6
- Loading - **x3.2-4.5 faster** in master, -5% against Spark 2.4.6
Before:
```
Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582 59877 59877 0 1.7 598.8 0.0X
before 1582 61361 61361 0 1.6 613.6 0.0X
Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off 48197 48288 118 2.1 482.0 1.0X
after 1582, vec on 38247 38351 128 2.6 382.5 1.3X
before 1582, vec off 53179 53359 249 1.9 531.8 0.9X
before 1582, vec on 44076 44268 269 2.3 440.8 1.1X
```
After:
```
Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582 21250 21250 0 4.7 212.5 0.1X
before 1582 22105 22105 0 4.5 221.0 0.1X
Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off 14903 14933 40 6.7 149.0 1.0X
after 1582, vec on 8342 8426 73 12.0 83.4 1.8X
before 1582, vec off 15528 15575 76 6.4 155.3 1.0X
before 1582, vec on 9025 9075 61 11.1 90.2 1.7X
```
Spark 2.4.6-SNAPSHOT:
```
Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582 18858 18858 0 5.3 188.6 1.0X
before 1582 18508 18508 0 5.4 185.1 1.0X
Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off 14063 14177 143 7.1 140.6 1.0X
after 1582, vec on 5955 6029 100 16.8 59.5 2.4X
before 1582, vec off 14119 14126 7 7.1 141.2 1.0X
before 1582, vec on 5991 6007 25 16.7 59.9 2.3X
```
### Does this PR introduce any user-facing change?
Yes, the `to_utc_timestamp` function returns the later local timestamp in the case of overlapping local timestamps at daylight saving time. it's changed back to the 2.4 behavior.
### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuites`, `ParquetIOSuite`, `OrcHadoopFsRelationSuite`.
- Re-generating results of the benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |
Closes#28189 from MaxGekk/optimize-to-from-java-timestamp.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to fallback to rebasing via local dates/timestamps for days/micros of before common era (BCE).
### Why are the changes needed?
It fixes the bug of rebasing dates/timestamps of BCE.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- By existing tests in `RebaseDateTimeSuite` and `DateTimeUtilsSuite`
- Added tests for negative years to `RebaseDateTimeSuite`
Closes#28172 from MaxGekk/fix-era-in-date-micros-rebasing.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
With benchmark original, where the timestamp values are valid to the new parser
the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5781 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 44764 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 93764 ms
[info] Running case: from_json(timestamp)
[info] Stopped after 3 iterations, 59021 ms
```
When we modify the benchmark to
```scala
def timestampStr: Dataset[String] = {
spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
}.select($"value".as("timestamp")).as[String]
}
readBench.addCase("timestamp strings", numIters) { _ =>
timestampStr.noop()
}
readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
spark.read.schema(tsSchema).json(timestampStr).noop()
}
readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
spark.read.json(timestampStr).noop()
}
```
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5623 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 506637 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 509076 ms
```
About 10x perf-regression
BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5623 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 506637 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 509076 ms
```
### Why are the changes needed?
Fix performance regression.
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
new tests added.
Closes#28181 from yaooqinn/SPARK-31414.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor`
To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .
### Why are the changes needed?
In the current implementation, `enum` is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.
```
case class Data(`0`: Int)
Seq(Data(1)).toDF.show
20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
...
```
### Does this PR introduce any user-facing change?
Yes. With this change and the code example above, we can get following error message.
```
java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"
...
```
### How was this patch tested?
Add another assertion to existing test case.
Closes#28184 from sarutak/improve-identifier-check.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Currently NOT IN subqueries (predicated null aware subquery) are not allowed inside OR expressions. We currently catch this condition in checkAnalysis and throw an error.
This PR enhances the subquery rewrite to support this type of queries.
Query
```SQL
SELECT * FROM s1 WHERE a > 5 or b NOT IN (SELECT c FROM s2);
```
Optimized Plan
```SQL
== Optimized Logical Plan ==
Project [a#3, b#4]
+- Filter ((a#3 > 5) || NOT exists#7)
+- Join ExistenceJoin(exists#7), ((b#4 = c#5) || isnull((b#4 = c#5)))
:- HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#3, b#4]
+- Project [c#5]
+- HiveTableRelation `default`.`s2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c#5, d#6]
```
This is rework from #22141.
The original author of this PR is dilipbiswal.
Closes#22141
### Why are the changes needed?
For better usability.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new tests in SQLQueryTestSuite, RewriteSubquerySuite and SubquerySuite.
Output from DB2 as a reference:
[nested-not-db2.txt](https://github.com/apache/spark/files/2299945/nested-not-db2.txt)
Closes#28158 from maropu/pr22141.
Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Since 3.0.0, we make CalendarInterval public for input, it's better for it to be inferred to CalendarIntervalType.
In the PR, we add a rule for CalendarInterval to be mapped to CalendarIntervalType in ScalaRelection, then records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe.
### Why are the changes needed?
CalendarInterval is public but can not be used as input for Datafame.
```scala
scala> import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.unsafe.types.CalendarInterval
scala> Seq((1, new CalendarInterval(1, 2, 3))).toDF("a", "b")
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$schemaFor$1(ScalaReflection.scala:735)
```
this should be supported as well as
```scala
scala> sql("select interval 2 month 1 day a")
res2: org.apache.spark.sql.DataFrame = [a: interval]
```
### Does this PR introduce any user-facing change?
Yes, records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe
### How was this patch tested?
add uts
Closes#28165 from yaooqinn/SPARK-31392.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
improve the code comment and make them consistent between `rebaseJulianToGregorian*` and `rebaseGregorianToJulian*`
### Why are the changes needed?
improve readability.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#28166 from cloud-fan/comment.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01:
| i | local timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes|
| -- | ------- |----|----| ---- |
|0|0001-01-01 00:00|-62135568422|-62135740800|-2872|
|1|0100-03-01 00:00|-59006333222|-59006419200|-1432|
|...|...|...|...|...|
|13|1582-10-15 00:00|-12219264422|-12219264000|7|
|14|1883-11-18 12:00|-2717640000|-2717640000|0|
The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps:
1. Look at the table, and find the time interval where the micros falls to
2. Take the difference between 2 calendars for this time interval
3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation.
Here are details of the implementation:
- Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example:
```json
[
{
"tz" : "America/Los_Angeles",
"switches" : [ -62135740800, -59006419200, ... , -2717640000 ],
"diffs" : [ 172378, 85978, ..., 0 ]
}
]
```
The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default.
The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test).
- The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385)
- The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case.
The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime.
- I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason.
- I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation.
- Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'.
### Why are the changes needed?
To make timestamps rebasing faster:
- Saving timestamps to parquet files is ~ **x3.8 faster**
- Loading timestamps from parquet files is ~**x2.8 faster**.
- Loading timestamps by Vectorized reader ~**x4.6 faster**.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that
- the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`.
- swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function.
- Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones.
- Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar.
- The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`.
- Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127):
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |
Closes#28119 from MaxGekk/optimize-rebase-micros.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.
- If invalid json expression is given, `NULL` will be returned.
- If an empty string or json array is given, `NULL` will be returned.
- If valid json object is given, all the keys of the outmost object will be returned as an array.
- For empty json object, empty array is returned.
We can also get JSON object keys using `map_keys+from_json`. But `json_object_keys` is more efficient.
```
Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}
Intel(R) Core(TM) i7-9750H CPU 2.60GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys 11666 12361 673 0.9 1166.6 1.0X
from_json+map_keys 15309 15973 701 0.7 1530.9 0.8X
```
### Why are the changes needed?
This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.
Some of the most popular DBMSs supports this function.
- PostgreSQL
- MySQL
- MariaDB
### Does this PR introduce any user-facing change?
Yes. Now users can extract keys of json objects using this function.
### How was this patch tested?
UTs added.
Closes#27836 from iRakson/jsonKeys.
Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a `json_array_length` function which will return the length of the outermost JSON array.
- This function will return length of the outermost JSON array, if JSON array is valid.
```
scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
| 5|
+--------------------------------------------------+
scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
| 2|
+------------------------------+
```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
| null|
+-------------------+
```
### Why are the changes needed?
- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.
- for better user experience and ease of use.
```
Performance Result for Json array - [1, 2, 3, 4]
Intel(R) Core(TM) i7-9750H CPU 2.60GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length 7728 7762 53 1.3 772.8 1.0X
size+from_json 12739 12895 199 0.8 1273.9 0.6X
```
### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.
### How was this patch tested?
Added UT.
Closes#27759 from iRakson/jsonArrayLength.
Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Previously, user can issue `SHOW TABLES` to get info of both tables and views.
This PR (SPARK-31113) implements `SHOW VIEWS` SQL command similar to HIVE to get views only.(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews)
**Hive** -- Only show view names
```
hive> SHOW VIEWS;
OK
view_1
view_2
...
```
**Spark(Hive-Compatible)** -- Only show view names, used in tests and `SparkSQLDriver` for CLI applications
```
SHOW VIEWS IN showdb;
view_1
view_2
...
```
**Spark** -- Show more information database/viewName/isTemporary
```
spark-sql> SHOW VIEWS;
userdb view_1 false
userdb view_2 false
...
```
### Why are the changes needed?
`SHOW VIEWS` command provides better granularity to only get information of views.
### Does this PR introduce any user-facing change?
Add new `SHOW VIEWS` SQL command
### How was this patch tested?
Add new test `show-views.sql` and pass existing tests
Closes#27897 from Eric5553/ShowViews.
Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes the added version of `spark.sql.execution.pandas.udf.buffer.size` to 3.0.0 (see also SPARK-27870)
### Why are the changes needed?
To show the correct version added.
### Does this PR introduce any user-facing change?
Yes but only in the unreleased branches. It will change the version shown in SQL documentation.
### How was this patch tested?
Not tested. Jenkins will test it out.
Closes#28144 from HyukjinKwon/SPARK-30841-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a minor PR used to fix some typo and improve comments mentioned with https://github.com/apache/spark/pull/28081/files#r402874997
### Why are the changes needed?
Fix some typo and improve comments.
### Does this PR introduce any user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#28112 from beliefer/fix-typo-in-codegen.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to add tests to check that code generation doesn't fail if expressions string argument contains escape chars. The PR adds similar tests added by https://github.com/apache/spark/pull/20182 for `from_utc_timestamp` / `to_utc_timestamp`.
### Why are the changes needed?
To prevent regressions in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the affected tests
Closes#28115 from MaxGekk/tests-arg-escape.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposed to skip predicates on PythonUDFs to be pushdown through Aggregate.
### Why are the changes needed?
The predicates on PythonUDFs cannot be pushdown through Aggregate. Pushed down predicates cannot be evaluate because PythonUDFs cannot be evaluated on Filter and cause error like:
```
Caused by: java.lang.UnsupportedOperationException: Cannot generate code for expression: mean(input[1, struct<bar:bigint>, true].bar)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:304)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:303)
at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:52)
at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141)
at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:821)
at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146)
at scala.Option.getOrElse(Option.scala:189)
```
### Does this PR introduce any user-facing change?
Yes. Previously the predicates on PythonUDFs will be pushdown through Aggregate can cause error. After this change, the query can work.
### How was this patch tested?
Unit test.
Closes#28089 from viirya/SPARK-30921.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The auto-generated alias name of built-in function `version()` is `sparkversion()`. After this PR, it is updated to `version()`.
### Why are the changes needed?
Based on our auto-generated alias name convention for the built-in functions, the alias names should be consistent with the function names.
This built-in function `version` is added in the upcoming Spark 3.0. Thus, we should fix it before the release.
### Does this PR introduce any user-facing change?
Yes. Update the column name in schema if users do not specify the alias.
### How was this patch tested?
Added a test case.
Closes#28131 from gatorsmile/spark-29554followup.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4.
2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case.
### Why are the changes needed?
This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see
<img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png">
Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets.
- Run the following code to check that there is no difference between rebased and original micros for modern timestamps:
```scala
test("rebasing differences") {
withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
.atZone(getZoneId("America/Los_Angeles"))
.toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
.atZone(getZoneId("America/Los_Angeles"))
.toInstant)
var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
val rebased = rebaseGregorianToJulianMicros(micros)
val curDiff = rebased - micros
if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes")
}
micros += 30 * MICROS_PER_MINUTE
}
println(s"counter = $counter")
}
}
```
```
local date-time = 0001-01-01T00:00 diff = -2872 minutes
local date-time = 0100-03-01T00:00 diff = -1432 minutes
local date-time = 0200-03-01T00:00 diff = 7 minutes
local date-time = 0300-03-01T00:00 diff = 1447 minutes
local date-time = 0500-03-01T00:00 diff = 2887 minutes
local date-time = 0600-03-01T00:00 diff = 4327 minutes
local date-time = 0700-03-01T00:00 diff = 5767 minutes
local date-time = 0900-03-01T00:00 diff = 7207 minutes
local date-time = 1000-03-01T00:00 diff = 8647 minutes
local date-time = 1100-03-01T00:00 diff = 10087 minutes
local date-time = 1300-03-01T00:00 diff = 11527 minutes
local date-time = 1400-03-01T00:00 diff = 12967 minutes
local date-time = 1500-03-01T00:00 diff = 14407 minutes
local date-time = 1582-10-15T00:00 diff = 7 minutes
local date-time = 1883-11-18T12:22:58 diff = 0 minutes
counter = 15
```
The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds.
- By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |
Closes#28101 from MaxGekk/fix-local-date-overlap.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to add a new SQL config for controlling a plan explain mode in the events of (e.g., `SparkListenerSQLExecutionStart` and `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. In the current master, the output of `QueryExecution.toString` (this is equivalent to the "extended" explain mode) is stored in these events. I think it is useful to control the content via `SQLConf`. For example, the query "Details" content (TPCDS q66 query) of a SQL tab in a Spark web UI will be changed as follows;
Before this PR:
![q66-extended](https://user-images.githubusercontent.com/692303/78211668-950b4580-74e8-11ea-90c6-db52d437534b.png)
After this PR:
![q66-formatted](https://user-images.githubusercontent.com/692303/78211674-9ccaea00-74e8-11ea-9d1d-43c7e2b0f314.png)
### Why are the changes needed?
For better usability.
### Does this PR introduce any user-facing change?
Yes; since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.
### How was this patch tested?
Added unit tests.
Closes#28097 from maropu/SPARK-31325.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
After my investigation, `SQLQueryTestSuite` spent a lot of time compiling the generated java code.
Take `group-by.sql` as an example.
At first, I added some debug log into `SQLQueryTestSuite`.
Please reference 92b6af740c/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L402)
The execution command is as follows:
`build/sbt "~sql/test-only *SQLQueryTestSuite -- -z group-by.sql"`
The output show below:
```
00:56:06.192 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=true. run time: 20604
00:56:13.719 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY. run time: 7526
00:56:18.786 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN. run time: 5066
```
According to the log, we know.
Config | Run time(ms)
-- | --
spark.sql.codegen.wholeStage=true | 20604
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY | 7526
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN | 5066
We should display the total compile time for generated java code.
This PR will add the following to `SQLQueryTestSuite`'s output.
```
=== Metrics of Whole Codegen ===
Total compile time: 80.564516529 seconds
```
Note: At first, I wanted to use `CodegenMetrics.METRIC_COMPILATION_TIME` to do this. After many experiments, I found that `CodegenMetrics.METRIC_COMPILATION_TIME` is only effective for a single test case, and cannot play a role in the whole life cycle of `SQLQueryTestSuite`.
I checked the type of ` CodegenMetrics.METRIC_COMPILATION_TIME` is `Histogram` and the latter preserves 1028 elements.` Histogram` is a metric which calculates the distribution of a value.
### Why are the changes needed?
Display the total compile time for generated java code.
### Does this PR introduce any user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#28081 from beliefer/output-codegen-compile-time.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
rename `QueryPlan.collectInPlanAndSubqueries` to `collectWithSubqueries`
### Why are the changes needed?
The old name is too verbose. `QueryPlan` is internal but it's the core of catalyst and we'd better make the API name clearer before we release it.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#28092 from cloud-fan/rename.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to replace the following SQL configs:
1. `spark.sql.legacy.parquet.rebaseDateTime.enabled` by
- `spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled` (`false` by default). The config enables rebasing dates/timestamps while saving to Parquet files. If it is set to `true`, dates/timestamps are converted to local date-time in Proleptic Gregorian calendar, date-time fields are extracted, and used in building new local date-time in the hybrid calendar (Julian + Gregorian). The resulted local date-time is converted to days or microseconds since the epoch.
- `spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled` (`false` by default). The config enables rebasing of dates/timestamps in reading from Parquet files.
2. `spark.sql.legacy.avro.rebaseDateTime.enabled` by
- `spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled` (`false` by default). It enables dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar via local date/timestamps.
- `spark.sql.legacy.avro.rebaseDateTimeInRead.enabled` (`false` by default). It enables rebasing dates/timestamps from the hybrid calendar to Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.
### Why are the changes needed?
This allows to load dates/timestamps saved by Spark 2.4, and save to Parquet/Avro files without rebasing. And the reverse use case - load data saved by Spark 3.0, and save it in the form which is compatible with Spark 2.4.
### Does this PR introduce any user-facing change?
Yes, users have to use new SQL configs. Old SQL configs are removed by the PR.
### How was this patch tested?
By existing test suites `AvroV1Suite`, `AvroV2Suite` and `ParquetIOSuite`.
Closes#28082 from MaxGekk/split-rebase-configs.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Create statement plans in `DataFrameWriter(V2)`, like the SQL API.
### Why are the changes needed?
It's better to leave all the resolution work to the analyzer.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#27992 from cloud-fan/statement.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, 9999-12-31]`:
| date | Proleptic Greg. days | Hybrid (Julian+Greg) days | diff|
| ---- | ----|----|----|
|0001-01-01|-719162|-719164|-2|
|0100-03-01|-682944|-682945|-1|
|0200-03-01|-646420|-646420|0|
|0300-03-01|-609896|-609895|1|
|0500-03-01|-536847|-536845|2|
|0600-03-01|-500323|-500320|3|
|0700-03-01|-463799|-463795|4|
|0900-03-01|-390750|-390745|5|
|1000-03-01|-354226|-354220|6|
|1100-03-01|-317702|-317695|7|
|1300-03-01|-244653|-244645|8|
|1400-03-01|-208129|-208120|9|
|1500-03-01|-171605|-171595|10|
|1582-10-15|-141427|-141427|0|
For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar.
For example, if need to rebase -650000 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -650000, and the result is -650001.
### Why are the changes needed?
To make dates rebasing faster.
### Does this PR introduce any user-facing change?
No, the results should be the same for valid range of the `DATE` type `[0001-01-01, 9999-12-31]`.
### How was this patch tested?
- Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates.
- Re-run `DateTimeRebaseBenchmark` on:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |
Closes#28067 from MaxGekk/optimize-rebasing.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```sql
scala> spark.sql(" select * from values(1), (2) t(key) where key in (select 1 as key where 1=0)").queryExecution
res15: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [*]
+- 'Filter 'key IN (list#39 [])
: +- Project [1 AS key#38]
: +- Filter (1 = 0)
: +- OneRowRelation
+- 'SubqueryAlias t
+- 'UnresolvedInlineTable [key], [List(1), List(2)]
== Analyzed Logical Plan ==
key: int
Project [key#40]
+- Filter key#40 IN (list#39 [])
: +- Project [1 AS key#38]
: +- Filter (1 = 0)
: +- OneRowRelation
+- SubqueryAlias t
+- LocalRelation [key#40]
== Optimized Logical Plan ==
Join LeftSemi, (key#40 = key#38)
:- LocalRelation [key#40]
+- LocalRelation <empty>, [key#38]
== Physical Plan ==
*(1) BroadcastHashJoin [key#40], [key#38], LeftSemi, BuildRight
:- *(1) LocalTableScan [key#40]
+- Br...
```
`LocalRelation <empty> ` should be able to propagate after subqueries are lift up to joins
### Why are the changes needed?
optimize query
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add new tests
Closes#28043 from yaooqinn/SPARK-31280.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
SPARK-25387 avoids npe for bad csv input, but when reading bad csv input with `columnNameCorruptRecord` specified, `getCurrentInput` is called and it still throws npe.
### Why are the changes needed?
Bug fix.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add a test.
Closes#28029 from wzhfy/corrupt_column_npe.
Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.
### Why are the changes needed?
In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#27517 from viirya/SPARK-29721-2.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The query plan of Spark SQL is a mutually recursive structure: QueryPlan -> Expression (PlanExpression) -> QueryPlan, but the transformations do not take this into account.
This PR refines the comments of `QueryPlan` to highlight this fact.
### Why are the changes needed?
better document.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#28050 from cloud-fan/comment.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to change types of `DateTimeTestUtils` values and functions by replacing `java.util.TimeZone` to `java.time.ZoneId`. In particular:
1. Type of `ALL_TIMEZONES` is changed to `Seq[ZoneId]`.
2. Remove `val outstandingTimezones: Seq[TimeZone]`.
3. Change the type of the time zone parameter in `withDefaultTimeZone` to `ZoneId`.
4. Modify affected test suites.
### Why are the changes needed?
Currently, Spark SQL's date-time expressions and functions have been already ported on Java 8 time API but tests still use old time APIs. In particular, `DateTimeTestUtils` exposes functions that accept only TimeZone instances. This is inconvenient, and CPU consuming because need to convert TimeZone instances to ZoneId instances via strings (zone ids).
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By affected test suites executed by jenkins builds.
Closes#28033 from MaxGekk/with-default-time-zone.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
OuterReference is one LeafExpression, so it's children is Nil, which makes its SQL representation always be outer(). This makes our explain-command and error msg unclear when OuterReference exists.
e.g.
```scala
org.apache.spark.sql.AnalysisException:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(in.`value` = max(outer()))]
Invalid expressions: [max(outer())];;
```
This PR override its `sql` method with its `prettyName` and single argment `e`'s `sql` methond
### Why are the changes needed?
improve err message
### Does this PR introduce any user-facing change?
yes, the err msg caused by OuterReference has changed
### How was this patch tested?
modified ut results
Closes#27985 from yaooqinn/SPARK-31225.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions.
2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet.
### Why are the changes needed?
Better performance for handling nested predicate pushdown.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New tests are added.
Closes#27728 from dbtsai/SPARK-17636.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so.
Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold.
This PR adds a new config for the threshold and set default value as 256 MB.
### Why are the changes needed?
Avoid skew join handling that may introduce a perf regression.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#27967 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR targets for non-nullable null type not to coerce to nullable type in complex types.
Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types.
This PR also reverts and supersedes d7b97a1d0d
### Why are the changes needed?
To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields:
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema()
spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema()
```
```scala
spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema()
```
### Does this PR introduce any user-facing change?
Yes.
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema()
```
```scala
spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema()
```
**Before:**
```
org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;;
'Project [cast(array() as array<int>) AS array()#68]
+- Range (0, 1, step=1, splits=Some(12))
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
```
```
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = true)
```
**After:**
```
root
|-- array(): array (nullable = false)
| |-- element: integer (containsNull = false)
```
```
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)
```
### How was this patch tested?
Unittests were added and manually tested.
Closes#27991 from HyukjinKwon/SPARK-31227.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to add a few `ZoneId` constant values to the `DateTimeTestUtils` object, and reuse the constants in tests. Proposed the following constants:
- PST = -08:00
- UTC = +00:00
- CEST = +02:00
- CET = +01:00
- JST = +09:00
- MIT = -09:30
- LA = America/Los_Angeles
### Why are the changes needed?
All proposed constant values (except `LA`) are initialized by zone offsets according to their definitions. This will allow to avoid:
- Using of 3-letter time zones that have been already deprecated in JDK, see _Three-letter time zone IDs_ in https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html
- Incorrect mapping of 3-letter time zones to zone offsets, see SPARK-31237. For example, `PST` is mapped to `America/Los_Angeles` instead of the `-08:00` zone offset.
Also this should improve stability and maintainability of test suites.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running affected test suites.
Closes#28001 from MaxGekk/replace-pst.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.
However, this leads to confusing behaviors
**Apache Spark 3.0.0-preview2**
```
spark-sql> CREATE TABLE t(a CHAR(3));
spark-sql> INSERT INTO TABLE t SELECT 'a ';
spark-sql> SELECT a, length(a) FROM t;
a 2
```
**Apache Spark 2.4.5**
```
spark-sql> CREATE TABLE t(a CHAR(3));
spark-sql> INSERT INTO TABLE t SELECT 'a ';
spark-sql> SELECT a, length(a) FROM t;
a 3
```
According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it.
This PR forbids CHAR type in non-Hive tables as it's not supported correctly.
### Why are the changes needed?
avoid confusing/wrong behavior
### Does this PR introduce any user-facing change?
yes, now users can't create/alter non-Hive tables with CHAR type.
### How was this patch tested?
new tests
Closes#27902 from cloud-fan/char.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This pr intends to add unit tests for the other join hints (`MERGEJOIN`, `SHUFFLE_HASH`, and `SHUFFLE_REPLICATE_NL`). This is a followup PR of #27935.
### Why are the changes needed?
For better test coverage.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes#28013 from maropu/SPARK-25121-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to update the doc for `spark.sql.session.timeZone`, and restrict format of config's values to 2 forms:
1. Geographical regions, such as `America/Los_Angeles`.
2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`.
### Why are the changes needed?
Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html).
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `./dev/scalastyle`, and manual testing.
Closes#27999 from MaxGekk/doc-session-time-zone.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
To support case class parameter for typed Scala UDF, e.g.
```
case class TestData(key: Int, value: String)
val f = (d: TestData) => d.key * d.value.toInt
val myUdf = udf(f)
val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2")
checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil)
```
### Why are the changes needed?
Currently, Spark UDF can only work on data types like java.lang.String, o.a.s.sql.Row, Seq[_], etc. This is inconvenient if user want to apply an operation on one column, and the column is struct type. You must access data from a Row object, instead of domain object like Dataset operations. It will be great if UDF can work on types that are supported by Dataset, e.g. case class.
And here's benchmark result of using case class comparing to row:
```scala
// case class: 58ms 65ms 59ms 64ms 61ms
// row: 59ms 64ms 73ms 84ms 69ms
val f1 = (d: TestData) => s"${d.key}, ${d.value}"
val f2 = (r: Row) => s"${r.getInt(0)}, ${r.getString(1)}"
val udf1 = udf(f1)
// set spark.sql.legacy.allowUntypedScalaUDF=true
val udf2 = udf(f2, StringType)
val df = spark.range(100000).selectExpr("cast (id as int) as id")
.select(struct('id, lit("str")).as("col"))
df.cache().collect()
// warmup to exclude some extra influence
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()
start = System.currentTimeMillis()
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)
start = System.currentTimeMillis()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)
```
### Does this PR introduce any user-facing change?
Yes. User now could be able to use typed Scala UDF with case class as input parameter.
### How was this patch tested?
Added unit tests.
Closes#27937 from Ngone51/udf_caseclass_support.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.
### Why are the changes needed?
The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.
### Does this PR introduce any user-facing change?
Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.
### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.
Closes#27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR (SPARK-31229) is rather a followup of https://github.com/apache/spark/pull/27926 (SPARK-31166). It adds unittests for `TypeCoercion.findTypeForComplex` and `Cast.canCast` about struct, map and array with the respect to null types.
### Why are the changes needed?
To detect which scope was broken in the future easily.
### Does this PR introduce any user-facing change?
No, it's a test-only.
### How was this patch tested?
Unittests were added.
Closes#27990 from HyukjinKwon/SPARK-31166-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.
However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.
### Why are the changes needed?
avoid breaking changes
### Does this PR introduce any user-facing change?
Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4
### How was this patch tested?
new tests.
Closes#27965 from cloud-fan/string.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improve `ScalaReflection` to only don't erasure non user defined `AnyVal` type, but still erasure other types, e.g. `Any`. And this brings two benefits:
1. Give better encode error message for some unsupported types, e.g. `Any`
2. Won't miss the walk path for the `AnyVal` type
### Why are the changes needed?
Firstly, PR #15284 added encode(serializeFor/deserializeFor) support for value class, which extends `AnyVal`, by not erasure types. But, this also introduce a problem that when user try to encoder unsupported types, e.g. `Any`, it will fail on `java.lang.ClassNotFoundException: scala.Any` due to the reason that `scala.Any` doesn't erasure to `java.lang.Object`.
Also, in current `getClassNameFromType()`, it always erasure types which could missing walked path for user defined `AnyVal` types.
### Does this PR introduce any user-facing change?
Yes. For the test below:
```
case class Bar(i: Any)
case class Foo(i: Bar) extends AnyVal
test() {
implicitly[ExpressionEncoder[Foo]]
}
```
Before:
```
java.lang.ClassNotFoundException: scala.Any
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
...
````
After:
```
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "i")
- field (class: "org.apache.spark.sql.catalyst.encoders.Bar", name: "i")
- root class: "org.apache.spark.sql.catalyst.encoders.Foo"
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:561)
```
### How was this patch tested?
Added unit test and test manually.
Closes#27959 from Ngone51/impr_anyval.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to fix the issue of rebasing leap years in Julian calendar to Proleptic Gregorian calendar in which the years are not leap years. In the Julian calendar, every four years is a leap year, with a leap day added to the month of February. In Proleptic Gregorian calendar, every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years, if they are exactly divisible by 400. In this ways, the date **1000-02-29** exists in the Julian calendar but not in Proleptic Gregorian calendar.
I modified the `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` in `DateTimeUtils` by passing 1 as a day number of month while forming `LocalDate` or `LocalDateTime`, and adding the number of days using the `plusDays()` method. For example, **1000-02-29** doesn't exist in Proleptic Gregorian calendar, and `LocalDate.of(1000, 2, 29)` throws an exception. To avoid the issue, I build the `LocalDate.of(1000, 2, 1)` date and add 28 days. The `plusDays(28)` method produces the next valid date after `1000-02-28` which is **1000-03-01**.
### Why are the changes needed?
Before the changes, the `java.time.DateTimeException` exception is raised while loading the date `1000-02-29` from parquet files saved by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year
```
The parquet files were saved via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
| date|
+----------+
|1000-02-29|
+----------+
```
### Does this PR introduce any user-facing change?
Yes, after the fix:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
| date|
+----------+
|1000-03-01|
+----------+
```
### How was this patch tested?
Added tests to `DateTimeUtilsSuite`.
Closes#27974 from MaxGekk/julian-date-29-feb.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix errors and missing parts for datetime pattern document
1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own.
2. Some pattern letters are missing
3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N')
4. the second fraction pattern different logic for parsing and formatting
### Why are the changes needed?
fix and improve doc
### Does this PR introduce any user-facing change?
yes, new and updated doc
### How was this patch tested?
pass Jenkins
viewed locally with `jekyll serve`
![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png)
Closes#27956 from yaooqinn/SPARK-31189.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via **Avro** datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into **Avro** files.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.
The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.avro.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.
The details of the implementation:
1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds.
2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to **Avro** files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from **Avro** files if the SQL config is on.
5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types.
### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.
### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date |
+----------+
|1001-01-07|
+----------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date |
+----------+
|1001-01-01|
+----------+
```
### How was this patch tested?
1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro")
scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df2: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
scala> :paste
// Entering paste mode (ctrl-D to finish)
val timestampSchema = s"""
| {
| "namespace": "logical",
| "type": "record",
| "name": "test",
| "fields": [
| {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null}
| ]
| }
|""".stripMargin
// Exiting paste mode, now interpreting.
scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro")
```
2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. :
- `rebasing microseconds timestamps in write`
- `rebasing milliseconds timestamps in write`
- `rebasing dates in write`
Closes#27953 from MaxGekk/rebase-avro-datetime.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.
The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR.
### Why are the changes needed?
This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/26933
Fraction string like "1.23" is definitely not a valid integral format and we should fail to do the cast under the ANSI mode.
### Why are the changes needed?
correct the ANSI cast behavior from string to integral
### Does this PR introduce any user-facing change?
Yes under ANSI mode, but ANSI mode is off by default.
### How was this patch tested?
new test
Closes#27957 from cloud-fan/ansi.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.
According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.parquet.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.
The details of the implementation:
1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion.
2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on.
5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`.
6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`.
7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`.
### Why are the changes needed?
- For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.
- It fixes the bug of incorrect saving/loading timestamps of the `INT96` type
### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts |
+--------------------------+
|1001-01-07 11:32:20.123456|
+--------------------------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts |
+--------------------------+
|1001-01-01 01:02:03.123456|
+--------------------------+
```
### How was this patch tested?
1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date")
scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros")
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis")
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96")
```
2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes):
```bash
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d |ts |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```
Read the saved date/timestamp by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d |ts |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```
Closes#27915 from MaxGekk/rebase-parquet-datetime.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A followup of https://github.com/apache/spark/pull/27936 to update document.
### Why are the changes needed?
correct document
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#27950 from cloud-fan/null.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
pattern `''` means literal `'`
```sql
select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS");
5377-02-14 06:27:19'000000519
```
0946a9514f missed this case and this pr add it back.
### Why are the changes needed?
bugfix
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes#27949 from yaooqinn/SPARK-31150-2.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Prpend `-` to the compare result instead of creating a new reverse comparator for each compare when sorting in DESC order in InterpretedOrdering.
### Why are the changes needed?
Currently, we'll create a new reverse comparator for each compare in InterpretedOrdering, which could generate lots of small and instant object and hurt JVM when there're plenty of data.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27938 from Ngone51/reverse_comparator.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g.
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee');
```
Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before.
This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class.
And 'e' and 'c' char checking in this method.
Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g.
```sql
spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'");
20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")]
java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S
spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'");
NULL
```
### Why are the changes needed?
avoid vagueness
bug fix
### Does this PR introduce any user-facing change?
no, these are not exposed yet
### How was this patch tested?
add ut
Closes#27939 from yaooqinn/SPARK-31176.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```
<extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren>
<extract source> ::= <datetime value expression> | <interval value expression>
```
We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval.
This pr adds interval value support for `extract` expression as extract source
### Why are the changes needed?
For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions.
### Does this PR introduce any user-facing change?
yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals
### How was this patch tested?
add unit tests
Closes#27876 from yaooqinn/SPARK-31119.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.
### Why are the changes needed?
In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.
However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.
### Does this PR introduce any user-facing change?
No as ANSI mode is off by default.
### How was this patch tested?
new tests
Closes#27936 from cloud-fan/null.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.
e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
('2019-10-06 10:11:12.'),
('2019-10-06 10:11:12.0'),
('2019-10-06 10:11:12.1'),
('2019-10-06 10:11:12.12'),
('2019-10-06 10:11:12.123UTC'),
('2019-10-06 10:11:12.1234'),
('2019-10-06 10:11:12.12345CST'),
('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456
select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.
### Why are the changes needed?
improve timestamp parsing and more compatible with 2.4.x
### Does this PR introduce any user-facing change?
no, the related changes are newly added
### How was this patch tested?
add uts
Closes#27906 from yaooqinn/SPARK-31150.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps.
The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it.
### Why are the changes needed?
To avoid breaking queries.
### Does this PR introduce any user-facing change?
Yes, now some queries that work in 2.x can work in 3.0 as well.
### How was this patch tested?
new test
Closes#27926 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is kind of a followup of #26808. It leverages the helper method for aliasing in built-in SQL expressions to use the alias as its output column name where it's applicable.
- `Expression`, `UnaryMathExpression` and `BinaryMathExpression` search the alias in the tags by default.
- When the naming is different in its implementation, it has to be overwritten for the expression specifically. E.g., `CallMethodViaReflection`, `Remainder`, `CurrentTimestamp`,
`FormatString` and `XPathDouble`.
This PR fixes the aliases of the functions below:
| class | alias |
|--------------------------|------------------|
|`Rand` |`random` |
|`Ceil` |`ceiling` |
|`Remainder` |`mod` |
|`Pow` |`pow` |
|`Signum` |`sign` |
|`Chr` |`char` |
|`Length` |`char_length` |
|`Length` |`character_length`|
|`FormatString` |`printf` |
|`Substring` |`substr` |
|`Upper` |`ucase` |
|`XPathDouble` |`xpath_number` |
|`DayOfMonth` |`day` |
|`CurrentTimestamp` |`now` |
|`Size` |`cardinality` |
|`Sha1` |`sha` |
|`CallMethodViaReflection` |`java_method` |
Note: `EqualTo`, `=` and `==` aliases were excluded because it's unable to leverage this helper method. It should fix the parser.
Note: this PR also excludes some instances such as `ToDegrees`, `ToRadians`, `UnaryMinus` and `UnaryPositive` that needs an explicit name overwritten to make the scope of this PR smaller.
### Why are the changes needed?
To respect expression name.
### Does this PR introduce any user-facing change?
Yes, it will change the output column name.
### How was this patch tested?
Manually tested, and unittests were added.
Closes#27901 from HyukjinKwon/31146.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
It's not needed at all as now we replace "y" with "u" if there is no "G". So the era is either explicitly specified (e.g. "yyyy G") or can be inferred from the year (e.g. "uuuu").
### Why are the changes needed?
By default we use "uuuu" as the year pattern, which indicates the era already. If we set a default era, it can get conflicted and fail the parsing.
### Does this PR introduce any user-facing change?
yea, now spark can parse date/timestamp with negative year via the "yyyy" pattern, which will be converted to "uuuu" under the hood.
### How was this patch tested?
new tests
Closes#27707 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This reverts commit 47d6e80a2e.
### Why are the changes needed?
There is no standard requiring that `div` must return the type of the operand, and always returning long type looks fine. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.
### Does this PR introduce any user-facing change?
Yes, change the behavior of `div` back to be the same as 2.4.
### How was this patch tested?
N/A
Closes#27835 from cloud-fan/revert2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
spark.sql.legacy.timeParser.enabled should be removed from SQLConf and the migration guide
spark.sql.legacy.timeParsePolicy is the right one
### Why are the changes needed?
fix doc
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Pass the jenkins
Closes#27889 from yaooqinn/SPARK-31131.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
AQE has a perf regression when using the default settings: if we coalesce the shuffle partitions into one or few partitions, we may leave many CPU cores idle and the perf is worse than with AQE off (which leverages all CPU cores).
Technically, this is not a bad thing. If there are many queries running at the same time, it's better to coalesce shuffle partitions into fewer partitions. However, the default settings of AQE should try to avoid any perf regression as possible as we can.
This PR changes the default value of minPartitionNum when coalescing shuffle partitions, to be `SparkContext.defaultParallelism`, so that AQE can leverage all the CPU cores.
### Why are the changes needed?
avoid AQE perf regression
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#27879 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the error message, adding an example for typed Scala UDF.
### Why are the changes needed?
Help user to know how to migrate to typed Scala UDF.
### Does this PR introduce any user-facing change?
No, it's a new error message in Spark 3.0.
### How was this patch tested?
Pass Jenkins.
Closes#27884 from Ngone51/spark_31010_followup.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR reverts https://github.com/apache/spark/pull/26051 and https://github.com/apache/spark/pull/26066
### Why are the changes needed?
There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.
### Does this PR introduce any user-facing change?
Yes, change the behavior of `size(null)` back to be the same as 2.4.
### How was this patch tested?
N/A
Closes#27834 from cloud-fan/revert.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
`DateTimeUtilsSuite.daysToMicros and microsToDays` takes 30 seconds, which is too long for a UT.
This PR changes the test to check random data, to reduce testing time. Now this test takes 1 second.
### Why are the changes needed?
make test faster
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#27873 from cloud-fan/test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components.
### Why are the changes needed?
This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.
Here is the example which demonstrates the issue:
```sql
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
```
### Does this PR introduce any user-facing change?
Yes, after the changes:
```sql
scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])
```
### How was this patch tested?
By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
Closes#27807 from MaxGekk/rebase-timestamp-before-1582.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When encoding Java Beans to Spark DataFrame, respecting `javax.annotation.Nonnull` and producing non-null fields.
### Why are the changes needed?
When encoding Java Beans to Spark DataFrame, non-primitive types are encoded as nullable fields. Although It works for most cases, it can be an issue under a few situations, e.g. the one described in the JIRA ticket when saving DataFrame to Avro format with non-null field.
We should allow Spark users more flexibility when creating Spark DataFrame from Java Beans. Currently, Spark users cannot create DataFrame with non-nullable fields in the schema from beans with non-nullable properties.
Although it is possible to project top-level columns with SQL expressions like `AssertNotNull` to make it non-null, for nested fields it is more tricky to do it similarly.
### Does this PR introduce any user-facing change?
Yes. After this change, Spark users can use `javax.annotation.Nonnull` to annotate non-null fields in Java Beans when encoding beans to Spark DataFrame.
### How was this patch tested?
Added unit test.
Closes#27851 from viirya/SPARK-31071.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030)
### Why are the changes needed?
For backward compatibility.
### Does this PR introduce any user-facing change?
No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.
### How was this patch tested?
Existing and new added UT.
Locally document test:
![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png)
Closes#27830 from xuanyuanking/SPARK-31030.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, we parse interval from multi units strings or from date-time/year-month pattern strings, the former handles all whitespace, the latter not or even spaces.
### Why are the changes needed?
behavior consistency
### Does this PR introduce any user-facing change?
yes, interval in date-time/year-month like
```
select interval '\n-\t10\t 12:34:46.789\t' day to second
-- !query 126 schema
struct<INTERVAL '-10 days -12 hours -34 minutes -46.789 seconds':interval>
-- !query 126 output
-10 days -12 hours -34 minutes -46.789 seconds
```
is valid now.
### How was this patch tested?
add ut.
Closes#26815 from yaooqinn/SPARK-30189.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
RuleExecutor already support metering for analyzer/optimizer rules. By providing such information in `PlanChangeLogger`, user can get more information when debugging rule changes .
This PR enhanced `PlanChangeLogger` to display RuleExecutor metrics. This can be easily done by calling the existing API `resetMetrics` and `dumpTimeSpent`, but there might be conflicts if user is also collecting total metrics of a sql job. Thus I introduced `QueryExecutionMetrics`, as the snapshot of `QueryExecutionMetering`, to better support this feature.
Information added to `PlanChangeLogger`
```
=== Metrics of Executed Rules ===
Total number of runs: 554
Total time: 0.107756568 seconds
Total number of effective runs: 11
Total time of effective runs: 0.047615486 seconds
```
### Why are the changes needed?
Provide better plan change debugging user experience
### Does this PR introduce any user-facing change?
Only add more debugging info of `planChangeLog`, default log level is TRACE.
### How was this patch tested?
Update existing tests to verify the new logs
Closes#27846 from Eric5553/ExplainRuleExecMetrics.
Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes two things:
1. Convert `null` to `string` type during schema inference of `schema_of_json` as JSON datasource does. This is a bug fix as well because `null` string is not the proper DDL formatted string and it is unable for SQL parser to recognise it as a type string. We should match it to JSON datasource and return a string type so `schema_of_json` returns a proper DDL formatted string.
2. Let `schema_of_json` respect `dropFieldIfAllNull` option during schema inference.
### Why are the changes needed?
To let `schema_of_json` return a proper DDL formatted string, and respect `dropFieldIfAllNull` option.
### Does this PR introduce any user-facing change?
Yes, it does.
```scala
import collection.JavaConverters._
import org.apache.spark.sql.functions._
spark.range(1).select(schema_of_json(lit("""{"id": ""}"""))).show()
spark.range(1).select(schema_of_json(lit("""{"id": "a", "drop": {"drop": null}}"""), Map("dropFieldIfAllNull" -> "true").asJava)).show(false)
```
**Before:**
```
struct<id:null>
struct<drop:struct<drop:null>,id:string>
```
**After:**
```
struct<id:string>
struct<id:string>
```
### How was this patch tested?
Manually tested, and unittests were added.
Closes#27854 from HyukjinKwon/SPARK-31065.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a follow up for https://github.com/apache/spark/pull/27650 where allow None provider for create table. Here we are doing the same thing for ReplaceTable.
### Why are the changes needed?
Although currently the ASTBuilder doesn't seem to allow `replace` without `USING` clause. This would allow `DataFrameWriterV2` to use the statements instead of commands directly.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests
Closes#27838 from yuchenhuo/SPARK-30902.
Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The newly added catalog APIs are marked as Experimental but other DS v2 APIs are marked as Evolving.
This PR makes it consistent and mark all Connector APIs as Evolving.
### Why are the changes needed?
For consistency.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#27811 from cloud-fan/tag.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Introduced a new parameter `emptyCollection` for `CreateMap` and `CreateArray` functiion to remove dependency on SQLConf.get.
### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UTs.
Closes#27657 from iRakson/SPARK-30899.
Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs;
e75d9afb2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L613)
For example, the query below generates wrong grouping IDs in the master;
```
scala> val numCols = 32 // or, 31
scala> val cols = (0 until numCols).map { i => s"c$i" }
scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet")
scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
scala> sql(s"insert into test_$numCols values ($insertVals,3)")
scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false)
scala> sql(s"drop table test_$numCols")
// numCols = 32
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0 |3 |
|0 |3 | // Wrong Grouping ID
+-------------+------+
// numCols = 31
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0 |3 |
|1 |3 |
+-------------+------+
```
To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values.
### Why are the changes needed?
To support more cases in `GROUPING_ID`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes#26918 from maropu/FixGroupingIdIssue.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose:
1. To replace matching by `Literal` in `ExprUtils.evalSchemaExpr()` to checking foldable property of the `schema` expression.
2. To replace matching by `Literal` in `ExprUtils.evalTypeExpr()` to checking foldable property of the `schema` expression.
3. To change checking of the input parameter in the `SchemaOfCsv` expression, and allow foldable `child` expression.
4. To change checking of the input parameter in the `SchemaOfJson` expression, and allow foldable `child` expression.
### Why are the changes needed?
This should improve Spark SQL UX for `from_csv`/`from_json`. Currently, Spark expects only literals:
```sql
spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
```
and only string literals are acceptable as CSV examples by `schema_of_csv`/`schema_of_json`:
```sql
spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1));
Error in query: cannot resolve 'schema_of_csv(concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)))' due to data type mismatch: The input csv should be a string literal and not null; however, got concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)).; line 1 pos 7;
'Project [unresolvedalias(schema_of_csv(concat_ws(,, cast(0.1 as string), cast(1 as string))), None)]
+- OneRowRelation
spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''));
Error in query: cannot resolve 'schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''))' due to data type mismatch: The input json should be a string literal and not null; however, got regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '').; line 1 pos 7;
'Project [unresolvedalias(schema_of_json(regexp_replace({"item_id": 1, "item_price": 0.1}, item_, )), None)]
+- OneRowRelation
```
### Does this PR introduce any user-facing change?
Yes, after the changes users can pass any foldable string expression as the `schema` parameter to `from_csv()/from_json()`. For the example above:
```sql
spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
{"id":1,"city":"Moscow"}
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
{"id":1,"city":"Moscow"}
```
After change the `schema_of_csv`/`schema_of_json` functions accept foldable expressions, for example:
```sql
spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1));
struct<_c0:double,_c1:int>
spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''));
struct<id:bigint,price:double>
```
### How was this patch tested?
Added new test to `CsvFunctionsSuite` and to `JsonFunctionsSuite`.
Closes#27804 from MaxGekk/foldable-arg-csv-json-func.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to show a deprecation warning on two-parameter TRIM/LTRIM/RTRIM function usages based on the community decision.
- https://lists.apache.org/thread.html/r48b6c2596ab06206b7b7fd4bbafd4099dccd4e2cf9801aaa9034c418%40%3Cdev.spark.apache.org%3E
### Why are the changes needed?
For backward compatibility, SPARK-28093 is reverted. However, from Apache Spark 3.0.0, we should give a safe guideline to use SQL syntax instead of the esoteric function signatures.
### Does this PR introduce any user-facing change?
Yes. This shows a directional warning.
### How was this patch tested?
Pass the Jenkins with a newly added test case.
Closes#27643 from dongjoon-hyun/SPARK-30886.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds an internal config for changing the logging level of adaptive execution query plan evolvement.
### Why are the changes needed?
To make AQE debugging easier.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#27798 from maryannxue/aqe-log-level.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>