Commit graph

4384 commits

Author SHA1 Message Date
Peter Toth 4f53bfbbd5
[SPARK-31535][SQL] Fix nested CTE substitution
### What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)
```
Before this PR the result was `1`.

### Why are the changes needed?
To fix a correctness issue.

### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new test case.

Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 15:31:32 -07:00
Takeshi Yamamuro e01125db0d
[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp
### What changes were proposed in this pull request?

This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows;

 - SELECT current_date;
 - SELECT current_timestamp;
 - SELECT substring('abcd' FROM 1);
 - SELECT substring('abcd' FROM 1 FOR 2);

### Why are the changes needed?

To make the SQL (built-in functions) references complete.

### Does this PR introduce any user-facing change?

<img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png">

<img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png">

<img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png">

### How was this patch tested?

Added test examples.

Closes #28342 from maropu/SPARK-31562.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 11:46:52 -07:00
Max Gekk 7d8216a664
[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types
### What changes were proposed in this pull request?
In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection.

### Why are the changes needed?
The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563.

### Does this PR introduce any user-facing change?
Highly likely, not.

### How was this patch tested?
Added a test to `ColumnExpressionSuite`

Closes #28343 from MaxGekk/fix-InSet-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-25 09:29:51 -07:00
Kent Yao f92652d0b5
[SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions
### What changes were proposed in this pull request?

Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms.

For example
Negative:
https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts
https://prestodb.io/docs/current/functions/datetime.html#date_trunc
https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc
https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/

Positive:
https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html
https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

This PR removes these `fmt`s support for trunc and date_trunc functions.

### Why are the changes needed?

clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later.

### Does this PR introduce any user-facing change?
no, targeting 3.0.0, these are newly added in 3.0.0

### How was this patch tested?

remove and modify existing units tests

Closes #28313 from yaooqinn/SPARK-31528.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:28:41 -07:00
Kent Yao caf3ab8411
[SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor
### What changes were proposed in this pull request?

the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor`

For example:
```scala
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(<console>:57)
  ... 53 elided

scala>
```

In this PR, we add the missing cases to `arrayClassFor`

### Why are the changes needed?

bugfix as described above

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

add a test for array encoders

Closes #28324 from yaooqinn/SPARK-31552.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:04:26 -07:00
Yuming Wang b10263b8e5 [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators
### What changes were proposed in this pull request?

`LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example:

Teradata / Hive 3.0 / Snowflake:
```sql
--like any
select 'foo' LIKE ANY ('%foo%','%bar%');

--like all
select 'foo' LIKE ALL ('%foo%','%bar%');
```
PostgreSQL:
```sql
-- like any
select 'foo' LIKE ANY (array['%foo%','%bar%']);

-- like all
select 'foo' LIKE ALL (array['%foo%','%bar%']);
```

This PR add support these two operators.

More details:
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQ
https://issues.apache.org/jira/browse/HIVE-15229
https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html

### Why are the changes needed?

To smoothly migrate SQLs to Spark SQL.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test.

Closes #27477 from wangyum/SPARK-30724.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-24 22:20:32 +09:00
Max Gekk 26165427c7 [SPARK-31488][SQL] Support java.time.LocalDate in Parquet filter pushdown
### What changes were proposed in this pull request?
1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes.
2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days.
3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`).

### Why are the changes needed?
To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes #28259 from MaxGekk/parquet-filter-java8-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-24 02:21:53 +00:00
Takeshi Yamamuro 42f496f6ac [SPARK-31526][SQL][TESTS] Add a new test suite for ExpressionInfo
### What changes were proposed in this pull request?

This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;

 - Added a new test suite named `ExpressionInfoSuite`
 - To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
 - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
 - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
 - Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224)

### Why are the changes needed?

To improve test suites/coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28308 from maropu/SPARK-31526.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-24 11:19:20 +09:00
Yuanjian Li ca90e1932d [SPARK-31515][SQL] Canonicalize Cast should consider the value of needTimeZone
### What changes were proposed in this pull request?
Override the canonicalized fields with respect to the result of `needsTimeZone`.

### Why are the changes needed?
The current approach breaks sematic equal of two cast expressions that don't relate with datetime type. If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New UT added.

Closes #28288 from xuanyuanking/SPARK-31515.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-23 14:32:10 +09:00
Kent Yao 3b5792114a [SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info
### What changes were proposed in this pull request?

\_FUNC\_ is used in note() of `ExpressionDescription` since https://github.com/apache/spark/pull/28248, it can be more cases later, we should replace it with function name for documentation

### Why are the changes needed?

doc fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

pass Jenkins, and verify locally with Jekyll serve

Closes #28305 from yaooqinn/SPARK-31474-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 13:33:04 +09:00
Max Gekk e7856a7902 [MINOR][SQL] Add comments for filters values and return values of Row.get()/apply()
### What changes were proposed in this pull request?
- Document row field values of `DATE` and `TIMESTAMP` type returned by `Row.get()` and `Row.apply`.
- Refer to `Row.get()` from the description of filter values

### Why are the changes needed?
Reflect current behaviour of Row's method `apply()` and `get()` in comments to inform users about different return types that are depended on the SQL config settings `spark.sql.datetime.java8API.enabled`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run `$ ./dev/scalastyle`

Closes #28300 from MaxGekk/doc-filter-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-23 04:23:33 +00:00
Kent Yao 37d2e037ed [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function
### What changes were proposed in this pull request?

Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

This PR removes these extract fields support from extract function for date and timestamp values

`isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms
`isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it.

For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated.
1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`]
2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`]
3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found.
4. with a discussion the community,we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too.

Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system

### Why are the changes needed?

Remove some nonstandard and uncommon features as we can add them back if necessary

### Does this PR introduce any user-facing change?

NO, we should target this to 3.0.0 and these are added during 3.0.0

### How was this patch tested?

Remove unused tests

Closes #28284 from yaooqinn/SPARK-31507.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 10:24:49 +00:00
Wenchen Fan b209b5f406
[SPARK-31503][SQL] fix the SQL string of the TRIM functions
### What changes were proposed in this pull request?

override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax.

### Why are the changes needed?

The current implementation is wrong. It gives you a SQL string that returns different result.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

new tests

Closes #28281 from cloud-fan/sql.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 11:22:18 -07:00
Kent Yao 1985437110 [SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function
### What changes were proposed in this pull request?
```sql
spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6
```
Currently, there are 4 types of day-of-week range:
1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7)
2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6)
3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7)
4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6)

Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903
https://issues.apache.org/jira/browse/SPARK-28623

In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything.

Also, add more documentation to the extracting function to make extract field more clear to understand.

### Why are the changes needed?

Consistency insurance

### Does this PR introduce any user-facing change?

yes, doc updated and extract `dow` is as same as `dayofweek`

### How was this patch tested?

1. modified ut
2. local SQL doc verification
#### before
![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png)

#### after
![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png)

Closes #28248 from yaooqinn/SPARK-31474.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-21 11:55:33 +00:00
Takeshi Yamamuro e42dbe7cd4 [SPARK-31429][SQL][DOC] Automatically generates a SQL document for built-in functions
### What changes were proposed in this pull request?

This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references.

### Why are the changes needed?

To make SQL references complete.

### Does this PR introduce any user-facing change?

Yes;

![a](https://user-images.githubusercontent.com/692303/79406712-c39e1b80-7fd2-11ea-8b85-9f9cbb6efed3.png)
![b](https://user-images.githubusercontent.com/692303/79320526-eb46a280-7f44-11ea-8639-90b1fb2b8848.png)
![c](https://user-images.githubusercontent.com/692303/79320707-3365c500-7f45-11ea-9984-69ffe800fb87.png)

### How was this patch tested?

Manually checked and added tests.

Closes #28224 from maropu/SPARK-31429.

Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-21 10:55:13 +09:00
Wenchen Fan 69f9ee18b6
[SPARK-31452][SQL] Do not create partition spec for 0-size partitions in AQE
### What changes were proposed in this pull request?

This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing.

### Why are the changes needed?

launching tasks that do nothing is a waste.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

updated tests

Closes #28226 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-20 13:50:07 -07:00
gatorsmile 6c792a79c1 [SPARK-31234][SQL][FOLLOW-UP] ResetCommand should not affect static SQL Configuration
### What changes were proposed in this pull request?
This PR is the follow-up PR of https://github.com/apache/spark/pull/28003

- add a migration guide
- add an end-to-end test case.

### Why are the changes needed?
The original PR made the major behavior change in the user-facing RESET command.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a new end-to-end test

Closes #28265 from gatorsmile/spark-31234followup.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-04-20 13:08:55 -07:00
Max Gekk 88d39e5a89 [SPARK-31385][SQL] Restrict micros rebasing via switch arrays up to 2037 year
### What changes were proposed in this pull request?
1. Generate rebasing arrays for micros up to 2037 in `RebaseDateTimeSuite.generateRebaseJson()`.
2. Exclude 4 time zones from the black list in `generateRebaseJson()`.
3. Re-generate JSON files with rebasing info - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`.

### Why are the changes needed?
1. `sun.util.calendar.ZoneInfo` resolves DST after 2037 year incorrectly. See aa318070b2/jdk/src/share/classes/sun/util/calendar/ZoneInfo.java (L55-L62) . By restricting the rebase arrays to 2037 year, we follow the behaviour of `ZoneInfo` which uses DST of 2037 for all years beyond 2037.
2. To enable optimization of micros rebasing via switch arrays for the time zones:
    - Asia/Tehran
    - Iran
    - Africa/Casablanca
    - Africa/El_Aaiun

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `RebaseDateTimeUtils`, `DateTimeUtilsSuite` and `DateFunctionsSuite`.

Closes #28253 from MaxGekk/fix-4-time-zones-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-20 06:35:16 +00:00
Takeshi Yamamuro 74aed8cc8b
[SPARK-31476][SQL] Add an ExpressionInfo entry for EXTRACT
### What changes were proposed in this pull request?

This PR intends to add an ExpressionInfo entry for EXTRACT for better documentations.
This PR comes from the comment in https://github.com/apache/spark/pull/21479#discussion_r409900080

### Why are the changes needed?

To make SQL documentations complete.

### Does this PR introduce any user-facing change?

Yes, this PR updates the `Spark SQL, Built-in Functions` page.

### How was this patch tested?

Run the example tests.

Closes #28251 from maropu/AddExtractExpr.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-18 13:37:12 -07:00
gatorsmile 6bf5f01a4a [SPARK-31477][SQL] Dump codegen and compile time in BenchmarkQueryTest
### What changes were proposed in this pull request?
This PR is to dump the codegen and compilation time for benchmark query tests.

### Why are the changes needed?
Measure the codegen and compilation time costs in TPC-DS queries

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manual test in my local laptop:
```
23:13:12.845 WARN org.apache.spark.sql.TPCDSQuerySuite:
=== Metrics of Whole-stage Codegen ===
Total code generation time: 21.275102261 seconds
Total compilation time: 12.223771828 seconds
```

Closes #28252 from gatorsmile/testMastercode.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-18 20:59:45 +09:00
Takeshi Yamamuro a7fb330ed3 [SPARK-31468][SQL] Null types should be implicitly casted to Decimal types
### What changes were proposed in this pull request?

This PR intends to fix a bug that occurs when comparing null types to decimal types in master/branch-3.0;
```
scala> Seq(BigDecimal(10)).toDF("v1").selectExpr("v1 = NULL").explain(true)
org.apache.spark.sql.AnalysisException: cannot resolve '(`v1` = NULL)' due to data type mismatch: differing types in '(`v1` = NULL)' (decimal(38,18) and null).; line 1 pos 0;
'Project [(v1#5 = null) AS (v1 = NULL)#7]
+- Project [value#2 AS v1#5]
   +- LocalRelation [value#2]
...
```
The query above passed in v2.4.5.

### Why are the changes needed?

bugfix

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28241 from maropu/SPARK-31468.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 14:11:17 +00:00
Kent Yao 697083c051 [SPARK-31469][SQL] Make extract interval field ANSI compliance
### What changes were proposed in this pull request?

Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values

While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other.

While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual.

So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`.

While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.

To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.

### Why are the changes needed?

Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.

The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.

With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.

e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579

e.g. presto

```sql
presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)

presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)

```

### Does this PR introduce any user-facing change?

Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function

### How was this patch tested?

rm some used tests

Closes #28242 from yaooqinn/SPARK-31469.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 13:59:02 +00:00
beliefer 1513673f83 [SPARK-30913][SPARK-30841][CORE][SQL][FOLLOWUP] Supplement version information to the configuration of Tests.scala and SQL
### What changes were proposed in this pull request?
I checked all the config of Spark again. find some new commit not add version information.

**Test.scala**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.testing.skipValidateCores | 3.1.0 | SPARK-29154 | 474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 |  

**SQL**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.legacy.integerGroupingId | 3.1.0 | SPARK-30279 | 71c73d58f6e88d2558ed2e696897767d93bac60f#diff-9a6b543db706f1a90f790783d6930a13 |  

The two config only exists in branch master.

### Why are the changes needed?
Supplement version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28233 from beliefer/sql-conf-version-legacy-integerGroupingId.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-17 17:10:48 +09:00
yi.wu 40f9dbb628 [SPARK-31425][SQL][CORE] UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should also respect UnsafeAlignedOffset
### What changes were proposed in this pull request?

Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts.

### Why are the changes needed?

Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue.

Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8`  to simulate SPARC platform. And tests only pass with this fix.

Closes #28195 from Ngone51/fix_uao.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 04:48:27 +00:00
herman fab4ca5156
[SPARK-31450][SQL] Make ExpressionEncoder thread-safe
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).

### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #28223 from hvanhovell/SPARK-31450.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-16 18:47:46 -07:00
Peter Toth 7ad6ba36f2 [SPARK-30564][SQL] Revert Block.length and use comment placeholders in HashAggregateExec
### What changes were proposed in this pull request?
SPARK-21870 (cb0cddf#diff-06dc5de6163687b7810aa76e7e152a76R146-R149) caused significant performance regression in cases where the source code size is fairly large as `HashAggregateExec` uses `Block.length` to decide on splitting the code. The change in `length` makes sense as the comment and extra new lines shouldn't be taken into account when deciding on splitting, but the regular expression based approach is very slow and adds a big relative overhead to cases where the execution is quick (small number of rows).
This PR:
- restores `Block.length` to its original form
- places comments in `HashAggragateExec` with `CodegenContext.registerComment` so as to appear only when comments are enabled (`spark.sql.codegen.comments=true`)

Before this PR:
```
deeply nested struct field r/w:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem)                  1137           1143           8          0.1       11368.3       0.0X
```

After this PR:
```
deeply nested struct field r/w:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem)                   167            180           7          0.6        1674.3       0.1X
```
### Why are the changes needed?
To fix performance regression.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #28083 from peter-toth/SPARK-30564-use-comment-placeholders.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-16 17:52:22 +09:00
Max Gekk c76c31e2c6 [SPARK-31455][SQL] Fix rebasing of not-existed timestamps
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed timestamps in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range [1582-10-05, 1582-10-15). Not existed timestamps from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianMicros()` because reverse rebasing from the hybrid timestamps to Proleptic Gregorian timestamps does not have such problem.

The shifting affects only the date part of timestamps while keeping the time part as is. For example:
```
1582-10-10 00:11:22.334455 -> 1582-10-15 00:11:22.334455
```

### Why are the changes needed?
Currently, not-existed timestamps are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 00:00:00 -> 1582-10-24 00:00:00. That contradicts to shifting of not existed dates in other cases, for example:
```
scala> sql("select timestamp'1990-9-31 12:12:12'").show
+----------------------------------+
|TIMESTAMP('1990-10-01 12:12:12.0')|
+----------------------------------+
|               1990-10-01 12:12:12|
+----------------------------------+
```

### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `TIMESTAMP` values to external timestamps based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 12:13:14 date to ORC files, it will be shifted to the next valid date 1582-10-15 12:13:14.

### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.

Closes #28227 from MaxGekk/fix-not-exist-timestamps.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-16 02:54:38 +00:00
Max Gekk 2b10d70bad [SPARK-31423][SQL] Fix rebasing of not-existed dates
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed dates in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 1582-10-15). Not existed dates from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates to Proleptic Gregorian dates does not have such problem.

### Why are the changes needed?
Currently, not-existed dates are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 1582-10-24. That's contradict to shifting not existed dates in other cases, for example:
```
scala> sql("select date'1990-9-31'").show
+-----------------+
|DATE '1990-10-01'|
+-----------------+
|       1990-10-01|
+-----------------+
```

### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `DATE` values to external dates based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 date to ORC files, it will be shifted to the next valid date 1582-10-15.

### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.

Closes #28225 from MaxGekk/fix-not-exist-dates.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-15 16:33:56 +00:00
Max Gekk 744c2480b5 [SPARK-31443][SQL] Fix perf regression of toJavaDate
### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of  `java.sql.Date`.

Also new benchmark for collecting dates is added to `DateTimeBenchmark`.

### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27):

Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  559            603          38          8.9         111.8       1.0X
Collect dates                                      2306           3221        1558          2.2         461.1       0.2X
```
Before the changes:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1052           1130          73          4.8         210.3       1.0X
Collect dates                                      3251           4943        1624          1.5         650.2       0.3X
```
After:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  416            419           3         12.0          83.2       1.0X
Collect dates                                      1928           2759        1180          2.6         385.6       0.2X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28212 from MaxGekk/optimize-toJavaDate.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-15 06:19:12 +00:00
Max Gekk 2c5d489679 [SPARK-31439][SQL] Fix perf regression of fromJavaDate
### What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`.

Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`.

### Why are the changes needed?
The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values:
- x4 on the master branch
- 30% against Spark 2.4.6-SNAPSHOT

Spark 2.4.6-SNAPSHOT:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  614            655          43          8.1         122.8       1.0X
```

Before the changes:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1154           1206          46          4.3         230.9       1.0X
```

After:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  427            434           7         11.7          85.3       1.0X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28205 from MaxGekk/optimize-fromJavaDate.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-14 14:44:00 +00:00
Wenchen Fan 6b88d136de [SPARK-31402][SQL][FOLLOWUP] Refine code comments in RebaseDateTime
### What changes were proposed in this pull request?

Refine the code comments of days rebasing, to be consistent with the micros rebasing. i.e. one method is the actual implementation and the other variant is the optimized version.

### Why are the changes needed?

improve code comments

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #28199 from cloud-fan/comment.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-14 08:06:55 +00:00
Max Gekk a0f8cc08a3 [SPARK-31426][SQL] Fix perf regressions of toJavaTimestamp/fromJavaTimestamp
### What changes were proposed in this pull request?
Reuse the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` functions introduced by the PR #28119 in `DateTimeUtils`.`toJavaTimestamp()` and `fromJavaTimestamp()`. Actually, new implementation is derived from Spark 2.4 + rebasing via pre-calculated rebasing maps.

### Why are the changes needed?
The changes speed up conversions to/from java.sql.Timestamp, and as a consequence the PR improve performance of ORC datasource in loading/saving timestamps:
- Saving ~ **x2.8 faster** in master, and -11% against Spark 2.4.6
- Loading - **x3.2-4.5 faster** in master, -5% against Spark 2.4.6

Before:
```
Save timestamps to ORC:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582                                        59877          59877           0          1.7         598.8       0.0X
before 1582                                       61361          61361           0          1.6         613.6       0.0X

Load timestamps from ORC:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48197          48288         118          2.1         482.0       1.0X
after 1582, vec on                                38247          38351         128          2.6         382.5       1.3X
before 1582, vec off                              53179          53359         249          1.9         531.8       0.9X
before 1582, vec on                               44076          44268         269          2.3         440.8       1.1X
```

After:
```
Save timestamps to ORC:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582                                        21250          21250           0          4.7         212.5       0.1X
before 1582                                       22105          22105           0          4.5         221.0       0.1X

Load timestamps from ORC:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               14903          14933          40          6.7         149.0       1.0X
after 1582, vec on                                 8342           8426          73         12.0          83.4       1.8X
before 1582, vec off                              15528          15575          76          6.4         155.3       1.0X
before 1582, vec on                                9025           9075          61         11.1          90.2       1.7X
```

Spark 2.4.6-SNAPSHOT:
```
Save timestamps to ORC:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582                                        18858          18858           0          5.3         188.6       1.0X
before 1582                                       18508          18508           0          5.4         185.1       1.0X

Load timestamps from ORC:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               14063          14177         143          7.1         140.6       1.0X
after 1582, vec on                                 5955           6029         100         16.8          59.5       2.4X
before 1582, vec off                              14119          14126           7          7.1         141.2       1.0X
before 1582, vec on                                5991           6007          25         16.7          59.9       2.3X
```

### Does this PR introduce any user-facing change?
Yes, the `to_utc_timestamp` function returns the later local timestamp in the case of overlapping local timestamps at daylight saving time. it's changed back to the 2.4 behavior.

### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuites`, `ParquetIOSuite`, `OrcHadoopFsRelationSuite`.
- Re-generating results of the benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28189 from MaxGekk/optimize-to-from-java-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-14 04:50:20 +00:00
Max Gekk cf63ad61f5 [SPARK-31402][SQL] Fix rebasing of BCE dates/timestamps
### What changes were proposed in this pull request?
In the PR, I propose to fallback to rebasing via local dates/timestamps for days/micros of before common era (BCE).

### Why are the changes needed?
It fixes the bug of rebasing dates/timestamps of BCE.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- By existing tests in `RebaseDateTimeSuite` and `DateTimeUtilsSuite`
- Added tests for negative years to `RebaseDateTimeSuite`

Closes #28172 from MaxGekk/fix-era-in-date-micros-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-13 06:07:31 +00:00
Kent Yao d65f534c5a [SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing
### What changes were proposed in this pull request?

With benchmark original, where the timestamp values are valid to the new parser

the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5781 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 44764 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 93764 ms
[info]   Running case: from_json(timestamp)
[info]   Stopped after 3 iterations, 59021 ms
```
When we modify the benchmark to

```scala
     def timestampStr: Dataset[String] = {
        spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
          iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
        }.select($"value".as("timestamp")).as[String]
      }

      readBench.addCase("timestamp strings", numIters) { _ =>
        timestampStr.noop()
      }

      readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
        spark.read.schema(tsSchema).json(timestampStr).noop()
      }

      readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
        spark.read.json(timestampStr).noop()
      }
```
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is

```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
```
About 10x perf-regression

BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is

```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
```

### Why are the changes needed?

 Fix performance regression.

### Does this PR introduce any user-facing change?

NO
### How was this patch tested?

new tests added.

Closes #28181 from yaooqinn/SPARK-31414.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-13 03:11:28 +00:00
Kousuke Saruta 6cd0bef7fe
[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen
### What changes were proposed in this pull request?

Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor`
To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .

### Why are the changes needed?

In the current implementation, `enum` is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.
```
case class  Data(`0`: Int)
Seq(Data(1)).toDF.show

20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type

...

```

### Does this PR introduce any user-facing change?

Yes. With this change and the code example above, we can get following error message.
```
java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"

...
```

### How was this patch tested?

Add another assertion to existing test case.

Closes #28184 from sarutak/improve-identifier-check.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-12 13:14:41 -07:00
Dilip Biswal f0e2fc37d1 [SPARK-25154][SQL] Support NOT IN sub-queries inside nested OR conditions
### What changes were proposed in this pull request?

Currently NOT IN subqueries (predicated null aware subquery) are not allowed inside OR expressions. We currently catch this condition in checkAnalysis and throw an error.

This PR enhances the subquery rewrite to support this type of queries.

Query
```SQL
SELECT * FROM s1 WHERE a > 5 or b NOT IN (SELECT c FROM s2);
```
Optimized Plan
```SQL
== Optimized Logical Plan ==
Project [a#3, b#4]
+- Filter ((a#3 > 5) || NOT exists#7)
   +- Join ExistenceJoin(exists#7), ((b#4 = c#5) || isnull((b#4 = c#5)))
      :- HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#3, b#4]
      +- Project [c#5]
         +- HiveTableRelation `default`.`s2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c#5, d#6]
```
This is rework from #22141.
The original author of this PR is dilipbiswal.

Closes #22141

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new tests in SQLQueryTestSuite, RewriteSubquerySuite and SubquerySuite.
Output from DB2 as a reference:
[nested-not-db2.txt](https://github.com/apache/spark/files/2299945/nested-not-db2.txt)

Closes #28158 from maropu/pr22141.

Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-11 08:28:11 +09:00
Kent Yao a454510917 [SPARK-31392][SQL] Support CalendarInterval to be reflect to CalendarntervalType
### What changes were proposed in this pull request?

Since 3.0.0, we make CalendarInterval public for input, it's better for it to be inferred to CalendarIntervalType.
In the PR, we add a rule for CalendarInterval to be mapped to CalendarIntervalType in ScalaRelection, then records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe.

### Why are the changes needed?

CalendarInterval is public but can not be used as input for Datafame.

```scala
scala> import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.unsafe.types.CalendarInterval

scala> Seq((1, new CalendarInterval(1, 2, 3))).toDF("a", "b")
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$schemaFor$1(ScalaReflection.scala:735)
```

this should be supported as well as
```scala
scala> sql("select interval 2 month 1 day a")
res2: org.apache.spark.sql.DataFrame = [a: interval]
```
### Does this PR introduce any user-facing change?

Yes, records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe
### How was this patch tested?

add uts

Closes #28165 from yaooqinn/SPARK-31392.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-10 07:34:01 +00:00
Wenchen Fan 148950fa2b [SPARK-31359][DOC][FOLLOWUP] improve code comments in RebaseDateTime
### What changes were proposed in this pull request?

improve the code comment and make them consistent between `rebaseJulianToGregorian*` and `rebaseGregorianToJulian*`

### Why are the changes needed?

improve readability.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #28166 from cloud-fan/comment.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-10 03:43:32 +00:00
Max Gekk e2d9399602 [SPARK-31359][SQL] Speed up timestamps rebasing
### What changes were proposed in this pull request?
In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01:

| i | local  timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes|
| -- | ------- |----|----| ---- |
|0|0001-01-01 00:00|-62135568422|-62135740800|-2872|
|1|0100-03-01 00:00|-59006333222|-59006419200|-1432|
|...|...|...|...|...|
|13|1582-10-15 00:00|-12219264422|-12219264000|7|
|14|1883-11-18 12:00|-2717640000|-2717640000|0|

The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps:
1. Look at the table, and find the time interval where the micros falls to
2. Take the difference between 2 calendars for this time interval
3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation.

Here are details of the implementation:
- Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example:
```json
[
  {
    "tz" : "America/Los_Angeles",
    "switches" : [ -62135740800, -59006419200, ... , -2717640000 ],
    "diffs" : [ 172378, 85978, ..., 0 ]
  }
]
```
  The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default.
  The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test).

- The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385)
- The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case.
The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime.

- I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason.

- I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation.

- Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'.

### Why are the changes needed?
To make timestamps rebasing faster:
- Saving timestamps to parquet files is ~ **x3.8 faster**
- Loading timestamps from parquet files is ~**x2.8 faster**.
- Loading timestamps by Vectorized reader ~**x4.6 faster**.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that
  - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`.
  - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function.
- Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones.
- Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar.
- The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`.
- Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127):

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28119 from MaxGekk/optimize-rebase-micros.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-09 05:23:52 +00:00
iRakson b56242332d
[SPARK-31009][SQL] Support json_object_keys function
### What changes were proposed in this pull request?
A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.

- If invalid json expression is given, `NULL` will be returned.
- If an empty string or json array is given, `NULL` will be returned.
- If valid json object is given, all the keys of the outmost object will be returned as an array.
- For empty json object, empty array is returned.

We can also get JSON object keys using `map_keys+from_json`.  But `json_object_keys` is more efficient.
```
Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys                                  11666          12361         673          0.9        1166.6       1.0X
from_json+map_keys                                15309          15973         701          0.7        1530.9       0.8X

```

### Why are the changes needed?
This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.

Some of the most popular DBMSs supports this function.
- PostgreSQL
- MySQL
- MariaDB

### Does this PR introduce any user-facing change?
Yes. Now users can extract keys of json objects using this function.

### How was this patch tested?
UTs added.

Closes #27836 from iRakson/jsonKeys.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-08 13:04:59 -07:00
iRakson 71022d7130
[SPARK-31008][SQL] Support json_array_length function
### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a  `json_array_length` function which will return the length of the outermost JSON array.

- This function will return length of the outermost JSON array, if JSON array is valid.
```

scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
|                                                 5|
+--------------------------------------------------+

scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
|                             2|
+------------------------------+

```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
|               null|
+-------------------+
```

### Why are the changes needed?

- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.

- for better user experience and ease of use.

```
Performance Result for Json array - [1, 2, 3, 4]

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length                                  7728           7762          53          1.3         772.8       1.0X
size+from_json                                    12739          12895         199          0.8        1273.9       0.6X

```

### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.

### How was this patch tested?
Added UT.

Closes #27759 from iRakson/jsonArrayLength.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-07 15:34:33 -07:00
Eric Wu a28ed86a38
[SPARK-31113][SQL] Add SHOW VIEWS command
### What changes were proposed in this pull request?
Previously, user can issue `SHOW TABLES` to get info of both tables and views.
This PR (SPARK-31113) implements `SHOW VIEWS` SQL command similar to HIVE to get views only.(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews)

**Hive** -- Only show view names
```
hive> SHOW VIEWS;
OK
view_1
view_2
...
```

**Spark(Hive-Compatible)** -- Only show view names, used in tests and `SparkSQLDriver` for CLI applications
```
SHOW VIEWS IN showdb;
view_1
view_2
...
```

**Spark** -- Show more information database/viewName/isTemporary
```
spark-sql> SHOW VIEWS;
userdb	view_1	false
userdb	view_2	false
...
```

### Why are the changes needed?
`SHOW VIEWS` command provides better granularity to only get information of views.

### Does this PR introduce any user-facing change?
Add new `SHOW VIEWS` SQL command

### How was this patch tested?
Add new test `show-views.sql` and pass existing tests

Closes #27897 from Eric5553/ShowViews.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-07 09:25:01 -07:00
HyukjinKwon 6cf7336a07 [SPARK-30841][SQL][FOLLOW-UP] Change 'version' of spark.sql.execution.pandas.udf.buffer.size to 3.0.0
### What changes were proposed in this pull request?

This PR fixes the added version of `spark.sql.execution.pandas.udf.buffer.size` to 3.0.0 (see also SPARK-27870)

### Why are the changes needed?

To show the correct version added.

### Does this PR introduce any user-facing change?

Yes but only in the unreleased branches. It will change the version shown in SQL documentation.

### How was this patch tested?

Not tested. Jenkins will test it out.

Closes #28144 from HyukjinKwon/SPARK-30841-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-07 23:06:53 +09:00
beliefer 112caea214 [SPARK-31315][SQL][FOLLOWUP][MINOR] Fix some typo and improve comments
### What changes were proposed in this pull request?
This is a minor PR used to fix some typo and improve comments mentioned with https://github.com/apache/spark/pull/28081/files#r402874997

### Why are the changes needed?
Fix some typo and improve comments.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28112 from beliefer/fix-typo-in-codegen.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-06 17:43:02 +09:00
Maxim Gekk 09198528b9 [SPARK-31343][SQL][TESTS] Check codegen does not fail on expressions with escape chars in string parameters
### What changes were proposed in this pull request?
In the PR, I propose to add tests to check that code generation doesn't fail if expressions string argument contains escape chars. The PR adds similar tests added by https://github.com/apache/spark/pull/20182 for `from_utc_timestamp` / `to_utc_timestamp`.

### Why are the changes needed?
To prevent regressions in the future.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the affected tests

Closes #28115 from MaxGekk/tests-arg-escape.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-06 05:43:29 +00:00
Liang-Chi Hsieh 1f02871489 [SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate
### What changes were proposed in this pull request?

This patch proposed to skip predicates on PythonUDFs to be pushdown through Aggregate.

### Why are the changes needed?

The predicates on PythonUDFs cannot be pushdown through Aggregate. Pushed down predicates cannot be evaluate because PythonUDFs cannot be evaluated on Filter and cause error like:

```
Caused by: java.lang.UnsupportedOperationException: Cannot generate code for expression: mean(input[1, struct<bar:bigint>, true].bar)
        at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:304)
        at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:303)
        at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:52)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141)
        at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:821)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146)
        at scala.Option.getOrElse(Option.scala:189)
```

### Does this PR introduce any user-facing change?

Yes. Previously the predicates on PythonUDFs will be pushdown through Aggregate can cause error. After this change, the query can work.

### How was this patch tested?

Unit test.

Closes #28089 from viirya/SPARK-30921.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-06 09:36:20 +09:00
gatorsmile 39a6f518cb
[SPARK-29554][SQL][FOLLOWUP] Update Auto-generated Alias Name for Version
### What changes were proposed in this pull request?
The auto-generated alias name of built-in function `version()` is `sparkversion()`. After this PR, it is updated to `version()`.

### Why are the changes needed?
Based on our auto-generated alias name convention for the built-in functions, the alias names should be consistent with the function names.

This built-in function `version` is added in the upcoming Spark 3.0. Thus, we should fix it before the release.

### Does this PR introduce any user-facing change?
Yes. Update the column name in schema if users do not specify the alias.

### How was this patch tested?
Added a test case.

Closes #28131 from gatorsmile/spark-29554followup.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-05 16:37:03 -07:00
Maxim Gekk 820bb9985a [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
### What changes were proposed in this pull request?
1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4.
2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case.

### Why are the changes needed?
This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see
<img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png">
Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets.
- Run the following code to check that there is no difference between rebased and original micros for modern timestamps:
```scala
    test("rebasing differences") {
      withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
        val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
          .atZone(getZoneId("America/Los_Angeles"))
          .toInstant)
        val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
          .atZone(getZoneId("America/Los_Angeles"))
          .toInstant)

        var micros = start
        var diff = Long.MaxValue
        var counter = 0
        while (micros < end) {
          val rebased = rebaseGregorianToJulianMicros(micros)
          val curDiff = rebased - micros
          if (curDiff != diff) {
            counter += 1
            diff = curDiff
            val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
            println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes")
          }
          micros += 30 * MICROS_PER_MINUTE
        }
        println(s"counter = $counter")
      }
    }
```
```
local date-time = 0001-01-01T00:00 diff = -2872 minutes
local date-time = 0100-03-01T00:00 diff = -1432 minutes
local date-time = 0200-03-01T00:00 diff = 7 minutes
local date-time = 0300-03-01T00:00 diff = 1447 minutes
local date-time = 0500-03-01T00:00 diff = 2887 minutes
local date-time = 0600-03-01T00:00 diff = 4327 minutes
local date-time = 0700-03-01T00:00 diff = 5767 minutes
local date-time = 0900-03-01T00:00 diff = 7207 minutes
local date-time = 1000-03-01T00:00 diff = 8647 minutes
local date-time = 1100-03-01T00:00 diff = 10087 minutes
local date-time = 1300-03-01T00:00 diff = 11527 minutes
local date-time = 1400-03-01T00:00 diff = 12967 minutes
local date-time = 1500-03-01T00:00 diff = 14407 minutes
local date-time = 1582-10-15T00:00 diff = 7 minutes
local date-time = 1883-11-18T12:22:58 diff = 0 minutes
counter = 15
```
The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds.
- By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |

Closes #28101 from MaxGekk/fix-local-date-overlap.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-03 04:35:31 +00:00
Takeshi Yamamuro d98df7626b [SPARK-31325][SQL][WEB UI] Control a plan explain mode in the events of SQL listeners via SQLConf
### What changes were proposed in this pull request?

This PR intends to add a new SQL config for controlling a plan explain mode in the events of (e.g., `SparkListenerSQLExecutionStart` and `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. In the current master, the output of `QueryExecution.toString` (this is equivalent to the "extended" explain mode) is stored in these events. I think it is useful to control the content via `SQLConf`. For example, the query "Details" content (TPCDS q66 query) of a SQL tab in a Spark web UI will be changed as follows;

Before this PR:
![q66-extended](https://user-images.githubusercontent.com/692303/78211668-950b4580-74e8-11ea-90c6-db52d437534b.png)

After this PR:
![q66-formatted](https://user-images.githubusercontent.com/692303/78211674-9ccaea00-74e8-11ea-9d1d-43c7e2b0f314.png)

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

Yes; since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.

### How was this patch tested?

Added unit tests.

Closes #28097 from maropu/SPARK-31325.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-04-02 21:09:16 -07:00
beliefer a9260d0349 [SPARK-31315][SQL] SQLQueryTestSuite: Display the total compile time for generated java code
### What changes were proposed in this pull request?
After my investigation, `SQLQueryTestSuite` spent a lot of time compiling the generated java code.
Take `group-by.sql` as an example.
At first, I added some debug log into `SQLQueryTestSuite`.
Please reference 92b6af740c/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L402)
The execution command is as follows:
`build/sbt "~sql/test-only *SQLQueryTestSuite -- -z group-by.sql"`
The output show below:
```
00:56:06.192 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=true. run time: 20604
00:56:13.719 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY. run time: 7526
00:56:18.786 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN. run time: 5066
```
According to the log, we know.

Config | Run time(ms)
-- | --
spark.sql.codegen.wholeStage=true | 20604
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY | 7526
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN | 5066

We should display the total compile time for generated java code.

This PR will add the following to `SQLQueryTestSuite`'s output.
```
=== Metrics of Whole Codegen ===
Total compile time: 80.564516529 seconds
```

Note: At first, I wanted to use `CodegenMetrics.METRIC_COMPILATION_TIME` to do this. After many experiments, I found that `CodegenMetrics.METRIC_COMPILATION_TIME` is only effective for a single test case, and cannot play a role in the whole life cycle of `SQLQueryTestSuite`.
I checked the type of  ` CodegenMetrics.METRIC_COMPILATION_TIME` is `Histogram` and the latter preserves 1028 elements.` Histogram` is a metric which calculates the distribution of a value.

### Why are the changes needed?
Display the total compile time for generated java code.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28081 from beliefer/output-codegen-compile-time.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-02 09:13:22 +00:00