## What changes were proposed in this pull request?
Check the namespace existence while calling "use namespace", and throw NoSuchNamespaceException if namespace not exists.
### Why are the changes needed?
Users need to know that the namespace does not exist when they try to set a wrong namespace.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Run all suites and add a test for this
Closes#27900 from stczwd/SPARK-31100.
Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch fixes wrong groupBy result if the grouping key is a null-value struct.
### Why are the changes needed?
`NormalizeFloatingNumbers` reconstructs a struct if input expression is StructType. If the input struct is null, it will reconstruct a struct with null-value fields, instead of null.
### Does this PR introduce _any_ user-facing change?
Yes, fixing incorrect groupBy result.
### How was this patch tested?
Unit test.
Closes#28962 from viirya/SPARK-32136.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As the followup of #28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled.
### Why are the changes needed?
When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled.
### Does this PR introduce _any_ user-facing change?
Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions.
### How was this patch tested?
Unit tests.
Closes#28952 from viirya/SPARK-32056-sql.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option.
### Why are the changes needed?
To prevent perf regression while inferring schemas from JSON with potential timestamps fields.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
- Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`.
- Regenerated results of `JsonBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |
Closes#28966 from MaxGekk/json-inferTimestamps-disable-by-default.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/28534 , to make `TIMESTAMP_SECONDS` function support fractional input as well.
### Why are the changes needed?
Previously the cast function can cast fractional values to timestamp. Now we suggest users to ues these new functions, and we need to cover all the cast use cases.
### Does this PR introduce _any_ user-facing change?
Yes, now `TIMESTAMP_SECONDS` function accepts fractional input.
### How was this patch tested?
new tests
Closes#28956 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Spark can't push down scan predicate condition of **Or**:
Such as if I have a table `default.test`, it's partition col is `dt`,
If we use query :
```
select * from default.test
where dt=20190625 or (dt = 20190626 and id in (1,2,3) )
```
In this case, Spark will resolve **Or** condition as one expression, and since this expr has reference of "id", then it can't been push down.
Base on pr https://github.com/apache/spark/pull/28733, In my PR , for SQL like
`select * from default.test`
`where dt = 20190626 or (dt = 20190627 and xxx="a") `
For this condition `dt = 20190626 or (dt = 20190627 and xxx="a" )`, it will been converted to CNF
```
(dt = 20190626 or dt = 20190627) and (dt = 20190626 or xxx = "a" )
```
then condition `dt = 20190626 or dt = 20190627` will be push down when partition pruning
### Why are the changes needed?
Optimize partition pruning
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Added UT
Closes#28805 from AngersZhuuuu/cnf-for-partition-pruning.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update documentation to reflect changes in faf220aad9
I've changed the documentation to reflect updated statistics may be used to improve query plan.
### Why are the changes needed?
I believe the documentation is stale and misleading.
### Does this PR introduce _any_ user-facing change?
Yes, this is a javadoc documentation fix.
### How was this patch tested?
Doc fix.
Closes#28925 from emkornfield/spark-32095.
Authored-by: Micah Kornfield <micahk@google.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
fix error exception messages during exceptions on Union and set operations
### Why are the changes needed?
Union and set operations can only be performed on tables with the compatible column types,while when we have more than two column, the exception messages will have wrong column index.
Steps to reproduce:
```
drop table if exists test1;
drop table if exists test2;
drop table if exists test3;
create table if not exists test1(id int, age int, name timestamp);
create table if not exists test2(id int, age timestamp, name timestamp);
create table if not exists test3(id int, age int, name int);
insert into test1 select 1,2,'2020-01-01 01:01:01';
insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01';
insert into test3 select 1,3,4;
```
Query1:
```sql
select * from test1 except select * from test2;
```
Result1:
```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. timestamp <> int at the second column of the second table;; 'Except false :- Project [id#620, age#621, name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, name#625] (state=,code=0)
```
Query2:
```sql
select * from test1 except select * from test3;
```
Result2:
```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types
int <> timestamp at the 2th column of the second table;
```
the above query1 has the right exception message
the above query2 have the wrong errors information, it may need to change to the following
```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types.
int <> timestamp at the third column of the second table
```
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
unit test
Closes#28951 from GuoPhilipse/32131-correct-error-messages.
Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Co-authored-by: GuoPhilipse <guofei_ok@126.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR add unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement.
### Why are the changes needed?
Now the MERGE INTO syntax is,
```
MERGE INTO [db_name.]target_table [AS target_alias]
USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]
```
It would be nice if we support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement, because users may want to deal with different "AND <condition>"s, the result of which just like a series of "CASE WHEN"s. The expected syntax looks like
```
MERGE INTO [db_name.]target_table [AS target_alias]
USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
ON <merge_condition>
[when_matched_clause [, ...]]
[when_not_matched_clause [, ...]]
```
where when_matched_clause is
```
WHEN MATCHED [ AND <condition> ] THEN <matched_action>
```
and when_not_matched_clause is
```
WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action>
```
matched_action can be one of
```
DELETE
UPDATE SET * or
UPDATE SET col1 = value1 [, col2 = value2, ...]
```
and not_matched_action can be one of
```
INSERT *
INSERT (col1 [, col2, ...]) VALUES (value1 [, value2, ...])
```
### Does this PR introduce _any_ user-facing change?
Yes. The SQL command changes, but it is backward compatible.
### How was this patch tested?
New tests added.
Closes#28875 from xianyinxin/SPARK-32030.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled.
### Why are the changes needed?
When repartition by some partition expressions, users can specify number of partitions or not. If the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling.
### Does this PR introduce _any_ user-facing change?
Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions.
### How was this patch tested?
Added unit test.
Closes#28900 from viirya/SPARK-32056.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fix `UserDefinedType.equal()` by comparing the UDT class instead of checking `acceptsType()`.
### Why are the changes needed?
It's weird that equality comparison between two UDT types can have different result by switching the order:
```scala
// ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
val udt1 = new ExampleBaseTypeUDT
val udt2 = new ExampleSubTypeUDT
println(udt1 == udt2) // true
println(udt2 == udt1) // false
```
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```scala
// ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
val udt1 = new ExampleBaseTypeUDT
val udt2 = new ExampleSubTypeUDT
println(udt1 == udt2) // true
println(udt2 == udt1) // false
```
After:
```scala
// ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
val udt1 = new ExampleBaseTypeUDT
val udt2 = new ExampleSubTypeUDT
println(udt1 == udt2) // false
println(udt2 == udt1) // false
```
### How was this patch tested?
Added a unit test.
Closes#28923 from Ngone51/fix-udt-equal.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Bug fix for overflow case in `UTF8String.substringSQL`.
### Why are the changes needed?
SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`.
### Does this PR introduce _any_ user-facing change?
Yes, bug fix for the overflow case.
### How was this patch tested?
New UT.
Closes#28937 from xuanyuanking/SPARK-32115.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This pull request fixes a bug present in the csv type inference.
We have problems when we have different types in the same column.
**Previously:**
```
$ cat /example/f1.csv
col1
43200000
true
spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+----+
|col1|
+----+
|null|
|true|
+----+
root
|-- col1: boolean (nullable = true)
```
**Now**
```
spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+-------------+
|col1 |
+-------------+
|43200000 |
|true |
+-------------+
root
|-- col1: string (nullable = true)
```
Previously the hierarchy of type inference is the following:
> IntegerType
> > LongType
> > > DecimalType
> > > > DoubleType
> > > > > TimestampType
> > > > > > BooleanType
> > > > > > > StringType
So, when, for example, we have integers in one column, and the last element is a boolean, all the column is inferred as a boolean column incorrectly and all the number are shown as null when you see the data
We need the following hierarchy. When we have different numeric types in the column it will be resolved correctly. And when we have other different types it will be resolved as a String type column
> IntegerType
> > LongType
> > > DecimalType
> > > > DoubleType
> > > > > StringType
> TimestampType
> > StringType
> BooleanType
> > StringType
> StringType
### Why are the changes needed?
Fix the bug explained
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test and manual tests
Closes#28896 from planga82/feature/SPARK-32025_csv_inference.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR tries to address the comment: https://github.com/apache/spark/pull/28645#discussion_r442183888
It changes `canUpCast/canCast` to allow cast from sub UDT to base UDT, in order to achieve the goal to allow UserDefinedType to use `ExpressionEncoder` to deserialize rows in ScalaUDF as well.
One thing that needs to mention is, even we allow cast from sub UDT to base UDT, it doesn't really do the cast in `Cast`. Because, yet, sub UDT and base UDT are considered as the same type(because of #16660), see:
5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala (L81-L86)5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala (L92-L95)
Therefore, the optimize rule `SimplifyCast` will eliminate the cast at the end.
### Why are the changes needed?
Reduce the special case caused by `UserDefinedType` in `ResolveEncodersInUDF` and `ScalaUDF`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
It should be covered by the test of `SPARK-19311`, which is also updated a little in this PR.
Closes#28920 from Ngone51/fix-udf-udt.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This change simplifies the ArrowColumnVector ListArray accessor to use provided Arrow APIs available in v0.15.0 to calculate element indices.
### Why are the changes needed?
This simplifies the code by avoiding manual calculations on the Arrow offset buffer and makes use of more stable APIs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#28915 from BryanCutler/arrow-simplify-ArrowColumnVector-ListArray-SPARK-32080.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Replace Decimal by Int op in the `MakeInterval` & `MakeTimestamp` expression. For instance, `(secs * Decimal(MICROS_PER_SECOND)).toLong` can be replaced by the unscaled long because the former one already contains microseconds.
### Why are the changes needed?
To improve performance.
Before:
```
make_timestamp(): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
...
make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 4 10.7 93.8 38.8X
```
After:
```
make_timestamp(2019, 1, 2, 3, 4, 50.123456) 76 92 15 13.1 76.5 48.1X
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- By existing test suites `IntervalExpressionsSuite`, `DateExpressionsSuite` and etc.
- Re-generate results of `MakeDateTimeBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |
Closes#28886 from MaxGekk/make_interval-opt-decimal.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Extract common code from the expressions that get date or time fields from input dates/timestamps to new expressions `GetDateField` and `GetTimeField`, and re-use the common traits from the affected classes.
### Why are the changes needed?
Code deduplication improves maintainability.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By `DateExpressionsSuite`
Closes#28894 from MaxGekk/get-date-time-field-expr.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Call the `replace()` method from `UTF8String` to remove the `GMT` string from the input of `DateTimeUtils.cleanLegacyTimestampStr`. It removes all `GMT` substrings.
### Why are the changes needed?
Simpler code improves maintainability
### Does this PR introduce _any_ user-facing change?
Should not
### How was this patch tested?
By existing test suites `JsonSuite` and `UnivocityParserSuite`.
Closes#28892 from MaxGekk/simplify-cleanLegacyTimestampStr.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR changes the references of the `PartialMerge`/`Final` `AggregateExpression` from `aggBufferAttributes` to `inputAggBufferAttributes`.
After this change, the tests of `SPARK-31620` can fail on the assertion of `QueryTest.assertEmptyMissingInput`. So, this PR also fixes it by overriding the `inputAggBufferAttributes` of the Aggregate operators.
### Why are the changes needed?
With my understanding of Aggregate framework, especially, according to the logic of `AggUtils.planAggXXX`, I think for the `PartialMerge`/`Final` `AggregateExpression` the right references should be `inputAggBufferAttributes`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Before this patch, for an Aggregate operator, its input attributes will always be equal to or more than(because it refers to its own attributes while it should refer to the attributes from the child) its reference attributes. Therefore, its missing inputs must always be empty and break nothing. Thus, it's impossible to add a UT for this patch.
However, after correcting the right references in this PR, the problem is then exposed by `QueryTest.assertEmptyMissingInput` in the UT of SPARK-31620, since missing inputs are no longer always empty. This PR can fix the problem.
Closes#28869 from Ngone51/fix-agg-reference.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A minor PR that adds a couple of usage examples for ArrayFilter and ArrayExists that shows how to deal with NULL data.
### Why are the changes needed?
Enhances the examples that shows how to filter out null values from an array and also to test if null value exists in an array.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually.
Closes#28890 from dilipbiswal/array_func_description.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a new config `spark.sql.files.minPartitionNum` to control file split partition in local session.
### Why are the changes needed?
Aims to control file split partitions in session level.
More details see discuss in [PR-28778](https://github.com/apache/spark/pull/28778).
### Does this PR introduce _any_ user-facing change?
Yes, new config.
### How was this patch tested?
Add UT.
Closes#28853 from ulysses-you/SPARK-32019.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Change precision of seconds and its fraction from 8 to 18 to be able to construct intervals of max allowed microseconds value (long).
### Why are the changes needed?
To improve UX of Spark SQL.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
- Add tests to IntervalExpressionsSuite
- Add an example to the `MakeInterval` expression
- Add tests to `interval.sql`
Closes#28873 from MaxGekk/make_interval-sec-precision.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
1. Add judge equal as bigger condition in `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl#eval`
2. Unit test for interval `day`, `month`, `year`
### Why are the changes needed?
Bug exists when sequence input get same equal start and end dates, which will occur `while loop` forever
### Does this PR introduce _any_ user-facing change?
Yes,
Before this PR, people will get a `java.lang.ArrayIndexOutOfBoundsException`, when eval as below:
`sql("select sequence(cast('2011-03-01' as date), cast('2011-03-01' as date), interval 1 year)").show(false)
`
### How was this patch tested?
Unit test.
Closes#28819 from TJX2014/master-SPARK-31980.
Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
When two bucketed tables with different number of buckets are joined, it can introduce a full shuffle:
```
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k")
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val joined = t1.join(t2, t1("i") === t2("i"))
joined.explain
== Physical Plan ==
*(5) SortMergeJoin [i#44], [i#50], Inner
:- *(2) Sort [i#44 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(i#44, 200), true, [id=#105]
: +- *(1) Project [i#44, j#45, k#46]
: +- *(1) Filter isnotnull(i#44)
: +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, DataFilters: [isnotnull(i#44)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 8 out of 8
+- *(4) Sort [i#50 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i#50, 200), true, [id=#115]
+- *(3) Project [i#50, j#51, k#52]
+- *(3) Filter isnotnull(i#50)
+- *(3) ColumnarToRow
+- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4
```
This PR proposes to introduce coalescing buckets when the following conditions are met to eliminate the full shuffle:
- Join is the sort merge one (which is created only for equi-join).
- Join keys match with output partition expressions on their respective sides.
- The larger bucket number is divisible by the smaller bucket number.
- `spark.sql.bucketing.coalesceBucketsInSortMergeJoin.enabled` is set to `true`.
- The ratio of the number of buckets should be less than the value set in `spark.sql.bucketing.coalesceBucketsInSortMergeJoin.maxBucketRatio`.
### Why are the changes needed?
Eliminating the full shuffle can benefit for scenarios where two large tables are joined. Especially when the tables are already bucketed but differ in the number of buckets, we could take advantage of it.
### Does this PR introduce any user-facing change?
If the bucket coalescing conditions explained above are met, a full shuffle can be eliminated (also note that you will see `SelectedBucketsCount: 8 out of 8 (Coalesced to 4)` in the physical plan):
```
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
spark.conf.set("spark.sql.bucketing.coalesceBucketsInSortMergeJoin.enabled", "true")
val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k")
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val joined = t1.join(t2, t1("i") === t2("i"))
joined.explain
== Physical Plan ==
*(3) SortMergeJoin [i#44], [i#50], Inner
:- *(1) Sort [i#44 ASC NULLS FIRST], false, 0
: +- *(1) Project [i#44, j#45, k#46]
: +- *(1) Filter isnotnull(i#44)
: +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, DataFilters: [isnotnull(i#44)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 8 out of 8 (Coalesced to 4)
+- *(2) Sort [i#50 ASC NULLS FIRST], false, 0
+- *(2) Project [i#50, j#51, k#52]
+- *(2) Filter isnotnull(i#50)
+- *(2) ColumnarToRow
+- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4
```
### How was this patch tested?
Added unit tests
Closes#28123 from imback82/coalescing_bucket.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR adds support for typed Scala UDF to accept composed type of case class, e.g. Seq[T], Array[T], Map[Int, T] (assuming T is case class type), as input parameter type.
### Why are the changes needed?
After #27937, typed Scala UDF now has supported case class as its input parameter type. However, it can not accept the composed type of case class, such as Seq[T], Array[T], Map[Int, T] (assuming T is case class type), which causing confuse(e.g. https://github.com/apache/spark/pull/27937#discussion_r422699979) to the user.
### Does this PR introduce _any_ user-facing change?
Yes.
Run the query:
```
scala> case class Person(name: String, age: Int)
scala> Seq((1, Seq(Person("Jack", 5)))).toDF("id", "persons").withColumn("ages", udf{ s: Seq[Person] => s.head.age }.apply(col("persons"))).show
```
Before:
```
org.apache.spark.SparkException: Failed to execute user defined function($read$$Lambda$2861/628175152: (array<struct<name:string,age:int>>) => int)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1129)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.$anonfun$applyOrElse$69(Optimizer.scala:1492)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
....
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Person
at $anonfun$res3$1(<console>:30)
at $anonfun$res3$1$adapted(<console>:30)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:156)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1126)
... 142 more
```
After:
```
+---+-----------+----+
| id| persons|ages|
+---+-----------+----+
| 1|[[Jack, 5]]| [5]|
+---+-----------+----+
```
### How was this patch tested?
Added tests.
Closes#28645 from Ngone51/impr-udf.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Introduce UnsafeRow format validation for streaming state store.
### Why are the changes needed?
Currently, Structured Streaming directly puts the UnsafeRow into StateStore without any schema validation. It's a dangerous behavior when users reusing the checkpoint file during migration. Any changes or bug fix related to the aggregate function may cause random exceptions, even the wrong answer, e.g SPARK-28067.
### Does this PR introduce _any_ user-facing change?
Yes. If the underlying changes are detected when the checkpoint is reused during migration, the InvalidUnsafeRowException will be thrown.
### How was this patch tested?
UT added. Will also add integrated tests for more scenario in another PR separately.
Closes#28707 from xuanyuanking/SPARK-31894.
Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils
2. Remove TimeZoneGMT
3. Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId
4. Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType
5. Avoid one `*` `DateTimeUtils`.`in fromJulianDay()`
6. Use toTotalMonths in `DateTimeUtils`.`subtractDates()`
7. Remove `julianCommonEraStart`, `timestampToString()`, `microsToEpochDays()`, `epochDaysToMicros()`, `instantToDays()` from `DateTimeUtils`.
8. Make splitDate() private.
9. Remove `def daysToMicros(days: Int): Long` and `def microsToDays(micros: Long): Int`.
### Why are the changes needed?
This simplifies the common code related to date-time operations, and should improve maintainability. In particular:
1. TimeZoneUTC and TimeZoneGMT are moved to DateTimeTestUtils because they are used only in tests
2. TimeZoneGMT can be removed because it is equal to TimeZoneUTC
3. After the PR #27494, Spark expressions and DateTimeUtils functions switched to ZoneId instead of TimeZone completely. `defaultTimeZone()` with `TimeZone` as return type is not needed anymore.
4. SQLDate and SQLTimestamp types can be explicitly aliased to internal types of DateType and and TimestampType instead of declaring this in a comment.
5. Avoid one `*` `DateTimeUtils`.`in fromJulianDay()`.
6. Use toTotalMonths in `DateTimeUtils`.`subtractDates()`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites
Closes#27617 from MaxGekk/move-time-zone-consts.
Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Better error message when SPARK_HOME or spark,test.home is not set.
### Why are the changes needed?
Currently the error message is not easily consumable as it prints (see below) the real error after printing the current environment which is rather long.
**Old output**
`
time.name" -> "Java(TM) SE Runtime Environment", "sun.boot.library.path" -> "/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/jre/lib",
"java.vm.version" -> "25.221-b11",
. . .
. . .
. . .
) did not contain key "SPARK_HOME" spark.test.home or SPARK_HOME is not set.
at org.scalatest.Assertions.newAssertionFailedExceptio
`
**New output**
An exception or error caused a run to abort: spark.test.home or SPARK_HOME is not set.
org.scalatest.exceptions.TestFailedException: spark.test.home or SPARK_HOME is not set
### Does this PR introduce any user-facing change?
`
No.
### How was this patch tested?
Ran the tests in intellej manually to see the new error.
Closes#28825 from dilipbiswal/minor-spark-31950-followup.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Change the max step from 1 week to 30 minutes in the tests `RebaseDateTimeSuite`.`generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`.
2. Parallelise JSON files generation in the function `generateRebaseJson` by using `ThreadUtils.parmap`.
### Why are the changes needed?
1. To prevent the bugs that are fixed by https://github.com/apache/spark/pull/28787 and https://github.com/apache/spark/pull/28816.
2. The parallelisation speeds up JSON file generation.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By generating the JSON file `julian-gregorian-rebase-micros.json`.
Closes#28827 from MaxGekk/rebase-30-min.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
To avoid the unnecessary CNF computation for full-outer joins, this PR fixes code for filtering out full-outer joins at the entrance of the rule.
### Why are the changes needed?
To mitigate optimizer overhead.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#28810 from maropu/SPARK-31705.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
1. Set the given time zone as the first parameter of `RebaseDateTime`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` to Java 7 `GregorianCalendar`.
```scala
val cal = new Calendar.Builder()
// `gregory` is a hybrid calendar that supports both the Julian and Gregorian calendar systems
.setCalendarType("gregory")
...
.setTimeZone(tz)
.build()
```
This makes the instance of the calendar independent from the default JVM time zone.
2. Change type of the first parameter from `ZoneId` to `TimeZone`. This allows to avoid unnecessary conversion from `TimeZone` to `ZoneId`, for example in
```scala
def rebaseJulianToGregorianMicros(micros: Long): Long = {
...
if (rebaseRecord == null || micros < rebaseRecord.switches(0)) {
rebaseJulianToGregorianMicros(timeZone.toZoneId, micros)
```
and back to `TimeZone` inside of `rebaseJulianToGregorianMicros(zoneId: ZoneId, ...)`
3. Modify tests in `RebaseDateTimeSuite`, and set the default JVM time zone only for functions that depend on it.
### Why are the changes needed?
1. Ignoring passed parameter and using a global variable is bad practice.
2. Dependency from the global state doesn't allow to run the functions in parallel otherwise there is non-zero probability that the functions may return wrong result if the default JVM is changed during their execution.
3. This open opportunity for parallelisation of JSON files generation `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`. Currently, the tests `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'` generate the JSON files by iterating over all time zones sequentially w/ step of 1 week. Due to the large step, we can miss some spikes in diffs between 2 calendars (Java 8 Gregorian and Java 7 hybrid calendars) as the PR https://github.com/apache/spark/pull/28787 fixed and https://github.com/apache/spark/pull/28816 should fix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running existing tests from `RebaseDateTimeSuite`.
Closes#28824 from MaxGekk/pure-micros-rebasing.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
we fail casting from numeric to timestamp by default.
## Why are the changes needed?
casting from numeric to timestamp is not a non-standard,meanwhile it may generate different result between spark and other systems,for example hive
## Does this PR introduce any user-facing change?
Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS
## How was this patch tested?
unit test added
Closes#28593 from GuoPhilipse/31710-fix-compatibility.
Lead-authored-by: GuoPhilipse <guofei_ok@126.com>
Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It fixes microseconds rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar in the function `RebaseDateTime`.`rebaseJulianToGregorianMicros(zoneId: ZoneId, micros: Long): Long` in the case of local timestamp overlapping.
In the case of overlapping, we look ahead of 1 day to determinate which instant we should take - earlier or later zoned timestamp. If our current standard zone and DST offsets are equal to zone offset of the next date, we choose the later timestamp otherwise the earlier one. For example, the local timestamp **1945-11-18 01:30:00.0** can be mapped to two instants (microseconds since 1970-01-01 00:00:00Z): -761211000000000 or -761207400000000. If the first one is passed to `rebaseJulianToGregorianMicros()`, we take the earlier instant in Proleptic Gregorian calendar while rebasing **1945-11-18T01:30+09:00[Asia/Hong_Kong]** otherwise the later one **1945-11-18T01:30+08:00[Asia/Hong_Kong]**.
Note: The fix assumes that only one transition of standard or DST offsets can occur during a day.
### Why are the changes needed?
Current implementation of `rebaseJulianToGregorianMicros()` handles timestamps overlapping only during daylight saving time but overlapping can happen also during transition from one standard time zone to another one. For example in the case of `Asia/Hong_Kong`, the time zone switched from `Japan Standard Time` (UTC+9) to `Hong Kong Time` (UTC+8) on _Sunday, 18 November, 1945 01:59:59 AM_. The changes allow to handle the special case as well.
### Does this PR introduce _any_ user-facing change?
There is no behaviour change for timestamps of CE after 0001-01-01. The PR might affects timestamps of BCE for which the modified `rebaseJulianToGregorianMicros()` is called directly.
### How was this patch tested?
1. By existing tests in `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `TimestampFormatterSuite`.
2. Added new checks to `RebaseDateTimeSuite`.`SPARK-31959: JST -> HKT at Asia/Hong_Kong in 1945`:
```scala
assert(rebaseJulianToGregorianMicros(hkZid, rebasedEarlierMicros) === earlierMicros)
assert(rebaseJulianToGregorianMicros(hkZid, rebasedLaterMicros) === laterMicros)
```
3. Regenerated `julian-gregorian-rebase-micros.json` with the step of 30 minutes, and got the same JSON file. The JSON file isn't affected because previously it was generated with the step of 1 week. And the spike in diffs/switch points during 1 hour of timestamp overlapping wasn't detected.
Closes#28816 from MaxGekk/fix-overlap-julian-2-grep.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to move keywords `ANTI`, `SEMI`, and `MINUS` from reserved to non-reserved.
### Why are the changes needed?
To comply with the ANSI/SQL standard.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests.
Closes#28807 from maropu/SPARK-26905-2.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Old JDK can have outdated time zone database in which `Asia/Hong_Kong` doesn't have timestamp overlapping in 1946 at all. This PR changes the test "SPARK-31959: JST -> HKT at Asia/Hong_Kong in 1945" in `RebaseDateTimeSuite`, and makes it tolerant to the case.
### Why are the changes needed?
To fix the test failures on old JDK w/ outdated tzdb like on Jenkins machine `research-jenkins-worker-09`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test on old JDK
Closes#28832 from MaxGekk/HongKong-tz-1945-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
In NestedColumnAliasing rule, we create aliases for nested field access in project list. We considered that top level parent field and nested fields under it were both accessed. In the case, we don't create the aliases because they are redundant.
There is another case, where a nested parent field and nested fields under it were both accessed, which we don't consider now. We don't need to create aliases in this case too.
## How was this patch tested?
Added test.
Closes#24525 from viirya/SPARK-27633.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This proposes a minor refactoring to match `NestedColumnAliasing` to `GeneratorNestedColumnAliasing` so it returns the pruned plan directly.
```scala
case p NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, attrToAliases)
```
vs
```scala
case GeneratorNestedColumnAliasing(p) => p
```
### Why are the changes needed?
Just for readability.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests should cover.
Closes#28812 from HyukjinKwon/SPARK-31977.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR intends to extract SQL reserved/non-reserved keywords from the ANTLR grammar file (`SqlBase.g4`) directly.
This approach is based on the cloud-fan suggestion: https://github.com/apache/spark/pull/28779#issuecomment-642033217
### Why are the changes needed?
It is hard to maintain a full set of the keywords in `TableIdentifierParserSuite`, so it would be nice if we could extract them from the `SqlBase.g4` file directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#28802 from maropu/SPARK-31950-2.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Currently we only push nested column pruning through a few operators such as LIMIT, SAMPLE, etc. This patch extends the feature to other operators including RepartitionByExpression, Join.
### Why are the changes needed?
Currently nested column pruning only applied on a few operators. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.
### Does this PR introduce _any_ user-facing change?
Yes. More SQL operators are covered by nested column pruning.
### How was this patch tested?
Added unit test, end-to-end tests.
Closes#28556 from viirya/others-column-pruning.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix the bug in microseconds rebasing during transitions from one standard time zone offset to another one. In the PR, I propose to change the implementation of `rebaseGregorianToJulianMicros` which performs rebasing via local timestamps. In the case of overlapping:
1. Check that the original instant belongs to earlier or later instant of overlapped local timestamp.
2. If it is an earlier instant, take zone and DST offsets from the previous day otherwise
3. Set time zone offsets to Julian timestamp from the next day.
Note: The fix assumes that transitions cannot happen more often than once per 2 days.
### Why are the changes needed?
Current implementation handles timestamps overlapping only during daylight saving time but overlapping can happen also during transition from one standard time zone to another one. For example in the case of `Asia/Hong_Kong`, the time zone switched from `Japan Standard Time` (UTC+9) to `Hong Kong Time` (UTC+8) on _Sunday, 18 November, 1945 01:59:59 AM_. The changes allow to handle the special case as well.
### Does this PR introduce _any_ user-facing change?
It might affect micros rebasing in before common era when not-optimised version of `rebaseGregorianToJulianMicros()` is used directly.
### How was this patch tested?
1. By existing tests in `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `TimestampFormatterSuite`.
2. Added new test to `RebaseDateTimeSuite`
3. Regenerated `gregorian-julian-rebase-micros.json` with the step of 30 minutes, and got the same JSON file. The JSON file isn't affected because previously it was generated with the step of 1 week. And the spike in diffs/switch points during 1 hour of timestamp overlapping wasn't detected.
Closes#28787 from MaxGekk/HongKong-tz-1945.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr normalize all binary comparison expressions when comparing plans.
### Why are the changes needed?
Improve test framework, otherwise this test will fail:
```scala
test("SPARK-31912 Normalize all binary comparison expressions") {
val original = testRelation
.where('a === 'b && Literal(13) >= 'b).as("x")
val optimized = testRelation
.where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 13).as("x")
comparePlans(Optimize.execute(original.analyze), optimized.analyze)
}
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#28734 from wangyum/SPARK-31912.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException
Thanks to **Jeffrey Stokes** for reporting the issue and explaining the underlying problem in detail in the JIRA.
### Why are the changes needed?
This fixes StringIndexOutOfBoundsException on an overflow.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added a test in StringsUtilSuite.
Closes#28750 from dilipbiswal/SPARK-31916.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR add a new rule to support push predicate through join by rewriting join condition to CNF(conjunctive normal form). The following example is the steps of this rule:
1. Prepare Table:
```sql
CREATE TABLE x(a INT);
CREATE TABLE y(b INT);
...
SELECT * FROM x JOIN y ON ((a < 0 and a > b) or a > 10);
```
2. Convert the join condition to CNF:
```
(a < 0 or a > 10) and (a > b or a > 10)
```
3. Split conjunctive predicates
Predicates
---|
(a < 0 or a > 10)
(a > b or a > 10)
4. Push predicate
Table | Predicate
--- | ---
x | (a < 0 or a > 10)
### Why are the changes needed?
Improve query performance. PostgreSQL, [Impala](https://issues.apache.org/jira/browse/IMPALA-9183) and Hive support this feature.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and benchmark test.
SQL | Before this PR | After this PR
--- | --- | ---
TPCDS 5T Q13 | 84s | 21s
TPCDS 5T q85 | 66s | 34s
TPCH 1T q19 | 37s | 32s
Closes#28733 from gengliangwang/cnf.
Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/23388 .
https://github.com/apache/spark/pull/23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.
This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.
Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.
### Why are the changes needed?
fix a bug
### Does this PR introduce _any_ user-facing change?
yes, see the newly added test.
### How was this patch tested?
new test
Closes#28785 from cloud-fan/normalize.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
If a datetime pattern contains no year field, the day of year field should not be ignored if exists
e.g.
```
spark-sql> select to_timestamp('31', 'DD');
1970-01-01 00:00:00
spark-sql> select to_timestamp('31 30', 'DD dd');
1970-01-30 00:00:00
spark.sql.legacy.timeParserPolicy legacy
spark-sql> select to_timestamp('31', 'DD');
1970-01-31 00:00:00
spark-sql> select to_timestamp('31 30', 'DD dd');
NULL
```
This PR only fixes some corner cases that use 'D' pattern to parse datetimes and there is w/o 'y'.
### Why are the changes needed?
fix some corner cases
### Does this PR introduce _any_ user-facing change?
yes, the day of year field will not be ignored
### How was this patch tested?
add unit tests.
Closes#28766 from yaooqinn/SPARK-31939.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently we only push nested column pruning from a Project through a few operators such as LIMIT, SAMPLE, etc. There are a few operators like Aggregate, Expand which can prune nested columns by themselves, without a Project on top.
This patch extends the feature to those operators.
### Why are the changes needed?
Currently nested column pruning only applied on a few cases. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.
### Does this PR introduce _any_ user-facing change?
Yes. More SQL operators are covered by nested column pruning.
### How was this patch tested?
Added unit test, end-to-end tests.
Closes#28560 from viirya/SPARK-27217-2.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to add `TYPE` in the ANSI non-reserved list because it is not reserved in the standard. See SPARK-26905 for a full set of the reserved/non-reserved keywords of `SQL:2016`.
Note: The current master behaviour is as follows;
```
scala> sql("SET spark.sql.ansi.enabled=false")
scala> sql("create table t1 (type int)")
res4: org.apache.spark.sql.DataFrame = []
scala> sql("SET spark.sql.ansi.enabled=true")
scala> sql("create table t2 (type int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'type'(line 1, pos 17)
== SQL ==
create table t2 (type int)
-----------------^^^
```
### Why are the changes needed?
To follow the ANSI/SQL standard.
### Does this PR introduce _any_ user-facing change?
Makes users use `TYPE` as identifiers.
### How was this patch tested?
Update the keyword lists in `TableIdentifierParserSuite`.
Closes#28773 from maropu/SPARK-26905.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Currently, `date_format` and `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp`, `to_date` have different exception handling behavior for formatting datetime values.
In this PR, we apply the exception handling behavior of `date_format` to `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date`.
In the phase of creating the datetime formatted or formating, exceptions will be raised.
e.g.
```java
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa');
20/05/28 15:25:38 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa')]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-aaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
```
```java
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA');
20/05/28 15:26:10 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA')]
java.lang.IllegalArgumentException: Illegal pattern character: A
```
```java
spark-sql> select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd');
20/05/28 15:23:23 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd')]
java.lang.ArrayIndexOutOfBoundsException: 11
at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2568)
```
In the phase of parsing, `DateTimeParseException | DateTimeException | ParseException` will be suppressed, but `SparkUpgradeException` will still be raised
e.g.
```java
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy exception
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
20/05/28 15:31:15 ERROR SparkSQLDriver: Failed in [select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz")]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2020-01-27T20:06:11.847-0800' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
```
```java
spark-sql> set spark.sql.legacy.timeParserPolicy=corrected;
spark.sql.legacy.timeParserPolicy corrected
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy legacy
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
2020-01-28 12:06:11.847
```
### Why are the changes needed?
Consistency
### Does this PR introduce _any_ user-facing change?
Yes, invalid datetime patterns will fail `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` instead of resulting `NULL`
### How was this patch tested?
add more tests
Closes#28650 from yaooqinn/SPARK-31830.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR makes `repartition`/`DISTRIBUTE BY` obeys [initialPartitionNum](af4248b2d6/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (L446-L455)) when adaptive execution enabled.
### Why are the changes needed?
To make `DISTRIBUTE BY`/`GROUP BY` partitioned by same partition number.
How to reproduce:
```scala
spark.sql("CREATE TABLE spark_31220(id int)")
spark.sql("set spark.sql.adaptive.enabled=true")
spark.sql("set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1000")
```
Before this PR:
```
scala> spark.sql("SELECT id from spark_31220 GROUP BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- HashAggregate(keys=[id#5], functions=[])
+- Exchange hashpartitioning(id#5, 1000), true, [id=#171]
+- HashAggregate(keys=[id#5], functions=[])
+- FileScan parquet default.spark_31220[id#5]
scala> spark.sql("SELECT id from spark_31220 DISTRIBUTE BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Exchange hashpartitioning(id#5, 200), false, [id=#179]
+- FileScan parquet default.spark_31220[id#5]
```
After this PR:
```
scala> spark.sql("SELECT id from spark_31220 GROUP BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- HashAggregate(keys=[id#5], functions=[])
+- Exchange hashpartitioning(id#5, 1000), true, [id=#171]
+- HashAggregate(keys=[id#5], functions=[])
+- FileScan parquet default.spark_31220[id#5]
scala> spark.sql("SELECT id from spark_31220 DISTRIBUTE BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Exchange hashpartitioning(id#5, 1000), false, [id=#179]
+- FileScan parquet default.spark_31220[id#5]
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#27986 from wangyum/SPARK-31220.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update comments for `DateTimeUtils`.`toJavaDate` and `fromJavaDate`, and for the legacy date formatters `LegacySimpleDateFormatter` and `LegacyFastDateFormatter` regarding to the default JVM time zone. The comments say that the default JVM time zone is used intentionally for backward compatibility with Spark 2.4 and earlier versions.
Closes#28709
### Why are the changes needed?
To document current behaviour of related methods in `DateTimeUtils` and the legacy date formatters. For example, correctness of `HiveResult.hiveResultString` and `toHiveString` is directly related to the same time zone used by `toJavaDate` and `LegacyFastDateFormatter`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the Scala style checker `./dev/scalastyle`
Closes#28767 from MaxGekk/doc-legacy-formatters.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### Why are the changes needed?
The qualifier name should contains catalog name.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT.
Closes#28726 from lipzhu/SPARK-31906.
Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In this PR, we throw `SparkUpgradeException` when getting `DateTimeException` for datetime formatting in the `EXCEPTION` legacy Time Parser Policy.
### Why are the changes needed?
`DateTimeException` is also declared by `java.time.format.DateTimeFormatter#format`, but in Spark, it can barely occur. We have suspected one that due to a JDK bug so far. see https://bugs.openjdk.java.net/browse/JDK-8079628.
For `from_unixtime` function, we will suppress the DateTimeException caused by `DD` and result `NULL`. It is a silent date change that should be avoided in Java 8.
### Does this PR introduce _any_ user-facing change?
Yes, when running on Java8 and using `from_unixtime` function with pattern `DD` to format datetimes, if dayofyear>=100, `SparkUpgradeException` will alert users instead of silently resulting null. For `date_format`, `SparkUpgradeException` take the palace of `DateTimeException`.
### How was this patch tested?
add unit tests.
Closes#28736 from yaooqinn/SPARK-31867-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# What changes were proposed in this pull request?
After all these attempts https://github.com/apache/spark/pull/28692 and https://github.com/apache/spark/pull/28719 an https://github.com/apache/spark/pull/28727.
they all have limitations as mentioned in their discussions.
Maybe the only way is to forbid them all
### Why are the changes needed?
These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.
From the Java doc of WeekFields
```java
/**
* Gets the first day-of-week.
* <p>
* The first day-of-week varies by culture.
* For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
* This method returns the first day using the standard {code DayOfWeek} enum.
*
* return the first day-of-week, not null
*/
public DayOfWeek getFirstDayOfWeek() {
return firstDayOfWeek;
}
```
But for the SimpleDateFormat, the day-of-week is not localized
```
u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1
```
Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.
e.g.
For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too
```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```
For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)
### Does this PR introduce _any_ user-facing change?
With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change.
### How was this patch tested?
add unit tests
Closes#28728 from yaooqinn/SPARK-31879-NEW2.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs.
This PR is to eliminate this side effect.
### Why are the changes needed?
There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#28717 from cloud-fan/fix.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR set the hour to 12/0 when the AMPM_OF_DAY field exists
### Why are the changes needed?
When the hour is absent but the am-pm is present, the time is incorrect for pm
### Does this PR introduce _any_ user-facing change?
yes, the change is user-facing but to change back to 2.4 to keep backward compatibility
e.g.
```sql
spark-sql> select to_timestamp('33:33 PM', 'mm:ss a');
1970-01-01 12:33:33
spark-sql> select to_timestamp('33:33 AM', 'mm:ss a');
1970-01-01 00:33:33
```
otherwise, the results are all `1970-01-01 00:33:33`
### How was this patch tested?
add unit tests
Closes#28713 from yaooqinn/SPARK-31896.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR disables week-based date filed for parsing
closes#28674
### Why are the changes needed?
1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at https://github.com/apache/spark/pull/28674
2. The existing behavior itself in 2.4 is confusing, e.g.
```sql
spark-sql> select to_timestamp('1', 'w');
1969-12-28 00:00:00
spark-sql> select to_timestamp('1', 'u');
1970-01-05 00:00:00
```
The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form.
And, e.g.
```sql
spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy');
2020-01-01 00:00:00
spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY');
2019-12-29 00:00:00
spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w');
NULL
spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w');
2019-12-29 00:00:00
```
I think we don't need to introduce all the weird behavior from Java
3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it.
4. Avoiding JDK bugs
https://issues.apache.org/jira/browse/SPARK-31880
### Does this PR introduce _any_ user-facing change?
Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions.
### How was this patch tested?
more tests added
Closes#28706 from yaooqinn/SPARK-31892.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# What changes were proposed in this pull request?
This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4
### Why are the changes needed?
#### cases
```sql
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00
```
#### reasons
These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.
From the Java doc of WeekFields
```java
/**
* Gets the first day-of-week.
* <p>
* The first day-of-week varies by culture.
* For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
* This method returns the first day using the standard {code DayOfWeek} enum.
*
* return the first day-of-week, not null
*/
public DayOfWeek getFirstDayOfWeek() {
return firstDayOfWeek;
}
```
But for the SimpleDateFormat, the day-of-week is not localized
```
u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1
```
Currently, the default locale we use is the US, so the result moved a day backward.
For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)
With this change, it restores the first day of week calculating for functions when using the default locale.
### Does this PR introduce _any_ user-facing change?
Yes, but the behavior change is used to restore the old one of v2.4
### How was this patch tested?
add unit tests
Closes#28692 from yaooqinn/SPARK-31879.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
We should use dataType.catalogString to unified the data type mismatch message.
Before:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': StringType to IntegerType;
```
After:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': string to int;
```
### How was this patch tested?
UT.
Closes#28654 from lipzhu/SPARK-31834.
Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
JIRA SPARK-28067: Wrong results are returned for aggregate sum with decimals with whole stage codegen enabled
**Repro:**
WholeStage enabled enabled -> Wrong results
WholeStage disabled -> Returns exception Decimal precision 39 exceeds max precision 38
**Issues:**
1. Wrong results are returned which is bad
2. Inconsistency between whole stage enabled and disabled.
**Cause:**
Sum does not take care of possibility of overflow for the intermediate steps. ie the updateExpressions and mergeExpressions.
This PR makes the following changes:
- Add changes to check if overflow occurs for decimal in aggregate Sum and if there is an overflow, it will return null for the Sum operation when spark.sql.ansi.enabled is false.
- When spark.sql.ansi.enabled is true, then the sum operation will return an exception if an overflow occurs for the decimal operation in Sum.
- This is keeping it consistent with the behavior defined in spark.sql.ansi.enabled property
**Before the fix: Scenario 1:** - WRONG RESULTS
```
scala> val df = Seq(
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> df2.show(40,false)
+---------------------------------------+
|sum(decNum) |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+
```
--
**Before fix: Scenario2: Setting spark.sql.ansi.enabled to true** - WRONG RESULTS
```
scala> spark.conf.set("spark.sql.ansi.enabled", "true")
scala> val df = Seq(
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> df2.show(40,false)
+---------------------------------------+
|sum(decNum) |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+
```
**After the fix: Scenario1:**
```
scala> val df = Seq(
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> df2.show(40,false)
+-----------+
|sum(decNum)|
+-----------+
|null |
+-----------+
```
**After fix: Scenario2: Setting the spark.sql.ansi.enabled to true:**
```
scala> spark.conf.set("spark.sql.ansi.enabled", "true")
scala> val df = Seq(
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> df2.show(40,false)
20/02/18 10:57:43 ERROR Executor: Exception in task 5.0 in stage 4.0 (TID 30)
java.lang.ArithmeticException: Decimal(expanded,100000000000000000000.000000000000000000,39,18}) cannot be represented as Decimal(38, 18).
```
### Why are the changes needed?
The changes are needed in order to fix the wrong results that are returned for decimal aggregate sum.
### Does this PR introduce any user-facing change?
User would see wrong results on aggregate sum that involved decimal overflow prior to this change, but now the user will see null. But if user enables the spark.sql.ansi.enabled flag to true, then the user will see an exception and not incorrect results.
### How was this patch tested?
New test has been added and existing tests for sql, catalyst and hive suites were run ok.
Closes#27627 from skambha/decaggfixwrongresults.
Lead-authored-by: Sunitha Kambhampati <skambha@us.ibm.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to make PySpark exception more Pythonic by hiding JVM stacktrace by default. It can be enabled by turning on `spark.sql.pyspark.jvmStacktrace.enabled` configuration.
```
Traceback (most recent call last):
...
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
...
```
If this `spark.sql.pyspark.jvmStacktrace.enabled` is enabled, it appends:
```
JVM stacktrace:
org.apache.spark.Exception: ...
...
```
For example, the codes below:
```python
from pyspark.sql.functions import udf
udf
def divide_by_zero(v):
raise v / 0
spark.range(1).select(divide_by_zero("id")).show()
```
will show an error messages that looks like Python exception thrown from the local.
<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary>
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
```
</details>
<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary>
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.35.193, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
```
</details>
<details>
<summary>Python exception message without this change</summary>
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 5.0 failed 4 times, most recent failure: Lost task 10.3 in stage 5.0 (TID 37, 192.168.35.193, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
```
</details>
<br/>
Another example with Python 3.7:
```python
sql("a")
```
<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary>
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/session.py", line 646, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
a
^^^
```
</details>
<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary>
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/session.py", line 646, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
a
^^^
JVM stacktrace:
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
a
^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
```
</details>
<details>
<summary>Python exception message without this change</summary>
```
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
a
^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/session.py", line 646, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco
raise converted
pyspark.sql.utils.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
a
^^^
```
</details>
### Why are the changes needed?
Currently, PySpark exceptions are very unfriendly to Python users with causing a bunch of JVM stacktrace. See "Python exception message without this change" above.
### Does this PR introduce _any_ user-facing change?
Yes, it will change the exception message. See the examples above.
### How was this patch tested?
Manually tested by
```bash
./bin/pyspark --conf spark.sql.pyspark.jvmStacktrace.enabled=true
```
and running the examples above.
Closes#28661 from HyukjinKwon/python-debug.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Replace `SimpleDateFormat` by `FastDateFormat` as the legacy formatter of `FractionTimestampFormatter`.
2. Optimise `LegacyFastTimestampFormatter` for `java.sql.Timestamp` w/o fractional part.
### Why are the changes needed?
1. By default `HiveResult`.`hiveResultString` retrieves timestamps values as instances of `java.sql.Timestamp`, and uses the legacy parser `SimpleDateFormat` to convert the timestamps to strings. After the fix https://github.com/apache/spark/pull/28024, the fractional formatter and its companion - legacy formatter `SimpleDateFormat` are created per every value. By switching from `LegacySimpleTimestampFormatter` to `LegacyFastTimestampFormatter`, we can utilize the internal cache of `FastDateFormat`, and avoid parsing the default pattern `yyyy-MM-dd HH:mm:ss`.
2. The second change in the method `def format(ts: Timestamp): String` of `LegacyFastTimestampFormatter` is needed to optimize the formatter for patterns without the fractional part and avoid conversions to microseconds.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing tests in `TimestampFormatter`.
Closes#28678 from MaxGekk/fastdateformat-as-legacy-frac-formatter.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075
In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below
he new datetime formatter introduces silent data change like,
```sql
spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy legacy
spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
00000001970-01-01
spark-sql>
```
For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it
```java
switch (signStyle) {
case EXCEEDS_PAD:
if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
buf.append(decimalStyle.getPositiveSign());
}
break;
....
```
the `minWidth` == `len(y..y)`
the `EXCEED_POINTS` is
```java
/**
* Array of 10 to the power of n.
*/
static final long[] EXCEED_POINTS = new long[] {
0L,
10L,
100L,
1000L,
10000L,
100000L,
1000000L,
10000000L,
100000000L,
1000000000L,
10000000000L,
};
```
So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised.
And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue.
### Why are the changes needed?
fix silent data change
### Does this PR introduce _any_ user-facing change?
Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous 'y' or 'Y'
### How was this patch tested?
new tests
Closes#28684 from yaooqinn/SPARK-31867-2.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR makes a minor change in deciding whether a partition is skewed by comparing the partition size to the median size of coalesced partitions instead of median size of raw partitions before coalescing.
### Why are the changes needed?
This change is line with target size criteria of splitting skew join partitions and can also cope with situations of extra empty partitions caused by over-partitioning. This PR has also improved skew join tests in AQE tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated UTs.
Closes#28669 from maryannxue/spark-31864.
Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to upgrade Janino to 3.1.2 which is released recently.
Major changes were done for refactoring, as well as there're lots of "no commit message". Belows are the pairs of (commit title, commit) which seem to deal with some bugs or specific improvements (not purposed to refactor) after 3.0.15.
* Issue #119: Guarantee executing popOperand() in popUninitializedVariableOperand() via moving popOperand() out of "assert"
* Issue #116: replace operand to final target type if boxing conversion and widening reference conversion happen together
* Merged pull request `#114` "Grow the code for relocatables, and do fixup, and relocate".
* 367c58e73e
* issue `#107`: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
* f7d99596d4
* Throw an NYI CompileException when a static interface method is invoked.
* efd3884983
* Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions)
* 32fdb5f5f1
* Issue `#104`: ClassLoaderIClassLoader 's ClassNotFoundException handle mechanism enhancement
* 6e8a97d609
You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html
### Why are the changes needed?
We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: https://github.com/janino-compiler/janino/issues/113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.
Janino 3.1.1 contains the PR https://github.com/janino-compiler/janino/pull/114 which would enable Janino to succeed to compile user's query properly. I've also fixed a couple of more bugs as 3.1.1 made Spark UTs fail - hence we need to upgrade to 3.1.2.
Furthermore, from my testing, https://github.com/janino-compiler/janino/issues/90 (which Josh Rosen filed before) seems to be also resolved in 3.1.2 as well.
Looks like Janino is maintained by one person and there's no even version branches and releases/tags so we can't expect Janino maintainer to release a new bugfix version - hence have to try out new minor version.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#27860 from HeartSaVioR/SPARK-31101.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Delete duplicate code castsuit
### Why are the changes needed?
keep spark code clean
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no need
Closes#28655 from GuoPhilipse/delete-duplicate-code-castsuit.
Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Co-authored-by: GuoPhilipse <guofei_ok@126.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
If `LLL`/`qqq` is used in the datetime pattern string, and the current JDK in use has a bug for the stand-alone form (see https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception with a clear error message.
### Why are the changes needed?
to keep backward compatibility with Spark 2.4
### Does this PR introduce _any_ user-facing change?
Yes
Spark 2.4
```
scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
| Jan|
+---------------------------------------------+
```
Spark 3.0 with Java 11
```
scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
| Jan|
+---------------------------------------------+
```
Spark 3.0 with Java 8
```
// before this PR
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
| 1|
+---------------------------------------------+
// after this PR
scala> sql("select date_format('1990-1-1', 'LLL')").show
org.apache.spark.SparkUpgradeException
```
### How was this patch tested?
manual test with java 8 and 11
Closes#28646 from cloud-fan/format.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, the legacy fractional formatter is based on the implementation from Spark 2.4 which formats the input timestamp twice:
```
val timestampString = ts.toString
val formatted = legacyFormatter.format(ts)
```
to strip trailing zeros. This PR proposes to avoid the first formatting by forming the second fraction directly.
### Why are the changes needed?
It makes legacy fractional formatter faster.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing test "format fraction of second" in `TimestampFormatterSuite` + added test for timestamps before 1970-01-01 00:00:00Z
Closes#28643 from MaxGekk/optimize-legacy-fract-format.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies some codegen related tests to test escape characters for datetime functions which are time zone aware. If the timezone is absent, the formatter could result in `null` caused by `java.util.NoSuchElementException: None.get` and bypassing the real intention of those test cases.
### Why are the changes needed?
fix tests
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
passing the modified test cases.
Closes#28653 from yaooqinn/SPARK-31835.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR extracts the logic for selecting the planned join type out of the `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst.
### Why are the changes needed?
This change both cleans up the code in `JoinSelection` and allows the logic to be in one place and be used from other rules that need to make decision based on the join type before the planning time.
### Does this PR introduce _any_ user-facing change?
`BuildSide`, `BuildLeft`, and `BuildRight` are moved from `org.apache.spark.sql.execution` to Catalyst in `org.apache.spark.sql.catalyst.optimizer`.
### How was this patch tested?
This is a refactoring, passes existing tests.
Closes#28540 from dbaliafroozeh/RefactorJoinSelection.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `ln`
- `rint`
- `lcase`
- `position`
### Why are the changes needed?
Improve the implement of some expression.
### Does this PR introduce _any_ user-facing change?
'Yes'. This PR will let user see the correct alias in schema.
### How was this patch tested?
Jenkins test.
Closes#28551 from beliefer/show-correct-alias-in-schema.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to set the alias, and class name in its `ExpressionInfo` for `struct`.
- Class name in `ExpressionInfo`
- from: `org.apache.spark.sql.catalyst.expressions.NamedStruct`
- to:`org.apache.spark.sql.catalyst.expressions.CreateNamedStruct`
- Alias name: `named_struct(col1, v, ...)` -> `struct(v, ...)`
This PR takes over https://github.com/apache/spark/pull/28631
### Why are the changes needed?
To show the correct output name and class names to users.
### Does this PR introduce _any_ user-facing change?
Yes.
**Before:**
```scala
scala> sql("DESC FUNCTION struct").show(false)
+------------------------------------------------------------------------------------+
|function_desc |
+------------------------------------------------------------------------------------+
|Function: struct |
|Class: org.apache.spark.sql.catalyst.expressions.NamedStruct |
|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.|
+------------------------------------------------------------------------------------+
```
```scala
scala> sql("SELECT struct(1, 2)").show(false)
+------------------------------+
|named_struct(col1, 1, col2, 2)|
+------------------------------+
|[1, 2] |
+------------------------------+
```
**After:**
```scala
scala> sql("DESC FUNCTION struct").show(false)
+------------------------------------------------------------------------------------+
|function_desc |
+------------------------------------------------------------------------------------+
|Function: struct |
|Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct |
|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.|
+------------------------------------------------------------------------------------+
```
```scala
scala> sql("SELECT struct(1, 2)").show(false)
+------------+
|struct(1, 2)|
+------------+
|[1, 2] |
+------------+
```
### How was this patch tested?
Manually tested, and Jenkins tests.
Closes#28633 from HyukjinKwon/SPARK-31808.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Convert `java.time.Instant` to `java.sql.Timestamp` in pushed down filters to ORC datasource when Java 8 time API enabled.
### Why are the changes needed?
The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`:
```
java.lang.IllegalArgumentException: Wrong value class java.time.Instant for TIMESTAMP.EQUALS leaf
at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)
```
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
Added tests to `OrcFilterSuite`.
Closes#28636 from MaxGekk/orc-timestamp-filter-pushdown.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style.
In this PR, we explicitly disable Narrow-Text Style for these pattern characters.
### Why are the changes needed?
Without this change, there will be a silent data change.
### Does this PR introduce _any_ user-facing change?
Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters.
1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns
2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users.
### How was this patch tested?
add unit tests
Closes#28592 from yaooqinn/SPARK-31771.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
As we support multiple catalogs with DataSourceV2, we may need the `CURRENT_CATALOG` value expression from the SQL standard.
`CURRENT_CATALOG` is a general value specification in the SQL Standard, described as:
> The value specified by CURRENT_CATALOG is the character string that represents the current default catalog name.
### Why are the changes needed?
improve catalog v2 with ANSI SQL standard.
### Does this PR introduce any user-facing change?
yes, add a new function `current_catalog()` to point the current active catalog
### How was this patch tested?
add ut
Closes#27006 from yaooqinn/SPARK-30352.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR https://github.com/apache/spark/pull/28582 added the methods to avoid conversions to days and microseconds.
### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing tests in `RowJsonSuite`.
Closes#28620 from MaxGekk/toJson-format-Java-datetime-types.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add and register three new functions: `TIMESTAMP_SECONDS`, `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS`
A test is added.
Reference: [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds)
### Why are the changes needed?
People will have convenient way to get timestamps from seconds,milliseconds and microseconds.
### Does this PR introduce _any_ user-facing change?
Yes, people will have the following ways to get timestamp:
```scala
sql("select TIMESTAMP_SECONDS(t.a) as timestamp from values(1230219000),(-1230219000) as t(a)").show(false)
```
```
+-------------------------+
|timestamp |
+-------------------------+
|2008-12-25 23:30:00|
|1931-01-07 16:30:00|
+-------------------------+
```
```scala
sql("select TIMESTAMP_MILLIS(t.a) as timestamp from values(1230219000123),(-1230219000123) as t(a)").show(false)
```
```
+-------------------------------+
|timestamp |
+-------------------------------+
|2008-12-25 23:30:00.123|
|1931-01-07 16:29:59.877|
+-------------------------------+
```
```scala
sql("select TIMESTAMP_MICROS(t.a) as timestamp from values(1230219000123123),(-1230219000123123) as t(a)").show(false)
```
```
+------------------------------------+
|timestamp |
+------------------------------------+
|2008-12-25 23:30:00.123123|
|1931-01-07 16:29:59.876877|
+------------------------------------+
```
### How was this patch tested?
Unit test.
Closes#28534 from TJX2014/master-SPARK-31710.
Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.
If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.
### Why are the changes needed?
To keep backward compatibility with Spark 2.4.
### Does this PR introduce _any_ user-facing change?
Yes.
Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
| 1970-01-16 00:00:00|
+------------------------+
scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
| 1970-01-16|
+-------------------+
scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
| 2019-01-01 00:40:00|
+----------------------------------+
scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
| 2019-01-01 10:10:10|
+----------------------------------------------+
```
in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
| null|
+--------------------+
scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
| null|
+---------------+
scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
| 2019-01-01 00:00:00|
+------------------------------+
scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
| 2019-01-01 00:00:00|
+------------------------------------------+
```
After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.
### How was this patch tested?
new tests
Closes#28576 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings:
- TimestampFormatter:
- `def format(ts: Timestamp): String`
- `def format(instant: Instant): String`
- DateFormatter:
- `def format(date: Date): String`
- `def format(localDate: LocalDate): String`
2. Re-use the added methods from `HiveResult.toHiveString`
3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions.
### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing tests for toHiveString and new tests in `TimestampFormatterSuite`.
Closes#28582 from MaxGekk/opt-format-old-types.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds a private `WriteBuilder` mixin trait: `SupportsStreamingUpdate`, so that the builtin v2 streaming sinks can still support the update mode.
Note: it's private because we don't have a proper design yet. I didn't take the proposal in https://github.com/apache/spark/pull/23702#discussion_r258593059 because we may want something more general, like updating by an expression `key1 = key2 + 10`.
### Why are the changes needed?
In Spark 2.4, all builtin v2 streaming sinks support all streaming output modes, and v2 sinks are enabled by default, see https://issues.apache.org/jira/browse/SPARK-22911
It's too risky for 3.0 to go back to v1 sinks, so I propose to add a private trait to fix builtin v2 sinks, to keep backward compatibility.
### Does this PR introduce _any_ user-facing change?
Yes, now all the builtin v2 streaming sinks support all streaming output modes, which is the same as 2.4
### How was this patch tested?
existing tests.
Closes#28523 from cloud-fan/update.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Eliminate the `UpCast` if it's child data type is already decimal type.
### Why are the changes needed?
While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:
```
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
.write.mode("overwrite")
.parquet(f.getAbsolutePath)
// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
```
```
[info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
```
### Does this PR introduce _any_ user-facing change?
Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.
### How was this patch tested?
Added tests.
Closes#28572 from Ngone51/fix_encoder.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change timestamp casting to int in ANSI and non-codegen mode, and make the error message consistent to the error messages in the codegen mode. In particular, casting to int is implemented in the same way as casting to short and byte.
### Why are the changes needed?
1. The error message in the non-codegen mode is diversed from the error message in the codegen mode.
2. The error message contains intermediate results that could confuse.
### Does this PR introduce _any_ user-facing change?
Yes. Before the changes, the error message of casting timestamp to int contains intermediate result but after the changes it contains the input values which causes arithmetic overflow.
### How was this patch tested?
By running the modified test suite `AnsiCastSuite`.
Closes#28549 from MaxGekk/fix-error-msg-cast-timestamp.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch effectively reverts SPARK-30098 via below changes:
* Removed the config
* Removed the changes done in parser rule
* Removed the usage of config in tests
* Removed tests which depend on the config
* Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098
* Reflect the change into docs (migration doc, create table syntax)
### Why are the changes needed?
SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change.
Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details.
### Does this PR introduce _any_ user-facing change?
No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases.
### How was this patch tested?
Existing UTs.
Closes#28517 from HeartSaVioR/revert-SPARK-30098.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Set default time zone and locale in the default constructor of `SparkFunSuite`:
- Default time zone to `America/Los_Angeles`
- Default locale to `Locale.US`
### Why are the changes needed?
1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite
2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running all affected test suites
Closes#28548 from MaxGekk/timezone-settings-SparkFunSuite.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`:
- Change the operations resolving directions to top-down.
- Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions`
- Push the aggregate expressions into the aggregate which contains the expanded operations.
- Use UnresolvedHaving for all having clause.
### Why are the changes needed?
Correctness bug fix. See the demo and analysis in SPARK-31663.
### Does this PR introduce _any_ user-facing change?
Yes, correctness bug fix for HAVING with GROUPING SETS.
### How was this patch tested?
New UTs added.
Closes#28501 from xuanyuanking/SPARK-31663.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added tests to check casting timestamps before 1970-01-01 00:00:00Z to ByteType, ShortType, IntegerType and LongType in ansi and non-ansi modes.
### Why are the changes needed?
To improve test coverage and prevent errors while modifying the CAST expression code.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the modified test suites:
```
$ ./build/sbt "test:testOnly *CastSuite"
```
Closes#28531 from MaxGekk/test-cast-timestamp-to-byte.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR try to fix a bug in `org.apache.spark.sql.hive.execution.ScriptTransformationExec`. This bug appears in our online cluster. `ScriptTransformationExec` should throw an exception, when user uses a python script which contains parse error. But current implementation may miss this case of failure.
### Why are the changes needed?
When user uses a python script which contains a parse error, there will be no output. So `scriptOutputReader.next(scriptOutputWritable) <= 0` matches, then we use `checkFailureAndPropagate()` to check the `proc`. But the `proc` may still be alive and `writerThread.exception` is not defined, `checkFailureAndPropagate` cannot check this case of failure. In the end, the Spark SQL job runs successfully and returns no result. In fact, the SparK SQL job should fails and shows the exception properly.
For example, the error python script is blow.
``` python
# encoding: utf8
import unknow_module
import sys
for line in sys.stdin:
print line
```
The bug can be reproduced by running the following code in our cluter.
```
spark.range(100*100).toDF("index").createOrReplaceTempView("test")
spark.sql("select TRANSFORM(index) USING 'python error_python.py' as new_index from test").collect.foreach(println)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT
Closes#27724 from slamke/transformation.
Authored-by: sunke.03 <sunke.03@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not.
### Why are the changes needed?
Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values.
### Does this PR introduce _any_ user-facing change?
Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config.
### How was this patch tested?
updated tests
Closes#28477 from cloud-fan/rebase.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`.
### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`.
Closes#28502 from MaxGekk/random-java8-datetime.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.
### Why are the changes needed?
Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
> ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)
== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```
This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
> 1,
> -- two
> 2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.
Closes#27920 from javierivanov/SPARK-31102.
Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `TimeWindow`
- `MaxBy`
- `MinBy`
- `UnaryMinus`
- `BitwiseCount`
This PR also fix a typo issue, please look at b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L142)
Note:
1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed. We can reuse `prettyName` to replace `funcName`.
2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below:
`override def sql: String = s"bit_count(${child.sql})"`
I don't think it's elegant enough. Because `Expression` gives the following definitions.
```
def sql: String = {
val childrenSQL = children.map(_.sql).mkString(", ")
s"$prettyName($childrenSQL)"
}
```
By this definition, `BitwiseCount` should override `prettyName` method.
### Why are the changes needed?
Improve the implement of some expression.
### Does this PR introduce any user-facing change?
'Yes'. This PR will let user see the correct alias in schema.
### How was this patch tested?
Jenkins test.
Closes#28164 from beliefer/elegant-pretty-name-for-function.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType.
### Why are the changes needed?
To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`.
Closes#28481 from MaxGekk/test-random-parquet-dict-enc.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar.
### Why are the changes needed?
This makes RowEncoderSuite much stable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running RowEncoderSuite and set non-existing date manually:
```scala
val date = new java.sql.Date(1000 - 1900, 1, 29)
Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY))
```
Closes#28486 from MaxGekk/fix-RowEncoderSuite.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods.
This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of the RuntimeReplaceable ones with toPettySQL
### Why are the changes needed?
Consistency of schema output between RuntimeReplaceable expressions and normal ones.
For example, `date_format` vs `to_timestamp`, before this PR, they output differently
#### Before
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>
select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp>
```
#### After
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>
select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")
struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp>
````
### Does this PR introduce _any_ user-facing change?
Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example
### How was this patch tested?
regenerate all related tests
Closes#28420 from yaooqinn/SPARK-31615.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is a followup to address the https://github.com/apache/spark/pull/28366#discussion_r420611872 by refining the SQL config document.
### Why are the changes needed?
Make developers less confusing.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Only doc change.
Closes#28468 from viirya/SPARK-31365-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR proposes to deprecate 'spark.sql.optimizer.metadataOnly' configuration and remove it in the future release.
### Why are the changes needed?
This optimization can cause a potential correctness issue, see also SPARK-26709.
Also, it seems difficult to extend the optimization. Basically you should whitelist all available functions. It costs some maintenance overhead, see also SPARK-31590.
Looks we should just better let users use `SparkSessionExtensions` instead if they must use, and remove it in Spark side.
### Does this PR introduce _any_ user-facing change?
Yes, setting `spark.sql.optimizer.metadataOnly` will show a deprecation warning:
```scala
scala> spark.conf.unset("spark.sql.optimizer.metadataOnly")
```
```
20/05/06 12:57:23 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```
```scala
scala> spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
```
```
20/05/06 12:57:44 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```
### How was this patch tested?
Manually tested.
Closes#28459 from HyukjinKwon/SPARK-31647.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown.
### Why are the changes needed?
We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources.
In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added/Modified unit tests.
Closes#28366 from viirya/SPARK-31365.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date.
### Why are the changes needed?
The changes fix a bug and restore compatibility with Spark 2.4 in which:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
| d|
+----------+
|1582-01-01|
+----------+
```
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
| d|
+----------+
|1582-01-11|
+----------+
```
After:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
| d|
+----------+
|1582-01-01|
+----------+
```
### How was this patch tested?
Add a test to `JsonSuite`.
Closes#28453 from MaxGekk/json-rebase-legacy-days.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`.
### Why are the changes needed?
The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`.
Before (see the PR's results https://github.com/apache/spark/pull/28440):
```
================================================================================================
Conversion from/to external types
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp 376 388 10 13.3 75.2 1.1X
Collect java.sql.Timestamp 1878 1937 64 2.7 375.6 0.2X
```
After:
```
================================================================================================
Conversion from/to external types
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp 249 264 24 20.1 49.8 1.7X
Collect java.sql.Timestamp 1503 1523 24 3.3 300.5 0.3X
```
Perf improvements in average of:
1. From java.sql.Timestamp is ~ 34%
2. To java.sql.Timestamps is ~16%
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`.
Closes#28441 from MaxGekk/opt-rebase-common-threshold.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This reverts commit 43a73e387c. It sets `INT96` as the timestamp type while saving timestamps to parquet files.
### Why are the changes needed?
To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing test suites.
Closes#28450 from MaxGekk/parquet-int96.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes#28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.
Example:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
case class R(id: String, value: String, bytes: Array[Byte])
def makeR(id: String, value: String) = R(id, value, value.getBytes)
val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()
// In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).
df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)
// The same problem is displayed when using window functions.
val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val result = df.select(
collect_set('value).over(win) as "stringSet",
collect_set('bytes).over(win) as "bytesSet"
)
.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
.show()
```
We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays
Array(1, 2, 3) == Array(1, 2, 3) => False
The result is that duplicates are not removed in the hashset
The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type.
This transformation is only applied when we have a BinaryType
### Why are the changes needed?
Fix the bug explained
### Does this PR introduce any user-facing change?
Yes. Now `collect_set()` correctly deduplicates array of byte.
### How was this patch tested?
Unit testing
Closes#28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Add test cases covering all interval units: MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR
### Why are the changes needed?
For test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Test only.
Closes#28418 from xuanyuanking/SPARK-28424.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to fix two legacy timestamp formatter `LegacySimpleTimestampFormatter` and `LegacyFastTimestampFormatter` to perform micros rebasing in parsing/formatting from/to strings.
### Why are the changes needed?
Legacy timestamps formatters operate on the hybrid calendar (Julian + Gregorian), so, the input micros should be rebased to have the same date-time fields as in Proleptic Gregorian calendar used by Spark SQL, see SPARK-26651.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
Added tests to `TimestampFormatterSuite`
Closes#28408 from MaxGekk/fix-rebasing-in-legacy-timestamp-formatter.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In `CTESubstitution`, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.
### Why are the changes needed?
Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
local perf test
```
scala> :pa
// Entering paste mode (ctrl-D to finish)
def test(i: Int): Unit = 1.to(i).foreach { _ =>
spark.sql("""
with
t1 as (select 1),
t2 as (select 1),
t3 as (select 1),
t4 as (select 1),
t5 as (select 1),
t6 as (select 1),
t7 as (select 1),
t8 as (select 1),
t9 as (select 1)
select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed()
}
// Exiting paste mode, now interpreting.
test: (i: Int)Unit
scala> test(10000)
scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent)
```
The result before this patch
```
Rule Effective Time / Total Time Effective Runs / Total Runs
CTESubstitution 3328796344 / 3924576425 10000 / 20000
```
The result after this patch
```
Rule Effective Time / Total Time Effective Runs / Total Runs
CTESubstitution 1503085936 / 2091992092 10000 / 20000
```
About 2 times faster.
Closes#28407 from cloud-fan/cte.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Because of ebc8fa50d0 and beec8d535f, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....`
This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back.
### Why are the changes needed?
restore the SQL string for datetime operations
### Does this PR introduce any user-facing change?
NO, we are restoring here
### How was this patch tested?
added unit tests
Closes#28402 from yaooqinn/SPARK-31586-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Check all available legacy formats in the tests added by https://github.com/apache/spark/pull/28345
- Check dates rebasing in legacy parsers for only one direction either days -> string or string -> days.
### Why are the changes needed?
Round trip tests can hide issues in dates rebasing. For example, if we remove rebasing from legacy parsers (from `parse()` and `format()`) the tests will pass.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `DateFormatterSuite`.
Closes#28398 from MaxGekk/test-rebasing-in-legacy-date-formatter.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic.
Before this PR (a column name changes run-by-run):
```
scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
| 0.9524061403696937|
+-------------------------+
```
After this PR (a column name fixed):
```
scala> sql("select rand()").show()
+------------------+
| rand()|
+------------------+
|0.7137935639522275|
+------------------+
// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+
| rand(1)|
+------------------+
|0.6363787615254752|
+------------------+
// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]
```
Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.
### Why are the changes needed?
To make output schema deterministic.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes#28392 from maropu/SPARK-31594.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR addresses two things:
- `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921)
- `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`.
### Why are the changes needed?
It's a bug.
### Does this PR introduce any user-facing change?
Yes, now `SHOW TBLPROPERTIES` works on views:
```
scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES view").show(truncate=false)
+---------------------------------+-------------+
|key |value |
+---------------------------------+-------------+
|view.catalogAndNamespace.numParts|2 |
|view.query.out.col.0 |c1 |
|view.query.out.numCols |1 |
|p2 |v2 |
|view.catalogAndNamespace.part.0 |spark_catalog|
|p1 |v1 |
|view.catalogAndNamespace.part.1 |default |
+---------------------------------+-------------+
```
And for a temporary view:
```
scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false)
+---+-----+
|key|value|
+---+-----+
+---+-----+
```
### How was this patch tested?
Added tests.
Closes#28375 from imback82/show_tblproperties_followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933
I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values.
```sql
presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
_col0
-------
14
(1 row)
Query 20200428_135239_00000_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
_col0
-------
13
(1 row)
Query 20200428_135246_00001_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto>
```
```sql
postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
date_part
-----------
14
(1 row)
postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
date_part
-----------
13
```
```
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
0
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
0
```
In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value.
### Why are the changes needed?
Both satisfy the ANSI standard and common use cases in modern SQL platforms
### Does this PR introduce any user-facing change?
No, it new in 3.0
### How was this patch tested?
add more uts
Closes#28396 from yaooqinn/SPARK-31597.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to simplify the code of `InSet.sql` and create `Literal` instances directly from Catalyst's internal values by using the default `Literal` constructor.
### Why are the changes needed?
This simplifies code and avoids unnecessary conversions to external types.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test `SPARK-31563: sql of InSet for UTF8String collection` in `ColumnExpressionSuite`.
Closes#28399 from MaxGekk/fix-InSet-sql-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent.
Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239
Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR.
### Why are the changes needed?
remove redundant and useless code for easy maintenance
### Does this PR introduce any user-facing change?
Yes, the SQL string of `datetime - interval` become `datetime + (- interval)`
### How was this patch tested?
modified existing unit tests.
Closes#28381 from yaooqinn/SPARK-31586.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).
### Why are the changes needed?
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.
It works for simple cases in a very tricky way as it relies on rule execution order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.
In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.
See the demo below:
```
SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
The query's result is
```
+---+----------+
| b| fake|
+---+----------+
| 2|2020-01-01|
+---+----------+
```
But if we add CAST, it will return an empty result.
```
SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
### Does this PR introduce any user-facing change?
Yes, bug fix for cast in having aggregate expressions.
### How was this patch tested?
New UT added.
Closes#28294 from xuanyuanking/SPARK-31519.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets.
This PR also fixes various problems:
1. the name check should consider case sensitivity
2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail.
### Why are the changes needed?
correct the behavior
### Does this PR introduce any user-facing change?
yes, fix the fore-mentioned behaviors.
### How was this patch tested?
new tests
Closes#28371 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Remove all the extra whitespaces in the formatted explain.
### Why are the changes needed?
The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can.
### Does this PR introduce any user-facing change?
No, formatted explain is newly added in Spark 3.0.
### How was this patch tested?
Updated sql query tests.
Closes#28315 from Ngone51/fix_extra_spaces.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
To follow ANSI,the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0.
### Why are the changes needed?
Better ANSI compliance
### Does this PR introduce any user-facing change?
No, this PR should target 3.0.0 in which this feature is newly added.
### How was this patch tested?
add more unit tests
Closes#28310 from yaooqinn/SPARK-31527.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days.
### Why are the changes needed?
The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual testing and modified existing unit tests.
Closes#28345 from bersprockets/SPARK-31557.
Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
WITH t(c) AS (SELECT 2)
SELECT * FROM t
)
```
Before this PR the result was `1`.
### Why are the changes needed?
To fix a correctness issue.
### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.
### How was this patch tested?
Added new test case.
Closes#28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection.
### Why are the changes needed?
The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563.
### Does this PR introduce any user-facing change?
Highly likely, not.
### How was this patch tested?
Added a test to `ColumnExpressionSuite`
Closes#28343 from MaxGekk/fix-InSet-sql.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor`
For example:
```scala
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18
scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18
scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
at newArrayEncoder(<console>:57)
... 53 elided
scala>
```
In this PR, we add the missing cases to `arrayClassFor`
### Why are the changes needed?
bugfix as described above
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add a test for array encoders
Closes#28324 from yaooqinn/SPARK-31552.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
`LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example:
Teradata / Hive 3.0 / Snowflake:
```sql
--like any
select 'foo' LIKE ANY ('%foo%','%bar%');
--like all
select 'foo' LIKE ALL ('%foo%','%bar%');
```
PostgreSQL:
```sql
-- like any
select 'foo' LIKE ANY (array['%foo%','%bar%']);
-- like all
select 'foo' LIKE ALL (array['%foo%','%bar%']);
```
This PR add support these two operators.
More details:
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQhttps://issues.apache.org/jira/browse/HIVE-15229https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html
### Why are the changes needed?
To smoothly migrate SQLs to Spark SQL.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#27477 from wangyum/SPARK-30724.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes.
2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days.
3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`).
### Why are the changes needed?
To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a test to `ParquetFilterSuite`
Closes#28259 from MaxGekk/parquet-filter-java8-date-time.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;
- Added a new test suite named `ExpressionInfoSuite`
- To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
- Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
- Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
- Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224)
### Why are the changes needed?
To improve test suites/coverage.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests.
Closes#28308 from maropu/SPARK-31526.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Override the canonicalized fields with respect to the result of `needsTimeZone`.
### Why are the changes needed?
The current approach breaks sematic equal of two cast expressions that don't relate with datetime type. If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New UT added.
Closes#28288 from xuanyuanking/SPARK-31515.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
\_FUNC\_ is used in note() of `ExpressionDescription` since https://github.com/apache/spark/pull/28248, it can be more cases later, we should replace it with function name for documentation
### Why are the changes needed?
doc fix
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
pass Jenkins, and verify locally with Jekyll serve
Closes#28305 from yaooqinn/SPARK-31474-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Document row field values of `DATE` and `TIMESTAMP` type returned by `Row.get()` and `Row.apply`.
- Refer to `Row.get()` from the description of filter values
### Why are the changes needed?
Reflect current behaviour of Row's method `apply()` and `get()` in comments to inform users about different return types that are depended on the SQL config settings `spark.sql.datetime.java8API.enabled`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Run `$ ./dev/scalastyle`
Closes#28300 from MaxGekk/doc-filter-date-time.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDFhttps://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htmhttps://prestodb.io/docs/current/functions/datetime.htmlhttps://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.htmlhttps://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-partshttps://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
This PR removes these extract fields support from extract function for date and timestamp values
`isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms
`isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it.
For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated.
1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`]
2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`]
3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found.
4. with a discussion the community,we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too.
Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system
### Why are the changes needed?
Remove some nonstandard and uncommon features as we can add them back if necessary
### Does this PR introduce any user-facing change?
NO, we should target this to 3.0.0 and these are added during 3.0.0
### How was this patch tested?
Remove unused tests
Closes#28284 from yaooqinn/SPARK-31507.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax.
### Why are the changes needed?
The current implementation is wrong. It gives you a SQL string that returns different result.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
new tests
Closes#28281 from cloud-fan/sql.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
```sql
spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6
```
Currently, there are 4 types of day-of-week range:
1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7)
2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6)
3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7)
4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6)
Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903https://issues.apache.org/jira/browse/SPARK-28623
In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything.
Also, add more documentation to the extracting function to make extract field more clear to understand.
### Why are the changes needed?
Consistency insurance
### Does this PR introduce any user-facing change?
yes, doc updated and extract `dow` is as same as `dayofweek`
### How was this patch tested?
1. modified ut
2. local SQL doc verification
#### before
![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png)
#### after
![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png)
Closes#28248 from yaooqinn/SPARK-31474.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing.
### Why are the changes needed?
launching tasks that do nothing is a waste.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
updated tests
Closes#28226 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is the follow-up PR of https://github.com/apache/spark/pull/28003
- add a migration guide
- add an end-to-end test case.
### Why are the changes needed?
The original PR made the major behavior change in the user-facing RESET command.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a new end-to-end test
Closes#28265 from gatorsmile/spark-31234followup.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
1. Generate rebasing arrays for micros up to 2037 in `RebaseDateTimeSuite.generateRebaseJson()`.
2. Exclude 4 time zones from the black list in `generateRebaseJson()`.
3. Re-generate JSON files with rebasing info - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`.
### Why are the changes needed?
1. `sun.util.calendar.ZoneInfo` resolves DST after 2037 year incorrectly. See aa318070b2/jdk/src/share/classes/sun/util/calendar/ZoneInfo.java (L55-L62) . By restricting the rebase arrays to 2037 year, we follow the behaviour of `ZoneInfo` which uses DST of 2037 for all years beyond 2037.
2. To enable optimization of micros rebasing via switch arrays for the time zones:
- Asia/Tehran
- Iran
- Africa/Casablanca
- Africa/El_Aaiun
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `RebaseDateTimeUtils`, `DateTimeUtilsSuite` and `DateFunctionsSuite`.
Closes#28253 from MaxGekk/fix-4-time-zones-rebasing.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to add an ExpressionInfo entry for EXTRACT for better documentations.
This PR comes from the comment in https://github.com/apache/spark/pull/21479#discussion_r409900080
### Why are the changes needed?
To make SQL documentations complete.
### Does this PR introduce any user-facing change?
Yes, this PR updates the `Spark SQL, Built-in Functions` page.
### How was this patch tested?
Run the example tests.
Closes#28251 from maropu/AddExtractExpr.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is to dump the codegen and compilation time for benchmark query tests.
### Why are the changes needed?
Measure the codegen and compilation time costs in TPC-DS queries
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual test in my local laptop:
```
23:13:12.845 WARN org.apache.spark.sql.TPCDSQuerySuite:
=== Metrics of Whole-stage Codegen ===
Total code generation time: 21.275102261 seconds
Total compilation time: 12.223771828 seconds
```
Closes#28252 from gatorsmile/testMastercode.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix a bug that occurs when comparing null types to decimal types in master/branch-3.0;
```
scala> Seq(BigDecimal(10)).toDF("v1").selectExpr("v1 = NULL").explain(true)
org.apache.spark.sql.AnalysisException: cannot resolve '(`v1` = NULL)' due to data type mismatch: differing types in '(`v1` = NULL)' (decimal(38,18) and null).; line 1 pos 0;
'Project [(v1#5 = null) AS (v1 = NULL)#7]
+- Project [value#2 AS v1#5]
+- LocalRelation [value#2]
...
```
The query above passed in v2.4.5.
### Why are the changes needed?
bugfix
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests.
Closes#28241 from maropu/SPARK-31468.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values
While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other.
While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual.
So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`.
While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.
To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.
### Why are the changes needed?
Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.
The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.
With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.
e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579
e.g. presto
```sql
presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)
presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)
```
### Does this PR introduce any user-facing change?
Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function
### How was this patch tested?
rm some used tests
Closes#28242 from yaooqinn/SPARK-31469.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
I checked all the config of Spark again. find some new commit not add version information.
**Test.scala**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.testing.skipValidateCores | 3.1.0 | SPARK-29154 | 474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 |
**SQL**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.legacy.integerGroupingId | 3.1.0 | SPARK-30279 | 71c73d58f6e88d2558ed2e696897767d93bac60f#diff-9a6b543db706f1a90f790783d6930a13 |
The two config only exists in branch master.
### Why are the changes needed?
Supplement version information.
### Does this PR introduce any user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#28233 from beliefer/sql-conf-version-legacy-integerGroupingId.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts.
### Why are the changes needed?
Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue.
Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8` to simulate SPARC platform. And tests only pass with this fix.
Closes#28195 from Ngone51/fix_uao.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).
### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#28223 from hvanhovell/SPARK-31450.
Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
SPARK-21870 (cb0cddf#diff-06dc5de6163687b7810aa76e7e152a76R146-R149) caused significant performance regression in cases where the source code size is fairly large as `HashAggregateExec` uses `Block.length` to decide on splitting the code. The change in `length` makes sense as the comment and extra new lines shouldn't be taken into account when deciding on splitting, but the regular expression based approach is very slow and adds a big relative overhead to cases where the execution is quick (small number of rows).
This PR:
- restores `Block.length` to its original form
- places comments in `HashAggragateExec` with `CodegenContext.registerComment` so as to appear only when comments are enabled (`spark.sql.codegen.comments=true`)
Before this PR:
```
deeply nested struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem) 1137 1143 8 0.1 11368.3 0.0X
```
After this PR:
```
deeply nested struct field r/w: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem) 167 180 7 0.6 1674.3 0.1X
```
### Why are the changes needed?
To fix performance regression.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#28083 from peter-toth/SPARK-30564-use-comment-placeholders.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed timestamps in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range [1582-10-05, 1582-10-15). Not existed timestamps from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianMicros()` because reverse rebasing from the hybrid timestamps to Proleptic Gregorian timestamps does not have such problem.
The shifting affects only the date part of timestamps while keeping the time part as is. For example:
```
1582-10-10 00:11:22.334455 -> 1582-10-15 00:11:22.334455
```
### Why are the changes needed?
Currently, not-existed timestamps are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 00:00:00 -> 1582-10-24 00:00:00. That contradicts to shifting of not existed dates in other cases, for example:
```
scala> sql("select timestamp'1990-9-31 12:12:12'").show
+----------------------------------+
|TIMESTAMP('1990-10-01 12:12:12.0')|
+----------------------------------+
| 1990-10-01 12:12:12|
+----------------------------------+
```
### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `TIMESTAMP` values to external timestamps based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 12:13:14 date to ORC files, it will be shifted to the next valid date 1582-10-15 12:13:14.
### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.
Closes#28227 from MaxGekk/fix-not-exist-timestamps.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed dates in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 1582-10-15). Not existed dates from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates to Proleptic Gregorian dates does not have such problem.
### Why are the changes needed?
Currently, not-existed dates are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 1582-10-24. That's contradict to shifting not existed dates in other cases, for example:
```
scala> sql("select date'1990-9-31'").show
+-----------------+
|DATE '1990-10-01'|
+-----------------+
| 1990-10-01|
+-----------------+
```
### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `DATE` values to external dates based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 date to ORC files, it will be shifted to the next valid date 1582-10-15.
### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.
Closes#28225 from MaxGekk/fix-not-exist-dates.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`.
Also new benchmark for collecting dates is added to `DateTimeBenchmark`.
### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27):
Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 559 603 38 8.9 111.8 1.0X
Collect dates 2306 3221 1558 2.2 461.1 0.2X
```
Before the changes:
```
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 1052 1130 73 4.8 210.3 1.0X
Collect dates 3251 4943 1624 1.5 650.2 0.3X
```
After:
```
To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date 416 419 3 12.0 83.2 1.0X
Collect dates 1928 2759 1180 2.6 385.6 0.2X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |
Closes#28212 from MaxGekk/optimize-toJavaDate.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>