### What changes were proposed in this pull request?
Handle type coercion when resolving V2 function. In particular:
- prior to evaluating function arguments, insert cast whenever the argument type doesn't match the expected input type.
- use `BoundFunction.inputTypes()` to lookup magic method for scalar function
### Why are the changes needed?
For V2 functions, the actual argument types should not necessarily match those of the input types, and Spark should handle type coercion whenever it is needed.
### Does this PR introduce _any_ user-facing change?
Yes. Now V2 function resolution should be able to handle type coercion properly.
### How was this patch tested?
Added a few new tests.
Closes#32764 from sunchao/SPARK-35390.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/client`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32763 from beliefer/SPARK-35058.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to extend Spark SQL API to accept `java.time.LocalDateTime` as an external type of recently added new Catalyst type - `TimestampWithoutTZ`. The Java class `java.time.LocalDateTime` has a similar semantic to ANSI SQL timestamp without timezone type, and it is the most suitable to be an external type for `TimestampWithoutTZType`. In more details:
* Added `TimestampWithoutTZConverter` which converts java.time.LocalDateTime instances to/from internal representation of the Catalyst type `TimestampWithoutTZType` (to Long type). The `TimestampWithoutTZConverter` object uses new methods of DateTimeUtils:
* localDateTimeToMicros() converts the input date time to the total length in microseconds.
* microsToLocalDateTime() obtains a java.time.LocalDateTime
* Support new type `TimestampWithoutTZType` in RowEncoder via the methods createDeserializerForLocalDateTime() and createSerializerForLocalDateTime().
* Extended the Literal API to construct literals from `java.time.LocalDateTime` instances.
### Why are the changes needed?
To allow users parallelization of `java.time.LocalDateTime` collections, and construct timestamp without time zone columns. Also to collect such columns back to the driver side.
### Does this PR introduce _any_ user-facing change?
The PR extends existing functionality. So, users can parallelize instances of the java.time.LocalDateTime class and collect them back.
```
scala> val ds = Seq(java.time.LocalDateTime.parse("1970-01-01T00:00:00")).toDS
ds: org.apache.spark.sql.Dataset[java.time.LocalDateTime] = [value: timestampwithouttz]
scala> ds.collect()
res0: Array[java.time.LocalDateTime] = Array(1970-01-01T00:00)
```
### How was this patch tested?
New unit tests
Closes#32814 from gengliangwang/LocalDateTime.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Fix Scala doc for removed parameters for `InvokeLike.invoke`.
### Why are the changes needed?
#32532 forgot to update the Scala doc after removing 2 parameters for `InvokeLike.invoke`. This fixes it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#32827 from sunchao/SPARK-35384-followup.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch introduces a new option to specify the minimum number of offsets to read per trigger i.e. minOffsetsPerTrigger and maxTriggerDelay to avoid the infinite wait for the trigger.
This new option will allow skipping trigger/batch when the number of records available in Kafka is low. This is a very useful feature in cases where we have a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day.
'maxTriggerDelay' option will help to avoid cases of infinite delay in scheduling trigger and the trigger will happen irrespective of records available if the maxTriggerDelay time exceeds the last trigger. It would be an optional parameter with a default value of 15 mins. This option will be only applicable if minOffsetsPerTrigger is set.
minOffsetsPerTrigger option would be optional of course, but once specified it would take precedence over maxOffestsPerTrigger which will be honored only after minOffsetsPerTrigger is satisfied.
### Why are the changes needed?
There are many scenarios where there is a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. Tunning such jobs is difficult as decreasing trigger processing time increasing the number of batches and hence cluster resource usage and adds to small file issues. Increasing trigger processing time adds consumer lag. This patch tries to address this issue.
### How was this patch tested?
This patch was tested by adding test cases as well as manually on a cluster where the job was running for a full one day with a data burst happening once a day.
Here is the picture of databurst and hence consumer lag:
<img width="1198" alt="Screenshot 2021-04-29 at 11 39 35 PM" src="https://user-images.githubusercontent.com/1044003/116997587-9b2ab180-acfa-11eb-91fd-524802ce3316.png">
This is how the job behaved at burst time running every 4.5 mins (which is the specified trigger time):
<img width="1154" alt="Burst Time" src="https://user-images.githubusercontent.com/1044003/116997919-12f8dc00-acfb-11eb-9b0a-98387fc67560.png">
This is job behavior during the non-burst time where it is skipping 2 to 3 triggers and running once every 9 to 13.5 mins
<img width="1154" alt="Non Burst Time" src="https://user-images.githubusercontent.com/1044003/116998244-8b5f9d00-acfb-11eb-8340-33d47149ef81.png">
Here are some more stats from the two-run i.e. one normal run and the other with minOffsetsperTrigger set:
| Run | Data Size | Number of Batch Runs | Number of Files |
| ------------- | ------------- |------------- |------------- |
| Normal Run | 54.2 GB | 320 | 21968 |
| Run with minOffsetsperTrigger | 54.2 GB | 120 | 12104 |
Closes#32653 from satishgopalani/SPARK-35312.
Authored-by: Satish Gopalani <satish.gopalani@pubmatic.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Cleanup unreachable code.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existed test.
Closes#32791 from pan3793/cleanup.
Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
Extend Catalyst's type system by a new type that conforms to the SQL standard (see SQL:2016, section 4.6.2): TimestampWithoutTZType represents the timestamp without time zone type
### Why are the changes needed?
Spark SQL today supports the TIMESTAMP data type. However the semantics provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. Timestamps embedded in a SQL query or passed through JDBC are presumed to be in session local timezone and cast to UTC before being processed.
These are desirable semantics in many cases, such as when dealing with calendars.
In many (more) other cases, such as when dealing with log files it is desirable that the provided timestamps not be altered.
SQL users expect that they can model either behavior and do so by using TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH LOCAL TIME ZONE for time zone sensitive data.
Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist in the standard.
In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for standard semantic.
Using these two types will provide clarity.
This is a starting PR. See more details in https://issues.apache.org/jira/browse/SPARK-35662
### Does this PR introduce _any_ user-facing change?
Yes, a new data type for Timestamp without time zone type. It is still in development.
### How was this patch tested?
Unit test
Closes#32802 from gengliangwang/TimestampNTZType.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It's a long-standing bug that we forgot to resolve `UnresolvedAlias` in `CollectMetrics`. It's a bit hard to trigger this bug before 3.2 as most likely people won't create `UnresolvedAlias` when calling `Dataset.observe`. However things have been changed after https://github.com/apache/spark/pull/30974
This PR proposes to handle `CollectMetrics` in the rule `ResolveAliases`.
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
updated test
Closes#32803 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Optimizes the retrieval of approximate quantiles for an array of percentiles.
* Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation.
* Modifies the ApproximatePercentiles operator to call into the new method.
All formatting changes are the result of running ./dev/scalafmt
### Why are the changes needed?
The existing implementation does repeated calls per input percentile resulting in redundant computation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added unit tests for the new method.
Closes#32700 from alkispoly-db/spark_35558_approx_quants_array.
Authored-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
A followup for 345d35ed1a, in this PR we support CURRENT_USER without tailing parentheses in default mode. And for ANSI mode, we can only use CURRENT_USER without tailing parentheses because it is a reserved keyword that cannot be used as a function name
### Why are the changes needed?
1. make it the same as current_date/current_timestamp
2. better ANSI compliance
### Does this PR introduce _any_ user-facing change?
no, just a followup
### How was this patch tested?
new tests
Closes#32770 from yaooqinn/SPARK-21957-F.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add database if exists check in `SeesionCatalog`
### Why are the changes needed?
Curently execute `drop database test` will throw unfriendly error msg.
```
Error in query: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
at org.apache.spark.sql.hive.HiveExternalCatalog.dropDatabase(HiveExternalCatalog.scala:200)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropDatabase(ExternalCatalogWithListener.scala:53)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropDatabase(SessionCatalog.scala:273)
at org.apache.spark.sql.execution.command.DropDatabaseCommand.run(ddl.scala:111)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3707)
```
### Does this PR introduce _any_ user-facing change?
Yes, more cleaner error msg.
### How was this patch tested?
Add test.
Closes#32768 from ulysses-you/SPARK-35629.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Sets `references` for `NamedLambdaVariable` and `LambdaFunction`.
| Expression | NamedLambdaVariable | LambdaFunction |
| --- | --- | --- |
| References before | None | All function references |
| References after | self.toAttribute | Function references minus arguments' references |
In `NestedColumnAliasing`, this means that `ExtractValue(ExtractValue(attr, lv: NamedLambdaVariable), ...)` now references both `attr` and `lv`, rather than just `attr`. As a result, it will not be included in the nested column references.
### Why are the changes needed?
Before, lambda key was referenced outside of lambda function.
#### Example 1
Before:
```
Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, false)) AS a#0]
+- 'Join Cross
:- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0]
: +- LocalRelation <empty>, [kvs#0]
+- LocalRelation <empty>, [keys#0]
```
After:
```
Project [transform(keys#418, lambdafunction(kvs#417[lambda key#420].v1, lambda key#420, false)) AS a#419]
+- Join Cross
:- LocalRelation <empty>, [kvs#417]
+- LocalRelation <empty>, [keys#418]
```
#### Example 2
Before:
```
Project [transform(keys#0, lambdafunction(kvs#0[lambda key#0].v1, lambda key#0, false)) AS a#0]
+- GlobalLimit 5
+- LocalLimit 5
+- Project [keys#0, _extract_v1#0 AS _extract_v1#0]
+- GlobalLimit 5
+- LocalLimit 5
+- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0, keys#0]
+- LocalRelation <empty>, [kvs#0, keys#0]
```
After:
```
Project [transform(keys#428, lambdafunction(kvs#427[lambda key#430].v1, lambda key#430, false)) AS a#429]
+- GlobalLimit 5
+- LocalLimit 5
+- Project [keys#428, kvs#427]
+- GlobalLimit 5
+- LocalLimit 5
+- LocalRelation <empty>, [kvs#427, keys#428]
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Scala unit tests for the examples above
Closes#32773 from karenfeng/SPARK-35636.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`.
Current implement doesn't pushdown filters for `In/InSet` which contains `Cast`.
For instance:
```scala
spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1")
spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain
```
before this pr:
```
== Physical Plan ==
*(1) Filter cast(id#5 as bigint) IN (1,2,4)
+- *(1) ColumnarToRow
+- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
```
after this pr:
```
== Physical Plan ==
*(1) Filter id#95 IN (1,2,4)
+- *(1) ColumnarToRow
+- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, [1,2,4])], ReadSchema: struct<id:int>
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#32488 from cfmcgrady/SPARK-35316.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR removes `canPlanAsBroadcastHashJoin` check in `EliminateOuterJoin.
### Why are the changes needed?
We can always removes outer join if it only has DISTINCT on streamed side.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#32744 from wangyum/SPARK-34808-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/execution`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32694 from beliefer/SPARK-35059.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, we do not have a suitable definition of the `user` concept in Spark. We only have a `sparkUser` app widely but do not support identify or retrieve the user information from a session in STS or a runtime query execution.
`current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant.
This PR add `current_user()` as a SQL function. And, they are the same. In this PR, we add these functions w/o ambiguity.
1. For a normal single-threaded Spark application, clearly the `sparkUser` is always equivalent to `current_user()` .
2. For a multi-threaded Spark application, e.g. Spark thrift server, we use a `ThreadLocal` variable to store the client-side user(after authenticated) before running the query and retrieve it in the parser.
### Why are the changes needed?
`current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant.
### Does this PR introduce _any_ user-facing change?
yes, added `current_user()` as a SQL function
### How was this patch tested?
new tests in thrift server and sql/catalyst
Closes#32718 from yaooqinn/SPARK-21957.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A test case of AdaptiveQueryExecSuite becomes flaky since there are too many debug logs in RootLogger:
https://github.com/Yikun/spark/runs/2715222392?check_suite_focus=truehttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139125/testReport/
To fix it, I suggest supporting multiple loggers in the testing method withLogAppender. So that the LogAppender gets clean target log outputs.
### Why are the changes needed?
Fix a flaky test case.
Also, reduce unnecessary memory cost in tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#32725 from gengliangwang/fixFlakyLogAppender.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals:
- `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)`
- `today [zoneId]` - midnight today.
- `yesterday [zoneId]` - midnight yesterday
- `tomorrow [zoneId]` - midnight tomorrow
- `now` - current query start time.
For example:
```sql
spark-sql> SELECT timestamp 'tomorrow';
2019-09-07 00:00:00
```
Similarly, the following special date values are supported only in typed date literals:
- `epoch [zoneId]` - `1970-01-01`
- `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`.
- `yesterday [zoneId]` - the current date -1
- `tomorrow [zoneId]` - the current date + 1
- `now` - the date of running the current query. It has the same notion as `today`.
For example:
```sql
spark-sql> SELECT date 'tomorrow' - date 'yesterday';
2
```
### Why are the changes needed?
In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems:
- If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed.
- The special values play the role of distributed non-deterministic functions though users might think of the values as constants.
### Does this PR introduce _any_ user-facing change?
Yes but the probability should be small.
### How was this patch tested?
By running existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```
Closes#32714 from MaxGekk/remove-datetime-special-values.
Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries.
A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark.
### Why are the changes needed?
make it easy for people to debug janino, as I'm not a janino expert.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#32716 from cloud-fan/janino.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently, the results of following SQL queries are not redacted:
```
SET [KEY];
SET;
```
For example:
```
scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
+--------------------+------+
| key| value|
+--------------------+------+
|javax.jdo.option....|123456|
+--------------------+------+
scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
+--------------------+------+
| key| value|
+--------------------+------+
|javax.jdo.option....|123456|
+--------------------+------+
scala> spark.sql("set").show()
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|javax.jdo.option....| 123456|
```
We should hide the sensitive information and redact the query output.
### Why are the changes needed?
Security.
### Does this PR introduce _any_ user-facing change?
Yes, the sensitive information in the output of Set commands are redacted
### How was this patch tested?
Unit test
Closes#32712 from gengliangwang/redactSet.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Handle Currying Product while serializing TreeNode to JSON. While processing [Product](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L820), we may get an assert error for cases like Currying Product because of the mismatch of sizes between field name and field values.
Fallback to use reflection to get all the values for constructor parameters when we meet such cases.
### Why are the changes needed?
Avoid throwing error while serializing TreeNode to JSON, try to output as much information as possible.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UT case added.
Closes#32713 from ivoson/SPARK-35411-followup.
Authored-by: Tengfei Huang <tengfei.h@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr add new rule to removes outer join if it only has distinct on streamed side. For example:
```scala
spark.range(200L).selectExpr("id AS a").createTempView("t1")
spark.range(300L).selectExpr("id AS b").createTempView("t2")
spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true)
```
Before this pr:
```
== Optimized Logical Plan ==
Aggregate [a#2L], [a#2L]
+- Project [a#2L]
+- Join LeftOuter, (a#2L = b#6L)
:- Project [id#0L AS a#2L]
: +- Range (0, 200, step=1, splits=Some(2))
+- Project [id#4L AS b#6L]
+- Range (0, 300, step=1, splits=Some(2))
```
After this pr:
```
== Optimized Logical Plan ==
Aggregate [a#2L], [a#2L]
+- Project [id#0L AS a#2L]
+- Range (0, 200, step=1, splits=Some(2))
```
### Why are the changes needed?
Improve query performance. [DB2](https://www.ibm.com/docs/en/db2-for-zos/11?topic=manipulation-how-db2-simplifies-join-operations) support this feature:
![image](https://user-images.githubusercontent.com/5399861/119594277-0d7c4680-be0e-11eb-8bd4-366d8c4639f0.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#31908 from wangyum/SPARK-34808.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
This PR refactors `SubqueryExpression` class. It removes the children field from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`.
### Why are the changes needed?
Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up.
For example:
`SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2`
During the analysis phase, outer references in the subquery are stored in the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule `PullupCorrelatedPredicates`, the children field will be used to store the join conditions, which contain both the inner and the outer references: `scalar-subquery [t1.c1 = t2.c1]`. This is why the references of SubqueryExpression excludes the inner plan's output:
29ed1a2de4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (L68-L69)
This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32687 from allisonwang-db/refactor-subquery-expr.
Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added the following TreePattern enums:
- EXCHANGE
- IN_SUBQUERY_EXEC
- UPDATE_FIELDS
Migrated `transformAllExpressions` call sites to use `transformAllExpressionsWithPruning`
### Why are the changes needed?
Reduce the number of tree traversals and hence improve the query compilation latency.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Perf diff:
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
OptimizeUpdateFields | 54646396 | 27444424 | 0.5
ReplaceUpdateFieldsExpression | 24694303 | 2087517 | 0.08
Closes#32643 from sigmod/all_expressions.
Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
### What changes were proposed in this pull request?
This PR fixes a build error with Scala 2.13 on GA.
#32301 seems to bring this error.
### Why are the changes needed?
To recover CI.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA
Closes#32696 from sarutak/followup-SPARK-35194.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability.
### Why are the changes needed?
Improves readability for future maintenance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32301 from karenfeng/refactor-nested-column-aliasing.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Spark conv function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it.
However, seems Spark has different behavior in two cases:
MySQL allows leading spaces but Spark does not.
If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException
This patch now help conv follow behavior in those two cases
conv allows leading spaces
conv will return the max unsigned long when the input string is way too long
### Why are the changes needed?
fixing it to match the behavior of conv function to the (almost) only one reference of another DBMS, MySQL
### Does this PR introduce _any_ user-facing change?
Yes, as pointed out above
### How was this patch tested?
Add test
Closes#32684 from dgd-contributor/SPARK-33428.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32646 from beliefer/SPARK-35057.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The main code change is:
* Change rule `DemoteBroadcastHashJoin` to `DynamicJoinSelection` and add shuffle hash join selection code.
* Specify a join strategy hint `SHUFFLE_HASH` if AQE think a join can be converted to SHJ.
* Skip `preferSortMerge` config check in AQE side if a join can be converted to SHJ.
### Why are the changes needed?
Use AQE runtime statistics to decide if we can use shuffled hash join instead of sort merge join. Currently, the formula of shuffled hash join selection dose not work due to the dymanic shuffle partition number.
Add a new config spark.sql.adaptive.shuffledHashJoinLocalMapThreshold to decide if join can be converted to shuffled hash join safely.
### Does this PR introduce _any_ user-facing change?
Yes, add a new config.
### How was this patch tested?
Add test.
Closes#32550 from ulysses-you/SPARK-35282-2.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add the function type, such as "scala_udf", "python_udf", "java_udf", "hive", "built-in" to the `ExpressionInfo` for UDF.
### Why are the changes needed?
Make the `ExpressionInfo` of UDF more meaningful
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing and newly added UT
Closes#32587 from linhongliu-db/udf-language.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
* remove `EliminateUnnecessaryJoin`, using `AQEPropagateEmptyRelation` instead.
* eliminate join, aggregate, limit, repartition, sort, generate which is beneficial.
### Why are the changes needed?
Make `EliminateUnnecessaryJoin` available with more case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#32602 from ulysses-you/SPARK-35455.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Addressed the dongjoon-hyun comments on the previous PR #30018.
Extended the `RemoveRedundantAggregates` rule to remove redundant aggregations in even more queries. For example in
```
dataset
.dropDuplicates()
.groupBy('a)
.agg(max('b))
```
the `dropDuplicates` is not needed, because the result on `max` does not depend on duplicate values.
### Why are the changes needed?
Improve performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#31914 from tanelk/SPARK-33122_redundant_aggs_followup.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Use `SpecificInternalRow` instead of `GenericInternalRow` to avoid boxing / unboxing cost.
### Why are the changes needed?
Since it doesn't know the input row schema, `GenericInternalRow` potentially need to apply boxing for input arguments. It's better to use `SpecificInternalRow` instead since we know input data types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32647 from sunchao/specific-input-row.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes a bug with subexpression elimination for CaseWhen statements. https://github.com/apache/spark/pull/30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.
### Why are the changes needed?
Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.
### Does this PR introduce _any_ user-facing change?
Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```
`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.
### How was this patch tested?
Updated existing test with new case.
Closes#32595 from Kimahriman/bug-case-subexpr-elimination.
Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This patch fixes a bug when dealing with common expressions in conditional expressions such as `CaseWhen` during subexpression elimination.
For example, previously we find common expressions among conditions of `CaseWhen`, but children expressions are also counted into. We should not count these children expressions as common expressions.
### Why are the changes needed?
If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests.
Closes#32559 from viirya/SPARK-35410.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to avoid wrapping if-else to the constant literals for `percentage` and `accuracy` in `percentile_approx`. They are expected to be literals (or foldable expressions).
Pivot works by two phrase aggregations, and it works with manipulating the input to `null` for non-matched values (pivot column and value).
Note that pivot supports an optimized version without such logic with changing input to `null` for some types (non-nested types basically). So the issue fixed by this PR is only for complex types.
```scala
val df = Seq(
("a", -1.0), ("a", 5.5), ("a", 2.5), ("b", 3.0), ("b", 5.2)).toDF("type", "value")
.groupBy().pivot("type", Seq("a", "b")).agg(
percentile_approx(col("value"), array(lit(0.5)), lit(10000)))
df.show()
```
**Before:**
```
org.apache.spark.sql.AnalysisException: cannot resolve 'percentile_approx((IF((type <=> CAST('a' AS STRING)), value, CAST(NULL AS DOUBLE))), (IF((type <=> CAST('a' AS STRING)), array(0.5D), NULL)), (IF((type <=> CAST('a' AS STRING)), 10000, CAST(NULL AS INT))))' due to data type mismatch: The accuracy or percentage provided must be a constant literal;
'Aggregate [percentile_approx(if ((type#7 <=> cast(a as string))) value#8 else cast(null as double), if ((type#7 <=> cast(a as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS a#16, percentile_approx(if ((type#7 <=> cast(b as string))) value#8 else cast(null as double), if ((type#7 <=> cast(b as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(b as string))) 10000 else cast(null as int), 0, 0) AS b#18]
+- Project [_1#2 AS type#7, _2#3 AS value#8]
+- LocalRelation [_1#2, _2#3]
```
**After:**
```
+-----+-----+
| a| b|
+-----+-----+
|[2.5]|[3.0]|
+-----+-----+
```
### Why are the changes needed?
To make percentile_approx work with pivot as expected
### Does this PR introduce _any_ user-facing change?
Yes. It threw an exception but now it returns a correct result as shown above.
### How was this patch tested?
Manually tested and unit test was added.
Closes#32619 from HyukjinKwon/SPARK-35480.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch sorts equivalent expressions based on their child-parent relation.
### Why are the changes needed?
`EquivalentExpressions` maintains a map of equivalent expressions. It is `HashMap` now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.
For example, we have two different expressions `Add(Literal(1), Literal(2))` and `Add(Literal(3), add)`.
Case 1: child subexpr comes first.
```scala
addExprTree(add)
addExprTree(Add(Literal(3), add))
addExprTree(Add(Literal(3), add))
```
Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions.
```
addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the map first, then add `add` into the map
addExprTree(add)
addExprTree(Add(Literal(3), add))
```
As we are going to sort equivalent expressions at all, we don't need `LinkedHashMap` but just do sorting.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests.
Closes#32586 from viirya/use-listhashmap.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
Fix scala compile error.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#32617 from ulysses-you/scala2-13.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32478 from beliefer/SPARK-35063.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to format strings correctly for `PushedFilters`. For example, `explain()` for a query below prints `v in (array('a'))` as `PushedFilters: [In(v, [WrappedArray(a)])]`;
```
scala> sql("create table t (v array<string>) using parquet")
scala> sql("select * from t where v in (array('a'), null)").explain()
== Physical Plan ==
*(1) Filter v#4 IN ([a],null)
+- FileScan parquet default.t[v#4] Batched: false, DataFilters: [v#4 IN ([a],null)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t], PartitionFilters: [], PushedFilters: [In(v, [WrappedArray(a),null])], ReadSchema: struct<v:array<string>>
```
This PR makes `explain()` print it as `PushedFilters: [In(v, [[a]])]`;
```
scala> sql("select * from t where v in (array('a'), null)").explain()
== Physical Plan ==
*(1) Filter v#4 IN ([a],null)
+- FileScan parquet default.t[v#4] Batched: false, DataFilters: [v#4 IN ([a],null)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t], PartitionFilters: [], PushedFilters: [In(v, [[a],null])], ReadSchema: struct<v:array<string>>
```
NOTE: This PR includes a bugfix caused by #32577 (See the cloud-fan comment: https://github.com/apache/spark/pull/32577/files#r636108150).
### Why are the changes needed?
To improve explain strings.
### Does this PR introduce _any_ user-facing change?
Yes, this PR improves the explain strings for pushed-down filters.
### How was this patch tested?
Added tests in `SQLQueryTestSuite`.
Closes#32615 from maropu/ExplainPartitionFilters.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR reduces the execution time of `DeduplicateRelations` by:
1) use `Set` instead `Seq` to check duplicate relations
2) avoid plan output traverse and attribute rewrites when there are no changes in the children plan
### Why are the changes needed?
Rule `DeduplicateRelations` is slow.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Run `TPCDSQuerySuite` and checked the run time of `DeduplicateRelations`. The time has been reduced by 77.9% after this PR.
Closes#32590 from Ngone51/improve-dedup.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
CTAS with location clause acts as an insert overwrite. This can cause problems when there are subdirectories within a location directory.
This causes some users to accidentally wipe out directories with very important data. We should not allow CTAS with location to a non-empty directory.
### Why are the changes needed?
Hive already handled this scenario: HIVE-11319
Steps to reproduce:
```scala
sql("""create external table `demo_CTAS`( `comment` string) PARTITIONED BY (`col1` string, `col2` string) STORED AS parquet location '/tmp/u1/demo_CTAS'""")
sql("""INSERT OVERWRITE TABLE demo_CTAS partition (col1='1',col2='1') VALUES ('abc')""")
sql("select* from demo_CTAS").show
sql("""create table ctas1 location '/tmp/u2/ctas1' as select * from demo_CTAS""")
sql("select* from ctas1").show
sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""")
```
Before the fix: Both create table operations will succeed. But values in table ctas1 will be replaced by ctas2 accidentally.
After the fix: `create table ctas2...` will throw `AnalysisException`:
```
org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.
```
### Does this PR introduce _any_ user-facing change?
Yes, if the location directory is not empty, CTAS with location will throw AnalysisException
```
sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""")
```
```
org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.
```
`CREATE TABLE AS SELECT` with non-empty `LOCATION` will throw `AnalysisException`. To restore the behavior before Spark 3.2, need to set `spark.sql.legacy.allowNonEmptyLocationInCTAS` to `true`. , default value is `false`.
Updated SQL migration guide.
### How was this patch tested?
Test case added in SQLQuerySuite.scala
Closes#32411 from vinodkc/br_fixCTAS_nonempty_dir.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Updating column stats for Union operator stats estimation
### Why are the changes needed?
This is a followup PR to update the null count also in the Union stats operator estimation. https://github.com/apache/spark/pull/30334
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Updated UTs, manual testing
Closes#32494 from shahidki31/shahid/updateNullCountForUnion.
Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Update histogram statistics for RANGE operator stats estimation
### Why are the changes needed?
If histogram optimization is enabled, this statistics can be used in various cost based optimizations.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs. Manual test.
Closes#32498 from shahidki31/shahid/histogram.
Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>