### What changes were proposed in this pull request?
In https://github.com/yaooqinn/itachi/issues/8, we had a discussion about the current extension injection for the spark session. We've agreed that the current way is not that convenient for both third-party developers and end-users.
It's much simple if third-party developers can provide a resource file that contains default extensions for Spark to load ahead
### Why are the changes needed?
better use experience
### Does this PR introduce _any_ user-facing change?
no, dev change
### How was this patch tested?
new tests
Closes#32515 from yaooqinn/SPARK-35380.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
Change `map` in `InvokeLike.invoke` to a while loop to improve performance, following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).
### Why are the changes needed?
`InvokeLike.invoke`, which is used in non-codegen path for `Invoke` and `StaticInvoke`, currently uses `map` to evaluate arguments:
```scala
val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
if (needNullCheck && args.exists(_ == null)) {
// return null if one of arguments is null
null
} else {
...
```
which is pretty expensive if the method itself is trivial. We can change it to a plain while loop.
<img width="871" alt="Screen Shot 2021-05-12 at 12 19 59 AM" src="https://user-images.githubusercontent.com/506679/118055719-7f985a00-b33d-11eb-943b-cf85eab35f44.png">
Benchmark results show this can improve as much as 3x from `V2FunctionBenchmark`:
Before
```
OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------
native_long_add 36506 36656 251 13.7 73.0 1.0X
java_long_add_default 47151 47540 370 10.6 94.3 0.8X
java_long_add_magic 178691 182457 1327 2.8 357.4 0.2X
java_long_add_static_magic 177151 178258 1151 2.8 354.3 0.2X
```
After
```
OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------
native_long_add 29897 30342 568 16.7 59.8 1.0X
java_long_add_default 40628 41075 664 12.3 81.3 0.7X
java_long_add_magic 54553 54755 182 9.2 109.1 0.5X
java_long_add_static_magic 55410 55532 127 9.0 110.8 0.5X
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32527 from sunchao/SPARK-35384.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost the same ones; the only differences in these queries are ORDER BY columns.
### Why are the changes needed?
To improve test performance.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Existing tests.
Closes#32520 from maropu/SkipDupQueries.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Switch to plain `while` loop following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).
### Why are the changes needed?
`while` loop may yield better performance comparing to `foreach`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#32522 from sunchao/SPARK-35361-follow-up.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
A simple follow-up of #32474 to throw exception instead of sys.error.
### Why are the changes needed?
An exception only fails the query, instead of sys.error.
### Does this PR introduce _any_ user-facing change?
Yes, if `Invoke` or `StaticInvoke` cannot find the method, instead of original `sys.error` now we only throw an exception.
### How was this patch tested?
Existing tests.
Closes#32519 from viirya/SPARK-35347-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to bump up the janino version from 3.0.16 to v3.1.4.
The major changes of this upgrade are as follows:
- Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow.
- Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated.
- Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
- Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions).
For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html
NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing
NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs:
- janino-compiler/janino#145
- janino-compiler/janino#146
### Why are the changes needed?
janino v3.0.X is no longer maintained.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA passed.
Closes#32455 from maropu/janino_v3.1.4.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
According to discuss https://github.com/apache/spark/pull/25854#discussion_r629451135
### Why are the changes needed?
Clean code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed UT
Closes#32499 from AngersZhuuuu/SPARK-29145-fix.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Columnar execution support for ANSI interval types include YearMonthIntervalType and DayTimeIntervalType
### Why are the changes needed?
support cache tables with ANSI interval types.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
run ./dev/lint-java
run ./dev/scalastyle
run test: CachedTableSuite
run test: ColumnTypeSuite
Closes#32452 from Peng-Lei/SPARK-35243.
Lead-authored-by: PengLei <18066542445@189.cn>
Co-authored-by: Lei Peng <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In `ApplyFunctionExpression`, move `zipWithIndex` out of the loop for each input row.
### Why are the changes needed?
When the `ScalarFunction` is trivial, `zipWithIndex` could incur significant costs, as shown below:
<img width="899" alt="Screen Shot 2021-05-11 at 10 03 42 AM" src="https://user-images.githubusercontent.com/506679/117866421-fb19de80-b24b-11eb-8c94-d5e8c8b1eda9.png">
By removing it out of the loop, I'm seeing sometimes 2x speedup from `V2FunctionBenchmark`. For instance:
Before:
```
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
native_long_add 32437 32896 434 15.4 64.9 1.0X
java_long_add_default 85675 97045 NaN 5.8 171.3 0.4X
```
After:
```
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
native_long_add 30182 30387 279 16.6 60.4 1.0X
java_long_add_default 42862 43009 209 11.7 85.7 0.7X
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#32507 from sunchao/SPARK-35361.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added the following TreePattern enums:
- BOOL_AGG
- COUNT_IF
- CURRENT_LIKE
- RUNTIME_REPLACEABLE
Added tree traversal pruning to the following rules:
- ReplaceExpressions
- RewriteNonCorrelatedExists
- ComputeCurrentTime
- GetCurrentDatabaseAndCatalog
### Why are the changes needed?
Reduce the number of tree traversals and hence improve the query compilation latency.
Performance improvement (org.apache.spark.sql.TPCDSQuerySuite):
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
ReplaceExpressions | 27546369 | 19753804 | 0.72
RewriteNonCorrelatedExists | 17304883 | 2086194 | 0.12
ComputeCurrentTime | 35751301 | 19984477 | 0.56
GetCurrentDatabaseAndCatalog | 37230787 | 18874013 | 0.51
### How was this patch tested?
Existing tests.
Closes#32461 from sigmod/finish_analysis.
Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
This is a pre-requisite of https://github.com/apache/spark/pull/32476, in discussion of https://github.com/apache/spark/pull/32476#issuecomment-836469779 . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.
### Why are the changes needed?
Pre-requisite of https://github.com/apache/spark/pull/32476.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit test in `InnerJoinSuite.scala` for inner join code-gen.
Closes#32495 from c21/smj-refactor.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Sequence expression output a message looks confused.
This PR will fix the issue.
### Why are the changes needed?
Improve the error message for Sequence expression
### Does this PR introduce _any_ user-facing change?
Yes. this PR updates the error message of Sequence expression.
### How was this patch tested?
Tests updated.
Closes#32492 from beliefer/SPARK-35088-followup.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change the definition of `findTightestCommonType` from
```
def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType]
```
to
```
val findTightestCommonType: (DataType, DataType) => Option[DataType]
```
### Why are the changes needed?
For backward compatibility.
When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error
```
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2
```
from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150
In the previous release, the function was
```
static public scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType ()
```
After https://github.com/apache/spark/pull/31349, the function becomes:
```
static public scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2)
```
This PR is to reduce the unnecessary API change.
### Does this PR introduce _any_ user-facing change?
Yes, the definition of `TypeCoercion.findTightestCommonType` is consistent with previous release again.
### How was this patch tested?
Existing unit tests
Closes#32493 from gengliangwang/typecoercion.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too
### Why are the changes needed?
Make RepairTableCommand add partition batch size configurable.
### Does this PR introduce _any_ user-facing change?
User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table.
### How was this patch tested?
Not need
Closes#32489 from AngersZhuuuu/SPARK-35360.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Change `failOnError` to false for `NativeAdd` in `V2FunctionBenchmark`.
### Why are the changes needed?
Since `NativeAdd` is simply doing addition on long it's better to set `failOnError` to false so it will use native long addition instead of `Math.addExact`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#32481 from sunchao/SPARK-35261-follow-up.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename pattern strings and regexps of year-month and day-time intervals.
### Why are the changes needed?
To improve code maintainability.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing test suites.
Closes#32444 from AngersZhuuuu/SPARK-35111-followup.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
As title. We should use a more restrictive interface `ShuffledJoin` other than `BaseJoinExec` in `CoalesceBucketsInJoin`, as the rule only applies to sort merge join and shuffled hash join (i.e. `ShuffledJoin`).
### Why are the changes needed?
Code cleanup.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit test in `CoalesceBucketsInJoinSuite`.
Closes#32480 from c21/minor-cleanup.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
when `numSlices` is avaiable, `logical.Range` should compute a exact `maxRowsPerPartition`
### Why are the changes needed?
`maxRowsPerPartition` is used in optimizer, we should provide an exact value if possible
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#32350 from zhengruifeng/range_maxRowsPerPartition.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This patch proposes to use `MethodUtils` for looking up methods `Invoke` and `StaticInvoke` expressions.
### Why are the changes needed?
Currently we wrote our logic in `Invoke` and `StaticInvoke` expressions for looking up methods. It is tricky to consider all the cases and there is already existing utility package for this purpose. We should reuse the utility package.
### Does this PR introduce _any_ user-facing change?
No, internal change only.
### How was this patch tested?
Existing tests.
Closes#32474 from viirya/invoke-util.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to filter out TPCDS v1.4 q6 and q75 in `TPCDSQueryTestSuite`.
I saw`TPCDSQueryTestSuite` failed nondeterministically because output row orders were different with those in the golden files. For example, the failure in the GA job, https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:
a0c76a8755/sql/core/src/test/resources/tpcds/q6.sql (L20)
Actually, `tpcds/q6.sql` and `tpcds-v2.7.0/q6.sql` are almost the same and the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
a0c76a8755/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql (L22)
So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 has the same issue).
### Why are the changes needed?
For stable testing.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GA passed.
Closes#32454 from maropu/CleanUpTpcdsQueries.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR makes the below case work well.
```sql
select a b from values(1) t(a) distribute by a;
```
```logtalk
== Parsed Logical Plan ==
'RepartitionByExpression ['a]
+- 'Project ['a AS b#42]
+- 'SubqueryAlias t
+- 'UnresolvedInlineTable [a], [List(1)]
== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [b]; line 1 pos 62;
'RepartitionByExpression ['a]
+- Project [a#48 AS b#42]
+- SubqueryAlias t
+- LocalRelation [a#48]
```
### Why are the changes needed?
bugfix
### Does this PR introduce _any_ user-facing change?
yes, the original attributes can be used in `distribute by` / `cluster by` and hints like `/*+ REPARTITION(3, c) */`
### How was this patch tested?
new tests
Closes#32465 from yaooqinn/SPARK-35331.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Retain column metadata during the process of nested column pruning, when constructing `StructField`.
To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail.
### Why are the changes needed?
The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32354 from sunchao/SPARK-35232.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This allows `ScalarFunction` implemented in Java to optionally specify the magic method `invoke` to be static, which can be used if the UDF is stateless. Comparing to the non-static method, it can potentially give better performance due to elimination of dynamic dispatch, etc.
Also added a benchmark to measure performance of: the default `produceResult`, non-static magic method and static magic method.
### Why are the changes needed?
For UDFs that are stateless (e.g., no need to maintain intermediate state between each function call), it's better to allow users to implement the UDF function as static method which could potentially give better performance.
### Does this PR introduce _any_ user-facing change?
Yes. Spark users can now have the choice to define static magic method for `ScalarFunction` when it is written in Java and when the UDF is stateless.
### How was this patch tested?
Added new UT.
Closes#32407 from sunchao/SPARK-35261.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Instantiate a new Hive client through `Hive.getWithFastCheck(conf, false)` instead of `Hive.get(conf)`.
### Why are the changes needed?
[HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur:
```
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions'
at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
... 96 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
```
The `get_all_functions` is called only when `doRegisterAllFns` is set to true:
```java
private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
conf = c;
if (doRegisterAllFns) {
registerAllFunctionsOnce();
}
}
```
what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`.
### Does this PR introduce _any_ user-facing change?
Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower (with HIVE-24608 too)
### How was this patch tested?
Manually started a HMS server of Hive version 1.2.2, with patched Hive 2.3.8 using HIVE-24608. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc.
Closes#32446 from sunchao/SPARK-35321.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch proposes to make StaticInvoke able to find method with given method name even the parameter types do not exactly match to argument classes.
### Why are the changes needed?
Unlike `Invoke`, `StaticInvoke` only tries to get the method with exact argument classes. If the calling method's parameter types are not exactly matched with the argument classes, `StaticInvoke` cannot find the method.
`StaticInvoke` should be able to find the method under the cases too.
### Does this PR introduce _any_ user-facing change?
Yes. `StaticInvoke` can find a method even the argument classes are not exactly matched.
### How was this patch tested?
Unit test.
Closes#32413 from viirya/static-invoke.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32377 from beliefer/SPARK-35021.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added the following TreePattern enums:
- APPEND_COLUMNS
- DESERIALIZE_TO_OBJECT
- LAMBDA_VARIABLE
- MAP_OBJECTS
- SERIALIZE_FROM_OBJECT
- PROJECT
- TYPED_FILTER
Added tree traversal pruning to the following rules dealing with objects:
- EliminateSerialization
- CombineTypedFilters
- EliminateMapObjects
- ObjectSerializerPruning
### Why are the changes needed?
Reduce the number of tree traversals and hence improve the query compilation latency.
### How was this patch tested?
Existing tests.
Closes#32451 from sigmod/object.
Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
If `targetObject` is not nullable, we don't need the object null check in `Invoke`.
### Why are the changes needed?
small perf improvement
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#32466 from cloud-fan/invoke.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32367 from beliefer/SPARK-35020.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/32198
Before https://github.com/apache/spark/pull/32198, in `WriteTaskStatsTracker.newRow`, we know that the row is written to the current file. After https://github.com/apache/spark/pull/32198 , we no longer know this connection.
This PR adds the file path parameter in `WriteTaskStatsTracker.newRow` to bring back the connection.
### Why are the changes needed?
To not break some custom `WriteTaskStatsTracker` implementations.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#32459 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow up to https://github.com/apache/spark/pull/32032#discussion_r620928086. Basically, `children`/`innerChildren` should be mutually exclusive for `AlterViewAsCommand` and `CreateViewCommand`, which extend `AnalysisOnlyCommand`. Otherwise, there could be an issue in the `EXPLAIN` command. Currently, this is not an issue, because these commands will be analyzed (children will always be empty) when the `EXPLAIN` command is run.
### Why are the changes needed?
To be future-proof where these commands are directly used.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new tsts
Closes#32447 from imback82/SPARK-34701-followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`EXPLAIN CODEGEN <query>` (and Dataset.explain("codegen")) prints out the generated code for each stage of plan. The current implementation is to match `WholeStageCodegenExec` operator in query plan and prints out generated code (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118 ). This does not work with AQE as we wrap the whole query plan inside `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior change for EXPLAIN query (and Dataset.explain), as we enable AQE by default now.
The change is to explain code-gen for the current executed plan of AQE.
### Why are the changes needed?
Make `EXPLAIN CODEGEN` work same as before.
### Does this PR introduce _any_ user-facing change?
No (when comparing with latest Spark release 3.1.1).
### How was this patch tested?
Added unit test in `ExplainSuite.scala`.
Closes#32430 from c21/explain-aqe.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When checking the path in `FileStreamSink.hasMetadata`, we should ignore the error and assume the user wants to read a batch output.
### Why are the changes needed?
Keep the original behavior of ignoring the error.
### Does this PR introduce _any_ user-facing change?
Yes.
The path checking will not throw an exception when checking file sink format
### How was this patch tested?
New UT added.
Closes#31638 from xuanyuanking/SPARK-34526.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This patch changes custom metric updating to update per certain rows (currently 100), instead of per row.
### Why are the changes needed?
Based on previous discussion https://github.com/apache/spark/pull/31451#discussion_r605413557, we should only update custom metrics per certain (e.g. 100) rows and also at the end of the task. Updating per row doesn't make too much benefit.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit test.
Closes#32330 from viirya/metric-update.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch changes a few places using `FileSystem` API to manipulate checkpoint file to `CheckpointFileManager`.
### Why are the changes needed?
`CheckpointFileManager` is designed to handle checkpoint file manipulation. However, there are a few places exposing `FileSystem` from checkpoint files/paths. We should use `CheckpointFileManager` to manipulate checkpoint files. For example, we may want to have one storage system for checkpoint file. If all checkpoint file manipulation is performed through `CheckpointFileManager`, we can only implement `CheckpointFileManager` for the storage system, and don't need to implement `FileSystem` API for it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit tests.
Closes#32361 from viirya/checkpoint-manager.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Hide internal view properties for describe table command, because those
properties are generated by spark and should be transparent to the end-user.
### Why are the changes needed?
Avoid internal properties confusing the users.
### Does this PR introduce _any_ user-facing change?
Yes
Before this change, the user will see below output for `describe formatted test_view`
```
....
Table Properties [view.catalogAndNamespace.numParts=2, view.catalogAndNamespace.part.0=spark_catalog, view.catalogAndNamespace.part.1=default, view.query.out.col.0=c, view.query.out.col.1=v, view.query.out.numCols=2, view.referredTempFunctionsNames=[], view.referredTempViewNames=[]]
...
```
After this change, the internal properties will be hidden for `describe formatted test_view`
```
...
Table Properties []
...
```
### How was this patch tested?
existing UT
Closes#32441 from linhongliu-db/hide-properties.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`.
### Why are the changes needed?
For better testing.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA passed.
Closes#32420 from maropu/UseTpcdsKit.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to enrich ORC encryption test coverage for nested columns.
### Why are the changes needed?
This will provide a test coverage for this feature.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the newly added test case.
Closes#32449 from dongjoon-hyun/SPARK-35325.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pull request proposes a new API for streaming sources to signal that they can report metrics, and adds a use case to support Kafka micro batch stream to report the stats of # of offsets for the current offset falling behind the latest.
A public interface is added.
`metrics`: returns the metrics reported by the streaming source with given offset.
### Why are the changes needed?
The new API can expose any custom metrics for the "current" offset for streaming sources. Different from #31398, this PR makes metrics available to user through progress report, not through spark UI. A use case is that people want to know how the current offset falls behind the latest offset.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test for Kafka micro batch source v2 are added to test the Kafka use case.
Closes#31944 from yijiacui-db/SPARK-34297.
Authored-by: Yijia Cui <yijia.cui@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.
This is the rework of #31887. Closes#31887.
### Why are the changes needed?
This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)
For this query:
```
val df = Seq(
(Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")
df.select(
f.flatten(
f.transform(
$"numbers",
(number: Column) => { f.transform(
$"letters",
(letter: Column) => { f.struct(
number.as("number"),
letter.as("letter")
) }
) }
)
).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added the new test in `DataFrameFunctionsSuite`.
Closes#32424 from maropu/pr31887.
Lead-authored-by: dsolow <dsolow@sayari.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: dmsolow <dsolow@sayarianalytics.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Added the following TreePattern enums:
- CREATE_NAMED_STRUCT
- EXTRACT_VALUE
- JSON_TO_STRUCT
- OUTER_REFERENCE
- AGGREGATE
- LOCAL_RELATION
- EXCEPT
- LIMIT
- WINDOW
Used them in the following rules:
- DecorrelateInnerQuery
- LimitPushDownThroughWindow
- OptimizeCsvJsonExprs
- PropagateEmptyRelation
- PullOutGroupingExpressions
- PushLeftSemiLeftAntiThroughJoin
- ReplaceExceptWithFilter
- RewriteDistinctAggregates
- SimplifyConditionalsInPredicate
- UnwrapCastInBinaryComparison
### Why are the changes needed?
Reduce the number of tree traversals and hence improve the query compilation latency.
### How was this patch tested?
Existing tests.
Closes#32421 from sigmod/opt.
Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.
### Why are the changes needed?
To correctly document.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the user-facing documentation.
### How was this patch tested?
I checked them via running linters.
Closes#32423 from HyukjinKwon/SPARK-35250.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fixing some typos in the documenting comments.
### Why are the changes needed?
To make reading the docs more pleasant.
### Does this PR introduce _any_ user-facing change?
Yes, since the user sees the docs.
### How was this patch tested?
It was not tested, because no code was changed.
Closes#32400 from Dobiasd/patch-1.
Authored-by: Tobias Hermann <editgym@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In `StaticInvoke`, when result is nullable, don't box the return value if its type is primitive.
### Why are the changes needed?
It is unnecessary to apply boxing when the method return value is of primitive type, and it would hurt performance a lot if the method is simple. The check is done in `Invoke` but not in `StaticInvoke`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a UT.
Closes#32416 from sunchao/SPARK-35281.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Extend Spark SQL parser to support parsing of:
- `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType`
- `INTERVAL DAY TO SECOND` to `DayTimeIntervalType`
2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`.
### Why are the changes needed?
To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules:
```
<interval type> ::= INTERVAL <interval qualifier>
<interval qualifier> ::= <start field> TO <end field> | <single datetime field>
<start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ]
<end field> ::= <non-second primary datetime field> | SECOND [ <left paren> <interval fractional seconds precision> <right paren> ]
<primary datetime field> ::= <non-second primary datetime field | SECOND
<non-second primary datetime field> ::= YEAR | MONTH | DAY | HOUR | MINUTE
<interval fractional seconds precision> ::= <unsigned integer>
<interval leading field precision> ::= <unsigned integer>
```
Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`.
### Does this PR introduce _any_ user-facing change?
Should not since the types has not been released yet.
### How was this patch tested?
By running the affected tests such as:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "test:testOnly *ExpressionTypeCheckingSuite"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z windowFrameCoercion.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql"
```
Closes#32409 from MaxGekk/parse-ansi-interval-types.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>