### What changes were proposed in this pull request?
This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; 4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)
The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`,
the current master generates;
```
/* 098 */ private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 099 */ // do aggregate
/* 100 */ // common sub-expressions
/* 101 */ int agg_value_6 = -1;
/* 102 */
/* 103 */ agg_value_6 = agg_expr_0_0 + agg_expr_1_0;
/* 104 */
/* 105 */ int agg_value_5 = -1;
/* 106 */
/* 107 */ agg_value_5 = agg_value_6 + agg_expr_2_0;
/* 108 */ boolean agg_isNull_4 = false;
/* 109 */ long agg_value_4 = -1L;
/* 110 */ if (!false) {
/* 111 */ agg_value_4 = (long) agg_value_5;
/* 112 */ }
/* 113 */ int agg_value_10 = -1;
/* 114 */
/* 115 */ agg_value_10 = agg_expr_0_0 + agg_expr_1_0;
/* 116 */ // evaluate aggregate functions and update aggregation buffers
/* 117 */ agg_doAggregate_sum_0(agg_value_10);
/* 118 */ agg_doAggregate_avg_0(agg_value_4, agg_isNull_4);
/* 119 */
/* 120 */ }
```
On the other hand, this pr generates;
```
/* 121 */ private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 122 */ // do aggregate
/* 123 */ // common sub-expressions
/* 124 */ long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0);
/* 125 */ int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0);
/* 126 */ // evaluate aggregate functions and update aggregation buffers
/* 127 */ agg_doAggregate_sum_0(agg_subExprValue_1);
/* 128 */ agg_doAggregate_avg_0(agg_subExprValue_0);
/* 129 */
/* 130 */ }
```
I run some micro benchmarks for this pr;
```
(base) maropu~:$system_profiler SPHardwareDataType
Hardware:
Hardware Overview:
Processor Name: Intel Core i5
Processor Speed: 2 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 8 GB
(base) maropu~:$java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
(base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v
val numCols = 40
val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" }
spark.range(3000000).selectExpr(colExprs: _*).createOrReplaceTempView("t")
val aggExprs = (2 until numCols).map { i =>
(0 until i).map(d => s"_c$d")
.mkString("AVG(", " + ", ")")
}
// Drops the time of a first run then pick that of a second run
timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() }
// the master
maxCodeGen: 12957
Elapsed time: 36.309858661s
// this pr
maxCodeGen=4184
Elapsed time: 2.399490285s
```
### Why are the changes needed?
To avoid the too-long-function issue in JVMs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests in `WholeStageCodegenSuite`
Closes#25710 from maropu/SplitSubexpr.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…create table through spark-sql or beeline
## What changes were proposed in this pull request?
fix table owner use user instead of principal when create table through spark-sql
private val userName = conf.getUser will get ugi's userName which is principal info, and i copy the source code into HiveClientImpl, and use ugi.getShortUserName() instead of ugi.getUserName(). The owner display correctly.
## How was this patch tested?
1. create a table in kerberos cluster
2. use "desc formatted tbName" check owner
Please review http://spark.apache.org/contributing.html before opening a pull request.
Closes#23952 from hddong/SPARK-26929-fix-table-owner.
Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following;
```
scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...
```
### Why are the changes needed?
For efficient developments
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested
Closes#25766 from maropu/PrintBytecodeStats.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It's more clear to only do table lookup in `ResolveTables` rule (for v2 tables) and `ResolveRelations` rule (for v1 tables). `ResolveInsertInto` should only resolve the `InsertIntoStatement` with resolved relations.
### Why are the changes needed?
to make the code simpler
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25774 from cloud-fan/simplify.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR add the `namespaces` keyword to `TableIdentifierParserSuite`.
### Why are the changes needed?
Improve the test.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25758 from highmoutain/3.0bugfix.
Authored-by: changchun.wang <251922566@qq.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)
### Why are the changes needed?
The bug brings NPE.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.
Closes#25790 from HeartSaVioR/SPARK-29046-v2.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one.
### Why are the changes needed?
To make comments consistent to implementations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running Scala Style checker.
Closes#25796 from MaxGekk/year-pattern-uuuu-followup.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
ThriftServerSessionPage displays timestamp 0 (1970/01/01) instead of nothing if query finish time and close time are not set.
![image](https://user-images.githubusercontent.com/25019163/64711118-6d578000-d4b9-11e9-9b11-2e3616319a98.png)
Change it to display nothing, like ThriftServerPage.
### Why are the changes needed?
Obvious bug.
### Does this PR introduce any user-facing change?
Finish time and Close time will be displayed correctly on ThriftServerSessionPage in JDBC/ODBC Spark UI.
### How was this patch tested?
Manual test.
Closes#25762 from juliuszsompolski/SPARK-29056.
Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions.
### Why are the changes needed?
`Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually benchmark it in spark-shell:
```
def testExplainTime(collectionSize: Int) = {
val df = spark.range(10).withColumn("id2", col("id") + 1)
val list = Range(0, collectionSize).toList
val startTime = System.currentTimeMillis()
df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
val elapsedTime = System.currentTimeMillis() - startTime
println(s"cost time: ${elapsedTime}ms")
}
```
Then test on collection size 5, 10, 100, 1000, 10000, test result is:
collection size | explain time (before) | explain time (after)
------ | ------ | ------
5 | 26ms | 29ms
10 | 30ms | 48ms
100 | 104ms | 50ms
1000 | 1202ms | 58ms
10000 | 10012ms | 523ms
Closes#25754 from WeichenXu123/improve_in_collection.
Lead-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR adds a utility class `AdaptiveSparkPlanHelper` which provides methods related to tree traversal of an `AdaptiveSparkPlanExec` plan. Unlike their counterparts in `TreeNode` or
`QueryPlan`, these methods traverse down leaf nodes of adaptive plans, i.e., `AdaptiveSparkPlanExec` and `QueryStageExec`.
### Why are the changes needed?
This utility class can greatly simplify tree traversal code for adaptive spark plans.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Refined `AdaptiveQueryExecSuite` with the help of the new utility methods.
Closes#25764 from maryannxue/aqe-utils.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column
### Why are the changes needed?
The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR https://github.com/apache/spark/pull/25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.
### Does this PR introduce any user-facing change?
No, it doesn't.
### How was this patch tested?
- Regenerated results of `ExtractBenchmark`
Closes#25772 from MaxGekk/date_part-benchmark.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
reorganize the packages of DS v2 interfaces/classes:
1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes
2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes
3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes
4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes
### Why are the changes needed?
Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25700 from cloud-fan/package.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In method `SQLMetricsTestUtils.testMetricsDynamicPartition()`, there is a CREATE TABLE sentence without `withTable` block. It causes test failure if use same table name in other unit tests.
### Why are the changes needed?
To avoid "table already exists" in tests.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Exist UT
Closes#25752 from LantaoJin/SPARK-29045.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
# What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)
### Why are the changes needed?
The bug brings NPE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.
Closes#25753 from HeartSaVioR/SPARK-29046.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into 'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'
Closes#25756 from dengziming/feature_fix_typos.
Authored-by: dengziming <dengziming@growingio.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement`
### Why are the changes needed?
`InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes:
1. parser creates `InsertIntoStatement`
2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands.
3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25763 from cloud-fan/remove.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- For trait without companion object constructor, currently the method to get constructor parameters `constructParams` in `ScalaReflection` will throw exception.
```
scala.ScalaReflectionException: <none> is not a term
at scala.reflect.api.Symbols$SymbolApi.asTerm(Symbols.scala:211)
at scala.reflect.api.Symbols$SymbolApi.asTerm$(Symbols.scala:211)
at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:106)
at org.apache.spark.sql.catalyst.ScalaReflection.getCompanionConstructor(ScalaReflection.scala:909)
at org.apache.spark.sql.catalyst.ScalaReflection.constructParams(ScalaReflection.scala:914)
at org.apache.spark.sql.catalyst.ScalaReflection.constructParams$(ScalaReflection.scala:912)
at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:47)
at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:890)
at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:886)
at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:47)
```
- Instead this PR would throw exception:
```
Unable to find constructor for type [XXX]. This could happen if [XXX] is an interface or a trait without companion object constructor
UnsupportedOperationException:
```
In the normal usage of ExpressionEncoder, this can happen if the type is interface extending `scala.Product`. Also, since this is a protected method, this could have been other arbitrary types without constructor.
### Why are the changes needed?
- The error message `<none> is not a term` isn't helpful for users to understand the problem.
### Does this PR introduce any user-facing change?
- The exception would be thrown instead of runtime exception from the `scala.ScalaReflectionException`.
### How was this patch tested?
- Added a unit test to illustrate the `type` where expression encoder will fail and trigger the proposed error message.
Closes#25736 from mickjermsurawong-stripe/SPARK-29026.
Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Current Spark Thrift Server return TypeInfo includes
1. INTERVAL_YEAR_MONTH
2. INTERVAL_DAY_TIME
3. UNION
4. USER_DEFINED
Spark doesn't support INTERVAL_YEAR_MONTH, INTERVAL_YEAR_MONTH, UNION
and won't return USER)DEFINED type.
This PR overwrite GetTypeInfoOperation with SparkGetTypeInfoOperation to exclude types which we don't need.
In hive-1.2.1 Type class is `org.apache.hive.service.cli.Type`
In hive-2.3.x Type class is `org.apache.hadoop.hive.serde2.thrift.Type`
Use ThrifrserverShimUtils to fit version problem and exclude types we don't need
### Why are the changes needed?
We should return type info of Spark's own type info
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manuel test & Added UT
Closes#25694 from AngersZhuuuu/SPARK-28982.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.
### Why are the changes needed?
To support `SHOW DATABASES` SQL commands for v2 tables.
### Does this PR introduce any user-facing change?
`spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set:
```
+---------------+
| namespace|
+---------------+
| ns1|
| ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+
```
### How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25601 from imback82/show_databases.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Removes useless `Properties` created according to hvanhovell 's suggestion.
### Why are the changes needed?
Avoid useless code.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
existing UTs
Closes#25742 from mgaido91/SPARK-28939_followup.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The intent to use the --hiveconf/--hivevar parameter is just an initialization value, so setting it once in ```SparkSQLSessionManager#openSession``` is sufficient, and each time the ```SparkExecuteStatementOperation``` setting causes the variable to not be modified.
### Why are the changes needed?
It is wrong to set the --hivevar/--hiveconf variable in every ```SparkExecuteStatementOperation```, which prevents variable updates.
### Does this PR introduce any user-facing change?
```
cat <<EOF > test.sql
select '\${a}', '\${b}';
set b=bvalue_MOD_VALUE;
set b;
EOF
beeline -u jdbc:hive2://localhost:10000 --hiveconf a=avalue --hivevar b=bvalue -f test.sql
```
current result:
```
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
+-----------------+-----------------+--+
| key | value |
+-----------------+-----------------+--+
| b | bvalue |
+-----------------+-----------------+--+
1 row selected (0.022 seconds)
```
after modification:
```
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
+-----------------+-----------------+--+
| key | value |
+-----------------+-----------------+--+
| b | bvalue_MOD_VALUE|
+-----------------+-----------------+--+
1 row selected (0.022 seconds)
```
### How was this patch tested?
modified the existing unit test
Closes#25722 from cxzl25/fix_SPARK-26598.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
This PR aims to recover the JDK11 compilation with a workaround.
For now, the master branch is broken like the following due to a [Scala bug](https://github.com/scala/bug/issues/10418) which is fixed in `2.13.0-RC2`.
```
[ERROR] [Error] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition,
both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit
and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit
match argument types (java.util.Map[String,String])
```
- https://github.com/apache/spark/actions (JDK11 build monitoring)
### Why are the changes needed?
This workaround recovers JDK11 compilation.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual build with JDK11 because this is JDK11 compilation fix.
- Jenkins builds with JDK8 and tests with JDK11.
- GitHub action will verify this after merging.
Closes#25738 from dongjoon-hyun/SPARK-28939.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove the project node if the streaming scan is columnar
### Why are the changes needed?
This is a followup of https://github.com/apache/spark/pull/25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply #25586 to streaming scan as well.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25727 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This is a ANSI SQL and feature id is `T312`
```
<binary overlay function> ::=
OVERLAY <left paren> <binary value expression> PLACING <binary value expression>
FROM <start position> [ FOR <string length> ] <right paren>
```
This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array.
ref: https://www.postgresql.org/docs/11/functions-binarystring.html
## How was this patch tested?
new UT.
There are some show of the PR on my production environment.
```
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
Spark_SQL
Time taken: 0.285 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
Spark CORE
Time taken: 0.202 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
Spark ANSI SQL
Time taken: 0.165 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
Structured SQL
Time taken: 0.141 s
```
Closes#25172 from beliefer/ansi-overlay-byte-array.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc
Notes:
- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.
### Why are the changes needed?
Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.
### Does this PR introduce any user-facing change?
Yes, in that deprecated items are removed from some public APIs.
### How was this patch tested?
Existing tests.
Closes#25684 from srowen/SPARK-28980.
Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it.
In this way, SQL configs are effective also in these cases, while earlier they were ignored.
### Why are the changes needed?
Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be:
```
withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
// Subexpression elimination is used here, despite it should have been disabled
data.describe()
}
```
### Does this PR introduce any user-facing change?
When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy.
### How was this patch tested?
added UT
Closes#25643 from mgaido91/SPARK-28939.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
The `V2SessionCatalog` has 2 functionalities:
1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`.
2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog).
To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage.
This PR does 2 things:
1. refine the document of the config `spark.sql.catalog.session`.
2. add a public abstract class `CatalogExtension` for users to write implementations.
TODOs for followup PRs:
1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one.
2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names.
## How was this patch tested?
existing tests
Closes#25104 from cloud-fan/session-catalog.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
`bin/spark-shell` support query interval value:
```scala
scala> spark.sql("SELECT interval 3 months 1 hours AS i").show(false)
+-------------------------+
|i |
+-------------------------+
|interval 3 months 1 hours|
+-------------------------+
```
But `sbin/start-thriftserver.sh` can't support query interval value:
```sql
0: jdbc:hive2://localhost:10000/default> SELECT interval 3 months 1 hours AS i;
Error: java.lang.IllegalArgumentException: Unrecognized type name: interval (state=,code=0)
```
This PR maps `CalendarIntervalType` to `StringType` for `TableSchema` to make Thriftserver support query interval value because we do not support `INTERVAL_YEAR_MONTH` type and `INTERVAL_DAY_TIME`:
02c33694c8/sql/hive-thriftserver/v1.2.1/src/main/java/org/apache/hive/service/cli/Type.java (L73-L78)
[SPARK-27791](https://issues.apache.org/jira/browse/SPARK-27791): Support SQL year-month INTERVAL type
[SPARK-27793](https://issues.apache.org/jira/browse/SPARK-27793): Support SQL day-time INTERVAL type
## How was this patch tested?
unit tests
Closes#25277 from wangyum/Thriftserver-support-interval-type.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
When we set spark.sql.decimalOperations.allowPrecisionLoss to false.
For the sql below, the result will overflow and return null.
Case a:
`select case when 1=2 then 1 else 1.000000000000000000000001 end * 1`
Similar with the division operation.
This sql below will lost precision.
Case b:
`select case when 1=2 then 1 else 1.000000000000000000000001 end / 1`
Let us check the code of TypeCoercion.scala.
a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875).
For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a common type and cast both operands to common type.
So, for these cases menthioned, their left operand is Decimal(34, 24) and right operand is Literal.
Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24).
Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class.
Let's check the relative code.
a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)
When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20).
Then the multi operation in case a will overflow and division operation in case b will lost precision.
In this PR, we skip to handle the binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it.
### Why are the changes needed?
Data will corrupt without this change.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#25701 from turboFei/SPARK-29000.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`.
### Why are the changes needed?
Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
added UT
Closes#25642 from mgaido91/SPARK-28916.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr cleans up string template formats for generated code in HashAggregateExec. This changes comes from rednaxelafx comment: https://github.com/apache/spark/pull/20965#discussion_r316418729
### Why are the changes needed?
To improve code-readability.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25714 from maropu/SPARK-21870-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ.
### Why are the changes needed?
It is a performance improvement for AQE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#25703 from maryannxue/aqe-demote-bhj.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`:
```
date_part('field', source)
```
and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).
The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are:
- `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
- `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
- `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
- isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
- `year`, `month`, `day`, `hour`, `minute`, `second`
- `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
- `quarter` - the quarter of the year (1 - 4)
- `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
- `dow` - the day of the week as Sunday (0) to Saturday (6)
- `isodow` - the day of the week as Monday (1) to Sunday (7)
- `doy` - the day of the year (1 - 365/366)
- `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
- `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
- `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.
Here are examples:
```sql
spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224
```
I changed implementation of `extract` to re-use `date_part()` internally.
## How was this patch tested?
Added `date_part.sql` and regenerated results of `extract.sql`.
Closes#25410 from MaxGekk/date_part.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Add a listListeners() method to StreamingQueryManager that lists all StreamingQueryListeners that have been added to that manager.
### Why are the changes needed?
While it's best practice to keep handles on all listeners added, it's still nice to have an API to be able to list what listeners have been added to a StreamingQueryManager.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Modified existing unit tests to use the new API instead of using reflection.
Closes#25518 from mukulmurthy/26046-listener.
Authored-by: Mukul Murthy <mukul.murthy@gmail.com>
Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>
## What changes were proposed in this pull request?
This PR disables schema verification and allows schema auto-creation in the Derby database, in case the config for the Metastore is set otherwise.
## How was this patch tested?
NA
Closes#25663 from bogdanghit/hive-schema.
Authored-by: Bogdan Ghit <bogdan.ghit@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
merge the `V2WriteSupportCheck` and `V2StreamingScanSupportCheck` to one rule: `TableCapabilityCheck`.
### Why are the changes needed?
It's a little confusing to have 2 rules to check DS v2 table capability, while one rule says it checks write and another rule says it checks streaming scan. We can clearly tell it from the rule names that the batch scan check is missing.
It's better to have a centralized place for this check, with a name that clearly says it checks table capability.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25679 from cloud-fan/dsv2-check.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base.
<details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
index f102383cb4d..eff33f280cf 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
-3,7 +3,7
-- !query 0
-select max(min(unique1)) from tenk1
+select udf(max(min(unique1))) from tenk1
-- !query 0 schema
struct<>
-- !query 0 output
-12,11 +12,11 It is not allowed to use an aggregate function in the argument of another aggreg
-- !query 1
-select (select count(*)
- from (values (1)) t0(inner_c))
+select udf((select udf(count(*))
+ from (values (1)) t0(inner_c))) as col
from (values (2),(3)) t1(outer_c)
-- !query 1 schema
-struct<scalarsubquery():bigint>
+struct<col:bigint>
-- !query 1 output
1
1
```
</p>
</details>
### Why are the changes needed?
To improve test coverage in UDFs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested via:
```bash
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql"
```
as guided in https://issues.apache.org/jira/browse/SPARK-27921Closes#25676 from HyukjinKwon/SPARK-28272.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base.
<details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
<p>
```diff
```
</p>
</details>
### Why are the changes needed?
To improve test coverage in UDFs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested via:
```bash
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql"
```
as guided in https://issues.apache.org/jira/browse/SPARK-27921Closes#25677 from HyukjinKwon/SPARK-28971.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too.
### Why are the changes needed?
scaladoc is incorrect.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#25687 from srowen/SPARK-28977.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support generator in aggregate expressions.
In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e:
```
aggregate(with generator)
|--child_plan
```
===>
```
project
|--generator(resolved)
|--aggregate
|--child_plan
```
### Why are the changes needed?
We should support sql like:
```
select explode(array(min(a), max(a))) from t
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test added.
Closes#25512 from WeichenXu123/explode_bug.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution.
### Why are the changes needed?
Removes an extra projection and copy.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Adds the provider information to the table properties in saveAsTable.
### Why are the changes needed?
Otherwise, catalog implementations don't know what kind of Table definition to create.
### Does this PR introduce any user-facing change?
nope
### How was this patch tested?
Existing unit tests check the existence of the provider now.
Closes#25669 from brkyvz/provider.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename `UnresolvedTable` to `V1Table` because it is not unresolved.
### Why are the changes needed?
The class name is inaccurate. This should be fixed before it is in a release.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25683 from rdblue/SPARK-28979-rename-unresolved-table.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop
### Why are the changes needed?
Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existed tests.
Closes#25616 from advancedxy/remove_invalid_configuration.
Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach:
1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or
2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise
3. As a bypassed condition (`true`).
### Why are the changes needed?
This is an important performance feature.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT
- Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature.
- Testing DPP with reused broadcast results.
- Testing the key iterators on different HashedRelation types.
- Testing the packing and unpacking of the broadcast keys in a LongType.
Closes#25600 from maryannxue/dpp.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>