### What changes were proposed in this pull request?
invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in da72b87374
### Why are the changes needed?
bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false`
### Does this PR introduce _any_ user-facing change?
yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false`
### How was this patch tested?
new tests
Closes#31003 from yaooqinn/SPARK-33980.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to upgrade cloudpickle from 1.5.0 to 1.6.0.
It virtually contains one fix:
4510be850d
From a cursory look, this isn't a regression, and not even properly supported in Python:
```python
>>> import pickle
>>> pickle.dumps({}.keys())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'dict_keys' object
```
So it seems fine not to backport.
### Why are the changes needed?
To leverage bug fixes from the cloudpickle upstream.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Jenkins build and GitHub actions build will test it out.
Closes#31007 from HyukjinKwon/cloudpickle-upgrade.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled.
### Why are the changes needed?
To monitor performance chaneges with CBO enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked.
Closes#31011 from maropu/AddOptionForCBOInTPCDSBenchmark.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests.
### Why are the changes needed?
This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running tests suites for v1 and v2 catalogs:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#31014 from MaxGekk/fix-drop-partition-v2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).
### Why are the changes needed?
To leverage fixes from the upstream in Py4J.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Jenkins build and GitHub Actions will test it out.
Closes#31009 from HyukjinKwon/SPARK-33984.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to implement `DESCRIBE COLUMN` for v2 tables.
Note that `isExnteded` option is not implemented in this PR.
### Why are the changes needed?
Parity with v1 tables.
### Does this PR introduce _any_ user-facing change?
Yes, now, `DESCRIBE COLUMN` works for v2 tables.
```scala
sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo")
sql("DESCRIBE testcat.tbl data").show
```
```
+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name| data|
|data_type| string|
| comment| hello|
+---------+----------+
```
Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.`
### How was this patch tested?
Added new test.
Closes#30881 from imback82/describe_col_v2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma.
When we use spark 2.4 with UT
```
test("insert overwrite directory with comma col name") {
withTempDir { dir =>
val path = dir.toURI.getPath
val v1 =
s"""
| INSERT OVERWRITE DIRECTORY '${path}'
| STORED AS TEXTFILE
| SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false")
""".stripMargin
sql(v1).explain(true)
sql(v1).show()
}
}
```
failed with as below since column name contains `,` then column names and column types size not equal.
```
19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements!
at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119)
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ',':
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)
And in script transform, we parse column name to avoid this problem
554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)
So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well.
### Why are the changes needed?
More save use serde
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#30850 from AngersZhuuuu/SPARK-33844.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove special handling of `CacheTable` in `TestHiveQueryExecution. analyzed` because it does not allow to support of `spark_catalog` in Hive table names. `spark_catalog` could be handled by a few lines below:
```scala
case UnresolvedRelation(ident, _, _) =>
if (ident.length > 1 && ident.head.equalsIgnoreCase(CatalogManager.SESSION_CATALOG_NAME)) {
```
added by https://github.com/apache/spark/pull/30883.
### Why are the changes needed?
1. To have feature parity with v1 In-Memory catalog.
2. To be able to write unified tests for In-Memory and Hive external catalogs.
### Does this PR introduce _any_ user-facing change?
Should not.
### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#30997 from MaxGekk/cache-table-spark_catalog.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now.
### Why are the changes needed?
Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Close#30902Closes#30902 from saikocat/SPARK-33888.
Lead-authored-by: Hoa <hoameomu@gmail.com>
Co-authored-by: Hoa <saikocatz@gmail.com>
Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn>
Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In hive we always use
```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`.
This pr add a SparkFile's root dir path to execution env property `PATH`, then sub-processor will find `scrip.py` as program under `PATH`.
### Why are the changes needed?
Support SQL migration form Hive to Spark.
### Does this PR introduce _any_ user-facing change?
User can direct use script file name as program in script transform SQL.
```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
### How was this patch tested?
UT
Closes#30973 from AngersZhuuuu/SPARK-33934.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr fix some operator missing rowCount when enable CBO, e.g.:
```scala
spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("set spark.sql.cbo.planStats.enabled=true")
spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost")
```
Before this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB)
+- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB)
+- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
```
After this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
+- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
+- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
```
### Why are the changes needed?
[`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30987 from wangyum/SPARK-33954.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use.
### Why are the changes needed?
increase readability and ease of use.
### Does this PR introduce _any_ user-facing change?
'Yes'.
### How was this patch tested?
Jenkins test
Closes#30982 from beliefer/SPARK-33951.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table.
### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
spark-sql> CACHE TABLE tbl1;
spark-sql> SELECT * FROM tbl1;
0 0
1 1
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
0 0
1 1
```
The last query must not return `0 0` since it was deleted by previous command.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
1 1
```
### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#30983 from MaxGekk/drop-partition-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.
### Why are the changes needed?
to support load old tree model in 3.0/3.1/master
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuites
Closes#30889 from zhengruifeng/fix_tree_load.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to add latest offset to source progress for streaming queries.
### Why are the changes needed?
Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress.
### Does this PR introduce _any_ user-facing change?
Yes, for new metric about latest source offset in source progress.
### How was this patch tested?
Unit test. Manually test in Spark cluster:
```
"description" : "KafkaV2[Subscribe[page_view_events]]",
"startOffset" : {
"page_view_events" : {
"2" : 582370921,
"4" : 391910836,
"1" : 631009201,
"3" : 406601346,
"0" : 195799112
}
},
"endOffset" : {
"page_view_events" : {
"2" : 583764414,
"4" : 392338002,
"1" : 632183480,
"3" : 407101489,
"0" : 197304028
}
},
"latestOffset" : {
"page_view_events" : {
"2" : 589852545,
"4" : 394204277,
"1" : 637313869,
"3" : 409286602,
"0" : 203878962
}
},
"numInputRows" : 4999997,
"inputRowsPerSecond" : 29287.70501405811,
```
Closes#30988 from viirya/latest-offset.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes an incorrect condition when comparing offset range size and min partition config.
### Why are the changes needed?
When calculating offset ranges, we consider `minPartitions` configuration. If `minPartitions` is not set or is less than or equal the size of given ranges, it means there are enough partitions at Kafka so we don't need to split offsets to satisfy min partition requirement. But the current condition is `offsetRanges.size > minPartitions.get` and is not correct. Currently `getRanges` will split offsets in unnecessary case.
Besides, in non-split case, we can assign preferred executor location and reuse `KafkaConsumer`. So unnecessary splitting offset range will miss the chance to reuse `KafkaConsumer`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Manual test in Spark cluster with Kafka.
Closes#30994 from viirya/ss-minor4.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Skip table stats in canonicalizing of `HiveTableRelation`.
### Why are the changes needed?
The changes fix a regression comparing to Spark 3.0, see SPARK-33963.
### Does this PR introduce _any_ user-facing change?
Yes. After changes Spark behaves as in the version 3.0.1.
### How was this patch tested?
By running new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#30995 from MaxGekk/fix-caching-hive-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr improve the statistics estimation of the `Tail`:
```scala
spark.sql("set spark.sql.cbo.enabled=true")
spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as e").write.saveAsTable("t1")
println(Tail(Literal(5), spark.sql("SELECT * FROM t1").queryExecution.logical).queryExecution.stringWithStats)
```
Before this pr:
```
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=3.8 KiB)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
```
After this pr:
```
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
```
### Why are the changes needed?
Import statistics estimation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30991 from wangyum/SPARK-33959.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.4.6 to fix the SBT regression.
### Why are the changes needed?
[SBT 1.4.6](https://github.com/sbt/sbt/releases/tag/v1.4.6) has the following fixes
- Updates to Coursier 2.0.8, which fixes the cache directory setting on Windows
- Fixes performance regression in shell tab completion
- Fixes match error when using withDottyCompat
- Fixes thread-safety in AnalysisCallback handler
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#30993 from dongjoon-hyun/SPARK-SBT-1.4.6.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr add rowCount for `Range` operator:
```scala
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("select id from range(100)").explain("cost")
```
Before this pr:
```
== Optimized Logical Plan ==
Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
```
After this pr:
```
== Optimized Logical Plan ==
Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, rowCount=100)
```
### Why are the changes needed?
[`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30989 from wangyum/SPARK-33956.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to update commons-lang3 to 3.11 to support Java 16+ better.
### Why are the changes needed?
commons-lang3 has the following bug fixes and Java 16 support.
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Pass the CIs.
Closes#30990 from williamhyun/Commons-lang3.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page.
### Why are the changes needed?
App live UI may stuck at Executors page without this fix.
Steps to reproduce (with master branch):
In mac OS standalone mode, open a spark-shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
val x = sc.makeRDD(1 to 100000, 5)
x.count()
Then open the app UI in the browser, and click the Executors page, will get stuck at this page:
![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png)
Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906.
I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test, rerun the steps of bug reproduce and see the bug is gone.
Closes#30920 from baohe-zhang/SPARK-33906.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled.
### Why are the changes needed?
bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#30978 from yaooqinn/SPARK-33944.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fix UT according to https://github.com/apache/spark/pull/29966#issuecomment-752830046
Change StructType construct from
```
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
```
to
```
def inputSchema: StructType = new StructType().add("inputColumn", LongType)
```
The whole udf class is :
```
package org.apache.spark.examples.sql
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
class Spark33084 extends UserDefinedAggregateFunction {
// Data types of input arguments of this aggregate function
def inputSchema: StructType = new StructType().add("inputColumn", LongType)
// Data types of values in the aggregation buffer
def bufferSchema: StructType =
new StructType().add("sum", LongType).add("count", LongType)
// The data type of the returned value
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
```
### Why are the changes needed?
Fix UT for scala 2.13
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed UT
Closes#30980 from AngersZhuuuu/spark-33084-followup.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposed to make worker/executor decommission signal configurable.
* Added confs: `spark.worker.decommission.signal` / `spark.executor.decommission.signal`
* Rename `WorkerSigPWRReceived`/ `ExecutorSigPWRReceived` to `WorkerDecomSigReceived`/ `ExecutorDecomSigReceived`
### Why are the changes needed?
The current signal `PWR` can't work on macOS since it's not compliant with POSIX while macOS does. So the developers currently can't do end-to-end decommission test on their macOS environment.
Besides, the configuration becomes more flexible for users in case the default signal (`PWR`) gets conflicted with their own applications/environment.
### Does this PR introduce _any_ user-facing change?
No (it's a new API for 3.2)
### How was this patch tested?
Manually tested.
Closes#30968 from Ngone51/configurable-decom-signal.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Removed the **hiveClientCalls.count** in CodeGenerator metrics in Component instance = Executor
### Why are the changes needed?
Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md)
Closes#30976 from coderbond007/SPARK-33942.
Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There are only 3 compilation warnings related to `view bounds are deprecated` in Spark Code:
```
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: view bounds are deprecated; use an implicit parameter instead.
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: view bounds are deprecated; use an implicit parameter instead.
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:55: view bounds are deprecated; use an implicit parameter instead.
```
This pr try to fix these compilation warnings.
### Why are the changes needed?
Fix compilation warnings about ` view bounds are deprecated`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30924 from LuciferYang/SPARK-33804.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from JsonToStructs if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing.
This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed.
### Why are the changes needed?
It is to avoid unexpected behavior change regarding parsing.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#30970 from viirya/SPARK-33907-3.2.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
In the `saveAsTable()` and `insertInto()` methods of `DataFrameWriter`, recognize `spark_catalog` as the default session catalog in table names.
### Why are the changes needed?
1. To simplify writing of unified v1 and v2 tests
2. To improve Spark SQL user experience. `insertInto()` should have feature parity with the `INSERT INTO` sql command. Currently, `insertInto()` fails on a table from a namespace in `spark_catalog`:
```scala
scala> sql("CREATE NAMESPACE spark_catalog.ns")
scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl.
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629)
... 47 elided
scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl")
org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl.
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498)
... 47 elided
```
but `INSERT INTO` succeed:
```sql
spark-sql> create table spark_catalog.ns.tbl (c int);
spark-sql> insert into spark_catalog.ns.tbl select 0;
spark-sql> select * from spark_catalog.ns.tbl;
0
```
### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```scala
scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
scala> Seq(1).toDF().write.insertInto("spark_catalog.ns.tbl")
scala> spark.table("spark_catalog.ns.tbl").show(false)
+-----+
|value|
+-----+
|0 |
|1 |
+-----+
```
### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.ShowPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.FileFormatWriterSuite"
```
Closes#30919 from MaxGekk/insert-into-spark_catalog.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Replace `SessionCatalogAndNamespace` by `DatabaseInSessionCatalog` in resolving database name from v1 session catalog.
2. Throw more precise errors from `DatabaseInSessionCatalog`
3. Fix expected error messages in `v1.ShowTablesSuiteBase`
Closes#30947
### Why are the changes needed?
Current error message "multi-part identifier cannot be empty" may confuse users. And this error message is just a consequence of "incorrectly" applied an implicit class. For example, `SHOW TABLES IN spark_catalog`:
1. Spark cuts off `spark_catalog` from namespaces in `SessionCatalogAndNamespace`, so, `ns == Seq.empty` here: 0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L365)
2. Then `ns.length != 1` is `true` and Spark tries to raise the exception at 0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L367)
3. ... but `ns.quoted` triggers implicit wrapping `Seq.empty` by `MultipartIdentifierHelper`, and hit to the second check `if (parts.isEmpty)` at 156704ba0d/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala (L120-L122)
So, Spark throws the exception at third step instead of `new AnalysisException(s"The database name is not valid: $quoted")` on the second step. And even on the second step, the exception doesn't show actual reason as it is pretty generic.
### Does this PR introduce _any_ user-facing change?
Yes in the case of v1 DDL commands when a database is not specified or nested databases is set.
### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DDLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```
Closes#30963 from MaxGekk/database-in-session-catalog.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to fix the `Dockerfile` for Spark release.
- Port b135db3b1a to `Dockerfile`
- Upgrade Ubuntu 18.04 -> 20.04 (because of porting b135db3)
- Remove Python 2 (because of Ubuntu upgrade)
- Use built-in Python 3.8.5 (because of Ubuntu upgrade)
- Node.js 11 -> 12 (because of Ubuntu upgrade)
- Ruby 2.5 -> 2.7 (because of Ubuntu upgrade)
- Python dependencies and Jekyll + plugins upgrade to the latest as it's used in GitHub Actions build (unrelated to the issue itself)
### Why are the changes needed?
To make a Spark release :-).
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested via:
```bash
cd dev/create-release/spark-rm
docker build -t spark-rm --build-arg UID=$UID .
```
```
...
Successfully built 516d7943634f
Successfully tagged spark-rm:latest
```
Closes#30971 from HyukjinKwon/SPARK-33927.
Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This minor patch changes two variables where calling `fetchEarliestOffsets` to `lazy` because these values are not always necessary.
### Why are the changes needed?
To avoid unnecessary Kafka RPC calls.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#30969 from viirya/ss-minor3.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The current implement of trim/trimleft/trimright have somewhat redundant.
### Why are the changes needed?
Improve the implement of trim/trimleft/trimright
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
Jenkins test
Closes#30905 from beliefer/SPARK-33890.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In hive script transform serde mode, NULL format default is `\\N`
```
String nullString = tbl.getProperty(
serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
nullSequence = new Text(nullString);
```
I make a mistake that in Spark's code we need to fix and keep same with hive too. So add some test case to show this issue.
### Why are the changes needed?
add UT
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#30946 from AngersZhuuuu/SPARK-32684.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For liveness check when checkAllContainers is not set, we check the liveness status of the Spark container if we can find it.
### Why are the changes needed?
Some environments may deploy long lived logs collecting side cars which outlive the Spark application. Just because they remain alive does not mean the Spark executor should keep running.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Extended the existing pod status tests.
Closes#30892 from holdenk/SPARK-33874-handle-long-lived-sidecars.
Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch cleans up KafkaOffsetReader API document.
### Why are the changes needed?
KafkaOffsetReader API documents are duplicated among KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if the doc is centralized.
This also adds missing API doc too.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Doc only.
Closes#30961 from viirya/SPARK-33932.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add the `since` tag to methods and interfaces added recently.
### Why are the changes needed?
1. To follow the existing convention for Spark API.
2. To inform devs when Spark API was changed.
### Does this PR introduce _any_ user-facing change?
Should not.
### How was this patch tested?
`dev/scalastyle`
Closes#30966 from MaxGekk/spark-23889-interfaces-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr fix remove the `CaseWhen` if elseValue is empty and other outputs are null because of we should consider deterministic.
### Why are the changes needed?
Fix bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30960 from wangyum/SPARK-33847-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add the version 3.2.0 to new method `renamePartition()` in the `SupportsPartitionManagement` interface.
### Why are the changes needed?
To inform Spark devs when the method appears in the interface.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`./dev/scalastyle`
Closes#30964 from MaxGekk/alter-table-rename-partition-v2-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For same SQL
```
SELECT TRANSFORM(a, b, c, null)
ROW FORMAT DELIMITED
USING 'cat'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '&'
FROM (select 1 as a, 2 as b, 3 as c) t
```
In hive:
```
hive> SELECT TRANSFORM(a, b, c, null)
> ROW FORMAT DELIMITED
> USING 'cat'
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '&'
> FROM (select 1 as a, 2 as b, 3 as c) t;
OK
123\N NULL
Time taken: 14.519 seconds, Fetched: 1 row(s)
hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe
```
In Spark
```
Spark master: local[*], Application Id: local-1609225830376
spark-sql> SELECT TRANSFORM(a, b, c, null)
> ROW FORMAT DELIMITED
> USING 'cat'
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '&'
> FROM (select 1 as a, 2 as b, 3 as c) t;
1 2 3 null NULL
Time taken: 4.297 seconds, Fetched 1 row(s)
spark-sql>
```
We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001`
In hive default value is '1' to char is '\u0001'
```
bucket_count -1
column.name.delimiter ,
columns
columns.comments
columns.types
file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat
```
### Why are the changes needed?
Keep same behavior with hive
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#30958 from AngersZhuuuu/SPARK-33930.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduce allowList push into (if / case) branches to fix potential bug.
### Why are the changes needed?
Fix potential bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes#30955 from wangyum/SPARK-33848-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move seed is legal check to `CheckAnalysis`.
### Why are the changes needed?
It's better to check seed expression is legal at analyzer side instead of execution, and user can get exception as soon as possible.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#30923 from ulysses-you/SPARK-33909.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Add `renamePartition()` to the `SupportsPartitionManagement`
2. Implement `renamePartition()` in `InMemoryPartitionTable`
3. Add v2 execution node `AlterTableRenamePartitionExec`
4. Resolve the logical node `AlterTableRenamePartition` to `AlterTableRenamePartitionExec` for v2 tables that support `SupportsPartitionManagement`
5. Move v1 tests to the base suite `org.apache.spark.sql.execution.command.AlterTableRenamePartitionSuiteBase` to run them for v2 table catalogs.
### Why are the changes needed?
To have feature parity with Datasource V1.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By running the unified tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```
Closes#30935 from MaxGekk/alter-table-rename-partition-v2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr is followup of SPARK-33775, the main change of this pr this sync suppression rules from `SparkBuild.scala` to `pom.xml` to let maven build have the same suppression ability for compilation warnings in Scala 2.13
### Why are the changes needed?
Suppress unimportant compilation warnings in Scala 2.13 with maven build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Local manual test:The suppressed compilation warnings are no longer printed to the console.
Closes#30951 from LuciferYang/SPARK-33775-FOLLOWUP.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to do column pruning for CsvToStructs expression if we only require some fields from it.
### Why are the changes needed?
`CsvToStructs` takes a schema parameter used to tell CSV Parser what fields are needed to parse. If `CsvToStructs` is followed by GetStructField. We can prune the schema to only parse certain field.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#30912 from viirya/SPARK-32968.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to recover GitHub Action `build_and_test` job.
### Why are the changes needed?
Currently, `build_and_test` job fails to start because of the following in master/branch-3.1 at least.
```
r-lib/actions/setup-rv1 is not allowed to be used in apache/spark.
Actions in this workflow must be: created by GitHub, verified in the GitHub Marketplace,
within a repository owned by apache or match the following:
adoptopenjdk/*, apache/*, gradle/wrapper-validation-action.
```
- https://github.com/apache/spark/actions/runs/449826457
![Screen Shot 2020-12-28 at 10 06 11 PM](https://user-images.githubusercontent.com/9700541/103262174-f1f13a80-4958-11eb-8ceb-631527155775.png)
### Does this PR introduce _any_ user-facing change?
No. This is a test infra.
### How was this patch tested?
To check GitHub Action `build_and_test` job on this PR.
Closes#30959 from dongjoon-hyun/SPARK-33931.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use the testing mode for the test to fix the flaky.
### Why are the changes needed?
The test is flaky:
```scala
[info] - SPARK-23365 Don't update target num executors when killing idle executors *** FAILED *** (126 milliseconds)
[info] 1 did not equal 2 (ExecutorAllocationManagerSuite.scala:1615)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info] at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$84(ExecutorAllocationManagerSuite.scala:1617)
...
```
The root cause should be the same as https://github.com/apache/spark/pull/29773 since the test run under non-testing mode.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked. Flaky is gone by running the test hundreds of times after this fix.
Closes#30956 from Ngone51/fix-flaky-SPARK-23365.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr simplify `CaseWhen`clauses with (true and false) and (false and true):
Expression | cond.nullable | After simplify
-- | -- | --
case when cond then true else false end | true | cond <=> true
case when cond then true else false end | false | cond
case when cond then false else true end | true | !(cond <=> true)
case when cond then false else true end | false | !cond
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30898 from wangyum/SPARK-33884.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>