### What changes were proposed in this pull request?
1. Recognize `spark_catalog` as the default session catalog in the checks of `TestHiveQueryExecution`.
2. Move v2 and v1 in-memory catalog test `"SPARK-33305: DROP TABLE should also invalidate cache"` to the common trait `command/DropTableSuiteBase`, and run it with v1 Hive external catalog.
### Why are the changes needed?
To run In-memory catalog tests in Hive catalog.
### Does this PR introduce _any_ user-facing change?
No, the changes influence only on tests.
### How was this patch tested?
By running the affected test suites for `DROP TABLE`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite"
```
Closes#30883 from MaxGekk/fix-spark_catalog-hive-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```scala
val nestedStruct = new StructType()
.add(StructField("b", StringType).withComment("Nested comment"))
val struct = new StructType()
.add(StructField("a", nestedStruct).withComment("comment"))
struct.toDDL
```
Currently, returns:
```
`a` STRUCT<`b`: STRING> COMMENT 'comment'`
```
With this PR, the code above returns:
```
`a` STRUCT<`b`: STRING COMMENT 'Nested comment'> COMMENT 'comment'`
```
### Why are the changes needed?
My team is using nested columns as first citizens, and I thought it would be nice to have comments for nested columns.
### Does this PR introduce _any_ user-facing change?
Now, when users call something like this,
```scala
spark.table("foo.bar").schema.fields.map(_.toDDL).mkString(", ")
```
they will get comments for the nested columns.
### How was this patch tested?
I added unit tests under `org.apache.spark.sql.types.StructTypeSuite`. They test if nested StructType's comment is included in the DDL string.
Closes#30851 from jacobhjkim/structtype-toddl.
Authored-by: Jacob Kim <me@jacobkim.io>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR tries to rename `dataSourceRewriteRules` into something more generic.
### Why are the changes needed?
These changes are needed to address the post-review discussion [here](https://github.com/apache/spark/pull/30558#discussion_r533885837).
### Does this PR introduce _any_ user-facing change?
Yes but the changes haven't been released yet.
### How was this patch tested?
Existing tests.
Closes#30808 from aokolnychyi/spark-33784.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds logic to build logical writes introduced in SPARK-33779.
Note: This PR contains a subset of changes discussed in PR #29066.
### Why are the changes needed?
These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#30806 from aokolnychyi/spark-33808.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add some case to match Array whose element type is primitive.
### Why are the changes needed?
We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` .
```
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215)
at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140)
```
And same problem with other array whose element is primitive.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Add test.
Closes#30868 from ulysses-you/SPARK-33860.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Orc support filter push down optimization, but this optimization will read file meta from external storage even if filters is empty.
This pr add a extra `filters.nonEmpty` when `spark.sql.orc.filterPushdown` is true
### Why are the changes needed?
Orc filters push down operation should only triggered when `filters.nonEmpty` is true
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30663 from LuciferYang/pushdownfilter-when-filter-nonempty.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR removed an unused variable `CompressionCodec.DEFAULT_COMPRESSION_CODEC`.
### Why are the changes needed?
Apache Spark 3.0.0 centralized this default value to `IO_COMPRESSION_CODEC.defaultValue` via [SPARK-26462](https://github.com/apache/spark/pull/23447).
We had better remove this variable to avoid any potential confusion in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI compilation.
Closes#30880 from dongjoon-hyun/minor.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change
For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns
For v2 table,
char/varchar to string,
char(x) to char(x),
char(x)/varchar(x) to varchar(y) if x <=y are valid cases,
other changes are invalid
### Why are the changes needed?
Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new test
Closes#30833 from yaooqinn/SPARK-33834.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
* Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec`
* Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data
* Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode)
* Add `SparkScriptTransformationSuite` test spark spec case
* add test in `SQLQueryTestSuite`
And we will close#29085 .
### Why are the changes needed?
Support user use Script Transform without Hive
### Does this PR introduce _any_ user-facing change?
User can use Script Transformation without hive in no serde mode.
Such as :
**default no serde **
```
SELECT TRANSFORM(a, b, c)
USING 'cat' AS (a int, b string, c long)
FROM testData
```
**no serde with spec ROW FORMAT DELIMITED**
```
SELECT TRANSFORM(a, b, c)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'null'
USING 'cat' AS (a, b, c)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\u0004'
MAP KEYS TERMINATED BY '\u0005'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
FROM testData
```
### How was this patch tested?
Added UT
Closes#29414 from AngersZhuuuu/SPARK-32106-MINOR.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to test all compression codecs for encrypted spilling.
### Why are the changes needed?
To improve test coverage. Currently, only `CompressionCodec.DEFAULT_COMPRESSION_CODEC` is under testing.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the updated test cases.
Closes#30879 from dongjoon-hyun/SPARK-33873.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Reopened from https://github.com/apache/spark/pull/27525.
The exception messages for dstream.py when using windows were improved to be specific about what sliding duration is important.
### Why are the changes needed?
The batch interval of dstreams are improperly named as sliding windows. The term sliding window is also used to reference the new window of a dstream collected over a window of rdds in a parent dstream. We should probably fix the naming convention of sliding window used in the dstream class, but for now more this more explicit exception message may reduce confusion.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
It wasn't since this is only a change of the exception message
Closes#30871 from kykrueger/kykrueger-patch-1.
Authored-by: Kyle Krueger <kyle.s.krueger@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to:
- Make doctests simpler to show the usage (since we're not running them now).
- Use the test utils to drop the tables if exists.
### Why are the changes needed?
Better docs and code readability.
### Does this PR introduce _any_ user-facing change?
No, dev-only. It includes some doc changes in unreleased branches.
### How was this patch tested?
Manually tested.
```bash
cd python
./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests"
```
Closes#30873 from HyukjinKwon/SPARK-33836.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations.
### Why are the changes needed?
To make PySpark tests less flaky.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested by trying some sleeps in https://github.com/apache/spark/pull/30873.
Closes#30875 from HyukjinKwon/SPARK-33869.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr push the `UnaryExpression` into (if / case) branches. The use case is:
```sql
create table t1 using parquet as select id from range(10);
explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3;
```
Before this pr:
```
== Physical Plan ==
*(1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
```
After this pr:
```
== Physical Plan ==
LocalTableScan <empty>, [id#1L]
```
This change can also improve this case:
a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30853 from wangyum/SPARK-33848.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add comments for the `PURGE` option to the logical nodes `DropTable` and `AlterTableDropPartition`.
### Why are the changes needed?
To improve code maintenance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running `./dev/scalastyle`
Closes#30837 from MaxGekk/comment-purge-logical-node.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions.
New groups below are added in this PR:
- binary_funcs
- bitwise_funcs
- collection_funcs
- predicate_funcs
- conditional_funcs
- conversion_funcs
- csv_funcs
- generator_funcs
- hash_funcs
- lambda_funcs
- math_funcs
- misc_funcs
- string_funcs
- struct_funcs
- xml_funcs
A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in `hash.scala` are categorized into `hash_funcs`. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes:
- All the aggregate functions are categorized into `agg_funcs`.
- `array_funcs` and `map_funcs` are sub-groups of `collection_funcs`. For example, `array_contains` is used only for arrays, so it is assigned to `array_funcs`. On the other hand, `reverse` is used for both arrays and strings, so it is assigned to `collection_funcs`.
- Some functions logically belong to multiple groups. In this case, these functions are categorized based on the file that they belong to. For example, `schema_of_csv` can be grouped into both `csv_funcs` and `struct_funcs` in terms of input types, but it is assigned to `csv_funcs` because it belongs to the `csvExpressions.scala` file that holds the other CSV-related functions.
- Functions in `nullExpressions.scala`, `complexTypeCreator.scala`, `randomExpressions.scala`, and `regexExpressions.scala` are categorized based on their functionalities. For example:
- `isnull` in `nullExpressions` is assigned to `predicate_funcs` because this is a predicate function.
- `array` in `complexTypeCreator.scala` is assigned to `array_funcs`based on its output type (The other functions in `array_funcs` are categorized based on their input types though).
A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master):
|group|name|class|
|-----|----|-----|
|agg_funcs|any|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr|
|agg_funcs|approx_count_distinct|org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus|
|agg_funcs|approx_percentile|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile|
|agg_funcs|avg|org.apache.spark.sql.catalyst.expressions.aggregate.Average|
|agg_funcs|bit_and|org.apache.spark.sql.catalyst.expressions.aggregate.BitAndAgg|
|agg_funcs|bit_or|org.apache.spark.sql.catalyst.expressions.aggregate.BitOrAgg|
|agg_funcs|bit_xor|org.apache.spark.sql.catalyst.expressions.aggregate.BitXorAgg|
|agg_funcs|bool_and|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd|
|agg_funcs|bool_or|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr|
|agg_funcs|collect_list|org.apache.spark.sql.catalyst.expressions.aggregate.CollectList|
|agg_funcs|collect_set|org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet|
|agg_funcs|corr|org.apache.spark.sql.catalyst.expressions.aggregate.Corr|
|agg_funcs|count_if|org.apache.spark.sql.catalyst.expressions.aggregate.CountIf|
|agg_funcs|count_min_sketch|org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg|
|agg_funcs|count|org.apache.spark.sql.catalyst.expressions.aggregate.Count|
|agg_funcs|covar_pop|org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation|
|agg_funcs|covar_samp|org.apache.spark.sql.catalyst.expressions.aggregate.CovSample|
|agg_funcs|cube|org.apache.spark.sql.catalyst.expressions.Cube|
|agg_funcs|every|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd|
|agg_funcs|first_value|org.apache.spark.sql.catalyst.expressions.aggregate.First|
|agg_funcs|first|org.apache.spark.sql.catalyst.expressions.aggregate.First|
|agg_funcs|grouping_id|org.apache.spark.sql.catalyst.expressions.GroupingID|
|agg_funcs|grouping|org.apache.spark.sql.catalyst.expressions.Grouping|
|agg_funcs|kurtosis|org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis|
|agg_funcs|last_value|org.apache.spark.sql.catalyst.expressions.aggregate.Last|
|agg_funcs|last|org.apache.spark.sql.catalyst.expressions.aggregate.Last|
|agg_funcs|max_by|org.apache.spark.sql.catalyst.expressions.aggregate.MaxBy|
|agg_funcs|max|org.apache.spark.sql.catalyst.expressions.aggregate.Max|
|agg_funcs|mean|org.apache.spark.sql.catalyst.expressions.aggregate.Average|
|agg_funcs|min_by|org.apache.spark.sql.catalyst.expressions.aggregate.MinBy|
|agg_funcs|min|org.apache.spark.sql.catalyst.expressions.aggregate.Min|
|agg_funcs|percentile_approx|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile|
|agg_funcs|percentile|org.apache.spark.sql.catalyst.expressions.aggregate.Percentile|
|agg_funcs|rollup|org.apache.spark.sql.catalyst.expressions.Rollup|
|agg_funcs|skewness|org.apache.spark.sql.catalyst.expressions.aggregate.Skewness|
|agg_funcs|some|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr|
|agg_funcs|stddev_pop|org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop|
|agg_funcs|stddev_samp|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp|
|agg_funcs|stddev|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp|
|agg_funcs|std|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp|
|agg_funcs|sum|org.apache.spark.sql.catalyst.expressions.aggregate.Sum|
|agg_funcs|var_pop|org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop|
|agg_funcs|var_samp|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp|
|agg_funcs|variance|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp|
|array_funcs|array_contains|org.apache.spark.sql.catalyst.expressions.ArrayContains|
|array_funcs|array_distinct|org.apache.spark.sql.catalyst.expressions.ArrayDistinct|
|array_funcs|array_except|org.apache.spark.sql.catalyst.expressions.ArrayExcept|
|array_funcs|array_intersect|org.apache.spark.sql.catalyst.expressions.ArrayIntersect|
|array_funcs|array_join|org.apache.spark.sql.catalyst.expressions.ArrayJoin|
|array_funcs|array_max|org.apache.spark.sql.catalyst.expressions.ArrayMax|
|array_funcs|array_min|org.apache.spark.sql.catalyst.expressions.ArrayMin|
|array_funcs|array_position|org.apache.spark.sql.catalyst.expressions.ArrayPosition|
|array_funcs|array_remove|org.apache.spark.sql.catalyst.expressions.ArrayRemove|
|array_funcs|array_repeat|org.apache.spark.sql.catalyst.expressions.ArrayRepeat|
|array_funcs|array_union|org.apache.spark.sql.catalyst.expressions.ArrayUnion|
|array_funcs|arrays_overlap|org.apache.spark.sql.catalyst.expressions.ArraysOverlap|
|array_funcs|arrays_zip|org.apache.spark.sql.catalyst.expressions.ArraysZip|
|array_funcs|array|org.apache.spark.sql.catalyst.expressions.CreateArray|
|array_funcs|flatten|org.apache.spark.sql.catalyst.expressions.Flatten|
|array_funcs|sequence|org.apache.spark.sql.catalyst.expressions.Sequence|
|array_funcs|shuffle|org.apache.spark.sql.catalyst.expressions.Shuffle|
|array_funcs|slice|org.apache.spark.sql.catalyst.expressions.Slice|
|array_funcs|sort_array|org.apache.spark.sql.catalyst.expressions.SortArray|
|bitwise_funcs|&|org.apache.spark.sql.catalyst.expressions.BitwiseAnd|
|bitwise_funcs|^|org.apache.spark.sql.catalyst.expressions.BitwiseXor|
|bitwise_funcs|bit_count|org.apache.spark.sql.catalyst.expressions.BitwiseCount|
|bitwise_funcs|shiftrightunsigned|org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned|
|bitwise_funcs|shiftright|org.apache.spark.sql.catalyst.expressions.ShiftRight|
|bitwise_funcs|~|org.apache.spark.sql.catalyst.expressions.BitwiseNot|
|collection_funcs|cardinality|org.apache.spark.sql.catalyst.expressions.Size|
|collection_funcs|concat|org.apache.spark.sql.catalyst.expressions.Concat|
|collection_funcs|reverse|org.apache.spark.sql.catalyst.expressions.Reverse|
|collection_funcs|size|org.apache.spark.sql.catalyst.expressions.Size|
|conditional_funcs|coalesce|org.apache.spark.sql.catalyst.expressions.Coalesce|
|conditional_funcs|ifnull|org.apache.spark.sql.catalyst.expressions.IfNull|
|conditional_funcs|if|org.apache.spark.sql.catalyst.expressions.If|
|conditional_funcs|nanvl|org.apache.spark.sql.catalyst.expressions.NaNvl|
|conditional_funcs|nullif|org.apache.spark.sql.catalyst.expressions.NullIf|
|conditional_funcs|nvl2|org.apache.spark.sql.catalyst.expressions.Nvl2|
|conditional_funcs|nvl|org.apache.spark.sql.catalyst.expressions.Nvl|
|conditional_funcs|when|org.apache.spark.sql.catalyst.expressions.CaseWhen|
|conversion_funcs|bigint|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|binary|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|boolean|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|cast|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|date|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|decimal|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|double|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|float|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|int|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|smallint|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|string|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|timestamp|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|tinyint|org.apache.spark.sql.catalyst.expressions.Cast|
|csv_funcs|from_csv|org.apache.spark.sql.catalyst.expressions.CsvToStructs|
|csv_funcs|schema_of_csv|org.apache.spark.sql.catalyst.expressions.SchemaOfCsv|
|csv_funcs|to_csv|org.apache.spark.sql.catalyst.expressions.StructsToCsv|
|datetime_funcs|add_months|org.apache.spark.sql.catalyst.expressions.AddMonths|
|datetime_funcs|current_date|org.apache.spark.sql.catalyst.expressions.CurrentDate|
|datetime_funcs|current_timestamp|org.apache.spark.sql.catalyst.expressions.CurrentTimestamp|
|datetime_funcs|current_timezone|org.apache.spark.sql.catalyst.expressions.CurrentTimeZone|
|datetime_funcs|date_add|org.apache.spark.sql.catalyst.expressions.DateAdd|
|datetime_funcs|date_format|org.apache.spark.sql.catalyst.expressions.DateFormatClass|
|datetime_funcs|date_from_unix_date|org.apache.spark.sql.catalyst.expressions.DateFromUnixDate|
|datetime_funcs|date_part|org.apache.spark.sql.catalyst.expressions.DatePart|
|datetime_funcs|date_sub|org.apache.spark.sql.catalyst.expressions.DateSub|
|datetime_funcs|date_trunc|org.apache.spark.sql.catalyst.expressions.TruncTimestamp|
|datetime_funcs|datediff|org.apache.spark.sql.catalyst.expressions.DateDiff|
|datetime_funcs|dayofmonth|org.apache.spark.sql.catalyst.expressions.DayOfMonth|
|datetime_funcs|dayofweek|org.apache.spark.sql.catalyst.expressions.DayOfWeek|
|datetime_funcs|dayofyear|org.apache.spark.sql.catalyst.expressions.DayOfYear|
|datetime_funcs|day|org.apache.spark.sql.catalyst.expressions.DayOfMonth|
|datetime_funcs|extract|org.apache.spark.sql.catalyst.expressions.Extract|
|datetime_funcs|from_unixtime|org.apache.spark.sql.catalyst.expressions.FromUnixTime|
|datetime_funcs|from_utc_timestamp|org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp|
|datetime_funcs|hour|org.apache.spark.sql.catalyst.expressions.Hour|
|datetime_funcs|last_day|org.apache.spark.sql.catalyst.expressions.LastDay|
|datetime_funcs|make_date|org.apache.spark.sql.catalyst.expressions.MakeDate|
|datetime_funcs|make_interval|org.apache.spark.sql.catalyst.expressions.MakeInterval|
|datetime_funcs|make_timestamp|org.apache.spark.sql.catalyst.expressions.MakeTimestamp|
|datetime_funcs|minute|org.apache.spark.sql.catalyst.expressions.Minute|
|datetime_funcs|months_between|org.apache.spark.sql.catalyst.expressions.MonthsBetween|
|datetime_funcs|month|org.apache.spark.sql.catalyst.expressions.Month|
|datetime_funcs|next_day|org.apache.spark.sql.catalyst.expressions.NextDay|
|datetime_funcs|now|org.apache.spark.sql.catalyst.expressions.Now|
|datetime_funcs|quarter|org.apache.spark.sql.catalyst.expressions.Quarter|
|datetime_funcs|second|org.apache.spark.sql.catalyst.expressions.Second|
|datetime_funcs|timestamp_micros|org.apache.spark.sql.catalyst.expressions.MicrosToTimestamp|
|datetime_funcs|timestamp_millis|org.apache.spark.sql.catalyst.expressions.MillisToTimestamp|
|datetime_funcs|timestamp_seconds|org.apache.spark.sql.catalyst.expressions.SecondsToTimestamp|
|datetime_funcs|to_date|org.apache.spark.sql.catalyst.expressions.ParseToDate|
|datetime_funcs|to_timestamp|org.apache.spark.sql.catalyst.expressions.ParseToTimestamp|
|datetime_funcs|to_unix_timestamp|org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp|
|datetime_funcs|to_utc_timestamp|org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp|
|datetime_funcs|trunc|org.apache.spark.sql.catalyst.expressions.TruncDate|
|datetime_funcs|unix_date|org.apache.spark.sql.catalyst.expressions.UnixDate|
|datetime_funcs|unix_micros|org.apache.spark.sql.catalyst.expressions.UnixMicros|
|datetime_funcs|unix_millis|org.apache.spark.sql.catalyst.expressions.UnixMillis|
|datetime_funcs|unix_seconds|org.apache.spark.sql.catalyst.expressions.UnixSeconds|
|datetime_funcs|unix_timestamp|org.apache.spark.sql.catalyst.expressions.UnixTimestamp|
|datetime_funcs|weekday|org.apache.spark.sql.catalyst.expressions.WeekDay|
|datetime_funcs|weekofyear|org.apache.spark.sql.catalyst.expressions.WeekOfYear|
|datetime_funcs|year|org.apache.spark.sql.catalyst.expressions.Year|
|generator_funcs|explode_outer|org.apache.spark.sql.catalyst.expressions.Explode|
|generator_funcs|explode|org.apache.spark.sql.catalyst.expressions.Explode|
|generator_funcs|inline_outer|org.apache.spark.sql.catalyst.expressions.Inline|
|generator_funcs|inline|org.apache.spark.sql.catalyst.expressions.Inline|
|generator_funcs|posexplode_outer|org.apache.spark.sql.catalyst.expressions.PosExplode|
|generator_funcs|posexplode|org.apache.spark.sql.catalyst.expressions.PosExplode|
|generator_funcs|stack|org.apache.spark.sql.catalyst.expressions.Stack|
|hash_funcs|crc32|org.apache.spark.sql.catalyst.expressions.Crc32|
|hash_funcs|hash|org.apache.spark.sql.catalyst.expressions.Murmur3Hash|
|hash_funcs|md5|org.apache.spark.sql.catalyst.expressions.Md5|
|hash_funcs|sha1|org.apache.spark.sql.catalyst.expressions.Sha1|
|hash_funcs|sha2|org.apache.spark.sql.catalyst.expressions.Sha2|
|hash_funcs|sha|org.apache.spark.sql.catalyst.expressions.Sha1|
|hash_funcs|xxhash64|org.apache.spark.sql.catalyst.expressions.XxHash64|
|json_funcs|from_json|org.apache.spark.sql.catalyst.expressions.JsonToStructs|
|json_funcs|get_json_object|org.apache.spark.sql.catalyst.expressions.GetJsonObject|
|json_funcs|json_array_length|org.apache.spark.sql.catalyst.expressions.LengthOfJsonArray|
|json_funcs|json_object_keys|org.apache.spark.sql.catalyst.expressions.JsonObjectKeys|
|json_funcs|json_tuple|org.apache.spark.sql.catalyst.expressions.JsonTuple|
|json_funcs|schema_of_json|org.apache.spark.sql.catalyst.expressions.SchemaOfJson|
|json_funcs|to_json|org.apache.spark.sql.catalyst.expressions.StructsToJson|
|lambda_funcs|aggregate|org.apache.spark.sql.catalyst.expressions.ArrayAggregate|
|lambda_funcs|array_sort|org.apache.spark.sql.catalyst.expressions.ArraySort|
|lambda_funcs|exists|org.apache.spark.sql.catalyst.expressions.ArrayExists|
|lambda_funcs|filter|org.apache.spark.sql.catalyst.expressions.ArrayFilter|
|lambda_funcs|forall|org.apache.spark.sql.catalyst.expressions.ArrayForAll|
|lambda_funcs|map_filter|org.apache.spark.sql.catalyst.expressions.MapFilter|
|lambda_funcs|map_zip_with|org.apache.spark.sql.catalyst.expressions.MapZipWith|
|lambda_funcs|transform_keys|org.apache.spark.sql.catalyst.expressions.TransformKeys|
|lambda_funcs|transform_values|org.apache.spark.sql.catalyst.expressions.TransformValues|
|lambda_funcs|transform|org.apache.spark.sql.catalyst.expressions.ArrayTransform|
|lambda_funcs|zip_with|org.apache.spark.sql.catalyst.expressions.ZipWith|
|map_funcs|element_at|org.apache.spark.sql.catalyst.expressions.ElementAt|
|map_funcs|map_concat|org.apache.spark.sql.catalyst.expressions.MapConcat|
|map_funcs|map_entries|org.apache.spark.sql.catalyst.expressions.MapEntries|
|map_funcs|map_from_arrays|org.apache.spark.sql.catalyst.expressions.MapFromArrays|
|map_funcs|map_from_entries|org.apache.spark.sql.catalyst.expressions.MapFromEntries|
|map_funcs|map_keys|org.apache.spark.sql.catalyst.expressions.MapKeys|
|map_funcs|map_values|org.apache.spark.sql.catalyst.expressions.MapValues|
|map_funcs|map|org.apache.spark.sql.catalyst.expressions.CreateMap|
|map_funcs|str_to_map|org.apache.spark.sql.catalyst.expressions.StringToMap|
|math_funcs|%|org.apache.spark.sql.catalyst.expressions.Remainder|
|math_funcs|*|org.apache.spark.sql.catalyst.expressions.Multiply|
|math_funcs|+|org.apache.spark.sql.catalyst.expressions.Add|
|math_funcs|-|org.apache.spark.sql.catalyst.expressions.Subtract|
|math_funcs|/|org.apache.spark.sql.catalyst.expressions.Divide|
|math_funcs|abs|org.apache.spark.sql.catalyst.expressions.Abs|
|math_funcs|acosh|org.apache.spark.sql.catalyst.expressions.Acosh|
|math_funcs|acos|org.apache.spark.sql.catalyst.expressions.Acos|
|math_funcs|asinh|org.apache.spark.sql.catalyst.expressions.Asinh|
|math_funcs|asin|org.apache.spark.sql.catalyst.expressions.Asin|
|math_funcs|atan2|org.apache.spark.sql.catalyst.expressions.Atan2|
|math_funcs|atanh|org.apache.spark.sql.catalyst.expressions.Atanh|
|math_funcs|atan|org.apache.spark.sql.catalyst.expressions.Atan|
|math_funcs|bin|org.apache.spark.sql.catalyst.expressions.Bin|
|math_funcs|bround|org.apache.spark.sql.catalyst.expressions.BRound|
|math_funcs|cbrt|org.apache.spark.sql.catalyst.expressions.Cbrt|
|math_funcs|ceiling|org.apache.spark.sql.catalyst.expressions.Ceil|
|math_funcs|ceil|org.apache.spark.sql.catalyst.expressions.Ceil|
|math_funcs|conv|org.apache.spark.sql.catalyst.expressions.Conv|
|math_funcs|cosh|org.apache.spark.sql.catalyst.expressions.Cosh|
|math_funcs|cos|org.apache.spark.sql.catalyst.expressions.Cos|
|math_funcs|cot|org.apache.spark.sql.catalyst.expressions.Cot|
|math_funcs|degrees|org.apache.spark.sql.catalyst.expressions.ToDegrees|
|math_funcs|div|org.apache.spark.sql.catalyst.expressions.IntegralDivide|
|math_funcs|expm1|org.apache.spark.sql.catalyst.expressions.Expm1|
|math_funcs|exp|org.apache.spark.sql.catalyst.expressions.Exp|
|math_funcs|e|org.apache.spark.sql.catalyst.expressions.EulerNumber|
|math_funcs|factorial|org.apache.spark.sql.catalyst.expressions.Factorial|
|math_funcs|floor|org.apache.spark.sql.catalyst.expressions.Floor|
|math_funcs|greatest|org.apache.spark.sql.catalyst.expressions.Greatest|
|math_funcs|hex|org.apache.spark.sql.catalyst.expressions.Hex|
|math_funcs|hypot|org.apache.spark.sql.catalyst.expressions.Hypot|
|math_funcs|least|org.apache.spark.sql.catalyst.expressions.Least|
|math_funcs|ln|org.apache.spark.sql.catalyst.expressions.Log|
|math_funcs|log10|org.apache.spark.sql.catalyst.expressions.Log10|
|math_funcs|log1p|org.apache.spark.sql.catalyst.expressions.Log1p|
|math_funcs|log2|org.apache.spark.sql.catalyst.expressions.Log2|
|math_funcs|log|org.apache.spark.sql.catalyst.expressions.Logarithm|
|math_funcs|mod|org.apache.spark.sql.catalyst.expressions.Remainder|
|math_funcs|negative|org.apache.spark.sql.catalyst.expressions.UnaryMinus|
|math_funcs|pi|org.apache.spark.sql.catalyst.expressions.Pi|
|math_funcs|pmod|org.apache.spark.sql.catalyst.expressions.Pmod|
|math_funcs|positive|org.apache.spark.sql.catalyst.expressions.UnaryPositive|
|math_funcs|power|org.apache.spark.sql.catalyst.expressions.Pow|
|math_funcs|pow|org.apache.spark.sql.catalyst.expressions.Pow|
|math_funcs|radians|org.apache.spark.sql.catalyst.expressions.ToRadians|
|math_funcs|randn|org.apache.spark.sql.catalyst.expressions.Randn|
|math_funcs|random|org.apache.spark.sql.catalyst.expressions.Rand|
|math_funcs|rand|org.apache.spark.sql.catalyst.expressions.Rand|
|math_funcs|rint|org.apache.spark.sql.catalyst.expressions.Rint|
|math_funcs|round|org.apache.spark.sql.catalyst.expressions.Round|
|math_funcs|shiftleft|org.apache.spark.sql.catalyst.expressions.ShiftLeft|
|math_funcs|signum|org.apache.spark.sql.catalyst.expressions.Signum|
|math_funcs|sign|org.apache.spark.sql.catalyst.expressions.Signum|
|math_funcs|sinh|org.apache.spark.sql.catalyst.expressions.Sinh|
|math_funcs|sin|org.apache.spark.sql.catalyst.expressions.Sin|
|math_funcs|sqrt|org.apache.spark.sql.catalyst.expressions.Sqrt|
|math_funcs|tanh|org.apache.spark.sql.catalyst.expressions.Tanh|
|math_funcs|tan|org.apache.spark.sql.catalyst.expressions.Tan|
|math_funcs|unhex|org.apache.spark.sql.catalyst.expressions.Unhex|
|math_funcs|width_bucket|org.apache.spark.sql.catalyst.expressions.WidthBucket|
|misc_funcs|assert_true|org.apache.spark.sql.catalyst.expressions.AssertTrue|
|misc_funcs|current_catalog|org.apache.spark.sql.catalyst.expressions.CurrentCatalog|
|misc_funcs|current_database|org.apache.spark.sql.catalyst.expressions.CurrentDatabase|
|misc_funcs|input_file_block_length|org.apache.spark.sql.catalyst.expressions.InputFileBlockLength|
|misc_funcs|input_file_block_start|org.apache.spark.sql.catalyst.expressions.InputFileBlockStart|
|misc_funcs|input_file_name|org.apache.spark.sql.catalyst.expressions.InputFileName|
|misc_funcs|java_method|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection|
|misc_funcs|monotonically_increasing_id|org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID|
|misc_funcs|raise_error|org.apache.spark.sql.catalyst.expressions.RaiseError|
|misc_funcs|reflect|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection|
|misc_funcs|spark_partition_id|org.apache.spark.sql.catalyst.expressions.SparkPartitionID|
|misc_funcs|typeof|org.apache.spark.sql.catalyst.expressions.TypeOf|
|misc_funcs|uuid|org.apache.spark.sql.catalyst.expressions.Uuid|
|misc_funcs|version|org.apache.spark.sql.catalyst.expressions.SparkVersion|
|predicate_funcs|!|org.apache.spark.sql.catalyst.expressions.Not|
|predicate_funcs|<=>|org.apache.spark.sql.catalyst.expressions.EqualNullSafe|
|predicate_funcs|<=|org.apache.spark.sql.catalyst.expressions.LessThanOrEqual|
|predicate_funcs|<|org.apache.spark.sql.catalyst.expressions.LessThan|
|predicate_funcs|==|org.apache.spark.sql.catalyst.expressions.EqualTo|
|predicate_funcs|=|org.apache.spark.sql.catalyst.expressions.EqualTo|
|predicate_funcs|>=|org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual|
|predicate_funcs|>|org.apache.spark.sql.catalyst.expressions.GreaterThan|
|predicate_funcs|and|org.apache.spark.sql.catalyst.expressions.And|
|predicate_funcs|in|org.apache.spark.sql.catalyst.expressions.In|
|predicate_funcs|isnan|org.apache.spark.sql.catalyst.expressions.IsNaN|
|predicate_funcs|isnotnull|org.apache.spark.sql.catalyst.expressions.IsNotNull|
|predicate_funcs|isnull|org.apache.spark.sql.catalyst.expressions.IsNull|
|predicate_funcs|like|org.apache.spark.sql.catalyst.expressions.Like|
|predicate_funcs|not|org.apache.spark.sql.catalyst.expressions.Not|
|predicate_funcs|or|org.apache.spark.sql.catalyst.expressions.Or|
|predicate_funcs|regexp_like|org.apache.spark.sql.catalyst.expressions.RLike|
|predicate_funcs|rlike|org.apache.spark.sql.catalyst.expressions.RLike|
|string_funcs|ascii|org.apache.spark.sql.catalyst.expressions.Ascii|
|string_funcs|base64|org.apache.spark.sql.catalyst.expressions.Base64|
|string_funcs|bit_length|org.apache.spark.sql.catalyst.expressions.BitLength|
|string_funcs|char_length|org.apache.spark.sql.catalyst.expressions.Length|
|string_funcs|character_length|org.apache.spark.sql.catalyst.expressions.Length|
|string_funcs|char|org.apache.spark.sql.catalyst.expressions.Chr|
|string_funcs|chr|org.apache.spark.sql.catalyst.expressions.Chr|
|string_funcs|concat_ws|org.apache.spark.sql.catalyst.expressions.ConcatWs|
|string_funcs|decode|org.apache.spark.sql.catalyst.expressions.Decode|
|string_funcs|elt|org.apache.spark.sql.catalyst.expressions.Elt|
|string_funcs|encode|org.apache.spark.sql.catalyst.expressions.Encode|
|string_funcs|find_in_set|org.apache.spark.sql.catalyst.expressions.FindInSet|
|string_funcs|format_number|org.apache.spark.sql.catalyst.expressions.FormatNumber|
|string_funcs|format_string|org.apache.spark.sql.catalyst.expressions.FormatString|
|string_funcs|initcap|org.apache.spark.sql.catalyst.expressions.InitCap|
|string_funcs|instr|org.apache.spark.sql.catalyst.expressions.StringInstr|
|string_funcs|lcase|org.apache.spark.sql.catalyst.expressions.Lower|
|string_funcs|left|org.apache.spark.sql.catalyst.expressions.Left|
|string_funcs|length|org.apache.spark.sql.catalyst.expressions.Length|
|string_funcs|levenshtein|org.apache.spark.sql.catalyst.expressions.Levenshtein|
|string_funcs|locate|org.apache.spark.sql.catalyst.expressions.StringLocate|
|string_funcs|lower|org.apache.spark.sql.catalyst.expressions.Lower|
|string_funcs|lpad|org.apache.spark.sql.catalyst.expressions.StringLPad|
|string_funcs|ltrim|org.apache.spark.sql.catalyst.expressions.StringTrimLeft|
|string_funcs|octet_length|org.apache.spark.sql.catalyst.expressions.OctetLength|
|string_funcs|overlay|org.apache.spark.sql.catalyst.expressions.Overlay|
|string_funcs|parse_url|org.apache.spark.sql.catalyst.expressions.ParseUrl|
|string_funcs|position|org.apache.spark.sql.catalyst.expressions.StringLocate|
|string_funcs|printf|org.apache.spark.sql.catalyst.expressions.FormatString|
|string_funcs|regexp_extract_all|org.apache.spark.sql.catalyst.expressions.RegExpExtractAll|
|string_funcs|regexp_extract|org.apache.spark.sql.catalyst.expressions.RegExpExtract|
|string_funcs|regexp_replace|org.apache.spark.sql.catalyst.expressions.RegExpReplace|
|string_funcs|repeat|org.apache.spark.sql.catalyst.expressions.StringRepeat|
|string_funcs|replace|org.apache.spark.sql.catalyst.expressions.StringReplace|
|string_funcs|right|org.apache.spark.sql.catalyst.expressions.Right|
|string_funcs|rpad|org.apache.spark.sql.catalyst.expressions.StringRPad|
|string_funcs|rtrim|org.apache.spark.sql.catalyst.expressions.StringTrimRight|
|string_funcs|sentences|org.apache.spark.sql.catalyst.expressions.Sentences|
|string_funcs|soundex|org.apache.spark.sql.catalyst.expressions.SoundEx|
|string_funcs|space|org.apache.spark.sql.catalyst.expressions.StringSpace|
|string_funcs|split|org.apache.spark.sql.catalyst.expressions.StringSplit|
|string_funcs|substring_index|org.apache.spark.sql.catalyst.expressions.SubstringIndex|
|string_funcs|substring|org.apache.spark.sql.catalyst.expressions.Substring|
|string_funcs|substr|org.apache.spark.sql.catalyst.expressions.Substring|
|string_funcs|translate|org.apache.spark.sql.catalyst.expressions.StringTranslate|
|string_funcs|trim|org.apache.spark.sql.catalyst.expressions.StringTrim|
|string_funcs|ucase|org.apache.spark.sql.catalyst.expressions.Upper|
|string_funcs|unbase64|org.apache.spark.sql.catalyst.expressions.UnBase64|
|string_funcs|upper|org.apache.spark.sql.catalyst.expressions.Upper|
|struct_funcs|named_struct|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct|
|struct_funcs|struct|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct|
|window_funcs|cume_dist|org.apache.spark.sql.catalyst.expressions.CumeDist|
|window_funcs|dense_rank|org.apache.spark.sql.catalyst.expressions.DenseRank|
|window_funcs|lag|org.apache.spark.sql.catalyst.expressions.Lag|
|window_funcs|lead|org.apache.spark.sql.catalyst.expressions.Lead|
|window_funcs|nth_value|org.apache.spark.sql.catalyst.expressions.NthValue|
|window_funcs|ntile|org.apache.spark.sql.catalyst.expressions.NTile|
|window_funcs|percent_rank|org.apache.spark.sql.catalyst.expressions.PercentRank|
|window_funcs|rank|org.apache.spark.sql.catalyst.expressions.Rank|
|window_funcs|row_number|org.apache.spark.sql.catalyst.expressions.RowNumber|
|xml_funcs|xpath_boolean|org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean|
|xml_funcs|xpath_double|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble|
|xml_funcs|xpath_float|org.apache.spark.sql.catalyst.expressions.xml.XPathFloat|
|xml_funcs|xpath_int|org.apache.spark.sql.catalyst.expressions.xml.XPathInt|
|xml_funcs|xpath_long|org.apache.spark.sql.catalyst.expressions.xml.XPathLong|
|xml_funcs|xpath_number|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble|
|xml_funcs|xpath_short|org.apache.spark.sql.catalyst.expressions.xml.XPathShort|
|xml_funcs|xpath_string|org.apache.spark.sql.catalyst.expressions.xml.XPathString|
|xml_funcs|xpath|org.apache.spark.sql.catalyst.expressions.xml.XPathList|
Closes#30040
NOTE: An original author of this PR is tanelk, so the credit should be given to tanelk.
### Why are the changes needed?
For better documents.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add a test to check if exprs have a group tag in `ExpressionInfoSuite`.
Closes#30867 from maropu/pr30040.
Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Improve `SimplifyConditionals`.
Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`.
Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`.
The use case is:
```sql
create table t1 using parquet as select id from range(10);
select if (id > 2, false, true) from t1;
```
Before this pr:
```
== Physical Plan ==
*(1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2]
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
```
After this pr:
```
== Physical Plan ==
*(1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2]
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
```
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30849 from wangyum/SPARK-33798-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly.
This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true.
### Why are the changes needed?
make the code simpler and safer
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests.
Closes#30777 from cloud-fan/ds.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Throw `PartitionAlreadyExistsException` from `ALTER TABLE .. RENAME TO PARTITION` for a table from Hive V1 External Catalog in the case when the target partition already exists.
### Why are the changes needed?
1. To have the same behavior of V1 In-Memory and Hive External Catalog.
2. To not propagate internal Hive's exceptions to users.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the partition renaming command throws `PartitionAlreadyExistsException` for tables from the Hive catalog.
### How was this patch tested?
Added new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite"
```
Closes#30866 from MaxGekk/throw-PartitionAlreadyExistsException.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to expose `DataStreamReader.table` (SPARK-32885) and `DataStreamWriter.toTable` (SPARK-32896) to PySpark, which are the only way to read and write with table in Structured Streaming.
### Why are the changes needed?
Please refer SPARK-32885 and SPARK-32896 for rationalizations of these public APIs. This PR only exposes them to PySpark.
### Does this PR introduce _any_ user-facing change?
Yes, PySpark users will be able to read and write with table in Structured Streaming query.
### How was this patch tested?
Manually tested.
> v1 table
>> create table A and ingest to the table A
```
spark.sql("""
create table table_pyspark_parquet (
value long,
`timestamp` timestamp
) USING parquet
""")
df = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query = df.writeStream.toTable('table_pyspark_parquet', checkpointLocation='/tmp/checkpoint5')
query.lastProgress
query.stop()
```
>> read table A and ingest to the table B which doesn't exist
```
df2 = spark.readStream.table('table_pyspark_parquet')
query2 = df2.writeStream.toTable('table_pyspark_parquet_nonexist', format='parquet', checkpointLocation='/tmp/checkpoint2')
query2.lastProgress
query2.stop()
```
>> select tables
```
spark.sql("DESCRIBE TABLE table_pyspark_parquet").show()
spark.sql("SELECT * FROM table_pyspark_parquet").show()
spark.sql("DESCRIBE TABLE table_pyspark_parquet_nonexist").show()
spark.sql("SELECT * FROM table_pyspark_parquet_nonexist").show()
```
> v2 table (leveraging Apache Iceberg as it provides V2 table and custom catalog as well)
>> create table A and ingest to the table A
```
spark.sql("""
create table iceberg_catalog.default.table_pyspark_v2table (
value long,
`timestamp` timestamp
) USING iceberg
""")
df = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query = df.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table', checkpointLocation='/tmp/checkpoint_v2table_1')
query.lastProgress
query.stop()
```
>> ingest to the non-exist table B
```
df2 = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query2 = df2.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist', checkpointLocation='/tmp/checkpoint_v2table_2')
query2.lastProgress
query2.stop()
```
>> ingest to the non-exist table C partitioned by `value % 10`
```
df3 = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
df3a = df3.selectExpr('value', 'timestamp', 'value % 10 AS partition').repartition('partition')
query3 = df3a.writeStream.partitionBy('partition').toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned', checkpointLocation='/tmp/checkpoint_v2table_3')
query3.lastProgress
query3.stop()
```
>> select tables
```
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table").show()
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist").show()
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show()
```
Closes#30835 from HeartSaVioR/SPARK-33836.
Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Move the `DROP TABLE` parsing tests to `DropTableParserSuite`
2. Place the v1 tests for `DROP TABLE` from `DDLSuite` and v2 tests from `DataSourceV2SQLSuite` to the common trait `DropTableSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS.
### Why are the changes needed?
- The unification will allow to run common `DROP TABLE` tests for both DSv1 and Hive DSv1, DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite"
```
Closes#30854 from MaxGekk/unify-drop-table-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `ALTER TABLE ... RENAME TO PARTITION` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
Note that `ALTER TABLE ... RENAME TO PARTITION` is not supported for v2 tables.
### Why are the changes needed?
The PR makes the resolution consistent behavior consistent. For example,
```
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") // works fine assuming id=1 exists.
```
, but after this PR:
```
sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)")
org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RENAME TO PARTITION' expects a table; line 1 pos 0
```
, which is the consistent behavior with other commands.
### Does this PR introduce _any_ user-facing change?
After this PR, `ALTER TABLE` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`.
### How was this patch tested?
Updated existing tests.
Closes#30862 from imback82/alter_table_rename_partition_v2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup PR for #30573 .
After this change applied, stage memory metrics will be updated on stage end.
### Why are the changes needed?
After #30573, executor memory metrics is updated on stage end but stage memory metrics is not updated.
It's better to update both metrics like `updateStageLevelPeakExecutorMetrics` does.
### Does this PR introduce _any_ user-facing change?
Yes. stage memory metrics is updated more accurately.
### How was this patch tested?
After I run a job and visited `/api/v1/<appid>/stages`, I confirmed `peakExecutorMemory` metrics is shown even though the life time of each stage is very short .
I also modify the json files for `HistoryServerSuite`.
Closes#30858 from sarutak/followup-SPARK-26341.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR mainly improves and cleans up the test code introduced in #30855 based on the comment.
The test code is actually taken from another test `explain formatted - check presence of subquery in case of DPP` so this PR cleans the code too ( removed unnecessary `withTable`).
### Why are the changes needed?
To keep the test code clean.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`ExplainSuite` passes.
Closes#30861 from sarutak/followup-SPARK-33850.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Make MapIterator of BytesToBytesMap `hasNext` method idempotent
### Why are the changes needed?
The `hasNext` maybe called multiple times, if not guarded, second call of hasNext method after reaching the end of iterator will throw NoSuchElement exception.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Update a unit test to cover this case.
Closes#30728 from advancedxy/SPARK-33756.
Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
`HiveDDLSuite` has many of the following patterns:
```scala
val e = intercept[AnalysisException] {
sql(sqlString)
}
assert(e.message.contains(exceptionMessage))
```
However, there already exists `assertAnalysisError` helper function which does exactly the same thing.
### Why are the changes needed?
To refactor code to simplify.
### Does this PR introduce _any_ user-facing change?
No, just refactoring the test code.
### How was this patch tested?
Existing tests
Closes#30857 from imback82/hive_ddl_suite_use_assertAnalysisError.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to use ListBuffer instead of Stack in SparkBuild.scala to remove deprecation warning.
### Why are the changes needed?
Stack is deprecated in Scala 2.12.0.
```scala
% build/sbt compile
...
[warn] /Users/william/spark/project/SparkBuild.scala:1112:25:
class Stack in package mutable is deprecated (since 2.12.0):
Stack is an inelegant and potentially poorly-performing wrapper around List.
Use a List assigned to a var instead.
[warn] val stack = new Stack[File]()
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual.
Closes#30860 from williamhyun/SPARK-33854.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that when AQE is enabled, EXPLAIN FORMATTED doesn't show the plan for subqueries.
```scala
val df = spark.range(1, 100)
df.createTempView("df")
spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("FORMATTED")
== Physical Plan ==
AdaptiveSparkPlan (3)
+- Project (2)
+- Scan OneRowRelation (1)
(1) Scan OneRowRelation
Output: []
Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0)
(2) Project
Output [1]: [Subquery subquery#3, [id=#20] AS scalarsubquery()#5L]
Input: []
(3) AdaptiveSparkPlan
Output [1]: [scalarsubquery()#5L]
Arguments: isFinalPlan=false
```
After this change, the plan for the subquerie is shown.
```scala
== Physical Plan ==
* Project (2)
+- * Scan OneRowRelation (1)
(1) Scan OneRowRelation [codegen id : 1]
Output: []
Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0)
(2) Project [codegen id : 1]
Output [1]: [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L]
Input: []
===== Subqueries =====
Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#3, [id=#24]
* HashAggregate (6)
+- Exchange (5)
+- * HashAggregate (4)
+- * Range (3)
(3) Range [codegen id : 1]
Output [1]: [id#0L]
Arguments: Range (1, 100, step=1, splits=Some(12))
(4) HashAggregate [codegen id : 1]
Input [1]: [id#0L]
Keys: []
Functions [1]: [partial_min(id#0L)]
Aggregate Attributes [1]: [min#7L]
Results [1]: [min#8L]
(5) Exchange
Input [1]: [min#8L]
Arguments: SinglePartition, ENSURE_REQUIREMENTS, [id=#20]
(6) HashAggregate [codegen id : 2]
Input [1]: [min#8L]
Keys: []
Functions [1]: [min(id#0L)]
Aggregate Attributes [1]: [min(id#0L)#4L]
Results [1]: [min(id#0L)#4L AS v#2L]
```
### Why are the changes needed?
For better debuggability.
### Does this PR introduce _any_ user-facing change?
Yes. Users can see the formatted plan for subqueries.
### How was this patch tested?
New test.
Closes#30855 from sarutak/fix-aqe-explain.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
The title is pretty self-explanatory.
### What changes were proposed in this pull request?
Fixing typos in the docs for `foreachBatch` functions.
### Why are the changes needed?
To fix typos in JavaDoc/ScalaDoc.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Yes.
Closes#30782 from ammar1x/patch-1.
Lead-authored-by: Ammar Al-Batool <ammar.albatool@gmail.com>
Co-authored-by: Ammar Al-Batool <ammar.al-batool@disneystreaming.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Currently, renaming v2 tables does not invalidate/recreate the cache, leading to an incorrect behavior (cache not being used) when v2 tables are renamed. This PR fixes the behavior.
### Why are the changes needed?
Fixing a bug since the cache associated with the renamed table is not being cleaned up/recreated.
### Does this PR introduce _any_ user-facing change?
Yes, now when a v2 table is renamed, cache is correctly updated.
### How was this patch tested?
Added a new test
Closes#30825 from imback82/rename_recreate_cache_v2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
As https://github.com/apache/spark/pull/29893#discussion_r545303780 mentioned:
> We need to set spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") before executing this suite; otherwise, test("insert with column list - follow table output order + partitioned table") will fail.
The reason why it does not fail because some test cases [running before this suite] do not change the default value of hive.exec.dynamic.partition.mode back to strict. However, the order of test suite execution is not deterministic.
### Why are the changes needed?
avoid flakiness in tests
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#30843 from yaooqinn/SPARK-32976-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade Zstd library to 1.4.8.
### Why are the changes needed?
This will bring Zstd 1.4.7 and 1.4.8 improvement and bug fixes and the following from `zstd-jni`.
- https://github.com/facebook/zstd/releases/tag/v1.4.7
- https://github.com/facebook/zstd/releases/tag/v1.4.8
- https://github.com/luben/zstd-jni/issues/153 (Apple M1 architecture)
### Does this PR introduce _any_ user-facing change?
This will unblock Apple Silicon usage.
### How was this patch tested?
Pass the CIs.
Closes#30848 from dongjoon-hyun/SPARK-33843.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
There were a lot of works on improving ALS's recommendForAll
For now, I found that it maybe futhermore optimized by
1, using GEMV and sharing a pre-allocated buffer per task;
2, using guava.ordering instead of BoundedPriorityQueue;
### Why are the changes needed?
In my test, using `f2jBLAS.sgemv`, it is about 2.3X faster than existing impl.
|Impl| Master | GEMM | GEMV | GEMV + array aggregator | GEMV + guava ordering + array aggregator | GEMV + guava ordering|
|------|----------|------------|----------|------------|------------|------------|
|Duration|341229|363741|191201|189790|148417|147222|
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30468 from zhengruifeng/als_rec_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big.
This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats.
### Why are the changes needed?
To be able to analyze table when histogram data is big.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing test and new tests
Closes#30809 from cloud-fan/cbo.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
FIX Github Action with unidoc
### Why are the changes needed?
FIX Github Action with unidoc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Pass GA
Closes#30846 from yaooqinn/SPARK-33599.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#30717 from beliefer/SPARK-33599.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr add a new rule(`PushFoldableIntoBranches`) to push down the foldable expressions through `CaseWhen/If`. This is a real case from production:
```sql
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);
create temp view v1 as
select 'a' as event_type, * from t1
union all
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2
explain select * from v1 where event_type = 'a';
```
Before this PR:
```
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
: +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L]
+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)
+- *(2) ColumnarToRow
+- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet
```
After this PR:
```
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet
```
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30790 from wangyum/SPARK-33798.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add `spark.sql.files.minPartitionNum` and it's description to sql-performence-tuning.md.
### Why are the changes needed?
Help user to find it.
### Does this PR introduce _any_ user-facing change?
Yes, it's the doc.
### How was this patch tested?
Pass CI.
Closes#30838 from ulysses-you/SPARK-33840.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT
```scala
test("Parquet vector reader incorrect with binary partition value") {
Seq(false, true).foreach(tag => {
withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
withTable("t1") {
sql(
"""CREATE TABLE t1(name STRING, id BINARY, part BINARY)
| USING PARQUET PARTITIONED BY (part)""".stripMargin)
sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
if (tag) {
checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
Row("a", "Spark SQL", ""))
} else {
checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
Row("a", "Spark SQL", "Spark SQL"))
}
}
}
})
}
```
### Why are the changes needed?
Fix data incorrect issue
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#30824 from AngersZhuuuu/SPARK-33593.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds `-Pspark-ganglia-lgpl` to the build definition with Scala 2.13 on GitHub Actions.
### Why are the changes needed?
Keep the code build-able with Scala 2.13.
With this change, all the sub-modules seems to be built-able with Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed Scala 2.13 build pass with the following command.
```
$ ./dev/change-scala-version.sh 2.13
$ build/sbt -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile
```
Closes#30834 from sarutak/ganglia-scala-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to update `CACHE TABLE` to use a `LogicalPlan` when caching a query to avoid creating a `DataFrame` as suggested here: https://github.com/apache/spark/pull/30743#discussion_r543123190
For reference, `UNCACHE TABLE` also uses `LogicalPlan`: 0c12900120/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (L91-L98)
### Why are the changes needed?
To avoid creating an unnecessary dataframe and make it consistent with `uncacheQuery` used in `UNCACHE TABLE`.
### Does this PR introduce _any_ user-facing change?
No, just internal changes.
### How was this patch tested?
Existing tests since this is an internal refactoring change.
Closes#30815 from imback82/cache_with_logical_plan.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This updates SS documentation to document about State Store and task locality.
### Why are the changes needed?
During running some tests for structured streaming, I found state store locality becomes an issue sometimes and it is not very straightforward for end-users. It'd be great if we can document it.
### Does this PR introduce _any_ user-facing change?
No, only doc change.
### How was this patch tested?
No, only doc change.
Closes#30789 from viirya/ss-statestore-doc.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to restructure and refine the Python dependency management page.
I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation.
FWIW, it has been reviewed by some tech writers and engineers.
I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html
### Why are the changes needed?
For better documentation.
### Does this PR introduce _any_ user-facing change?
It's doc change but only in unreleased bracnhs for now.
### How was this patch tested?
I manually built the docs as:
```bash
cd python/docs
make clean html
open
```
Closes#30822 from HyukjinKwon/SPARK-33824.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](031c5ef280)):
```
java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path.
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
...
```
I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows:
```
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183]
+- BroadcastQueryStage 2
+- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963]
```
A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`:
1e85707738/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (L47-L50)
The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there.
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked that q5 passed.
Closes#30818 from maropu/BugfixInAQE.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe.
### Why are the changes needed?
The test is still flaky after the fix https://github.com/apache/spark/pull/30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642
And it's easy to reproduce if you test it multiple times (e.g. 100) locally.
The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one.
The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested manually after the fix and the test is no longer flaky.
Closes#30823 from Ngone51/debug-flaky-spark-33088.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>