### What changes were proposed in this pull request?
when the child rdd has only one partition, skip the shuffle
### Why are the changes needed?
avoid unnecessary shuffle
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31409 from zhengruifeng/limit_with_single_partition.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Similar to SPARK-33690, this PR improves the output layout of `printSchema` for the case column names contain meta characters.
Here is an example.
Before:
```
scala> val df1 = spark.sql("SELECT 'aaa\nbbb\tccc\rddd\feee\bfff\u000Bggg\u0007hhh'")
scala> df1.printSchema
root
|-- aaa
ddd ccc
eefff
ggghhh: string (nullable = false)
```
After:
```
scala> df1.printSchema
root
|-- aaa\nbbb\tccc\rddd\feee\bfff\vggg\ahhh: string (nullable = false)
```
### Why are the changes needed?
To avoid breaking the layout of `Dataset#printSchema`
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#31412 from sarutak/escape-meta-printSchema.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/31357#31357 added a very strong restriction to the vectorized parquet reader, that the spark data type must exactly match the physical parquet type, when reading decimal fields. This restriction is actually not necessary, as we can safely read parquet decimals with a larger precision. This PR releases this restriction a little bit.
### Why are the changes needed?
To not fail queries unnecessarily.
### Does this PR introduce _any_ user-facing change?
Yes, now users can read parquet decimals with mismatched `DecimalType` as long as the scale is the same and precision is larger.
### How was this patch tested?
updated test.
Closes#31443 from cloud-fan/improve.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted.
### Why are the changes needed?
The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified UTs explain the missing points, which also do the test.
Closes#31435 from HeartSaVioR/SPARK-34326.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to remove `TRUNCATE` from the default `capabilities` list from `FileTable`.
### Why are the changes needed?
The abstract class `FileTable` now lists `TRUNCATE` in its `capabilities`, but `FileTable` does not know if an implementation really supports truncation or not. Specifically, we can check existing `FileTable` implementations including `AvroTable`, `CSVTable`, `JsonTable`, etc. No one implementation really implements `SupportsTruncate` in its writer builder.
### Does this PR introduce _any_ user-facing change?
No, because seems to me `FileTable` is not of user-facing public API.
### How was this patch tested?
Existing unit tests.
Closes#31432 from viirya/SPARK-34324.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
1. Move parser tests from `DDLParserSuite` to `TruncateTableParserSuite`.
2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.TruncateTableSuiteBase` and to `v1.TruncateTableSuite`.
3. Add a test for DSv2 `TRUNCATE TABLE` to `v2.TruncateTableSuite`.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *TruncateTableSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```
Closes#31387 from MaxGekk/unify-truncate-table-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Correct the version of SQL configuration `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` from 3.2.0 to 3.0.2.
Also, revise the documentation and test case.
### Why are the changes needed?
The release version in https://github.com/apache/spark/pull/31421 was wrong.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests
Closes#31434 from gengliangwang/reviseVersion.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes an issue that `Dataset.colRegex` doesn't work with column names or qualifiers which contain newlines.
In the current master, if column names or qualifiers passed to `colRegex` contain newlines, it throws exception.
```
val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table")
val col1 = df.colRegex("`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.*
.*mn`" among (test
_column)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
... 47 elided
val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "test
_table.`tes.*
.*mn`" among (test
_column)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
... 47 elided
```
### Why are the changes needed?
Column names and qualifiers can contain newlines but `colRegex` can't work with them, so it's a bug.
### Does this PR introduce _any_ user-facing change?
Yes. users can pass column names and qualifiers even though they contain newlines.
### How was this patch tested?
New test.
Closes#31426 from sarutak/fix-colRegex.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add `relationTypeMismatchHint` to `UnresolvedTable` so that if a relation is resolved to a view when a table is expected, a hint message can be included as a part of the analysis exception message. Note that the same feature is already introduced to `UnresolvedView` in #30636.
This mostly affects `ALTER TABLE` commands where the analysis exception message will now contain `Please use ALTER VIEW as instead`.
### Why are the changes needed?
To give a better error message. (The hint used to exist but got removed for commands that migrated to the new resolution framework)
### Does this PR introduce _any_ user-facing change?
Yes, now `ALTER TABLE` commands include a hint to use `ALTER VIEW` instead.
```
sql("ALTER TABLE v SET SERDE 'whatever'")
```
Before:
```
"v is a view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]' expects a table.
```
After this PR:
```
"v is a view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]' expects a table. Please use ALTER VIEW instead.
```
### How was this patch tested?
Updated existing test cases to include the hint.
Closes#31424 from imback82/better_error.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In spark, the `count(table.*)` may cause very weird result, for example:
```
select count(*) from (select 1 as a, null as b) t;
output: 1
select count(t.*) from (select 1 as a, null as b) t;
output: 0
```
This is because spark expands `t.*` while converts `*` to count(1), this will confuse
users. After checking the ANSI standard, `count(*)` should always be `count(1)` while `count(t.*)`
is not allowed. What's more, this is also not allowed by common databases, e.g. MySQL, Oracle.
So, this PR proposes to block the ambiguous behavior and print a clear error message for users.
### Why are the changes needed?
to avoid ambiguous behavior and follow ANSI standard and other SQL engines
### Does this PR introduce _any_ user-facing change?
Yes, `count(table.*)` behavior will be blocked and output an error message.
### How was this patch tested?
newly added and existing tests
Closes#31286 from linhongliu-db/fix-table-star.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow up for https://github.com/apache/spark/pull/30538.
It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior.
It also adds document for the behavior change.
### Why are the changes needed?
In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true.
### Does this PR introduce _any_ user-facing change?
Yes, adding a legacy configuration to restore the old behavior.
### How was this patch tested?
Unit test.
Closes#31421 from gengliangwang/legacyNullStringConstant.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.
In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
- `sumDistinct` -> `sum_distinct`
- `bitwiseNOT` -> `bitwise_not`
- `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
- `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
- `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
- (Scala-specific) `callUDF` -> `call_udf`
### Why are the changes needed?
To keep the consistent naming in APIs.
### Does this PR introduce _any_ user-facing change?
Yes, it deprecates some APIs and add new renamed APIs as described above.
### How was this patch tested?
Unittests were added.
Closes#31408 from HyukjinKwon/SPARK-34306.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Replaces `collection.map(f1).flatten(f2)` with `collection.flatMap` if possible. it's semantically consistent, but looks simpler.
### Why are the changes needed?
Code Simpilefications.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31416 from LuciferYang/SPARK-34310.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe.
This PR keep SHOW COLUMNS command's output attribute exprId unchanged.
### Why are the changes needed?
Keep SHOW PARTITIONS command's output attribute exprid unchanged.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#31377 from AngersZhuuuu/SPARK-34239.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Replace v1 exec node `AlterTableRecoverPartitionsCommand` by the logical node `AlterTableRecoverPartitions` in `CatalogImpl.recoverPartitions()`.
### Why are the changes needed?
1. Print user friendly error message for views:
```
my_temp_table is a temp view. 'recoverPartitions()' expects a table
```
Before the changes:
```
Table or view 'my_temp_table' not found in database 'default'
```
2. To not bind to v1 `ALTER TABLE .. RECOVER PARTITIONS`, and to support v2 tables potentially as well.
### Does this PR introduce _any_ user-facing change?
Yes, it can.
### How was this patch tested?
By running new test in `CatalogSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.internal.CatalogSuite"
```
Closes#31403 from MaxGekk/catalogimpl-recoverPartitions.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Remove old statement `AlterTableSetLocationStatement`
2. Introduce new command `AlterTableSetLocation` for `ALTER TABLE .. SET LOCATION`.
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: SPARK-29900.
### Does this PR introduce _any_ user-facing change?
It can change the error message for views.
### How was this patch tested?
By running `ALTER TABLE .. SET LOCATION` tests:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *DataSourceV2SQLSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```
Closes#31414 from MaxGekk/migrate-set-location-resolv-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Describe `SessionCatalog.refreshTable()` and `CatalogImpl.refreshTable()`. what they do and when they are supposed to be used.
### Why are the changes needed?
To improve code maintenance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running `./dev/scalastyle`
Closes#31364 from MaxGekk/doc-refreshTable.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is a followup that removes `bitwiseGet` in functions API. This is mainly for SQL compliance, and arguably not very much commonly used.
### Why are the changes needed?
See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59
### Does this PR introduce _any_ user-facing change?
No, it's a change in unreleased branches.
### How was this patch tested?
Existing tests should cover.
Closes#31410 from HyukjinKwon/SPARK-33245.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`ResolveSessionCatalog`'s `isTempView` and `isTempFunction` are not being used anymore since the resolution of temp view/function has moved to `Analyzer`.
This PR proposes to remove `isTempView` and `isTempFunction` from `ResolveSessionCatalog`.
### Why are the changes needed?
To clean up unused variables.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests should cover as this PR just removes the unused variables.
Closes#31400 from imback82/cleanup_resolve_session_catalog.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a follow-up to #31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node.
### Why are the changes needed?
To increase the test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a new test.
Closes#31352 from imback82/grouping_expr.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`.
### Why are the changes needed?
Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example:
```sql
> CREATE TABLE t (key bigint, value string) USING parquet
> CACHE TABLE v1 AS SELECT * FROM t ORDER BY key
> SELECT * FROM t ORDER BY key
```
#31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
This PR just adds unit tests.
Closes#31182 from sunchao/SPARK-34108.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to sum up custom metric values instead of taking arbitrary one when combining `StateStoreMetrics`.
### Why are the changes needed?
For stateful join in structured streaming, we need to combine `StateStoreMetrics` from both left and right side. Currently we simply take arbitrary one from custom metrics with same name from left and right. By doing this we miss half of metric number.
### Does this PR introduce _any_ user-facing change?
Yes, this corrects metrics collected for stateful join.
### How was this patch tested?
Unit test.
Closes#31369 from viirya/SPARK-34270.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr make parquet vectorized reader support [column index](https://issues.apache.org/jira/browse/PARQUET-1201).
### Why are the changes needed?
Improve filter performance. for example: `id = 1`, we only need to read `page-0` in `block 1`:
```
block 1:
null count min max
page-0 0 0 99
page-1 0 100 199
page-2 0 200 299
page-3 0 300 399
page-4 0 400 449
block 2:
null count min max
page-0 0 450 549
page-1 0 550 649
page-2 0 650 749
page-3 0 750 849
page-4 0 850 899
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and benchmark: https://github.com/apache/spark/pull/31393#issuecomment-769767724Closes#31393 from wangyum/SPARK-34289.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Test both date and timestamp column types
2. Write the timestamp as the `TIMESTAMP_MICROS` logical type
3. Change the timestamp value to `'1000-01-01 01:02:03'` to check exception throwing.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite"
```
Closes#31396 from MaxGekk/parquet-test-metakey-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When writing rows to a table only the old date time API types are handled in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#makeSetter. If the new API is used (spark.sql.datetime.java8API.enabled=true) casting Instant and LocalDate to Timestamp and Date respectively fails. The proposed change is to handle Instant and LocalDate values and transform them to Timestamp and Date.
### Why are the changes needed?
In the current state writing Instant or LocalDate values to a table fails with something like:
Caused by: java.lang.ClassCastException: class java.time.LocalDate cannot be cast to class java.sql.Date (java.time.LocalDate is in module java.base of loader 'bootstrap'; java.sql.Date is in module java.sql of loader 'platform') at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11(JdbcUtils.scala:573) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11$adapted(JdbcUtils.scala:572) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:678) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:994) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:994) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) ... 3 more
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests
Closes#31264 from cristichircu/SPARK-34144.
Lead-authored-by: Chircu <chircu@arezzosky.com>
Co-authored-by: Cristi Chircu <cristian.chircu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The currently SQL (temp or permanent) view resolution is done in 2 steps:
1. In `SessionCatalog`, we get the view metadata, parse the view SQL string, and wrap it with `View`.
2. At the beginning of the optimizer, we run `EliminateView`, which drops the wrapper `View`, and apply some special logic to match the view schema.
Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra `Project` to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in `SessionCatalog`, so that we only have 1 step.
### Why are the changes needed?
Code simplification. It also fixes issues like https://github.com/apache/spark/pull/31352
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#31368 from cloud-fan/try.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is to add two more metrics for `ObjectHashAggregateExec`, i.e. the spill size, and number of fallback to sort-based aggregation.
### Why are the changes needed?
As object hash aggregate fallback mechanism is special - it will fallback to sort-based aggregation based on number of keys seen so far [0]. This fallback logic sometimes is sub-optimal and leads to unnecessary sort, and performance degradation in run-time. The first step to help user/developer debug is to add more related metrics on UI, e.g. spill size, and number of fallback to sort-based aggregation. (spill size metrics was already added for hash aggregate [1])
[0]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L161
[1]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L68
### Does this PR introduce _any_ user-facing change?
Added two more metrics on Spark UI for operator `ObjectHashAggregateExec`. Screenshot is attached below.
### How was this patch tested?
* Added unit test in `SQLMetricsSuite.scala`.
* Tested on spark shell locally and verified the metrics shown up on UI.
<img width="399" alt="Screen Shot 2021-01-28 at 1 44 40 PM" src="https://user-images.githubusercontent.com/4629931/106204224-7a8a1300-6171-11eb-9814-c3432abadc29.png">
Closes#31340 from c21/object-hash.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add function exists check before load resource.
### Why are the changes needed?
We should not add jar into classpath if the create temporary function is already exists.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#31358 from ulysses-you/SPARK-34261.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Object hash aggregate will fallback to sort-based aggregation based on number of keys seen so far [0]. The default config threshold is 128 (spark.sql.objectHashAggregate.sortBased.fallbackThreshold in [1]). There's an edge case we can do better, where we do not fallback if there's no more input rows. Suppose the task only has 128 group-by keys in hash ma, we don't need to fallback in this case, and we can save the extra sort. This is an rare edge case in production, but it can happen.
[0]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L161
[1]: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1615
### Why are the changes needed?
To avoid unnecessary sort in query. Save resource.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add a unit test to verify task fallback or not is challenging. Given the change is pretty minor, besides relying on existing test in `ObjectHashAggregateSuite.scala`, I manually ran the followed query, and verified in debug mode that the code path for fallback was not executed. And verified the code path for fallback was executed without this change.
```
withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1") {
Seq.fill(1)(Tuple1(Array.empty[Int]))
.toDF("c0")
.groupBy(lit(1))
.agg(typed_count($"c0"), max($"c0")).collect()
}
```
Closes#31353 from c21/object-hash-fallback.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add `count_distinct` as an option argument to Dataset#summary (the method already supports count, min, max, etc.)
### Why are the changes needed?
The `summary()` method is used for lightweight exploratory data analysis. A distinct count of all the columns is one of the most common exploratory data analysis queries.
Distinct counts can be expensive, so this shouldn't be enabled by default. The proposed implementation is completely backwards compatible.
### Does this PR introduce _any_ user-facing change?
Yes, users can now call `df.summary("count_distinct")`, which wasn't an option before. Users can still call `df.summary()` without any arguments and the output is the same. `count_distinct` was not added as one of the `defaultStatistics`.
### How was this patch tested?
Unit tests.
### Additional comments
If this idea is accepted, we should add a PySpark implementation in this PR, as suggested by zero323.
Closes#31254 from MrPowers/SPARK-34165.
Authored-by: MrPowers <matthewkevinpowers@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Use `count` to simplify `find + size(or length)` operation, it's semantically consistent, but looks simpler.
**Before**
```
seq.filter(p).size
```
**After**
```
seq.count(p)
```
### Why are the changes needed?
Code Simpilefications.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31374 from LuciferYang/SPARK-34275.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Invoke `CatalogImpl.refreshTable()` in v1 implementation of the `ALTER TABLE .. SET LOCATION` command to refresh cached table data.
### Why are the changes needed?
The example below portraits the issue:
- Create a source table:
```sql
spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0);
default src_tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0
...
```
- Set new location for the empty partition (part=0):
```sql
spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> ALTER TABLE dst_tbl ADD PARTITION (part=0);
spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1;
spark-sql> CACHE TABLE dst_tbl;
spark-sql> SELECT * FROM dst_tbl;
1 1
spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0';
spark-sql> SELECT * FROM dst_tbl;
1 1
```
The last query does not return new loaded data.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works correctly:
```sql
spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0';
spark-sql> SELECT * FROM dst_tbl;
0 0
1 1
```
### How was this patch tested?
Added new test to `org.apache.spark.sql.hive.CachedTableSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#31361 from MaxGekk/refresh-cache-set-location.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe.
This PR keep SHOW PARTITIONS command's output attribute exprId unchanged.
And benefit for https://issues.apache.org/jira/browse/SPARK-34238
### Why are the changes needed?
Keep SHOW PARTITIONS command's output attribute exprid unchanged.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#31341 from AngersZhuuuu/SPARK-34238.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In PR #30140, it will compare new and old plans when replacing view and uncache data
if the view has changed. But the compared new plan is not analyzed which will cause
`UnresolvedException` when calling `sameResult`. So in this PR, we use the analyzed
plan to compare to fix this problem.
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
newly added tests
Closes#31360 from linhongliu-db/SPARK-34260.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When generating SQL queries only the old date time API types are handled for values in org.apache.spark.sql.jdbc.JdbcDialect#compileValue. If the new API is used (spark.sql.datetime.java8API.enabled=true) Instant and LocalDate values are not quoted and errors are thrown. The change proposed is to handle Instant and LocalDate values the same way that Timestamp and Date are.
### Why are the changes needed?
In the current state if an Instant is used in a filter, an exception will be thrown.
Ex (dataset was read from PostgreSQL): dataset.filter(current_timestamp().gt(col(VALID_FROM)))
Stacktrace (the T11 is from an instant formatted like yyyy-MM-dd'T'HH:mm:ss.SSSSSS'Z'):
Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test added
Closes#31148 from cristichircu/SPARK-33867.
Lead-authored-by: Chircu <chircu@arezzosky.com>
Co-authored-by: Cristi Chircu <chircu@arezzosky.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Remove `SessionState.refreshTable()` and modify the tests where the method is used.
### Why are the changes needed?
There are already 2 methods with the same name in:
- `SessionCatalog`
- `CatalogImpl`
One more method in `SessionState` does not give any benefits. By removing it, we can improve code maintenance.
### Does this PR introduce _any_ user-facing change?
Should not because `SessionState` is an internal class.
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *MetastoreDataSourcesSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcQuerySuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveParquetMetastoreSuite"
```
Closes#31366 from MaxGekk/remove-refreshTable-from-SessionState.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/31319 .
When reading parquet int/long as decimal, the behavior should be the same as reading int/long and then cast to the decimal type. This PR changes to the expected behavior.
When reading parquet binary as decimal, we don't really know how to interpret the binary (it may from a string), and should fail. This PR changes to the expected behavior.
### Why are the changes needed?
To make the behavior more sane.
### Does this PR introduce _any_ user-facing change?
Yes, but it's a followup.
### How was this patch tested?
updated test
Closes#31357 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR changes the column types in the table definitions of `TPCDSBase` from string to char and varchar, with respect to the original definitions for char/varchar columns in the official doc - [TPC-DS_v2.9.0](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf).
### Why are the changes needed?
Comply with both TPCDS standard and ANSI, and using string will get wrong results with those TPCDS queries
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
plan stability check
Closes#31012 from yaooqinn/tpcds.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Take into account the SQL config `spark.sql.statistics.size.autoUpdate.enabled` in the `TRUNCATE TABLE` command as other commands do.
2. Re-calculate actual table size in fs. Before the changes, `TRUNCATE TABLE` always sets table size to 0 in stats.
### Why are the changes needed?
This fixes the bug that is demonstrated by the example:
1. Create a partitioned table with 2 non-empty partitions:
```sql
spark-sql> CREATE TABLE tbl (c0 int, part int) PARTITIONED BY (part);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> INSERT INTO tbl PARTITION (part=1) SELECT 1;
spark-sql> ANALYZE TABLE tbl COMPUTE STATISTICS;
spark-sql> DESCRIBE TABLE EXTENDED tbl;
...
Statistics 4 bytes, 2 rows
...
```
2. Truncate only one partition:
```sql
spark-sql> TRUNCATE TABLE tbl PARTITION (part=1);
spark-sql> SELECT * FROM tbl;
0 0
```
3. The table is still non-empty but `TRUNCATE TABLE` reseted stats:
```
spark-sql> DESCRIBE TABLE EXTENDED tbl;
...
Statistics 0 bytes, 0 rows
...
```
### Does this PR introduce _any_ user-facing change?
It could impact on performance of following queries.
### How was this patch tested?
Added new test to `StatisticsCollectionSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *StatisticsCollectionSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *StatisticsSuite"
```
Closes#31350 from MaxGekk/fix-stats-in-trunc-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
we need to check whether the `lit` is null before calling `numChars`
### Why are the changes needed?
fix an obvious NPE bug
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#31336 from yaooqinn/SPARK-34233.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For v2 static partitions overwriting, we use `EqualTo ` to generate the `deleteExpr`
This is not right for null partition values, and cause the problem like below because `ConstantFolding` converts it to lit(null)
```scala
SPARK-34223: static partition with null raise NPE *** FAILED *** (19 milliseconds)
[info] org.apache.spark.sql.AnalysisException: Cannot translate expression to source filter: null
[info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.$anonfun$applyOrElse$1(V2Writes.scala:50)
[info] at scala.collection.immutable.List.flatMap(List.scala:366)
[info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:47)
[info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:39)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
```
The right way is to use EqualNullSafe instead to delete the null partitions.
### Why are the changes needed?
bugfix
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
an original test to new place
Closes#31339 from yaooqinn/SPARK-34236.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to the correctness issues during reading decimal values from Parquet files.
- For **MR** code path, `ParquetRowConverter` can read Parquet's decimal values with the original precision and scale written in the corresponding footer.
- For **Vectorized** code path, `VectorizedColumnReader` throws `SchemaColumnConvertNotSupportedException`.
### Why are the changes needed?
Currently, Spark returns incorrect results when the Parquet file's decimal precision and scale are different from the Spark's schema. This happens when there is multiple files with different decimal schema or HiveMetastore has a new schema.
**BEFORE (Simplified example for correctness)**
```scala
scala> sql("SELECT 1.0 a").write.parquet("/tmp/decimal")
scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show
+----+
| a|
+----+
|0.10|
+----+
```
This works correctly in the other data sources, `ORC/JSON/CSV`, like the following.
```scala
scala> sql("SELECT 1.0 a").write.orc("/tmp/decimal_orc")
scala> spark.read.schema("a DECIMAL(3,2)").orc("/tmp/decimal_orc").show
+----+
| a|
+----+
|1.00|
+----+
```
**AFTER**
1. **Vectorized** path: Instead of incorrect result, we will raise an explicit exception.
```scala
scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show
java.lang.UnsupportedOperationException: Schema evolution not supported.
```
2. **MR** path (complex schema or explicit configuration): Spark returns correct results.
```scala
scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").show
+----+-------+--------+
| a| b| c|
+----+-------+--------+
|1.00|100.000|{1 -> 2}|
+----+-------+--------+
scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").printSchema
root
|-- a: decimal(3,2) (nullable = true)
|-- b: decimal(18,3) (nullable = true)
|-- c: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = true)
```
### Does this PR introduce _any_ user-facing change?
Yes. This fixes the correctness issue.
### How was this patch tested?
Pass with the newly added test case.
Closes#31319 from dongjoon-hyun/SPARK-34212.
Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/27507 implements `regexp_extract_all` and added the scala function version of it.
According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description.
### Why are the changes needed?
`regexp_extract_all` is less common.
### Does this PR introduce _any_ user-facing change?
'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet.
### How was this patch tested?
Jenkins test.
Closes#31346 from beliefer/SPARK-24884-followup.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds repartition and sort nodes to satisfy the required distribution and ordering introduced in SPARK-33779.
Note: This PR contains the final part of changes discussed in PR #29066.
### Why are the changes needed?
These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This PR comes with a new test suite.
Closes#31083 from aokolnychyi/spark-34026.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed.
### Why are the changes needed?
Avoid file handle leak.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31323 from LuciferYang/source-not-closed.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixed all `OffsetWindowFunctionFrameBase#prepare` implementations to reset the states, and also add more comments in `WindowFunctionFrame` classdoc to explain why we need to reset states during preparation: `WindowFunctionFrame` instances are reused to process multiple partitions.
### Why are the changes needed?
To fix a correctness bug caused by the new feature "window function with ignore nulls" in the master branch.
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
new test
Closes#31325 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
with a simple case, the null will be passed to InsertIntoHadoopFsRelationCommand blindly, we should avoid the npe
```scala
test("NPE") {
withTable("t") {
sql(s"CREATE TABLE t(i STRING, c string) USING $format PARTITIONED BY (c)")
sql("INSERT OVERWRITE t PARTITION (c=null) VALUES ('1')")
checkAnswer(spark.table("t"), Row("1", null))
}
}
```
```logtalk
java.lang.NullPointerException
at scala.collection.immutable.StringOps$.length(StringOps.scala:51)
at scala.collection.immutable.StringOps.length(StringOps.scala:51)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35)
at scala.collection.IndexedSeqOptimized.foreach
at scala.collection.immutable.StringOps.foreach(StringOps.scala:33)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.orig-s0.0000030000-r30676-expand-or-complete(InsertIntoHadoopFsRelationCommand.scala:231)
```
### Why are the changes needed?
a bug fix
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#31320 from yaooqinn/SPARK-34223.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst.
This PR reverts 6da5cdf1db that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty.
This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark.
### Why are the changes needed?
fix perf regression when tables have char/varchar type columns
closes#31278
### Does this PR introduce _any_ user-facing change?
yes, spark will not raise error for oversized char/varchar values in read side
### How was this patch tested?
modified ut
the dropped read side benchmark
```
================================================================================================
Char Varchar Read Side Perf w/o Tailing Spaces
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20 1564 1573 9 63.9 15.6 1.0X
read char with length 20 1532 1551 18 65.3 15.3 1.0X
read varchar with length 20 1520 1531 13 65.8 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40 1573 1613 41 63.6 15.7 1.0X
read char with length 40 1575 1577 2 63.5 15.7 1.0X
read varchar with length 40 1568 1576 11 63.8 15.7 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60 1526 1540 23 65.5 15.3 1.0X
read char with length 60 1514 1539 23 66.0 15.1 1.0X
read varchar with length 60 1486 1497 10 67.3 14.9 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80 1531 1542 19 65.3 15.3 1.0X
read char with length 80 1514 1529 15 66.0 15.1 1.0X
read varchar with length 80 1524 1565 42 65.6 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100 1597 1623 25 62.6 16.0 1.0X
read char with length 100 1499 1512 16 66.7 15.0 1.1X
read varchar with length 100 1517 1524 8 65.9 15.2 1.1X
================================================================================================
Char Varchar Read Side Perf w/ Tailing Spaces
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20 1524 1526 1 65.6 15.2 1.0X
read char with length 20 1532 1537 9 65.3 15.3 1.0X
read varchar with length 20 1520 1532 15 65.8 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40 1556 1580 32 64.3 15.6 1.0X
read char with length 40 1600 1611 17 62.5 16.0 1.0X
read varchar with length 40 1648 1716 88 60.7 16.5 0.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60 1504 1524 20 66.5 15.0 1.0X
read char with length 60 1509 1512 3 66.2 15.1 1.0X
read varchar with length 60 1519 1535 21 65.8 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80 1640 1652 17 61.0 16.4 1.0X
read char with length 80 1625 1666 35 61.5 16.3 1.0X
read varchar with length 80 1590 1605 13 62.9 15.9 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100 1622 1628 5 61.6 16.2 1.0X
read char with length 100 1614 1646 30 62.0 16.1 1.0X
read varchar with length 100 1594 1606 11 62.7 15.9 1.0X
```
Closes#31281 from yaooqinn/SPARK-34192.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203.
### Why are the changes needed?
`InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below:
```
$ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory
```
```scala
scala> spark.conf.get("spark.sql.catalogImplementation")
res0: String = in-memory
scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default':
Map(p1 -> null)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440)
```
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog:
```scala
scala> spark.table("tbl").show(false)
+----+----+
|col1|p1 |
+----+----+
|0 |null|
+----+----+
scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
res4: org.apache.spark.sql.DataFrame = []
scala> spark.table("tbl").show(false)
+----+---+
|col1|p1 |
+----+---+
+----+---+
```
### How was this patch tested?
Added new test to `AlterTableDropPartitionSuiteBase`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#31322 from MaxGekk/insert-overwrite-null-part.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
optimize: change Option[LogicalRelation] to LogicalRelation
### Why are the changes needed?
simplify code
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Existed unit test.
Closes#31315 from monkeyboy123/spark-34067-follow-up.
Authored-by: Dereck Li <monkeyboy.ljh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`).
Before this PR the decision about the used implementation was based on by accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made.
When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test:
`with T as (select id as a, -id as x from range(3)),
U as (select id as b, cast(id as string) as x from range(3))
select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x`
where the 'x' column in the left side of the join is a Long but on the right side is a String.
### Why are the changes needed?
Fixes the issue where duplicate field name aggregation has null values in the dataframe.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT, tested manually on spark shell.
Closes#30788 from yliou/SPARK-33726.
Authored-by: yliou <yliou@berkeley.edu>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR keeps partitioning of input tables in TPCDSQueryBenchmark when `--cbo` option is enabled.
https://github.com/apache/spark/pull/31011 introduced the `--cbo` option but unfortunately in that mode the table partitioning of the input data is lost. This means that the results of CBO mode is very different to non CBO mode, one example is that Dynamic Partition Pruning doesn't kick in in CBO mode.
### Why are the changes needed?
To monitor performance changed with CBO enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked.
Closes#31218 from peter-toth/SPARK-34147-keep-partitioning-in-tpcdsquerybenchmark.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Fix the issues in the Spark 3.1.1 release API docs.
### Does this PR introduce _any_ user-facing change?
Yes, API doc changes.
### How was this patch tested?
Manually test.
Closes#31271 from xuanyuanking/SPARK-34185.
Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds `NullType` support for Arrow executions.
### Why are the changes needed?
As Arrow supports null type, we can convert `NullType` between PySpark and pandas with Arrow enabled.
### Does this PR introduce _any_ user-facing change?
Yes, if a user has a DataFrame including `NullType`, it will be able to convert with Arrow enabled.
### How was this patch tested?
Added tests.
Closes#31285 from ueshin/issues/SPARK-33489/arrow_nulltype.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr add partition columns for TPCDS tables. The partition column is consistent with the [TPCDSTables](https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala).
### Why are the changes needed?
Better track plan changes. For example, [this is the change](3fe1a93a40) after SPARK-34119.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#31243 from wangyum/SPARK-34155.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Invoke `CatalogImpl.refreshTable()` instead of `SessionCatalog.refreshTable` in v1 implementation of the `LOAD DATA` command. `SessionCatalog.refreshTable` just refreshes metadata comparing to `CatalogImpl.refreshTable()` which refreshes cached table data as well.
### Why are the changes needed?
The example below portraits the issue:
- Create a source table:
```sql
spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0);
default src_tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0
...
```
- Load data from the source table to a cached destination table:
```sql
spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1;
spark-sql> CACHE TABLE dst_tbl;
spark-sql> SELECT * FROM dst_tbl;
1 1
spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0);
spark-sql> SELECT * FROM dst_tbl;
1 1
```
The last query does not return new loaded data.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works correctly:
```sql
spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0);
spark-sql> SELECT * FROM dst_tbl;
0 0
1 1
```
### How was this patch tested?
Added new test to `org.apache.spark.sql.hive.CachedTableSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#31304 from MaxGekk/load-data-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Rename `SessionCatalog.isTemporaryTable()` to `SessionCatalog.isTempView()`.
### Why are the changes needed?
To improve code maintenance. Currently, there are two methods that do the same but have different names:
```scala
def isTempView(nameParts: Seq[String]): Boolean
```
and
```scala
def isTemporaryTable(name: TableIdentifier): Boolean
```
### Does this PR introduce _any_ user-facing change?
Should not since `SessionCatalog` is not public API.
### How was this patch tested?
By running the existing tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite"
```
Closes#31295 from MaxGekk/replace-isTemporaryTable-by-isTempView.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is
```
Join(b#1 = 3)
TableScan(t, [a#0, b#1])
Project(a#2)
TableScan(t, [a#2, b#3])
```
It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`.
### Why are the changes needed?
to not fail valid self-join queries.
### Does this PR introduce _any_ user-facing change?
yea as a bug fix
### How was this patch tested?
a new test
Closes#31287 from cloud-fan/self-join.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is the same as https://github.com/apache/spark/pull/30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.
### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.
The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.
1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage
1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.
Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Add UT
Closes#31269 from zhongyu09/aqe-broadcast-partial-fix.
Authored-by: Yu Zhong <zhongyu8@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails.
MS SQL server supports two non-standard spatial JDBC types, `geometry` and `geography` but Spark SQL can't treat them
```
java.sql.SQLException: Unrecognized SQL type -157
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381)
```
Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`.
### Why are the changes needed?
To provide better support.
### Does this PR introduce _any_ user-facing change?
Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables.
### How was this patch tested?
New test case added to `MsSqlServerIntegrationSuite`.
Closes#31283 from sarutak/mssql-spatial-types.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes the regression bug brought by SPARK-33888 (#30902).
After that PR merged, `PostgresDIalect#getCatalystType` throws Exception for array types.
```
[info] - Type mapping for various types *** FAILED *** (551 milliseconds)
[info] java.util.NoSuchElementException: key not found: scale
[info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:106)
[info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:104)
[info] at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111)
[info] at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51)
[info] at org.apache.spark.sql.jdbc.PostgresDialect$.getCatalystType(PostgresDialect.scala:43)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
```
### Why are the changes needed?
To fix the regression bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed the test case `SPARK-22291: Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL` in `PostgresIntegrationSuite` passed.
I also confirmed whether all the `v2.*IntegrationSuite` pass because this PR changed them and they passed.
Closes#31262 from sarutak/fix-postgres-dialect-regression.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR changes cache refreshing of v1 tables in v1 commands. In particular, v1 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v1 table and its dependents. In more details:
1. Modified the `CatalogImpl.refreshTable()` method to use `recacheByPlan()` instead of `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. Users can call this method via public API like `spark.catalog.refreshTable()`.
2. Rewritten the part in `CatalogImpl.refreshTable()` which was responsible for table meta-data refreshing because this code stopped to work properly after removing of the second `sparkSession.table(tableIdent)`.
3. Added new private method `invalidateCachedTable()` to `SessionCatalog`. Comparing to the existing `SessionCatalog.refreshTable`, it invalidates the relation cache only. If we called `SessionCatalog.refreshTable` from `CatalogImpl.refreshTable()`, we would refresh temporary and global temporary views twice (that could lead to refreshing file index twice).
### Why are the changes needed?
1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v1 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v1 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly.
2. To improve code maintenance.
3. To reduce the amount of calls to Hive external catalog.
4. Also this should speed up table recaching.
5. To have the same behavior as for v2 tables supported by https://github.com/apache/spark/pull/31172
### Does this PR introduce _any_ user-facing change?
From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. For example:
Before:
```scala
scala> sql("CREATE TABLE tbl (c int)")
scala> sql("CACHE TABLE tbl")
scala> sql("CREATE VIEW v AS SELECT * FROM tbl")
scala> sql("CACHE TABLE v")
scala> spark.catalog.isCached("v")
res6: Boolean = true
scala> spark.catalog.refreshTable("tbl")
scala> spark.catalog.isCached("v")
res8: Boolean = false
```
After:
```scala
scala> spark.catalog.refreshTable("tbl")
scala> spark.catalog.isCached("v")
res8: Boolean = true
```
### How was this patch tested?
1. Added new unit tests that create a view, a temporary view and a global temporary view on top of v1/v2 tables, and refresh the base table via `ALTER TABLE .. ADD/DROP/RENAME PARTITION`.
2. By running the unified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
# build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```
Closes#31206 from MaxGekk/refreshTable-recache-by-plan.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add `drop table` in charvarchar sql test.
### Why are the changes needed?
1. `drop table` is also a test case, for better coverage.
2. It's more clear to drop table which created in current test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#31277 from ulysses-you/SPARK-33901-FOLLOWUP.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add tests to check that v2 table dependents are re-cached after table altering via the commands:
- `ALTER TABLE .. ADD PARTITION`
- `ALTER TABLE .. DROP PARTITION`
- `ALTER TABLE .. RENAME PARTITION`
### Why are the changes needed?
To improve test coverage and prevent regressions.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite"
```
Closes#31250 from MaxGekk/check-v2-dependents-recached.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The current implement of `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame` only support `nth_value` that respect nulls. So nth_value will execute with `UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame`.
`UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame` will call `updateExpressions` of `nth_value` multiple times.
### Why are the changes needed?
Improve performance for nth_value ignore nulls over offset window
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Jenkins test
Closes#31178 from beliefer/SPARK-34096.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N.
### Why are the changes needed?
improve varchar performance for write side
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
benchmark and existing ut
Closes#31253 from yaooqinn/SPARK-34164.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.
### Why are the changes needed?
```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```
For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:
537a49fc09/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (L125)
However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.
Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated a unit test
Closes#31260 from Ngone51/fix-missing-tags.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This passes original SQL text to `CacheTableAsSelect` command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command.
In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value).
### Why are the changes needed?
Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue.
For instance, in the following:
```sql
> CREATE TABLE t ...;
> CACHE TABLE v AS SELECT * FROM t;
> DROP TABLE t;
> SELECT * FROM v;
```
The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist.
On the other hand, the following:
```sql
> CREATE TABLE t ...;
> CREATE TEMPORARY VIEW v AS SELECT * FROM t;
> CACHE TABLE v;
> DROP TABLE t;
> SELECT * FROM v;
```
will throw "Table or view not found" error in the last select query.
This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case.
This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache.
### Does this PR introduce _any_ user-facing change?
Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated.
### How was this patch tested?
Modified/Enhanced some existing tests.
Closes#31107 from sunchao/SPARK-34052.
Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Port DS V2 tests from `AlterTablePartitionV2SQLSuite ` to the test suite `v2.AlterTableRecoverPartitionsSuite`.
2. Port DS v1 tests from `DDLSuite` to `v1.AlterTableRecoverPartitionsSuiteBase`.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```
Closes#31105 from MaxGekk/unify-recover-partitions-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Clear table cache after adding partitions to v2 table in `AlterTableAddPartitionExec`.
### Why are the changes needed?
This PR fixes correctness issue. Without the fix, queries on cached tables modified via `ALTER TABLE .. ADD PARTITION` return incorrect results.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Added new UT to `org.apache.spark.sql.execution.command.v2.AlterTableAddPartitionSuite`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```
Closes#31229 from MaxGekk/v2-add-partition-recache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Change `Column.named` code to let expression check if exists `UnresolvedExtractValue` and use `UnresolvedAlias` to assign name.
### Why are the changes needed?
It's more reasonable to treat user specify expression as unresolved expression then we should assign name after analyze.
Let's say we have this code
```
spark.range(1).selectExpr("id as id1", "id as id2")
.selectExpr("cast(struct(id1, id2).id1 as int)")
```
before this PR, the field name is `CAST(struct(id1, id2)[id1] AS INT)`.
After, the field name is `CAST(struct(id1, id2).id1 AS INT)`.
### Does this PR introduce _any_ user-facing change?
Yes, the default field name may be changed.
### How was this patch tested?
Add test.
Closes#30974 from ulysses-you/SPARK-33939-0.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression.
here is a case from tpcds that could be fixed by this improvement
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/
The original case generate 20K bytes, we are trying to reduce it to less than 8k
### Why are the changes needed?
performance improvement as in the PR benchmark test, the performance w/ codegen is 2~3x better than w/o codegen.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
yes, it's a code reflect so the existing ut should be enough
cross-check with https://github.com/apache/spark/pull/31012 where the tpcds shall all pass
benchmark compared with master
```logtalk
================================================================================================
Char Varchar Read Side Perf
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20 1571 1667 83 63.6 15.7 1.0X
read char with length 20 1710 1764 58 58.5 17.1 0.9X
read varchar with length 20 1774 1792 16 56.4 17.7 0.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40 1824 1927 91 54.8 18.2 1.0X
read char with length 40 1788 1928 137 55.9 17.9 1.0X
read varchar with length 40 1676 1700 40 59.7 16.8 1.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60 1727 1762 30 57.9 17.3 1.0X
read char with length 60 1628 1674 43 61.4 16.3 1.1X
read varchar with length 60 1651 1665 13 60.6 16.5 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80 1748 1778 28 57.2 17.5 1.0X
read char with length 80 1673 1678 9 59.8 16.7 1.0X
read varchar with length 80 1667 1684 27 60.0 16.7 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100 1709 1743 48 58.5 17.1 1.0X
read char with length 100 1610 1664 67 62.1 16.1 1.1X
read varchar with length 100 1614 1673 53 61.9 16.1 1.1X
================================================================================================
Char Varchar Write Side Perf
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20 2277 2327 67 4.4 227.7 1.0X
write char with length 20 2421 2443 19 4.1 242.1 0.9X
write varchar with length 20 2393 2419 27 4.2 239.3 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40 2249 2290 38 4.4 224.9 1.0X
write char with length 40 2386 2444 57 4.2 238.6 0.9X
write varchar with length 40 2397 2405 12 4.2 239.7 0.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60 2326 2367 41 4.3 232.6 1.0X
write char with length 60 2478 2501 37 4.0 247.8 0.9X
write varchar with length 60 2475 2503 24 4.0 247.5 0.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80 9367 9773 354 1.1 936.7 1.0X
write char with length 80 10454 10621 238 1.0 1045.4 0.9X
write varchar with length 80 18943 19503 571 0.5 1894.3 0.5X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100 11055 11104 59 0.9 1105.5 1.0X
write char with length 100 12204 12275 63 0.8 1220.4 0.9X
write varchar with length 100 21737 22275 574 0.5 2173.7 0.5X
```
Closes#31199 from yaooqinn/SPARK-34130.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr keep necessary stats after partition pruning.
### Why are the changes needed?
Improve query performance. It will push down aggregate since SPARK-34081 because it can be planed as BroadcastHashJoin. But it lacks column statistics after [`PruneFileSourcePartitions`](d0c83f372b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (L102-L103)). Therefore, it will eventually be planned as SortMergeJoin.
Please see the log:
```
join.right.stats: org.apache.spark.sql.catalyst.optimizer.PushDownPredicates: Statistics(sizeInBytes=348.8 KiB, rowCount=1.79E+4)
join.right.stats: org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions: Statistics(sizeInBytes=1414.2 EiB)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and benchmark test
SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
q14a | 594 | 384
q14b | 600 | 402
This change will not affect the results of `PlanStabilitySuite`, because it does not have partition column.
Closes#31205 from wangyum/SPARK-34119.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
While adding new partition to v2 `InMemoryAtomicPartitionTable`/`InMemoryPartitionTable`, add single row to the table content when the table is fully partitioned.
### Why are the changes needed?
The `ALTER TABLE .. ADD PARTITION` command does not change content of fully partitioned v2 table. For instance, `INSERT INTO` changes table content:
```scala
sql(s"CREATE TABLE t (p0 INT, p1 STRING) USING _ PARTITIONED BY (p0, p1)")
sql(s"INSERT INTO t SELECT 1, 'def'")
sql(s"SELECT * FROM t").show(false)
+---+---+
|p0 |p1 |
+---+---+
|1 |def|
+---+---+
```
but `ALTER TABLE .. ADD PARTITION` doesn't change v2 table content:
```scala
sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')")
sql(s"SELECT * FROM t").show(false)
+---+---+
|p0 |p1 |
+---+---+
+---+---+
```
### Does this PR introduce _any_ user-facing change?
No, the changes impact only on tests but for the example above in tests:
```scala
sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')")
sql(s"SELECT * FROM t").show(false)
+---+---+
|p0 |p1 |
+---+---+
|0 |abc|
+---+---+
```
### How was this patch tested?
By running the unified tests for `ALTER TABLE .. ADD PARTITION`.
Closes#31216 from MaxGekk/add-partition-by-all-columns.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Change null Literal to PrettyAttribute during ResolveAlias.
### Why are the changes needed?
We will convert `Literal(null)` to target data type during analysis. Then the generated alias name will include something like `CAST(NULL AS String)` instead of `NULL`.
```
spark.sql("SELECT RAND(null)").columns
-- before
rand(CAST(NULL AS INT))
-- after
rand(NULL)
```
### Does this PR introduce _any_ user-facing change?
Yes, the default column name maybe changed.
### How was this patch tested?
Add test and pass exists test.
Closes#31233 from ulysses-you/SPARK-34150.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix table size parsing from the `Statistics` field which is formed at: c3d81fbe79/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (L573) . Before the fix, `getTableSize()` returns only the last digit. This works for Hive table in the tests because its size < 10 bytes, and accidentally works for V1 In-Memory catalog table in the tests.
### Why are the changes needed?
This makes tests more reliable. For example, the parsing can not work in `AlterTableDropPartitionSuite` when table size before partition dropping:
```
+---------+
|data_type|
+---------+
|878 bytes|
+---------+
```
After:
```
+---------+
|data_type|
+---------+
|439 bytes|
+---------+
```
at:
```scala
val onePartSize = getTableSize(t)
assert(0 < onePartSize && onePartSize < twoPartSize)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
```
Closes#31237 from MaxGekk/optimize-updateTableStats-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch moves move general StateStore-related tests into `StateStoreSuiteBase`.
### Why are the changes needed?
There are some general StateStore tests in `StateStoreSuite` which is `HDFSBackedStateStoreProvider`-specific test suite. We should move general tests into `StateStoreSuiteBase`, so it is easier to incorporate other StateStores.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Unit test.
Closes#31219 from viirya/SPARK-34148.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`.
### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0 0
spark-sql> show table extended like 'tbl' partition(part=0);
default tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl recover partitions;
spark-sql> select * from tbl;
0 0
```
The last query must return `0 1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl recover partitions;
spark-sql> select * from tbl;
0 0
0 1
```
### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#31066 from MaxGekk/recover-partitions-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect:
```
scala> sql("CACHE TABLE unknown")
org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 0;
```
, whereas the `pos` should be `12`.
This PR proposes to fix this issue for commands using `UnresolvedRelation`:
```
CACHE TABLE unknown
UNCACHE TABLE unknown
DELETE FROM unknown
UPDATE unknown SET name='abc'
MERGE INTO unknown1 AS target USING unknown2 AS source ON target.col = source.col WHEN MATCHED THEN DELETE
INSERT INTO TABLE unknown SELECT 1
INSERT OVERWRITE TABLE unknown VALUES (1, 'a')
```
### Why are the changes needed?
To fix a bug.
### Does this PR introduce _any_ user-facing change?
Yes, now the above example will print the following:
```
org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 12;
```
### How was this patch tested?
Add a new test.
Closes#31209 from imback82/unresolved_relation_message.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec
### Why are the changes needed?
Upgrade Avro and Parquet to latest version.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517Closes#30657 from wangyum/SPARK-33696.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow.
### Why are the changes needed?
rm unnecessary logic which may cause potential performance regressions
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
tpcds tests for plan
Closes#31079 from yaooqinn/SPARK-34037.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases.
### Why are the changes needed?
Use `val` instead of `var` when possible.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31142 from LuciferYang/SPARK-33346.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add support for `orc.force.positional.evolution` config that forces ORC top level column matching by position rather than by name.
This does work in Hive:
```
> set orc.force.positional.evolution;
+--------------------------------------+
| set |
+--------------------------------------+
| orc.force.positional.evolution=true |
+--------------------------------------+
> create table t (c1 string, c2 string) stored as orc;
> insert into t values ('foo', 'bar');
> alter table t change c1 c3 string;
```
The orc file in this case contains the original `c1` and `c2` columns that doesn't match the metadata in HMS. But due to the positional evolution setting, Hive is capable to return all the data:
```
> select * from t;
+--------+--------+
| t.c3 | t.c2 |
+--------+--------+
| foo | bar |
+--------+--------+
```
Without this PR Spark returns `null`s for the renamed `c3` column.
After this PR Spark returns the data in `c3` column.
### Why are the changes needed?
Hive/ORC does support it.
### Does this PR introduce _any_ user-facing change?
Yes, we will support `orc.force.positional.evolution`.
### How was this patch tested?
New UT.
Closes#29737 from peter-toth/SPARK-32864-support-orc-forced-positional-evolution.
Lead-authored-by: Peter Toth <peter.toth@gmail.com>
Co-authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
On the read-side, we should respect the original data instead of trimming it first.
It brings extra overhead on the code-gen code side, trimming and padding for the same field, and it's also unnecessary and a bug
### Why are the changes needed?
bugfix and perf regression
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#31181 from yaooqinn/SPARK-34114.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR changes cache refreshing of v2 tables in v2 commands. In particular, v2 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v2 table and its dependents. In more details:
1. Add new method `recacheTable()` to `DataSourceV2Strategy` and pass it the exec node where need to recache table. New method uses `recacheByPlan` to refresh data cache of v2 tables, and keeps table dependents still cached **while clearing their caches**.
2. Simplify `invalidateCache` (and rename it `invalidateTableCache`) by retargeting it for only table cache invalidation.
3. Modify a test for `REFRESH TABLE` and check that v2 table dependent is still cached after refreshing the base table.
### Why are the changes needed?
1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v2 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v2 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly.
2. Improve code maintenance.
3. Reduce the number of calls to the Cache Manager when need to recache a table. Before the changes, `invalidateCache()` invokes the Cache Manager 3 times: `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`.
4. Also this should speed up table recaching.
### Does this PR introduce _any_ user-facing change?
From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time.
### How was this patch tested?
By running the existing test suites for v2 the add/drop/rename partition commands.
Closes#31172 from MaxGekk/dsv2-recache-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr use `exists` or `forall` to simplify `filter + emptiness check`, it's semantically consistent, but looks simpler. The rule as follow:
- `seq.filter(p).size == 0)` -> `!seq.exists(p)`
- `seq.filter(p).length > 0` -> `seq.exists(p)`
- `seq.filterNot(p).isEmpty` -> `seq.forall(p)`
- `seq.filterNot(p).nonEmpty` -> `!seq.forall(p)`
### Why are the changes needed?
Code Simpilefications.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31184 from LuciferYang/SPARK-34118.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to pull the test of `numKeys` metric into a separate test in `StateStoreSuite`.
### Why are the changes needed?
Right now in `StateStoreSuite`, the tests of get/put/remove/commit are mixed with `numKeys` metric test. I found it is flaky when I was testing with other `StateStore` implementation.
Current test logic is tightly bound to the in-memory map behavior of `HDFSBackedStateStore`. For example, put can immediately show up in the `numKeys` metric.
But for a `StateStore` implementation relying on external storage, e.g. RocksDB, the metric might be updated once the data is actually committed. And `StateStoreSuite` should be a common test suite for all kinds of StateStore implementations.
Specifically, we also are able to check these metrics after state store is updated (committed). So I think we can refactor the test a little bit to make it easier to incorporate other `StateStore` externally.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Unit test.
Closes#31183 from viirya/SPARK-34116.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to strip auto-generated cast. The main logic is:
1. Add tag if Cast is specified by user.
2. Wrap `PrettyAttribute` in usePrettyExpression.
### Why are the changes needed?
Make sql consistent with dsl. Here is an inconsistent example before this PR:
```
-- output field name: FLOOR(1)
spark.emptyDataFrame.select(floor(lit(1)))
-- output field name: FLOOR(CAST(1 AS DOUBLE))
spark.sql("select floor(1)")
```
Note that, we don't remove the `Cast` so the auto-generated `Cast` can still work. The only changed place is `usePrettyExpression`, we use `PrettyAttribute` replace `Cast` to give a better sql string.
### Does this PR introduce _any_ user-facing change?
Yes, the default field name may change.
### How was this patch tested?
Add test and pass exists test.
Closes#31034 from ulysses-you/SPARK-33989.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time
### Why are the changes needed?
to accelerate PartitionPruning prune calculate
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
existed unit test
Closes#31122 from monkeyboy123/optimize-dynamic-pruning.
Authored-by: Dereck Li <monkeyboy.ljh@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
Fix a regression from https://github.com/apache/spark/pull/29959.
In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:
1. starts with "_" and doesn't contain "="
2. starts with "."
However, after the refactoring PR https://github.com/apache/spark/pull/29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426
This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName`
### Why are the changes needed?
Bugfix
### Does this PR introduce _any_ user-facing change?
Yes, it fixes a bug for reading file paths with partitions.
### How was this patch tested?
Unit test
Closes#31169 from gengliangwang/fileListingBug.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup PR for SPARK-33690 (#30647) .
In addition to the original PR, this PR intends to escape the following meta-characters in `Dataset#showString`.
* `\r` (carrige ret)
* `\f` (form feed)
* `\b` (backspace)
* `\u000B` (vertical tab)
* `\u0007` (bell)
### Why are the changes needed?
To avoid breaking the layout of `Dataset#showString`.
`\u0007` does not break the layout of `Dataset#showString` but it's noisy (beeps for each row) so it should be also escaped.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified the existing tests.
I also build the documents and check the generated html for `sql-migration-guide.md`.
Closes#31144 from sarutak/escape-metacharacters-in-getRows.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile.
### Why are the changes needed?
Remove redundant collection conversion
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed
Closes#31125 from LuciferYang/SPARK-34068.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This pr use `exists` to simplify `find + emptiness check`, it's semantically consistent, but looks simpler.
**Before**
```
seq.find(p).isDefined
or
seq.find(p).isEmpty
```
**After**
```
seq.exists(p)
or
!seq.exists(p)
```
### Why are the changes needed?
Code Simpilefications.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31130 from LuciferYang/SPARK-34070.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved.
### Why are the changes needed?
Currently, `CatalogImpl.dropTempView` is implemented as following:
```scala
override def dropTempView(viewName: String): Boolean = {
sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef =>
sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession, viewDef, cascade = false)
sessionCatalog.dropTempView(viewName)
}
}
```
Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like:
```
Invalid call to qualifier on unresolved object, tree: 'key
```
This can be reproduced via:
```scala
sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10")
sql(s"CREATE TABLE $t AS SELECT * FROM src")
sql(s"CACHE TABLE $t")
dropTempTable(v)
```
### Does this PR introduce _any_ user-facing change?
The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix.
### How was this patch tested?
Added new unit tests.
Closes#31136 from sunchao/SPARK-34076.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>