### What changes were proposed in this pull request?
Passing around the output attributes should have more benefits like keeping the exprID unchanged to avoid bugs when we apply more operators above the command output DataFrame.
This PR did 2 things :
1. After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`..
2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged.
### Why are the changes needed?
1. Keep `SHOW TBLPROPERTIES`'s output schema consistence
2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged.
### Does this PR introduce _any_ user-facing change?
After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`.
Before this PR:
```
sql > SHOW TBLPROPERTIES tabe_name('key')
value
value_of_key
```
After this PR
```
sql > SHOW TBLPROPERTIES tabe_name('key')
key value
key value_of_key
```
### How was this patch tested?
Added UT
Closes#31378 from AngersZhuuuu/SPARK-34240.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
With https://github.com/apache/spark/pull/31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`.
Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`.
### Why are the changes needed?
Without this PR the problem described in https://github.com/apache/spark/pull/31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`.
So for example when a new column (with a default value) is added to the table then one the following problem happens:
- when the new field is added after the last one the cell values will be null values instead of the default value
- when the schema is extended somewhere before the last field then values will be listed for the wrong column positions
Similar error will happen when one of the field is removed from the schema.
For details please check the attached unit tests where both cases are checked.
### Does this PR introduce _any_ user-facing change?
Fixes the potential value error.
### How was this patch tested?
The existing unit tests for schema evolution is generalized and reused.
New tests:
- `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)`
- `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)`
Closes#31501 from attilapiros/SPARK-34370.
Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data
the table level properties were overwritten by the partition level properties.
This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema
is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data.
This new behavior is consistent with Apache Hive.
See the example used in the unit test `SPARK-26836: support Avro schema evolution`, in Hive this results in:
```
0: jdbc:hive2://<IP>:10000> select * from t;
INFO : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds
INFO : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds
INFO : OK
+---------------+-------------+-------------+
| t.col1 | t.col2 | t.ds |
+---------------+-------------+-------------+
| col1_default | col2_value | 1981-01-07 |
| col1_value | col2_value | 1983-04-27 |
+---------------+-------------+-------------+
2 rows selected (0.159 seconds)
```
### Why are the changes needed?
Without this change the old schema would be used. This can use a correctness issue when the new schema introduces
a new field with a default value (following the rules of schema evolution) before an existing field.
In this case the rows coming from the partition where the old schema was used will **contain values in wrong column positions**.
For example check the attached unit test `SPARK-26836: support Avro schema evolution`
Without this fix the result of the select on the table would be:
```
+----------+----------+----------+
| col1| col2| ds|
+----------+----------+----------+
|col2_value| null|1981-01-07|
|col1_value|col2_value|1983-04-27|
+----------+----------+----------+
```
With this fix:
```
+------------+----------+----------+
| col1| col2| ds|
+------------+----------+----------+
|col1_default|col2_value|1981-01-07|
| col1_value|col2_value|1983-04-27|
+------------+----------+----------+
```
### Does this PR introduce _any_ user-facing change?
Just fixes the value errors.
When a new column is introduced even to the last position then instead of 'null' the given default will be used.
### How was this patch tested?
This was tested with the unit tested included to the PR.
And manually on Apache Spark / Hive.
Closes#31133 from attilapiros/SPARK-26836.
Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `ALTER TABLE ... SET/UNSET TBLPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).
### Does this PR introduce _any_ user-facing change?
After this PR, `ALTER TABLE SET/UNSET TBLPROPERTIES` will have a consistent resolution behavior.
### How was this patch tested?
Updated existing tests / added new tests.
Closes#31422 from imback82/v2_alter_table_set_unset_properties.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Move parser tests from `DDLParserSuite` to `TruncateTableParserSuite`.
2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.TruncateTableSuiteBase` and to `v1.TruncateTableSuite`.
3. Add a test for DSv2 `TRUNCATE TABLE` to `v2.TruncateTableSuite`.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *TruncateTableSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```
Closes#31387 from MaxGekk/unify-truncate-table-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to add `relationTypeMismatchHint` to `UnresolvedTable` so that if a relation is resolved to a view when a table is expected, a hint message can be included as a part of the analysis exception message. Note that the same feature is already introduced to `UnresolvedView` in #30636.
This mostly affects `ALTER TABLE` commands where the analysis exception message will now contain `Please use ALTER VIEW as instead`.
### Why are the changes needed?
To give a better error message. (The hint used to exist but got removed for commands that migrated to the new resolution framework)
### Does this PR introduce _any_ user-facing change?
Yes, now `ALTER TABLE` commands include a hint to use `ALTER VIEW` instead.
```
sql("ALTER TABLE v SET SERDE 'whatever'")
```
Before:
```
"v is a view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]' expects a table.
```
After this PR:
```
"v is a view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]' expects a table. Please use ALTER VIEW instead.
```
### How was this patch tested?
Updated existing test cases to include the hint.
Closes#31424 from imback82/better_error.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.
In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
- `sumDistinct` -> `sum_distinct`
- `bitwiseNOT` -> `bitwise_not`
- `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
- `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
- `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
- (Scala-specific) `callUDF` -> `call_udf`
### Why are the changes needed?
To keep the consistent naming in APIs.
### Does this PR introduce _any_ user-facing change?
Yes, it deprecates some APIs and add new renamed APIs as described above.
### How was this patch tested?
Unittests were added.
Closes#31408 from HyukjinKwon/SPARK-34306.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe.
This PR keep SHOW COLUMNS command's output attribute exprId unchanged.
### Why are the changes needed?
Keep SHOW PARTITIONS command's output attribute exprid unchanged.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#31377 from AngersZhuuuu/SPARK-34239.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Remove old statement `AlterTableSetLocationStatement`
2. Introduce new command `AlterTableSetLocation` for `ALTER TABLE .. SET LOCATION`.
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: SPARK-29900.
### Does this PR introduce _any_ user-facing change?
It can change the error message for views.
### How was this patch tested?
By running `ALTER TABLE .. SET LOCATION` tests:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *DataSourceV2SQLSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```
Closes#31414 from MaxGekk/migrate-set-location-resolv-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`ResolveSessionCatalog`'s `isTempView` and `isTempFunction` are not being used anymore since the resolution of temp view/function has moved to `Analyzer`.
This PR proposes to remove `isTempView` and `isTempFunction` from `ResolveSessionCatalog`.
### Why are the changes needed?
To clean up unused variables.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests should cover as this PR just removes the unused variables.
Closes#31400 from imback82/cleanup_resolve_session_catalog.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
We added test "built-in Hadoop version should support shaded client" in https://github.com/apache/spark/pull/31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2.
### Why are the changes needed?
The test fails in master branch when profile hadoop-2.7 is activated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran the test with hadoop-2.7 profile.
Closes#31391 from bozhang2820/fix-hadoop-2-version-test.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add function exists check before load resource.
### Why are the changes needed?
We should not add jar into classpath if the create temporary function is already exists.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#31358 from ulysses-you/SPARK-34261.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Invoke `CatalogImpl.refreshTable()` in v1 implementation of the `ALTER TABLE .. SET LOCATION` command to refresh cached table data.
### Why are the changes needed?
The example below portraits the issue:
- Create a source table:
```sql
spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0);
default src_tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0
...
```
- Set new location for the empty partition (part=0):
```sql
spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> ALTER TABLE dst_tbl ADD PARTITION (part=0);
spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1;
spark-sql> CACHE TABLE dst_tbl;
spark-sql> SELECT * FROM dst_tbl;
1 1
spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0';
spark-sql> SELECT * FROM dst_tbl;
1 1
```
The last query does not return new loaded data.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works correctly:
```sql
spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0';
spark-sql> SELECT * FROM dst_tbl;
0 0
1 1
```
### How was this patch tested?
Added new test to `org.apache.spark.sql.hive.CachedTableSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#31361 from MaxGekk/refresh-cache-set-location.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use `majorMinorPatchVersion` to check major & minor version in `IsolatedClientLoader.hiveVersion`.
### Why are the changes needed?
Currently `IsolatedClientLoader.hiveVersion` needs to enumerate all Hive patch versions. Therefore, whenever we upgrade Hive version we'd need to remember to update the method as well. It would be better if we just check major & minor version.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This is a refactoring and relies on existing tests.
Closes#31371 from sunchao/replace-hive-version.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove `SessionState.refreshTable()` and modify the tests where the method is used.
### Why are the changes needed?
There are already 2 methods with the same name in:
- `SessionCatalog`
- `CatalogImpl`
One more method in `SessionState` does not give any benefits. By removing it, we can improve code maintenance.
### Does this PR introduce _any_ user-facing change?
Should not because `SessionState` is an internal class.
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *MetastoreDataSourcesSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcQuerySuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveParquetMetastoreSuite"
```
Closes#31366 from MaxGekk/remove-refreshTable-from-SessionState.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Add back Maven enforcer for duplicate dependencies check
2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string.
3. Cleanup unnecessary code
### Why are the changes needed?
The Maven enforcer was removed as part of #30556. This proposes to add it back.
Also, Hadoop shaded client doesn't work in certain cases (see [these comments](https://github.com/apache/spark/pull/30701#discussion_r558522227) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#31203 from sunchao/SPARK-33212-followup.
Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed.
### Why are the changes needed?
Avoid file handle leak.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31323 from LuciferYang/source-not-closed.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983:
- Remove the API tag `Unstable` for `HiveSessionStateBuilder`
- Add document for spark.sql.hive package to emphasize it's a private package
### Why are the changes needed?
Follow the rule for a private package.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Doc change only.
Closes#31321 from xuanyuanking/SPARK-34185-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add notice about keep hive version consistence when config hive jars location
With PR #29881, if we don't keep hive version consistence. we will got below error.
```
Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8.
```
![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png)
### Why are the changes needed?
Make config doc detail
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#31317 from AngersZhuuuu/SPARK-32852-followup.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst.
This PR reverts 6da5cdf1db that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty.
This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark.
### Why are the changes needed?
fix perf regression when tables have char/varchar type columns
closes#31278
### Does this PR introduce _any_ user-facing change?
yes, spark will not raise error for oversized char/varchar values in read side
### How was this patch tested?
modified ut
the dropped read side benchmark
```
================================================================================================
Char Varchar Read Side Perf w/o Tailing Spaces
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20 1564 1573 9 63.9 15.6 1.0X
read char with length 20 1532 1551 18 65.3 15.3 1.0X
read varchar with length 20 1520 1531 13 65.8 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40 1573 1613 41 63.6 15.7 1.0X
read char with length 40 1575 1577 2 63.5 15.7 1.0X
read varchar with length 40 1568 1576 11 63.8 15.7 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60 1526 1540 23 65.5 15.3 1.0X
read char with length 60 1514 1539 23 66.0 15.1 1.0X
read varchar with length 60 1486 1497 10 67.3 14.9 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80 1531 1542 19 65.3 15.3 1.0X
read char with length 80 1514 1529 15 66.0 15.1 1.0X
read varchar with length 80 1524 1565 42 65.6 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100 1597 1623 25 62.6 16.0 1.0X
read char with length 100 1499 1512 16 66.7 15.0 1.1X
read varchar with length 100 1517 1524 8 65.9 15.2 1.1X
================================================================================================
Char Varchar Read Side Perf w/ Tailing Spaces
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20 1524 1526 1 65.6 15.2 1.0X
read char with length 20 1532 1537 9 65.3 15.3 1.0X
read varchar with length 20 1520 1532 15 65.8 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40 1556 1580 32 64.3 15.6 1.0X
read char with length 40 1600 1611 17 62.5 16.0 1.0X
read varchar with length 40 1648 1716 88 60.7 16.5 0.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60 1504 1524 20 66.5 15.0 1.0X
read char with length 60 1509 1512 3 66.2 15.1 1.0X
read varchar with length 60 1519 1535 21 65.8 15.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80 1640 1652 17 61.0 16.4 1.0X
read char with length 80 1625 1666 35 61.5 16.3 1.0X
read varchar with length 80 1590 1605 13 62.9 15.9 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100 1622 1628 5 61.6 16.2 1.0X
read char with length 100 1614 1646 30 62.0 16.1 1.0X
read varchar with length 100 1594 1606 11 62.7 15.9 1.0X
```
Closes#31281 from yaooqinn/SPARK-34192.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Fix the issues in the Spark 3.1.1 release API docs.
### Does this PR introduce _any_ user-facing change?
Yes, API doc changes.
### How was this patch tested?
Manually test.
Closes#31271 from xuanyuanking/SPARK-34185.
Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Invoke `CatalogImpl.refreshTable()` instead of `SessionCatalog.refreshTable` in v1 implementation of the `LOAD DATA` command. `SessionCatalog.refreshTable` just refreshes metadata comparing to `CatalogImpl.refreshTable()` which refreshes cached table data as well.
### Why are the changes needed?
The example below portraits the issue:
- Create a source table:
```sql
spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0);
default src_tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0
...
```
- Load data from the source table to a cached destination table:
```sql
spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1;
spark-sql> CACHE TABLE dst_tbl;
spark-sql> SELECT * FROM dst_tbl;
1 1
spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0);
spark-sql> SELECT * FROM dst_tbl;
1 1
```
The last query does not return new loaded data.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works correctly:
```sql
spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0);
spark-sql> SELECT * FROM dst_tbl;
0 0
1 1
```
### How was this patch tested?
Added new test to `org.apache.spark.sql.hive.CachedTableSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#31304 from MaxGekk/load-data-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`HiveExternalCatalogVersionsSuite` can't run in orgs internal environment where access to outside internet is not allowed because `HiveExternalCatalogVersionsSuite` will download spark release package from internet.
Similar to SPARK-32998, this pr add 1 environment variables `SPARK_RELEASE_MIRROR` to let user can specify an accessible download address of spark release package and run `HiveExternalCatalogVersionsSuite` in orgs internal environment.
### Why are the changes needed?
Let `HiveExternalCatalogVersionsSuite` can run in orgs internal environment without relying on external spark release download address.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test with and without env variables set in internal environment can't access internet.
execute
```
mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -am -DskipTests
mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest=none
```
**Without env**
```
HiveExternalCatalogVersionsSuite:
19:50:35.123 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed)
19:50:35.126 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://dist.apache.org/repos/dist/release/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed)
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - Unable to download Spark 3.0.1 (HiveExternalCatalogVersionsSuite.scala:125)
Run completed in 2 seconds, 669 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
```
**With env**
```
export SPARK_RELEASE_MIRROR=${spark-release.internal.com}/dist/release/
```
```
HiveExternalCatalogVersionsSuite
- backward compatibility
Run completed in 1 minute, 32 seconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#31294 from LuciferYang/SPARK-34202.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR changes cache refreshing of v1 tables in v1 commands. In particular, v1 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v1 table and its dependents. In more details:
1. Modified the `CatalogImpl.refreshTable()` method to use `recacheByPlan()` instead of `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. Users can call this method via public API like `spark.catalog.refreshTable()`.
2. Rewritten the part in `CatalogImpl.refreshTable()` which was responsible for table meta-data refreshing because this code stopped to work properly after removing of the second `sparkSession.table(tableIdent)`.
3. Added new private method `invalidateCachedTable()` to `SessionCatalog`. Comparing to the existing `SessionCatalog.refreshTable`, it invalidates the relation cache only. If we called `SessionCatalog.refreshTable` from `CatalogImpl.refreshTable()`, we would refresh temporary and global temporary views twice (that could lead to refreshing file index twice).
### Why are the changes needed?
1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v1 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v1 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly.
2. To improve code maintenance.
3. To reduce the amount of calls to Hive external catalog.
4. Also this should speed up table recaching.
5. To have the same behavior as for v2 tables supported by https://github.com/apache/spark/pull/31172
### Does this PR introduce _any_ user-facing change?
From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. For example:
Before:
```scala
scala> sql("CREATE TABLE tbl (c int)")
scala> sql("CACHE TABLE tbl")
scala> sql("CREATE VIEW v AS SELECT * FROM tbl")
scala> sql("CACHE TABLE v")
scala> spark.catalog.isCached("v")
res6: Boolean = true
scala> spark.catalog.refreshTable("tbl")
scala> spark.catalog.isCached("v")
res8: Boolean = false
```
After:
```scala
scala> spark.catalog.refreshTable("tbl")
scala> spark.catalog.isCached("v")
res8: Boolean = true
```
### How was this patch tested?
1. Added new unit tests that create a view, a temporary view and a global temporary view on top of v1/v2 tables, and refresh the base table via `ALTER TABLE .. ADD/DROP/RENAME PARTITION`.
2. By running the unified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
# build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```
Closes#31206 from MaxGekk/refreshTable-recache-by-plan.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update Avro dependency to version 1.10.1
### Why are the changes needed?
To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Since there were no API changes required we just run the tests
Closes#31232 from iemejia/SPARK-27733-avro-upgrade.
Authored-by: Ismaël Mejía <iemejia@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There is one Java UT error when testing sql/hive module independently in Scala 2.13 after SPARK-33212, the error message as follow:
```
[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 18.548 s <<< ERROR!
java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport
at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
```
This pr add a Scala-2.13 profile with dependency of `scala-parallel-collections_` to `sql/hive` module to fix the Java UT in Scala 2.13.
### Why are the changes needed?
Recover the independent mvn test ability of sql/hive module in Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test
```
dev/change-scala-version.sh 2.13
mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive -am -DskipTests
mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive
```
**Before**
```
[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 18.725 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 16.853 s <<< ERROR!
java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport
at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
[INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
16:15:36.186 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
16:15:36.288 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
16:15:36.396 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.481 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR] JavaDataFrameSuite.testUDAF:92->checkAnswer:41 » NoClassDefFound scala/collect...
[INFO]
[ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0
```
**After**
```
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.287 s - in org.apache.spark.sql.hive.JavaDataFrameSuite
[INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
16:12:16.697 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
16:12:17.540 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
16:12:17.654 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.58 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
```
Closes#31259 from LuciferYang/SPARK-34176.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Port DS V2 tests from `AlterTablePartitionV2SQLSuite ` to the test suite `v2.AlterTableRecoverPartitionsSuite`.
2. Port DS v1 tests from `DDLSuite` to `v1.AlterTableRecoverPartitionsSuiteBase`.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```
Closes#31105 from MaxGekk/unify-recover-partitions-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr keep necessary stats after partition pruning.
### Why are the changes needed?
Improve query performance. It will push down aggregate since SPARK-34081 because it can be planed as BroadcastHashJoin. But it lacks column statistics after [`PruneFileSourcePartitions`](d0c83f372b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (L102-L103)). Therefore, it will eventually be planned as SortMergeJoin.
Please see the log:
```
join.right.stats: org.apache.spark.sql.catalyst.optimizer.PushDownPredicates: Statistics(sizeInBytes=348.8 KiB, rowCount=1.79E+4)
join.right.stats: org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions: Statistics(sizeInBytes=1414.2 EiB)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and benchmark test
SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
q14a | 594 | 384
q14b | 600 | 402
This change will not affect the results of `PlanStabilitySuite`, because it does not have partition column.
Closes#31205 from wangyum/SPARK-34119.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`.
### Why are the changes needed?
It reduces the number of calls to Hive External catalog.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the modified test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```
Closes#31234 from MaxGekk/remove-getRawTable-from-alterPartitions.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`.
### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0 0
spark-sql> show table extended like 'tbl' partition(part=0);
default tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl recover partitions;
spark-sql> select * from tbl;
0 0
```
The last query must return `0 1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl recover partitions;
spark-sql> select * from tbl;
0 0
0 1
```
### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#31066 from MaxGekk/recover-partitions-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`java.io.FIle.toURL` method does not automatically escape characters that are illegal in URLs.
Java doc recommended that new code convert an abstract pathname into a URL by first converting it into a URI, via the `toURI` method, and then converting the URI into a URL via the `URI.toURL` method.
So this pr cleaned up the relevant cases in Spark code.
### Why are the changes needed?
Cleaning up `Deprecated` Java API usage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31230 from LuciferYang/SPARK-34151.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec
### Why are the changes needed?
Upgrade Avro and Parquet to latest version.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517Closes#30657 from wangyum/SPARK-33696.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This:
1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x.
2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2
Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client.
In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:
```
hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```
which default to:
```
hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster
```
but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`.
Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests).
### Why are the changes needed?
Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts.
### Does this PR introduce _any_ user-facing change?
When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.
### How was this patch tested?
Relying on existing tests.
Closes#30701 from sunchao/test-hadoop-3.2.2.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases.
### Why are the changes needed?
Use `val` instead of `var` when possible.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31142 from LuciferYang/SPARK-33346.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add support for `orc.force.positional.evolution` config that forces ORC top level column matching by position rather than by name.
This does work in Hive:
```
> set orc.force.positional.evolution;
+--------------------------------------+
| set |
+--------------------------------------+
| orc.force.positional.evolution=true |
+--------------------------------------+
> create table t (c1 string, c2 string) stored as orc;
> insert into t values ('foo', 'bar');
> alter table t change c1 c3 string;
```
The orc file in this case contains the original `c1` and `c2` columns that doesn't match the metadata in HMS. But due to the positional evolution setting, Hive is capable to return all the data:
```
> select * from t;
+--------+--------+
| t.c3 | t.c2 |
+--------+--------+
| foo | bar |
+--------+--------+
```
Without this PR Spark returns `null`s for the renamed `c3` column.
After this PR Spark returns the data in `c3` column.
### Why are the changes needed?
Hive/ORC does support it.
### Does this PR introduce _any_ user-facing change?
Yes, we will support `orc.force.positional.evolution`.
### How was this patch tested?
New UT.
Closes#29737 from peter-toth/SPARK-32864-support-orc-forced-positional-evolution.
Lead-authored-by: Peter Toth <peter.toth@gmail.com>
Co-authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile.
### Why are the changes needed?
Remove redundant collection conversion
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed
Closes#31125 from LuciferYang/SPARK-34068.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved.
### Why are the changes needed?
Currently, `CatalogImpl.dropTempView` is implemented as following:
```scala
override def dropTempView(viewName: String): Boolean = {
sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef =>
sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession, viewDef, cascade = false)
sessionCatalog.dropTempView(viewName)
}
}
```
Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like:
```
Invalid call to qualifier on unresolved object, tree: 'key
```
This can be reproduced via:
```scala
sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10")
sql(s"CREATE TABLE $t AS SELECT * FROM src")
sql(s"CACHE TABLE $t")
dropTempTable(v)
```
### Does this PR introduce _any_ user-facing change?
The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix.
### How was this patch tested?
Added new unit tests.
Closes#31136 from sunchao/SPARK-34076.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Port the test added by https://github.com/apache/spark/pull/31112 to:
1. v1 In-Memory catalog for `ALTER TABLE .. DROP PARTITION`
2. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. ADD PARTITION`
3. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. RENAME PARTITION`
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite"
```
Closes#31131 from MaxGekk/cache-stats-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Do not alter table stats if they are the same as in the catalog (at least since the recent retrieve).
### Why are the changes needed?
The changes reduce the number of calls to Hive external catalog.
### Does this PR introduce _any_ user-facing change?
Should not.
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#31135 from MaxGekk/optimize-updateTableStats.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up to replace `version.toDouble > 2` with `version >= "2.0"`
### Why are the changes needed?
`toDouble` has some assumption and can cause `java.lang.NumberFormatException`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#31134 from dongjoon-hyun/SPARK-33970-FOLLOWUP.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan.
### Why are the changes needed?
This fixes the issue demonstrated by the example below:
```scala
scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)")
scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
scala> sql("CACHE TABLE tbl")
scala> sql("SELECT * FROM tbl").show(false)
+---+----+
|id |part|
+---+----+
|0 |0 |
|1 |1 |
+---+----+
scala> spark.catalog.isCached("tbl")
scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = false
```
`ALTER TABLE .. DROP PARTITION` must keep the table in the cache.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats:
```scala
scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = true
```
### How was this patch tested?
By running new UT in `AlterTableDropPartitionSuite`.
Closes#31112 from MaxGekk/fix-caching-hive-table-2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr add test default partition in metastoredirectsql.
### Why are the changes needed?
Improve test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#31109 from wangyum/SPARK-33970.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`.
### Why are the changes needed?
There is a logical conflict between https://github.com/apache/spark/pull/31101 and https://github.com/apache/spark/pull/31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the modified test:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```
Closes#31111 from MaxGekk/add-partition-refresh-cache-2-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use `initialize(StructObjectInspector argOIs)` instead `initialize(ObjectInspector[] args)` in `HiveGenericUDTF`.
### Why are the changes needed?
In our case, we implement a Hive `GenericUDTF` and override `initialize(StructObjectInspector argOIs)`. Then it's ok to execute with Hive, but failed with Spark SQL. Here is the Spark SQL error msg:
```
No handler for UDF/UDAF/UDTF 'com.xxxx.xxxUDTF': java.lang.IllegalStateException:
Should not be called directly Please make sure your function overrides
`public StructObjectInspector initialize(ObjectInspector[] args)`.
```
The reason is Spark `HiveGenericUDTF` call `initialize(ObjectInspector[] argOIs)` to init a UDTF, but it's a Deprecated method.
```
public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];
for(int i = 0; i < inputFields.size(); ++i) {
udtfInputOIs[i] = ((StructField)inputFields.get(i)).getFieldObjectInspector();
}
return this.initialize(udtfInputOIs);
}
Deprecated
public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
throw new IllegalStateException("Should not be called directly");
}
```
We should use `initialize(StructObjectInspector argOIs)` to do this so that we can be compatible both of the two method. Same as Hive.
### Does this PR introduce _any_ user-facing change?
Yes, fix UDTF initialize method.
### How was this patch tested?
manual test and passed `HiveUDFDynamicLoadSuite`
Closes#29490 from ulysses-you/SPARK-32668.
Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
Add new tests to unified test suites to check the total amount of calls via the Hive client.
### Why are the changes needed?
1. To improve test coverage
2. To make foundation for future optimizations
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the affected test suites like:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#31092 from MaxGekk/access-to-catalog-refreshTable.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values.
2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`.
3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec.
### Why are the changes needed?
Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example:
```sql
spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false
```
Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works as expected:
```sql
spark-sql> SELECT isnull(p1) FROM tbl5;
true
```
### How was this patch tested?
1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`.
2. Compiling by Scala 2.13:
```
$ ./dev/change-scala-version.sh 2.13
$ ./build/sbt -Pscala-2.13 compile
```
Closes#30538 from MaxGekk/partition-spec-value-null.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Port DS V2 tests from `DataSourceV2SQLSuite` to the base test suite `ShowNamespacesSuiteBase` to run those tests for v1 catalogs.
2. Port DS v1 tests from `DDLSuite` to `ShowNamespacesSuiteBase` to run the tests for v2 catalogs too.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowNamespacesSuite"
```
Closes#30937 from MaxGekk/unify-show-namespaces-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma.
When we use spark 2.4 with UT
```
test("insert overwrite directory with comma col name") {
withTempDir { dir =>
val path = dir.toURI.getPath
val v1 =
s"""
| INSERT OVERWRITE DIRECTORY '${path}'
| STORED AS TEXTFILE
| SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false")
""".stripMargin
sql(v1).explain(true)
sql(v1).show()
}
}
```
failed with as below since column name contains `,` then column names and column types size not equal.
```
19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements!
at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119)
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ',':
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)
And in script transform, we parse column name to avoid this problem
554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)
So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well.
### Why are the changes needed?
More save use serde
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#30850 from AngersZhuuuu/SPARK-33844.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>