### What changes were proposed in this pull request?
This PR addresses two things:
- `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921)
- `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`.
### Why are the changes needed?
It's a bug.
### Does this PR introduce any user-facing change?
Yes, now `SHOW TBLPROPERTIES` works on views:
```
scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES view").show(truncate=false)
+---------------------------------+-------------+
|key |value |
+---------------------------------+-------------+
|view.catalogAndNamespace.numParts|2 |
|view.query.out.col.0 |c1 |
|view.query.out.numCols |1 |
|p2 |v2 |
|view.catalogAndNamespace.part.0 |spark_catalog|
|p1 |v1 |
|view.catalogAndNamespace.part.1 |default |
+---------------------------------+-------------+
```
And for a temporary view:
```
scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false)
+---+-----+
|key|value|
+---+-----+
+---+-----+
```
### How was this patch tested?
Added tests.
Closes#28375 from imback82/show_tblproperties_followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
HiveClient instance is cross-session, the following configurations which are defined in HiveUtils and used to create it should be considered static:
1. spark.sql.hive.metastore.version - used to determine the hive version in Spark
2. spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client
3. spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes - package names of classes that are shared or separated between SparkContextLoader and hive client class loader
Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command.
Speaking of spark.sql.hive.version - the fake of the spark.sql.hive.metastore.version, it is used by jdbc/thrift client for backward compatibility.
### Why are the changes needed?
bugfix, these configurations should not be changed.
### Does this PR introduce any user-facing change?
Yes, the following set of configs are not allowed to change.
```
Seq("spark.sql.hive.metastore.version ",
"spark.sql.hive.metastore.jars",
"spark.sql.hive.metastore.sharedPrefixes",
"spark.sql.hive.metastore.barrierPrefixes")
```
### How was this patch tested?
add unit test
Closes#28302 from yaooqinn/SPARK-31522.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
**Hive 2.3.7** fixed these issues:
- HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 or newer
- HIVE-21980:Parsing time can be high in case of deeply nested subqueries
- HIVE-22249: Support Parquet through HCatalog
### Why are the changes needed?
Fix CCE during creating HiveMetaStoreClient in JDK11 environment: [SPARK-29245](https://issues.apache.org/jira/browse/SPARK-29245).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
- [x] Test Jenkins with Hadoop 2.7 (https://github.com/apache/spark/pull/28148#issuecomment-616757840)
- [x] Test Jenkins with Hadoop 3.2 on JDK11 (https://github.com/apache/spark/pull/28148#issuecomment-616294353)
- [x] Manual test with remote hive metastore.
Hive side:
```
export JAVA_HOME=/usr/lib/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH
cd /usr/lib/hive-2.3.6 # Start Hive metastore with Hive 2.3.6
bin/schematool -dbType derby -initSchema --verbose
bin/hive --service metastore
```
Spark side:
```
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH
build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
bin/spark-sql --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083
```
Closes#28148 from wangyum/SPARK-31381.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts.
### Why are the changes needed?
Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue.
Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8` to simulate SPARC platform. And tests only pass with this fix.
Closes#28195 from Ngone51/fix_uao.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `AdaptiveTestUtils` to make AQE test simpler, which includes:
`DisableAdaptiveExecution` - a test tag to skip a single test case if AQE is enabled.
`EnableAdaptiveExecutionSuite` - a helper trait to enable AQE for all tests except those tagged with `DisableAdaptiveExecution`.
`DisableAdaptiveExecutionSuite` - a helper trait to disable AQE for all tests.
`assertExceptionMessage` - a method to handle message of normal or AQE exception in a consistent way.
`assertExceptionCause` - a method to handle cause of normal or AQE exception in a consistent way.
### Why are the changes needed?
With this utils, we can:
- reduce much more duplicate codes;
- handle normal or AQE exception in a consistent way;
- improve the stability of AQE tests;
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Updated tests with the util.
Closes#28162 from Ngone51/add_aqe_test_utils.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In `TestHiveQueryExecution`, if we detect a database in the referenced table, we should create the table under that database.
### Why are the changes needed?
This fix the test `Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q` which currently only pass when we run it with the whole test suit but fail when run it separately.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Run the test separately and together with the whole test suite.
Closes#28177 from Ngone51/fix_derived.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Previously, user can issue `SHOW TABLES` to get info of both tables and views.
This PR (SPARK-31113) implements `SHOW VIEWS` SQL command similar to HIVE to get views only.(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews)
**Hive** -- Only show view names
```
hive> SHOW VIEWS;
OK
view_1
view_2
...
```
**Spark(Hive-Compatible)** -- Only show view names, used in tests and `SparkSQLDriver` for CLI applications
```
SHOW VIEWS IN showdb;
view_1
view_2
...
```
**Spark** -- Show more information database/viewName/isTemporary
```
spark-sql> SHOW VIEWS;
userdb view_1 false
userdb view_2 false
...
```
### Why are the changes needed?
`SHOW VIEWS` command provides better granularity to only get information of views.
### Does this PR introduce any user-facing change?
Add new `SHOW VIEWS` SQL command
### How was this patch tested?
Add new test `show-views.sql` and pass existing tests
Closes#27897 from Eric5553/ShowViews.
Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
For now `SHOW CREATE TABLE` command doesn't support views, but `SHOW CREATE TABLE AS SERDE` supports it. Since the views syntax are the same between Hive DDL and Spark DDL, we should be able to support views in both two commands.
This is Hive syntax for creating views:
```
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;
```
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateView
This is Spark syntax for creating views:
```
CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [IF NOT EXISTS [db_name.]view_name
create_view_clauses
AS query;
```
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-create-view.html
Looks like it is the same. We could support views in both commands.
This patch proposes to add views support to `SHOW CREATE TABLE`.
### Why are the changes needed?
To extend the view support of `SHOW CREATE TABLE`, so users can use `SHOW CREATE TABLE` to show Spark DDL for views.
### Does this PR introduce any user-facing change?
Yes. `SHOW CREATE TABLE` can be used to show Spark DDL for views.
### How was this patch tested?
Unit tests.
Closes#27984 from viirya/spark-view.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.
It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.
This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.
Credit to cloud-fan as he discovered the problem and proposed the solution.
### Why are the changes needed?
Above section describes why it's a bug and how it's fixed.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New UTs added.
Closes#28079 from HeartSaVioR/SPARK-31312.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small.
- HiveContext
- createExternalTable APIs
### Why are the changes needed?
Avoid breaking the APIs that are commonly used.
### Does this PR introduce any user-facing change?
Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released.
### How was this patch tested?
add a new test suite for createExternalTable APIs.
Closes#27815 from gatorsmile/addAPIsBack.
Lead-authored-by: gatorsmile <gatorsmile@gmail.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions.
2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet.
### Why are the changes needed?
Better performance for handling nested predicate pushdown.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New tests are added.
Closes#27728 from dbtsai/SPARK-17636.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps.
Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`.
This PR fixes CLiSuite failure with the hive-1.2 profile in https://github.com/apache/spark/pull/27933.
In https://github.com/apache/spark/pull/27933, we fix the issue in JIRA by deciding the warehouse dir using all properties from spark conf and Hadoop conf, but properties from `--hiveconf` is not included, they will be applied to the `CliSessionState` instance after it initialized. When this command-line option key is `hive.metastore.warehouse.dir`, the actual warehouse dir is overridden. Because of the logic in Hive for creating the non-existing default database changed, that test passed with `Hive 2.3.6` but failed with `1.2`. So in this PR, Hadoop/Hive configurations are ordered by:
` spark.hive.xxx > spark.hadoop.xxx > --hiveconf xxx > hive-site.xml` througth `ShareState.loadHiveConfFile` before sessionState start
### Why are the changes needed?
Bugfix for Spark SQL CLI to pick right confs
### Does this PR introduce any user-facing change?
yes,
1. the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified.
2. configurations from `hive-site.xml` will not override command-line options or the properties defined with `spark.hadoo(hive).` prefix in spark conf.
### How was this patch tested?
add cli ut
Closes#27969 from yaooqinn/SPARK-31170-2.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by https://github.com/apache/spark/pull/27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR https://github.com/apache/spark/pull/27890 (and fixed by https://github.com/apache/spark/pull/27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.
### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.
### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt |
+----------+
|1200-01-01|
+----------+
```
### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.
Closes#28016 from MaxGekk/rebase-date-orc.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.
However, this leads to confusing behaviors
**Apache Spark 3.0.0-preview2**
```
spark-sql> CREATE TABLE t(a CHAR(3));
spark-sql> INSERT INTO TABLE t SELECT 'a ';
spark-sql> SELECT a, length(a) FROM t;
a 2
```
**Apache Spark 2.4.5**
```
spark-sql> CREATE TABLE t(a CHAR(3));
spark-sql> INSERT INTO TABLE t SELECT 'a ';
spark-sql> SELECT a, length(a) FROM t;
a 3
```
According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it.
This PR forbids CHAR type in non-Hive tables as it's not supported correctly.
### Why are the changes needed?
avoid confusing/wrong behavior
### Does this PR introduce any user-facing change?
yes, now users can't create/alter non-Hive tables with CHAR type.
### How was this patch tested?
new tests
Closes#27902 from cloud-fan/char.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.
### Why are the changes needed?
The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.
### Does this PR introduce any user-facing change?
Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.
### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.
Closes#27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.
We can reproduce by below commands:
```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)
```
### Why are the changes needed?
Fix the memory leak, specially for long running mode.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add an unit test.
Closes#27185 from LantaoJin/SPARK-30494.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Hive 2.3+ supports `getTablesByType` API, which will provide an efficient way to get HiveTable with specific type. Now, we have following mappings when using `HiveExternalCatalog`.
```
CatalogTableType.EXTERNAL => HiveTableType.EXTERNAL_TABLE
CatalogTableType.MANAGED => HiveTableType.MANAGED_TABLE
CatalogTableType.VIEW => HiveTableType.VIRTUAL_VIEW
```
Without this API, we need to achieve the goal by `getTables` + `getTablesByName` + `filter with type`.
This PR add `getTablesByType` in `HiveShim`. For those hive versions don't support this API, `UnsupportedOperationException` will be thrown. And the upper logic should catch the exception and fallback to the filter solution mentioned above.
Since the JDK11 related fix in `Hive` is not released yet, manual tests against hive 2.3.7-SNAPSHOT is done by following the instructions of SPARK-29245.
### Why are the changes needed?
This API will provide better usability and performance if we want to get a list of hiveTables with specific type. For example `HiveTableType.VIRTUAL_VIEW` corresponding to `CatalogTableType.VIEW`.
### Does this PR introduce any user-facing change?
No, this is a support function.
### How was this patch tested?
Add tests in VersionsSuite and manually run JDK11 test with following settings:
- Hive 2.3.6 Metastore on JDK8
- Hive 2.3.7-SNAPSHOT library build from source of Hive 2.3 branch
- Spark build with Hive 2.3.7-SNAPSHOT on jdk-11.0.6
Closes#27952 from Eric5553/GetTableByType.
Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`.
### Why are the changes needed?
0. Existing rebasing of days in `DaysWritable` is not correct.
1. To deduplicate code in `DaysWritable`
2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5
### Does this PR introduce any user-facing change?
This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`.
### How was this patch tested?
By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`.
Closes#27962 from MaxGekk/reuse-rebase-funcs.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.
The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR.
### Why are the changes needed?
This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps.
Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`.
### Why are the changes needed?
fix bug for Spark SQL cli to pick right confs
### Does this PR introduce any user-facing change?
yes, the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified.
### How was this patch tested?
add cli ut
Closes#27933 from yaooqinn/SPARK-31170.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move the code related to days rebasing from/to Julian calendar from `HiveInspectors` to new class `DaysWritable`.
### Why are the changes needed?
To improve maintainability of the `HiveInspectors` trait which is already pretty complex.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By `HiveOrcHadoopFsRelationSuite`.
Closes#27890 from MaxGekk/replace-DateWritable-by-DaysWritable.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
As a common usage and according to the spark doc, users may often just copy their `hive-site.xml` to Spark directly from hive projects. Sometimes, the config file is not that clean for spark and may cause some side effects.
for example, `hive.session.history.enabled` will create a log for the hive jobs but useless for spark and also it will not be deleted on JVM exit.
this pr
1) disable `hive.session.history.enabled` explicitly to disable creating `hive_job_log` file, e.g.
```
Hive history file=/var/folders/01/h81cs4sn3dq2dd_k4j6fhrmc0000gn/T//kentyao/hive_job_log_79c63b29-95a4-4935-a9eb-2d89844dfe4f_493861201.txt
```
2) set `hive.execution.engine` to `spark` explicitly in case the config is `tez` and casue uneccesary problem like this:
```
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
```
### Why are the changes needed?
reduce overhead of internal complexity and users' hive cognitive load for running spark
### Does this PR introduce any user-facing change?
yes, `hive_job_log` file will not be created even enabled, and will not try to initialize tez kinds of stuff
### How was this patch tested?
add ut and verify manually
Closes#27827 from yaooqinn/SPARK-31066.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components.
### Why are the changes needed?
This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.
Here is the example which demonstrates the issue:
```sql
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
```
### Does this PR introduce any user-facing change?
Yes, after the changes:
```sql
scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])
```
### How was this patch tested?
By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
Closes#27807 from MaxGekk/rebase-timestamp-before-1582.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This PR adds functionality to HiveExternalCatalog to be able to change the provider of a table.
This is useful for catalogs in Spark 3.0 to be able to use alterTable to change the provider of a table as part of an atomic REPLACE TABLE function.
No
Unit tests
Closes#27822 from brkyvz/externalCat.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables:
```
SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
```
fails with `cannot resolve 'spark_catalog.default.t.i`.
This is inconsistent with v2 table behavior where catalog name can be used:
```
SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id
```
This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables.
### Why are the changes needed?
Fixing an inconsistent behavior.
### Does this PR introduce any user-facing change?
Yes, now the following query works:
```
SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
```
### How was this patch tested?
Added new tests.
Closes#27776 from imback82/spark_catalog_col_name_resolution.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit afaeb29599.
### What changes were proposed in this pull request?
Based on the result and comment from https://github.com/apache/spark/pull/27552#discussion_r385531744
In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`.
there will be inconsistency between client and server if we use java8 APIs
### Why are the changes needed?
the change is still unclear enough
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Nah
Closes#27733 from yaooqinn/SPARK-30808.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch fixes several incorrect uses of `assume()` in our tests.
If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable).
In contrast, `assert(condition)` will fail the test if the condition doesn't hold.
If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test.
This patch fixes several such cases, replacing certain `assume()` calls with `assert()`.
Credit to ahirreddy for spotting this problem.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#27754 from JoshRosen/fix-assume-vs-assert.
Lead-authored-by: Josh Rosen <rosenville@gmail.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Make rule `PruneHiveTablePartitions` to execute as `earlyScanPushDownRules`.
### Why are the changes needed?
Similar to rule `PruneFileSourcePartitions`, `PruneHiveTablePartitions` should also be executed as earlyScanPushDownRules to eliminate the impact on statistic computation later.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27723 from Ngone51/early_hive_prune.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr intends to remove non-used trait, `GivenWhenThen`, from `HiveComparisonTest`.
### Why are the changes needed?
For better code.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#27726 from maropu/MINOR-20200228.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR groups all hive upgrade related migration guides inside Spark 3.0 together.
Also add another behavior change of `ScriptTransform` in the new Hive section.
### Why are the changes needed?
Make the doc more clearly to user.
### Does this PR introduce any user-facing change?
No, new doc for Spark 3.0.
### How was this patch tested?
N/A.
Closes#27670 from Ngone51/hive_migration.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/27659 (see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/253/), the tests below fail consistently, specifically in one job https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/ in Jenkins
```
org.apache.spark.sql.hive.execution.HiveSerDeSuite.Test the default fileformat for Hive-serde tables
```
The profile is same as PR builder but seems it fails specifically in this machine.
Several configurations used in `HiveSerDeSuite` are not being set presumably due to the inconsistency between `SQLConf.get` and the active Spark session described in the https://github.com/apache/spark/pull/27387, and as a side effect of the cloned session at https://github.com/apache/spark/pull/27659.
This PR proposes to explicitly set the configuration against `TestHive` by using `withExistingConf` at `withSQLConf`
### Why are the changes needed?
To make `spark-master-test-sbt-hadoop-2.7-hive-2.3` job pass.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Cannot reproduce in my local. Presumably it cannot be reproduced in the PR builder. We should see if the tests pass at `spark-master-test-sbt-hadoop-2.7-hive-2.3` job after this PR is merged.
Closes#27705 from HyukjinKwon/SPARK-30906.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/27387 (see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/202/), the tests below fail consistently, specifically in one job https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/ in Jenkins
```
org.apache.spark.sql.hive.HiveShowCreateTableSuite.simple hive table
org.apache.spark.sql.hive.HiveShowCreateTableSuite.simple external hive table
org.apache.spark.sql.hive.HiveShowCreateTableSuite.hive bucketing is supported
```
The profile is same as PR builder but seems it fails specifically in this machine. Seems the legacy configuration `spark.sql.legacy.createHiveTableByDefault.enabled` is not being set due to the inconsistency between `SQLConf.get` and the active Spark session as described in the https://github.com/apache/spark/pull/27387.
This PR proposes to explicitly set the configuration against the session used instead of `SQLConf.get`.
### Why are the changes needed?
To make `spark-master-test-sbt-hadoop-2.7-hive-2.3` job pass.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Cannot reproduce in my local. Presumably it cannot be reproduced in the PR builder. We should see if the tests pass at `spark-master-test-sbt-hadoop-2.7-hive-2.3` job after this PR is merged
Closes#27703 from HyukjinKwon/SPARK-30798-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.
### Why are the changes needed?
N/A
### Does this PR introduce any user-facing change?
N/A
### How was this patch tested?
N/A
Closes#27698 from gatorsmile/updateVersion.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
For the following:
```
CREATE TABLE t USING json AS SELECT 1 AS i
SELECT * FROM spark_catalog.t
```
`spark_catalog.t` is resolved to `spark_catalog.default.t` assuming the current namespace is `default`. However, this is not consistent with V2 behavior where the namespace must be specified if the catalog name is provided. This PR proposes to fix this inconsistency.
### Why are the changes needed?
To be consistent with V2 table naming scheme in SQL commands.
### Does this PR introduce any user-facing change?
Yes, now the user has to specify the namespace if the catalog name is provided. For example,
```
SELECT * FROM spark_catalog.t # Will throw AnalysisException with 'Session catalog cannot have an empty namespace: spark_catalog.t'
SELECT * FROM spark_catalog.default.t # OK
```
### How was this patch tested?
Added new tests
Closes#27642 from imback82/disallow_spark_catalog_wihtout_db.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### Why are the changes needed?
At present, HiveClientImpl.runHive will not throw an exception when it runs incorrectly, which will cause it to fail to feedback error information normally.
Example
```scala
spark.sql("add jar file:///tmp/not_exists.jar")
spark.sql("show databases").show()
```
/tmp/not_exists.jar doesn't exist, thus add jar is failed. However this code will run completely without causing application failure.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
add new suite tests
Closes#27644 from stczwd/SPARK-30868.
Authored-by: lijunqing <lijunqing@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
Add new `CommandCheck` rule and fail fast when detects duplicate columns in `AnalyzeColumnCommand`.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
To avoid duplicate statistics computation for the same column in `AnalyzeColumnCommand`.
### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
Yes. User now get exception when input duplicate columns.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Added new test.
Closes#27651 from Ngone51/fail_on_dup_cols.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Table generated by `CREATE TABLE LIKE` a partitioned table is a partitioned table. But when run `ALTER TABLE ADD PARTITION`, it will throw `AnalysisException: ALTER TABLE ADD PARTITION is not allowed`. That's because the default value of `tracksPartitionsInCatalog` from `CREATE TABLE LIKE` always is false.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add a unit test.
Closes#27538 from LantaoJin/SPARK-30785.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.
### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.
### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```
### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```
Closes#27552 from MaxGekk/hive-thriftserver-java8-time-api.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Let sub optimizer's `postHocOptimizationBatches` also includes super's `postHocOptimizationBatches`.
### Why are the changes needed?
It's necessary according to the design of catalyst optimizer.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass jenkins.
Closes#27607 from Ngone51/spark_15616_followup.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
[HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) removed the `SerDe` interface. This may break custom `SerDe` builds for Hive 1.2. This PR update the migration guide for this change.
### Why are the changes needed?
Otherwise:
```
2020-01-27 05:11:20.446 - stderr> 20/01/27 05:11:20 INFO DAGScheduler: ResultStage 2 (main at NativeMethodAccessorImpl.java:0) failed in 1.000 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 13, 10.110.21.210, executor 1): java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/SerDe
2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.defineClass1(Native Method)
2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
2020-01-27 05:11:20.446 - stderr> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
2020-01-27 05:11:20.446 - stderr> at java.security.AccessController.doPrivileged(Native Method)
2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
2020-01-27 05:11:20.446 - stderr> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:405)
2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
2020-01-27 05:11:20.446 - stderr> at java.lang.Class.forName0(Native Method)
2020-01-27 05:11:20.446 - stderr> at java.lang.Class.forName(Class.java:348)
2020-01-27 05:11:20.446 - stderr> at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:76)
.....
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual test
Closes#27492 from wangyum/SPARK-30755.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption.
However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog.
This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.
However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view.
This PR proposes to fix this issue by
1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views.
2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views.
### Why are the changes needed?
To avoid releasing a behavior that we should not support.
Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937
### Does this PR introduce any user-facing change?
yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix.
### How was this patch tested?
new tests
Closes#27550 from cloud-fan/ns.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
We should convert Spark InternalRows to hive data via `HiveInspectors.wrapperFor`.
### Why are the changes needed?
We may hit below exception without this change:
```
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.1.6, executor driver): java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
[info] at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:55)
[info] at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:321)
[info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:292)
[info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:247)
[info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:231)
[info] at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:55)
[info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$2(ScriptTransformationExec.scala:300)
[info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$2$adapted(ScriptTransformationExec.scala:281)
[info] at scala.collection.Iterator.foreach(Iterator.scala:941)
[info] at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$1(ScriptTransformationExec.scala:281)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
[info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270)
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new test. But please note that this test returns different result between Hive1.2 and Hive2.3 due to `HiveDecimal` or `SerDe` difference(don't know the root cause yet).
Closes#27556 from Ngone51/script_transform.
Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR scopes `SparkSession.active` to prevent problems with processing queries with possibly different spark sessions (and different configs). A new method, `withActive` is introduced on `SparkSession` that restores the previous spark session after the block of code is executed.
### Why are the changes needed?
`SparkSession.active` is a thread local variable that points to the current thread's spark session. It is important to note that the `SQLConf.get` method depends on `SparkSession.active`. In the current implementation it is possible that `SparkSession.active` points to a different session which causes various problems. Most of these problems arise because part of the query processing is done using the configurations of a different session. For example, when creating a data frame using a new session, i.e., `session.sql("...")`, part of the data frame is constructed using the currently active spark session, which can be a different session from the one used later for processing the query.
### Does this PR introduce any user-facing change?
The `withActive` method is introduced on `SparkSession`.
### How was this patch tested?
Unit tests (to be added)
Closes#27387 from dbaliafroozeh/UseWithActiveSessionInQueryExecution.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions.
### Why are the changes needed?
To describe these two classes.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes#27535 from fuwhu/SPARK-15616-FOLLOW-UP.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables.
This PR would allow qualified column names in query as following:
```SQL
SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT tbl.foo FROM testcat.ns1.ns2.tbl
```
### Why are the changes needed?
This is a bug because you cannot qualify column names in queries.
### Does this PR introduce any user-facing change?
Yes, now users can qualify column names for v2 tables.
### How was this patch tested?
Added new tests.
Closes#27391 from imback82/qualified_col.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called.
### Why are the changes needed?
In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Existing unit tests.
Closes#27232 from fuwhu/SPARK-30525.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table.
For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL.
For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration.
It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported.
For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table.
## How was this patch tested?
Added test.
Closes#24938 from viirya/SPARK-27946.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>