### What changes were proposed in this pull request?
This is the first step to remove `HiveClient` from `HiveSessionState`. In the metastore interaction, we always use the fully qualified table name when accessing/operating a table. That means, we always specify the database. Thus, it is not necessary to use `HiveClient` to change the active database in Hive metastore.
In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`
### How was this patch tested?
The existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14821 from gatorsmile/setCurrentDB.
### What changes were proposed in this pull request?
Address the comments by yhuai in the original PR: https://github.com/apache/spark/pull/14207
First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema.
Second, refactor the codes a little.
### How was this patch tested?
Fixed the test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14572 from gatorsmile/followup16552.
## What changes were proposed in this pull request?
`CreateTables` rule turns a general `CreateTable` plan to `CreateHiveTableAsSelectCommand` for hive serde table. However, this rule is logically a planner strategy, we should move it to `HiveStrategies`, to be consistent with other DDL commands.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14825 from cloud-fan/ctas.
## What changes were proposed in this pull request?
This PR enables the tests for `TimestampType` for JSON and unifies the logics for verifying schema when writing in CSV.
In more details, this PR,
- Enables the tests for `TimestampType` for JSON and
This was disabled due to an issue in `DatatypeConverter.parseDateTime` which parses dates incorrectly, for example as below:
```scala
val d = javax.xml.bind.DatatypeConverter.parseDateTime("0900-01-01T00:00:00.000").getTime
println(d.toString)
```
```
Fri Dec 28 00:00:00 KST 899
```
However, since we use `FastDateFormat`, it seems we are safe now.
```scala
val d = FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSS").parse("0900-01-01T00:00:00.000")
println(d)
```
```
Tue Jan 01 00:00:00 PST 900
```
- Verifies all unsupported types in CSV
There is a separate logics to verify the schemas in `CSVFileFormat`. This is actually not quite correct enough because we don't support `NullType` and `CalanderIntervalType` as well `StructType`, `ArrayType`, `MapType`. So, this PR adds both types.
## How was this patch tested?
Tests in `JsonHadoopFsRelation` and `CSVSuite`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14829 from HyukjinKwon/SPARK-16216-followup.
This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#14777 from JoshRosen/SPARK-17205.
### What changes were proposed in this pull request?
Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`.
~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~
### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14757 from gatorsmile/removeHiveClient.
## What changes were proposed in this pull request?
### Default - ISO 8601
Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:
- CSV
```
// TimestampType
1414459800000000
// DateType
16673
```
- Json
```
// TimestampType
1970-01-01 11:46:40.0
// DateType
1970-01-01
```
So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.
So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).
- For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)
```
1970-01-01T02:00:01.000-01:00
```
- For `Date` it becomes as below (`yyyy-MM-dd`)
```
1970-01-01
```
### Custom date format option - `dateFormat`
This PR also adds the support to write and read dates and timestamps in a formatted string as below:
- **DateType**
- With `dateFormat` option (e.g. `yyyy/MM/dd`)
```
+----------+
| date|
+----------+
|2015/08/26|
|2014/10/27|
|2016/01/28|
+----------+
```
### Custom date format option - `timestampFormat`
- **TimestampType**
- With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)
```
+----------------+
| date|
+----------------+
|2015/08/26 18:00|
|2014/10/27 18:30|
|2016/01/28 20:00|
+----------------+
```
## How was this patch tested?
Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14279 from HyukjinKwon/SPARK-16216-json-csv.
## What changes were proposed in this pull request?
Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc.
Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables.
At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?)
This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14752 from cloud-fan/minor2.
## What changes were proposed in this pull request?
This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`.
## How was this patch tested?
Manual.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#14769 from Sherry302/cleanComment.
When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#14763 from JoshRosen/SPARK-17194.
## What changes were proposed in this pull request?
The range operator previously didn't support SQL generation, which made it not possible to use in views.
## How was this patch tested?
Unit tests.
cc hvanhovell
Author: Eric Liang <ekl@databricks.com>
Closes#14724 from ericl/spark-17162.
## What changes were proposed in this pull request?
Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.
changes overview:
1. **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
**after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.
2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
**after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.
bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14155 from cloud-fan/catalog-table.
## What changes were proposed in this pull request?
Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14682 from viirya/fix-localrelation.
## What changes were proposed in this pull request?
This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables.
## How was this patch tested?
Added a test case in LogicalPlanToSQLSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14709 from petermaxlee/SPARK-17150.
## What changes were proposed in this pull request?
This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including:
- indexing
- array creation
- size
- array_contains
- sort_array
## How was this patch tested?
The patch itself is about adding tests.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14708 from petermaxlee/SPARK-17149.
A review of the code, working back from Hadoop's `FileSystem.exists()` and `FileSystem.isDirectory()` code, then removing uses of the calls when superfluous.
1. delete is harmless if called on a nonexistent path, so don't do any checks before deletes
1. any `FileSystem.exists()` check before `getFileStatus()` or `open()` is superfluous as the operation itself does the check. Instead the `FileNotFoundException` is caught and triggers the downgraded path. When a `FileNotFoundException` was thrown before, the code still creates a new FNFE with the error messages. Though now the inner exceptions are nested, for easier diagnostics.
Initially, relying on Jenkins test runs.
One troublespot here is that some of the codepaths are clearly error situations; it's not clear that they have coverage anyway. Trying to create the failure conditions in tests would be ideal, but it will also be hard.
Author: Steve Loughran <stevel@apache.org>
Closes#14371 from steveloughran/cloud/SPARK-16736-superfluous-fs-calls.
## What changes were proposed in this pull request?
This PR adds a field to subquery alias in order to make the usage of views in a resolved `LogicalPlan` more visible (and more understandable).
For example, the following view and query:
```sql
create view constants as select 1 as id union all select 1 union all select 42
select * from constants;
```
...now yields the following analyzed plan:
```
Project [id#39]
+- SubqueryAlias c, `default`.`constants`
+- Project [gen_attr_0#36 AS id#39]
+- SubqueryAlias gen_subquery_0
+- Union
:- Union
: :- Project [1 AS gen_attr_0#36]
: : +- OneRowRelation$
: +- Project [1 AS gen_attr_1#37]
: +- OneRowRelation$
+- Project [42 AS gen_attr_2#38]
+- OneRowRelation$
```
## How was this patch tested?
Added tests for the two code paths in `SessionCatalogSuite` (sql/core) and `HiveMetastoreCatalogSuite` (sql/hive)
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14657 from hvanhovell/SPARK-17068.
## What changes were proposed in this pull request?
This PR is a small follow-up to https://github.com/apache/spark/pull/14554. This also widens the visibility of a few (similar) Hive classes.
## How was this patch tested?
No test. Only a visibility change.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14654 from hvanhovell/SPARK-16964-hive.
## What changes were proposed in this pull request?
This patch updates the SQL parser to parse negative numeric literals as numeric literals, instead of unary minus of positive literals.
This allows the parser to parse the minimal value for each data type, e.g. "-32768S".
## How was this patch tested?
Updated test cases.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14608 from petermaxlee/SPARK-17013.
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.
This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.
Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).
When there is a mismatch, the error message looks like the following:
```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info] Expected "...147483648 -214748364[8]", but got "...147483648 -214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info] at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```
## How was this patch tested?
This is a test infrastructure change.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14472 from petermaxlee/SPARK-16866.
### What changes were proposed in this pull request?
The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty.
This PR is to fix the issue.
### How was this patch tested?
Fixed the test case to verify the change.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14550 from gatorsmile/tableComment.
## What changes were proposed in this pull request?
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.
Another syntax is: ALTER TABLE table RECOVER PARTITIONS
The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).
## How was this patch tested?
Added unit tests for it and Hive compatibility test suite.
Author: Davies Liu <davies@databricks.com>
Closes#14500 from davies/repair_table.
## What changes were proposed in this pull request?
This package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime.
This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.execution.
## How was this patch tested?
N/A - just visibility changes.
Author: Reynold Xin <rxin@databricks.com>
Closes#14554 from rxin/remote-private.
## What changes were proposed in this pull request?
For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`.
This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset.
So, here is the behaviour,
1. Check `compression` and use this if it is set.
2. If `compression` is not set, check `orc.compress` and use it.
3. If `compression` and `orc.compress` are not set, then use the default snappy.
## How was this patch tested?
Unit test in `OrcQuerySuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14518 from HyukjinKwon/SPARK-16610.
## What changes were proposed in this pull request?
Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.
## How was this patch tested?
Existing tests.
Author: Holden Karau <holden@us.ibm.com>
Closes#14407 from holdenk/SPARK-16779.
#### What changes were proposed in this pull request?
When doing a CTAS with a Partition By clause, we got a wrong error message.
For example,
```SQL
CREATE TABLE gen__tmp
PARTITIONED BY (key string)
AS SELECT key, value FROM mytable1
```
The error message we get now is like
```
Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 2, pos 0)
```
However, based on the code, the message we should get is like
```
Operation not allowed: A Create Table As Select (CTAS) statement is not allowed to create a partitioned table using Hive's file formats. Please use the syntax of "CREATE TABLE tableName USING dataSource OPTIONS (...) PARTITIONED BY ...\" to create a partitioned table through a CTAS statement.(line 2, pos 0)
```
Currently, partitioning columns is part of the schema. This PR fixes the bug by changing the detection orders.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14113 from gatorsmile/ctas.
## What changes were proposed in this pull request?
When we create the HiveConf for metastore client, we use a Hadoop Conf as the base, which may contain Hive settings in hive-site.xml (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49). However, HiveConf's initialize function basically ignores the base Hadoop Conf and always its default values (i.e. settings with non-null default values) as the base (https://github.com/apache/hive/blob/release-1.2.1/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2687). So, even a user put javax.jdo.option.ConnectionURL in hive-site.xml, it is not used and Hive will use its default, which is jdbc:derby:;databaseName=metastore_db;create=true.
This issue only shows up when `spark.sql.hive.metastore.jars` is not set to builtin.
## How was this patch tested?
New test in HiveSparkSubmitSuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#14497 from yhuai/SPARK-16901.
## What changes were proposed in this pull request?
we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14482 from cloud-fan/table.
## What changes were proposed in this pull request?
These 2 methods take `CatalogTable` as parameter, which already have the database information.
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14476 from cloud-fan/minor5.
## What changes were proposed in this pull request?
Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.
This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD.
TODO: In another pr, move DataSourceScanExec to it's own file.
## How was this patch tested?
Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).
Author: Eric Liang <ekl@databricks.com>
Closes#14241 from ericl/refactor.
### What changes were proposed in this pull request?
This PR is to remove `TestHiveSharedState`.
Also, this is also associated with the Hive refractoring for removing `HiveSharedState`.
### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14463 from gatorsmile/removeTestHiveSharedState.
## What changes were proposed in this pull request?
With SPARK-15034, we could use the value of spark.sql.warehouse.dir to set the warehouse location. In TestHive, we can now simply set the temporary warehouse path in sc's conf, and thus, param "warehousePath" could be removed.
## How was this patch tested?
exsiting testsuites.
Author: jiangxingbo <jiangxingbo@meituan.com>
Closes#14401 from jiangxb1987/warehousePath.
## What changes were proposed in this pull request?
`StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14363 from cloud-fan/column.
#### What changes were proposed in this pull request?
Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:
**Group A. Users specify the schema.**
_Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example,
```SQL
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input
```
_Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
```SQL
CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json
```
**Group B. Spark SQL infers the schema at runtime.**
_Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')
```
Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.
This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache.
In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.
#### How was this patch tested?
TODO: add more cases to cover the changes.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14207 from gatorsmile/userSpecifiedSchema.
## What changes were proposed in this pull request?
We currently test subquery SQL building using the `HiveCompatibilitySuite`. The is not desired since SQL building is actually a part of `sql/core` and because we are slowly reducing our dependency on Hive. This PR adds the same tests from the whitelist of `HiveCompatibilitySuite` into `LogicalPlanToSQLSuite`.
## How was this patch tested?
This adds more testcases. Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14383 from dongjoon-hyun/SPARK-15232.
## What changes were proposed in this pull request?
Currently, the generated SQLs have not-stable IDs for generated attributes.
The stable generated SQL will give more benefit for understanding or testing the queries.
This PR provides stable SQL generation by the followings.
- Provide unique ids for generated subqueries, `gen_subquery_xxx`.
- Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
**Before**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
```
**After**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14257 from dongjoon-hyun/SPARK-16621.
## What changes were proposed in this pull request?
Currently there are 2 inconsistence:
1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null
## How was this patch tested?
new test in `HiveDDLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14302 from cloud-fan/minor3.
## What changes were proposed in this pull request?
This PR contains three changes.
First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).
Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.
Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.
## How was this patch tested?
New tests in SQLWindowFunctionSuite
Author: Yin Huai <yhuai@databricks.com>
Closes#14284 from yhuai/lead-lag.
## What changes were proposed in this pull request?
Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```
## How was this patch tested?
Pass the Jenkins tests with a new test suite.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14307 from dongjoon-hyun/SPARK-16672.
## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```
**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14314 from gatorsmile/tableDDLAgainstView.
## What changes were proposed in this pull request?
This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present.
Before:
```sql
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```
After:
```sql
(ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```
## How was this patch tested?
New test case added in `ExpressionSQLBuilderSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#14334 from liancheng/window-spec-sql-format.
## What changes were proposed in this pull request?
It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14331 from cloud-fan/check.
## What changes were proposed in this pull request?
`CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
This PR cleans it up and only pass in necessary information to `CreateViewCommand`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14297 from cloud-fan/minor2.
## What changes were proposed in this pull request?
Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.:
```sql
LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
LAST_VALUE(FALSE, FALSE)
LAST_VALUE(TRUE, TRUE)
```
This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way.
This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`.
## How was this patch tested?
New test case added in `WindowQuerySuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#14295 from liancheng/spark-16648-last-value.
## What changes were proposed in this pull request?
we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14283 from cloud-fan/minor1.
## Problem
The current `sed` in `test_script.sh` is missing a `$`, leading to the failure of `script` test on OS X:
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 2 ==
![x1_y1] [x1]
![x2_y2] [x2]
```
In addition, this `script` test would also fail on systems like Windows where we couldn't be able to invoke `bash` or `echo | sed`.
## What changes were proposed in this pull request?
This patch
- fixes `sed` in `test_script.sh`
- adds command guards so that the `script` test would pass on systems like Windows
## How was this patch tested?
- Jenkins
- Manually verified tests pass on OS X
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14280 from lw-lin/osx-sed.
## What changes were proposed in this pull request?
after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14318 from cloud-fan/minor4.
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.
## How was this patch tested?
Pass the Jenkins tests with newly added.
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13990 from techaddict/SPARK-16287.
## What changes were proposed in this pull request?
Due to backward-compatibility reasons, the following Parquet schema is ambiguous:
```
optional group f (LIST) {
repeated group list {
optional group element {
optional int32 element;
}
}
}
```
According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type:
```
ARRAY<STRUCT<element: INT>>
```
However, when interpreted as a legacy 2-level layout, it's equivalent to
```
ARRAY<STRUCT<element: STRUCT<element: INT>>>
```
Historically, to disambiguate these cases, we employed two methods:
- `ParquetSchemaConverter.isElementType()`
Used to disambiguate the above cases while converting Parquet types to Spark types.
- `ParquetRowConverter.isElementType()`
Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows.
Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344.
`ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields.
In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type.
## How was this patch tested?
New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#14014 from liancheng/spark-16344-for-master-and-2.0.
## What changes were proposed in this pull request?
In 2.0, we add a new logic to convert HiveTableScan on ORC tables to Spark's native code path. However, during this conversion, we drop the original metastore schema (https://issues.apache.org/jira/browse/SPARK-15705). Because of this regression, I am changing the default value of `spark.sql.hive.convertMetastoreOrc` to false.
Author: Yin Huai <yhuai@databricks.com>
Closes#14267 from yhuai/SPARK-15705-changeDefaultValue.
## What changes were proposed in this pull request?
This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.
## How was this patch tested?
Pass the Jenkins. This is only a testsuite change.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14235 from dongjoon-hyun/SPARK-16590.
## What changes were proposed in this pull request?
In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.
## How was this patch tested?
added a test case in SQLQuerySuite.
Closes#14169
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14249 from yhuai/scriptTransformation.
## What changes were proposed in this pull request?
There are some calls to methods or fields (getParameters, properties) which are then passed to Java/Scala collection converters. Unfortunately those fields can be null in some cases and then the conversions throws NPE. We fix it by wrapping calls to those fields and methods with option and then do the conversion.
## How was this patch tested?
Manually tested with a custom Hive metastore.
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Closes#14200 from jacek-lewandowski/SPARK-16528.
#### What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.
~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~
For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14148 from gatorsmile/describeSchema.
## What changes were proposed in this pull request?
This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.
java_method is an alias for reflect, so this should also resolve SPARK-16277.
## How was this patch tested?
Added expression unit tests and an end-to-end test.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14138 from petermaxlee/reflect-static.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.
Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.
The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).
Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13831 from vanzin/SPARK-16119.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
Closes#13494 from lianhuiwang/metadata-only.
## What changes were proposed in this pull request?
In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with an *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.
Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
### Before
```
//SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
### After
```
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
## How was the this patch tested?
Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested
Post 1.6.1
Tested by modifying the FilteredScanSuite to run explains.
Author: Russell Spitzer <Russell.Spitzer@gmail.com>
Closes#11317 from RussellSpitzer/SPARK-12639-Star.
Some Hadoop classes needed by the Hive metastore client jars are not present
in Spark's packaging (for example, "org/apache/hadoop/mapred/MRVersion"). So
if the parent class loader fails to find a class, try to load it from the
isolated class loader, in case it's available there.
Tested by setting spark.sql.hive.metastore.jars to local paths with Hive/Hadoop
libraries and verifying that Spark can talk to the metastore.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14020 from vanzin/SPARK-16349.
## What changes were proposed in this pull request?
This PR prevents dropping the current database to avoid errors like the followings.
```scala
scala> sql("create database delete_db")
scala> sql("use delete_db")
scala> sql("drop database delete_db")
scala> sql("create table t as select 1")
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found;
```
## How was this patch tested?
Pass the Jenkins tests including an updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14115 from dongjoon-hyun/SPARK-16459.
## What changes were proposed in this pull request?
This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.
## How was this patch tested?
Added unit tests and end-to-end tests.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13991 from petermaxlee/SPARK-16318.
## What changes were proposed in this pull request?
This PR adds parse_url SQL functions in order to remove Hive fallback.
A new implementation of #13999
## How was this patch tested?
Pass the exist tests including new testcases.
Author: wujian <jan.chou.wu@gmail.com>
Closes#14008 from janplus/SPARK-16281.
## What changes were proposed in this pull request?
This PR implements `sentences` SQL function.
## How was this patch tested?
Pass the Jenkins tests with a new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14004 from dongjoon-hyun/SPARK_16285.
## What changes were proposed in this pull request?
In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate.
## How was this patch tested?
added a test case.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#14089 from adrian-wang/catalogstring.
#### What changes were proposed in this pull request?
When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
For example, given Table `t1` only has 3 columns
```SQL
create view v1(col2, col4, col3, col5) as select * from t1
```
Currently, Spark SQL reports the following error:
```
requirement failed
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
```
This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14047 from gatorsmile/viewMismatchedColumns.
#### What changes were proposed in this pull request?
Different from the other leaf nodes, `MetastoreRelation` and `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change the qualifier of the node. However, based on the existing alias handling, alias should be put in `SubqueryAlias`.
This PR is to separate alias handling from `MetastoreRelation` and `SimpleCatalogRelation` to make it consistent with the other nodes. It simplifies the signature and conversion to a `BaseRelation`.
For example, below is an example query for `MetastoreRelation`, which is converted to a `LogicalRelation`:
```SQL
SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
```
Before changes, the analyzed plan is
```
== Analyzed Logical Plan ==
(a + 1): int
Project [(a#951 + 1) AS (a + 1)#952]
+- Filter (a#951 > 2)
+- SubqueryAlias tmp
+- Relation[a#951] parquet
```
After changes, the analyzed plan becomes
```
== Analyzed Logical Plan ==
(a + 1): int
Project [(a#951 + 1) AS (a + 1)#952]
+- Filter (a#951 > 2)
+- SubqueryAlias tmp
+- SubqueryAlias test_parquet_ctas
+- Relation[a#951] parquet
```
**Note: the optimized plans are the same.**
For `SimpleCatalogRelation`, the existing code always generates two Subqueries. Thus, no change is needed.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14053 from gatorsmile/removeAliasFromMetastoreRelation.
#### What changes were proposed in this pull request?
In `CREATE TABLE AS SELECT`, if the `SELECT` query failed, the table should not exist. For example,
```SQL
CREATE TABLE tab
STORED AS TEXTFILE
SELECT 1 AS a, (SELECT a FROM (SELECT 1 AS a UNION ALL SELECT 2 AS a) t) AS b
```
The above query failed as expected but an empty table `t` is created.
This PR is to drop the created table when hitting any non-fatal exception.
#### How was this patch tested?
Added a test case to verify the behavior
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13926 from gatorsmile/dropTableAfterException.
## What changes were proposed in this pull request?
These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#14061 from rxin/SPARK-16388.
## What changes were proposed in this pull request?
Currently, if due to some failure, the outstream gets destroyed or closed and later `outstream.close()` leads to IOException in such case. Due to this, the `stderrBuffer` does not get logged and there is no way for users to see why the job failed.
The change is to first display the stderr buffer and then try closing the outstream.
## How was this patch tested?
The correct way to test this fix would be to grep the log to see if the `stderrBuffer` gets logged but I dont think having test cases which do that is a good idea.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
…
Author: Tejas Patil <tejasp@fb.com>
Closes#13834 from tejasapatil/script_transform.
#### What changes were proposed in this pull request?
- Remove useless `MetastoreRelation` from the signature of `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`.
- Avoid unnecessary metadata retrieval using Hive client in `InsertIntoHiveTable`.
#### How was this patch tested?
Existing test cases already cover it.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14062 from gatorsmile/removeMetastoreRelation.
## What changes were proposed in this pull request?
This PR implements `stack` table generating function.
## How was this patch tested?
Pass the Jenkins tests including new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14033 from dongjoon-hyun/SPARK-16286.
## What changes were proposed in this pull request?
This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14055 from dongjoon-hyun/SPARK-16383.
## What changes were proposed in this pull request?
This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.
Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).
## How was this patch tested?
Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.
Author: Reynold Xin <rxin@databricks.com>
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14009 from rxin/SPARK-16311.
## What changes were proposed in this pull request?
It seems ORC supports all the types in ([`PredicateLeaf.Type`](e085b7e9bd/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java (L50-L56))) which includes boolean types. So, this was tested first.
This PR adds the support for pushing filters down for `BooleanType` in ORC data source.
This PR also removes `OrcTableScan` class and the companion object, which is not used anymore.
## How was this patch tested?
Unittest in `OrcFilterSuite` and `OrcQuerySuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12972 from HyukjinKwon/SPARK-15198.
(Please note this is a revision of PR #13686, which has been closed in favor of this PR.)
This PR addresses [SPARK-15968](https://issues.apache.org/jira/browse/SPARK-15968).
## What changes were proposed in this pull request?
The `getCached` method of [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala) computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss.
Rather than get `pathsInMetastore` from
metastoreRelation.catalogTable.storage.locationUri.toSeq
I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5:
e0c3212a9b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala (L444).
I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog.
## How was this patch tested?
I've added a new unit test to `parquetSuites.scala`:
SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation
Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch.
Note I also modified a different but related unit test in `parquetSuites.scala`:
SPARK-15248: explicitly added partitions should be readable
This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test.
In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating.
Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases.
Thanks.
Author: Michael Allman <michael@videoamp.com>
Closes#13818 from mallman/spark-15968.
#### What changes were proposed in this pull request?
LogicalPlan `InsertIntoHiveTable` is useless. Thus, we can remove it from the code base.
#### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14037 from gatorsmile/InsertIntoHiveTable.
## What changes were proposed in this pull request?
This PR implements `inline` table generating function.
## How was this patch tested?
Pass the Jenkins tests with new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13976 from dongjoon-hyun/SPARK-16288.
## What changes were proposed in this pull request?
This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback.
## How was this patch tested?
Pass the Jenkins tests including new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13967 from dongjoon-hyun/SPARK-16278.
## What changes were proposed in this pull request?
This patch introduces a flag to disable loading test tables in TestHiveSparkSession and disables that in Python. This fixes an issue in which python/run-tests would fail due to failure to load test tables.
Note that these test tables are not used outside of HiveCompatibilitySuite. In the long run we should probably decouple the loading of test tables from the test Hive setup.
## How was this patch tested?
This is a test only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#14005 from rxin/SPARK-15954.
## What changes were proposed in this pull request?
This patch implements the elt function, as it is implemented in Hive.
## How was this patch tested?
Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13966 from petermaxlee/SPARK-16276.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
## What changes were proposed in this pull request?
Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.
## How was this patch tested?
Tested by running a job which was failing without this change due to TimSort bug.
Author: Sital Kedia <skedia@fb.com>
Closes#13107 from sitalkedia/fix_TimSort.
## What changes were proposed in this pull request?
This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null.
## How was this patch tested?
Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13964 from petermaxlee/SPARK-16274.
## What changes were proposed in this pull request?
After SPARK-15674, `DDLStrategy` prints out the following deprecation messages in the testsuites.
```
12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy:
CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated,
please use CREATE TEMPORARY VIEW viewName USING... instead
```
Total : 40
- JDBCWriteSuite: 14
- DDLSuite: 6
- TableScanSuite: 6
- ParquetSourceSuite: 5
- OrcSourceSuite: 2
- SQLQuerySuite: 2
- HiveCommandSuite: 2
- JsonSuite: 1
- PrunedScanSuite: 1
- FilteredScanSuite 1
This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in order to remove the deprecation messages in the above testsuites except `DDLSuite`, `SQLQuerySuite`, `HiveCommandSuite`.
The Jenkins results shows only remaining 10 messages.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61422/consoleFull
## How was this patch tested?
This is a testsuite-only change.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13956 from dongjoon-hyun/SPARK-16267.
## What changes were proposed in this pull request?
This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter.
**Reported Error Scenarios**
```scala
scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double) ; line 1 pos 7
scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7
```
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13930 from dongjoon-hyun/SPARK-16228.
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.
Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
StructField(
"cl1",
IntegerType,
nullable = false,
new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
.add("a", "int", true, "test1")
val struct = (new StructType)
.add("c", StringType, true, "test3")
val struct = (new StructType)
.add(StructField("d", StringType).withComment("test4"))
```
#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13860 from gatorsmile/newMethodForComment.
## What changes were proposed in this pull request?
This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark.
## How was this patch tested?
Updated tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13939 from rxin/hive-whitelist.
## What changes were proposed in this pull request?
- Fix tests regarding show functions functionality
- Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality.
Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files
## How was this patch tested?
Unit tests.
Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes#13916 from anabranch/master.
## What changes were proposed in this pull request?
When reading partitions of a partitioned Hive SerDe table, we only initializes the deserializer using partition properties. However, for SerDes like `AvroSerDe`, essential properties (e.g. Avro schema information) may be defined in table properties. We should merge both table properties and partition properties before initializing the deserializer.
Note that an individual partition may have different properties than the one defined in the table properties (e.g. partitions within a table can have different SerDes). Thus, for any property key defined in both partition and table properties, the value set in partition properties wins.
## How was this patch tested?
New test case added in `QueryPartitionSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13865 from liancheng/spark-13709-partitioned-avro-table.
## What changes were proposed in this pull request?
SPARK-14535 removed all calls to class OrcTableScan. This removes the dead code.
## How was this patch tested?
Existing unit tests.
Author: Brian Cho <bcho@fb.com>
Closes#13869 from dafrista/clean-up-orctablescan.
#### What changes were proposed in this pull request?
This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled.
The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13764 from gatorsmile/dataSourceComment.
## What changes were proposed in this pull request?
Extend the returning of unwrapper functions from primitive types to all types.
This PR is based on https://github.com/apache/spark/pull/13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista.
## How was this patch tested?
The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.
Author: Brian Cho <bcho@fb.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13854 from hvanhovell/SPARK-15956-scala210.
This reverts commit 0a9c027595. It breaks the 2.10 build, I'll fix this in a different PR.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13853 from hvanhovell/SPARK-15956-revert.
## What changes were proposed in this pull request?
This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements.
Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#13810 from liancheng/spark-16037-follow-up-tests.
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.
## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
Author: Yin Huai <yhuai@databricks.com>
Closes#13769 from yhuai/SPARK-16030-1.
## What changes were proposed in this pull request?
The current table insertion has some weird behaviours:
1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.
This PR fixes these 2 problems.
## How was this patch tested?
new test in hive `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13754 from cloud-fan/insert2.