spark-instrumented-optimizer/sql/core
Wenchen Fan 0706e64c49 [SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command
### What changes were proposed in this pull request?

For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and `spark.sql.legacy.createHiveTableByDefault` is false.

This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e265

This PR allows `CREATE EXTERNAL TABLE` when `LOCATION` is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.

### Why are the changes needed?

Changing from Hive text table to native Parquet table has many benefits:
1. be consistent with `DataFrameWriter.saveAsTable`.
2. better performance
3. better support for nested types (Hive text table doesn't work well with nested types, e.g. `insert into t values struct(null)` actually inserts a null value not `struct(null)` if `t` is a Hive text table, which leads to wrong result)
4. better interoperability as Parquet is a more popular open file format.

### Does this PR introduce _any_ user-facing change?

No by default. If the config is set, the behavior change is described below:

Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: `ALTER TABLE SET [SERDE | SERDEPROPERTIES]` and `LOAD DATA`.

char/varchar behavior has been taken care by https://github.com/apache/spark/pull/30412, and there is no behavior difference between data source and hive tables.

One potential issue is `CREATE TABLE ... LOCATION ...` while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.

### How was this patch tested?

Re-enable the tests

Closes #30554 from cloud-fan/create-table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 15:24:44 +00:00
..
benchmarks [SPARK-33523][SQL][TEST][FOLLOWUP] Fix benchmark case name in SubExprEliminationBenchmark 2020-11-25 15:22:47 -08:00
src [SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command 2020-12-03 15:24:44 +00:00
pom.xml [SPARK-33107][BUILD][FOLLOW-UP] Remove com.twitter:parquet-hadoop-bundle:1.6.0 and orc.classifier 2020-10-11 21:54:56 -07:00