[SPARK-24675][SQL] Rename table: validate existence of new location

## What changes were proposed in this pull request?
If table is renamed to a existing new location, data won't show up.
```
scala>  Seq("hello").toDF("a").write.format("parquet").saveAsTable("t")

scala> sql("select * from t").show()
+-----+
|    a|
+-----+
|hello|
+-----+

scala> sql("alter table t rename to test")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from test").show()
+---+
|  a|
+---+
+---+
```
The file layout is like
```
$ tree test
test
├── gabage
└── t
    ├── _SUCCESS
    └── part-00000-856b0f10-08f1-42d6-9eb3-7719261f3d5e-c000.snappy.parquet
```

In Hive, if the new location exists, the renaming will fail even the location is empty.

We should have the same validation in Catalog, in case of unexpected bugs.

## How was this patch tested?

New unit test.

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21655 from gengliangwang/validate_rename_table.
This commit is contained in:
Gengliang Wang 2018-07-05 09:25:19 -07:00 committed by Xiao Li
parent ac78bcce00
commit 33952cfa81
3 changed files with 39 additions and 0 deletions

View file

@ -1850,6 +1850,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.
- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, renaming a managed table to existing location is not allowed. An exception is thrown when attempting to rename a managed table to existing location.
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
- Since Spark 2.4, Spark has enabled non-cascading SQL cache invalidation in addition to the traditional cache invalidation mechanism. The non-cascading cache invalidation mechanism allows users to remove a cache without impacting its dependent caches. This new cache invalidation mechanism is used in scenarios where the data of the cache to be removed is still valid, e.g., calling unpersist() on a Dataset, or dropping a temporary view. This allows users to free up memory and keep the desired caches valid at the same time.
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.

View file

@ -619,6 +619,7 @@ class SessionCatalog(
requireTableExists(TableIdentifier(oldTableName, Some(db)))
requireTableNotExists(TableIdentifier(newTableName, Some(db)))
validateName(newTableName)
validateNewLocationOfRename(oldName, newName)
externalCatalog.renameTable(db, oldTableName, newTableName)
} else {
if (newName.database.isDefined) {
@ -1366,4 +1367,23 @@ class SessionCatalog(
// copy over temporary views
tempViews.foreach(kv => target.tempViews.put(kv._1, kv._2))
}
/**
* Validate the new locatoin before renaming a managed table, which should be non-existent.
*/
private def validateNewLocationOfRename(
oldName: TableIdentifier,
newName: TableIdentifier): Unit = {
val oldTable = getTableMetadata(oldName)
if (oldTable.tableType == CatalogTableType.MANAGED) {
val databaseLocation =
externalCatalog.getDatabase(oldName.database.getOrElse(currentDb)).locationUri
val newTableLocation = new Path(new Path(databaseLocation), formatTableName(newName.table))
val fs = newTableLocation.getFileSystem(hadoopConf)
if (fs.exists(newTableLocation)) {
throw new AnalysisException(s"Can not rename the managed table('$oldName')" +
s". The associated location('$newTableLocation') already exists.")
}
}
}
}

View file

@ -441,6 +441,24 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils {
}
}
test("rename a managed table with existing empty directory") {
val tableLoc = new File(spark.sessionState.catalog.defaultTablePath(TableIdentifier("tab2")))
try {
withTable("tab1") {
sql(s"CREATE TABLE tab1 USING $dataSource AS SELECT 1, 'a'")
tableLoc.mkdir()
val ex = intercept[AnalysisException] {
sql("ALTER TABLE tab1 RENAME TO tab2")
}.getMessage
val expectedMsg = "Can not rename the managed table('`tab1`'). The associated location"
assert(ex.contains(expectedMsg))
}
} finally {
waitForTasksToFinish()
Utils.deleteRecursively(tableLoc)
}
}
private def checkSchemaInCreatedDataSourceTable(
path: File,
userSpecifiedSchema: Option[String],