[SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory
### What changes were proposed in this pull request? This PR does minor changes in the file SaveAsHiveFile.scala. It contains the below changes : 1. dropping getParent from below part of code =============================== if (extURI.getScheme == "viewfs") { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) =============================== ### Why are the changes needed? Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs). In HIVE: ``` beeline > use dicedb; > insert into table part_test partition (j=1) values (1); ... INFO : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000** ``` but spark's behaviour, ``` spark-sql> use dicedb; spark-sql> insert into table part_test partition (j=2) values (2); 21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1** ... ``` The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs. After this change is applied spark-sql creates .staging inside /db/table/, similar to hive, as below, ``` spark-sql> use dicedb; 21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds spark-sql> insert into table part_test partition (j=8) values (8); 21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1 21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1** ``` The reason why we don't see this issue in Hive but only occurs in Spark-sql: In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually by creating hive-sql.jar Closes #33577 from senthh/viewfs-792392. Authored-by: senthilkumarb <senthilkumarb@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
parent
e650d06ba9
commit
fe7bf5f96f
|
@ -188,7 +188,7 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand {
|
|||
stagingDir: String): Path = {
|
||||
val extURI: URI = path.toUri
|
||||
if (extURI.getScheme == "viewfs") {
|
||||
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
|
||||
getExtTmpPathRelTo(path, hadoopConf, stagingDir)
|
||||
} else {
|
||||
new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-10000")
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue