[SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory

### What changes were proposed in this pull request?

This PR does minor changes in the file SaveAsHiveFile.scala.

It contains the below changes :

1. dropping getParent from below part of code
===============================
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
===============================

### Why are the changes needed?

Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs).

In HIVE:
```
 beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO  : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
```

but spark's behaviour,

```
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
...
```

The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs.

After this change is applied spark-sql creates .staging inside /db/table/,  similar to hive, as below,

```
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
```

The reason why we don't see this issue in Hive but only occurs in Spark-sql:

In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs)

### Does this PR introduce _any_ user-facing change?
 No

### How was this patch tested?

Tested manually by creating hive-sql.jar

Closes #33577 from senthh/viewfs-792392.

Authored-by: senthilkumarb <senthilkumarb@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
senthilkumarb 2021-08-27 12:58:28 -07:00 committed by Dongjoon Hyun
parent e650d06ba9
commit fe7bf5f96f

View file

@ -188,7 +188,7 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand {
stagingDir: String): Path = {
val extURI: URI = path.toUri
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
getExtTmpPathRelTo(path, hadoopConf, stagingDir)
} else {
new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-10000")
}