spark-instrumented-optimizer/external
Terry Kim 3d1dce75d9 [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
### What changes were proposed in this pull request?

When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: Seq[FileStatus])`, they use the `files` along with the original options. `files` in `inferSchema` could have been deduced from the "path" option if the option was present, so this can cause issues (e.g., reading more data, listing the path again) since the "path" option is **added** to the `files`.

### Why are the changes needed?

The current behavior can cause the following issue:
```scala
class TestFileFilter extends PathFilter {
  override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}

val path = "/tmp"
val df = spark.range(2)
df.write.json(path + "/p=1")
df.write.json(path + "/p=2")

val extraOptions = Map(
  "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
  "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
)

// This works fine.
assert(spark.read.options(extraOptions).json(path).count == 2)

// The following with "path" option fails with the following:
// assertion failed: Conflicting directory structures detected. Suspicious paths
//	file:/tmp
//	file:/tmp/p=1
assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2)
```

### Does this PR introduce _any_ user-facing change?

Yes, the above failure doesn't happen and you get the consistent experience when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", path).load`.

### How was this patch tested?

Updated existing tests.

Closes #29437 from imback82/path_bug.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 16:23:22 +00:00
..
avro [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources 2020-08-19 16:23:22 +00:00
docker [SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff 2020-07-17 12:05:45 -07:00
docker-integration-tests [SPARK-32576][SQL][TEST][FOLLOWUP] Add tests for all the character array types in PostgresIntegrationSuite 2020-08-10 19:05:50 +09:00
kafka-0-10 [SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql compiling for Scala 2.13 2020-07-14 02:06:50 -07:00
kafka-0-10-assembly [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00
kafka-0-10-sql [SPARK-32568][BUILD][SS] Upgrade Kafka to 2.6.0 2020-08-08 10:31:36 +09:00
kafka-0-10-token-provider [SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector 2020-03-12 19:04:35 -07:00
kinesis-asl [SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ 2020-07-23 16:20:17 -07:00
kinesis-asl-assembly [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00
spark-ganglia-lgpl [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00