spark-instrumented-optimizer/sql/core
Peter Toth b0cee9605e
[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex
## What changes were proposed in this pull request?

`InMemoryFileIndex` contains a cache of `LocatedFileStatus` objects. Each `LocatedFileStatus` object can contain several `BlockLocation`s or some subclass of it. Filling up this cache by listing files happens recursively either on the driver or on the executors, depending on the parallel discovery threshold (`spark.sql.sources.parallelPartitionDiscovery.threshold`). If the listing happens on the executors block location objects are converted to simple `BlockLocation` objects to ensure serialization requirements. If it happens on the driver then there is no conversion and depending on the file system a `BlockLocation` object can be a subclass like `HdfsBlockLocation` and consume more memory. This PR adds the conversion to the latter case and decreases memory consumption.

## How was this patch tested?

Added unit test.

Closes #22603 from peter-toth/SPARK-25062.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2018-10-06 14:50:03 -07:00
..
benchmarks [SPARK-25488][SQL][TEST] Refactor MiscBenchmark to use main method 2018-10-06 08:47:43 -07:00
src [SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex 2018-10-06 14:50:03 -07:00
pom.xml [SPARK-25592] Setting version to 3.0.0-SNAPSHOT 2018-10-02 08:48:24 -07:00