spark-instrumented-optimizer

History

Eric Liang 294163ee93 [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <ekl@databricks.com> Closes #16112 from ericl/spark-18679.	2016-12-02 20:59:39 +08:00
..
main	[SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables	2016-12-02 20:59:39 +08:00
test	[SPARK-18666][WEB UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled	2016-12-01 01:57:58 -08:00

Eric Liang 294163ee93 [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables

## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #16112 from ericl/spark-18679.

2016-12-02 20:59:39 +08:00

main

[SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables

2016-12-02 20:59:39 +08:00

test

[SPARK-18666][WEB UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-12-01 01:57:58 -08:00