spark-instrumented-optimizer/sql/core/src/main
petermaxlee 9812f7d538 [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.

## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14728 from petermaxlee/SPARK-17165.
2016-08-26 11:30:23 -07:00
..
java/org/apache/spark/sql [MINOR][SQL] Fix some typos in comments and test hints 2016-08-22 13:31:38 -07:00
resources [SPARK-16031] Add debug-only socket source in Structured Streaming 2016-06-19 21:27:04 -07:00
scala/org/apache/spark/sql [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely 2016-08-26 11:30:23 -07:00