spark-instrumented-optimizer

History

petermaxlee 9812f7d538 [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely ## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <petermaxlee@gmail.com> Closes #14728 from petermaxlee/SPARK-17165.		2016-08-26 11:30:23 -07:00
..
java/org/apache/spark/sql	[MINOR][SQL] Fix some typos in comments and test hints	2016-08-22 13:31:38 -07:00
resources	[SPARK-16031] Add debug-only socket source in Structured Streaming	2016-06-19 21:27:04 -07:00
scala/org/apache/spark/sql	[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely	2016-08-26 11:30:23 -07:00