spark-instrumented-optimizer

History

Tathagata Das cb0e9b0980 [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories. pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #3419 from tdas/filestream-fix2 and squashes the following commits: c19dd8a [Tathagata Das] Addressed PR comments. 513b608 [Tathagata Das] Updated docs. d364faf [Tathagata Das] Added the current time condition back 5526222 [Tathagata Das] Removed unnecessary imports. 38bb736 [Tathagata Das] Fix long line. 203bbc7 [Tathagata Das] Un-ignore tests. eaef4e1 [Tathagata Das] Fixed SPARK-4519 9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.	2014-11-24 13:50:20 -08:00
..
src	[SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times	2014-11-24 13:50:20 -08:00
pom.xml	Bumping version to 1.3.0-SNAPSHOT.	2014-11-18 21:24:18 -08:00

Tathagata Das cb0e9b0980 [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times

Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories.

pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3419 from tdas/filestream-fix2 and squashes the following commits:

c19dd8a [Tathagata Das] Addressed PR comments.
513b608 [Tathagata Das] Updated docs.
d364faf [Tathagata Das] Added the current time condition back
5526222 [Tathagata Das] Removed unnecessary imports.
38bb736 [Tathagata Das] Fix long line.
203bbc7 [Tathagata Das] Un-ignore tests.
eaef4e1 [Tathagata Das] Fixed SPARK-4519
9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.

2014-11-24 13:50:20 -08:00

src

[SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times

2014-11-24 13:50:20 -08:00

pom.xml

Bumping version to 1.3.0-SNAPSHOT.

2014-11-18 21:24:18 -08:00