JIRA: https://issues.apache.org/jira/browse/SPARK-7800
`isDefined` is marked as true twice in `Location.putNewKey`. The first one is unnecessary and will cause problem because it is too early and before some assert checking. E.g., if an attempt with incorrect `keyLengthBytes` marks `isDefined` as true, the location can not be used later.
ping JoshRosen
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6324 from viirya/dup_isdefined and squashes the following commits:
cbfe03b [Liang-Chi Hsieh] isDefined should not marked too early in putNewKey.
(cherry picked from commit 5a3c04bb92)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations.
This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries. In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length).
This patch incorporates / closes#5836.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6159 from JoshRosen/SPARK-7251 and squashes the following commits:
05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY
2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity
bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap
f5feadf [Josh Rosen] Add test for iterating over an empty map
273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap
(cherry picked from commit f2faa7af30)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
When on-heap memory allocation is used, ExecutorMemoryManager should maintain a cache / pool of buffers for re-use by tasks. This will significantly improve the performance of the new Tungsten's sort-shuffle for jobs with many short-lived tasks by eliminating a major source of GC.
This pull request is a minimum-viable-implementation of this idea. In its current form, this patch significantly improves performance on a stress test which launches huge numbers of short-lived shuffle map tasks back-to-back in the same JVM.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6227 from JoshRosen/SPARK-7698 and squashes the following commits:
fd6cb55 [Josh Rosen] SoftReference -> WeakReference
b154e86 [Josh Rosen] WIP sketch of pooling in ExecutorMemoryManager
(cherry picked from commit 7956dd7ab0)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>