Large parts of the VertexSetRDD were restructured to take advantage of:
1) the OpenHashSet as an index map
2) view based lazy mapValues and mapValuesWithVertices
3) the cogroup code is currently disabled (since it is not used in any of the tests)
The GraphImpl was updated to also use the OpenHashSet and PrimitiveOpenHashMap
wherever possible:
1) the LocalVidMaps (used to track replicated vertices) are now implemented
using the OpenHashSet
2) an OpenHashMap is temporarily constructed to combine the local OpenHashSet
with the local (replicated) vertex attribute arrays
3) because the OpenHashSet constructor grabs a class manifest all operations
that construct OpenHashSets have been moved to the GraphImpl Singleton to prevent
implicit variable capture within closures.
Add support for local:// URI scheme for addJars()
This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.
I would add something to the docs, but it's not obvious where to add it.
Oh, and it would be great if this could be merged in time for 0.8.1.
Reduce the memory footprint of BlockInfo objects
This pull request reduces the memory footprint of all BlockInfo objects and makes additional optimizations for shuffle blocks. For all BlockInfo objects, these changes remove two boolean fields and one Object field. For shuffle blocks, we additionally remove an Object field and a boolean field.
When storing tens of thousands of these objects, this may add up to significant memory savings. A ShuffleBlockInfo now only needs to wrap a single long.
This was motivated by a [report of high blockInfo memory usage during shuffles](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C20131026134353.202b2b9b%40sh9%3E).
I haven't run benchmarks to measure the exact memory savings.
/cc @aarondav
Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId
Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant.
This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.
Eliminate extra memory usage when shuffle file consolidation is disabled
Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled.
Fixing SPARK-946 is still forthcoming.