Large parts of the VertexSetRDD were restructured to take advantage of:
1) the OpenHashSet as an index map
2) view based lazy mapValues and mapValuesWithVertices
3) the cogroup code is currently disabled (since it is not used in any of the tests)
The GraphImpl was updated to also use the OpenHashSet and PrimitiveOpenHashMap
wherever possible:
1) the LocalVidMaps (used to track replicated vertices) are now implemented
using the OpenHashSet
2) an OpenHashMap is temporarily constructed to combine the local OpenHashSet
with the local (replicated) vertex attribute arrays
3) because the OpenHashSet constructor grabs a class manifest all operations
that construct OpenHashSets have been moved to the GraphImpl Singleton to prevent
implicit variable capture within closures.
Add support for local:// URI scheme for addJars()
This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.
I would add something to the docs, but it's not obvious where to add it.
Oh, and it would be great if this could be merged in time for 0.8.1.
Reduce the memory footprint of BlockInfo objects
This pull request reduces the memory footprint of all BlockInfo objects and makes additional optimizations for shuffle blocks. For all BlockInfo objects, these changes remove two boolean fields and one Object field. For shuffle blocks, we additionally remove an Object field and a boolean field.
When storing tens of thousands of these objects, this may add up to significant memory savings. A ShuffleBlockInfo now only needs to wrap a single long.
This was motivated by a [report of high blockInfo memory usage during shuffles](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C20131026134353.202b2b9b%40sh9%3E).
I haven't run benchmarks to measure the exact memory savings.
/cc @aarondav