Commit graph

4955 commits

Author SHA1 Message Date
Aaron Davidson 84991a1b91 Memory-optimized shuffle file consolidation
Overhead of each shuffle block for consolidation has been reduced from >300 bytes
to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks,
net overhead was ~8,400,000 bytes.

Despite the memory-optimized implementation incurring extra CPU overhead, the runtime
of the shuffle phase in this test was only around 2% slower, while the reduce phase
was 40% faster, when compared to not using any shuffle file consolidation.
2013-11-03 21:34:13 -08:00
Reynold Xin b5dc3393a5 Merge pull request #70 from rxin/hash1
Fast, memory-efficient hash set, hash table implementations optimized for primitive data types.

This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space).

Details:

This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs.

Building on top of OpenHashSet, this PR adds two different hash tables implementations:
1. OpenHashSet: for nullable keys, optional specialization for primitive values
2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values

I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items:

Int to Int: ~30%
java.lang.Integer to java.lang.Integer: ~100%
Int to java.lang.Integer: ~50%
java.lang.Integer to Int: ~85%
2013-11-03 20:43:15 -08:00
Reynold Xin eb5f8a3f97 Code review feedback. 2013-11-03 18:11:44 -08:00
Reynold Xin 1e9543b567 Fixed a bug that uses twice amount of memory for the primitive arrays due to a scala compiler bug.
Also addressed Matei's code review comment.
2013-11-02 23:19:01 -07:00
Reynold Xin da6bb0aedd Merge branch 'master' into hash1 2013-11-02 22:45:15 -07:00
Reynold Xin 41ead7a745 Merge pull request #133 from Mistobaan/link_fix
update default github
2013-11-02 14:41:50 -07:00
Reynold Xin d407c0732a Merge pull request #134 from rxin/readme
Fixed a typo in Hadoop version in README.
2013-11-02 14:36:37 -07:00
Reynold Xin 895747bb05 Fixed a typo in Hadoop version in README. 2013-11-02 12:58:44 -07:00
Fabrizio (Misto) Milo 4b5d61f31f update default github 2013-11-01 18:41:49 -07:00
Reynold Xin e7c7b804b5 Merge pull request #132 from Mistobaan/doc_fix
fix persistent-hdfs
2013-11-01 17:58:10 -07:00
Fabrizio (Misto) Milo 3f89354c45 fix persistent-hdfs 2013-11-01 17:47:37 -07:00
Matei Zaharia d6d11c2edb Merge pull request #129 from velvia/2013-11/document-local-uris
Document & finish support for local: URIs

Review all the supported URI schemes for addJar / addFile to the Cluster Overview page.
Add support for local: URI to addFile.
2013-11-01 15:40:33 -07:00
Dan Crankshaw d87d112b2c Merge branch 'master' of github.com:amplab/graphx 2013-11-01 12:04:09 -07:00
Evan Chan f3679fd494 Add local: URI support to addFile as well 2013-11-01 11:08:03 -07:00
Evan Chan e54a37fe15 Document all the URIs for addJar/addFile 2013-11-01 10:58:11 -07:00
Reynold Xin 99bfcc91e0 Merge pull request #46 from jegonzal/VertexSetWithHashSet
Switched VertexSetRDD and GraphImpl to use OpenHashSet
2013-10-31 21:38:10 -07:00
Joseph E. Gonzalez db89ac4bc8 Changing var to val for keySet in OpenHashMaps 2013-10-31 21:19:26 -07:00
Joseph E. Gonzalez e7d37472b8 After some testing I realized that the IndexedSeq is still instantiating the array (not maintaining a view) so I have replaced all IndexedSeq[V] with (Int => V) 2013-10-31 21:09:39 -07:00
Joseph E. Gonzalez 63311d9c72 renamed update to setMerge 2013-10-31 20:12:30 -07:00
Dan Crankshaw e218e30b52 Merge branch 'master' of github.com:amplab/graphx 2013-10-31 19:54:17 -07:00
Dan Crankshaw 0a61cafba8 Added logging to Graph, GraphLab, and Pregel. 2013-10-31 19:54:06 -07:00
Joseph E. Gonzalez 7f58440334 Merge branch 'master' of https://github.com/amplab/graphx into VertexSetWithHashSet 2013-10-31 18:30:50 -07:00
Reynold Xin fcaaf86803 Merge pull request #44 from jegonzal/rxinBitSet
Switching to VertexSetRDD to use @rxin BitSet and OpenHash
2013-10-31 18:27:30 -07:00
Joseph E. Gonzalez 8381aeffb3 This commit introduces the OpenHashSet and OpenHashMap as indexing primitives.
Large parts of the VertexSetRDD were restructured to take advantage of:

  1) the OpenHashSet as an index map
  2) view based lazy mapValues and mapValuesWithVertices
  3) the cogroup code is currently disabled (since it is not used in any of the tests)

The GraphImpl was updated to also use the OpenHashSet and PrimitiveOpenHashMap
wherever possible:

  1) the LocalVidMaps (used to track replicated vertices) are now implemented
     using the OpenHashSet
  2) an OpenHashMap is temporarily constructed to combine the local OpenHashSet
     with the local (replicated) vertex attribute arrays
  3) because the OpenHashSet constructor grabs a class manifest all operations
     that construct OpenHashSets have been moved to the GraphImpl Singleton to prevent
     implicit variable capture within closures.
2013-10-31 18:13:02 -07:00
Joseph E. Gonzalez 4ad58e2b9a This commit makes three changes to the (PrimitiveKey)OpenHashMap
1) _keySet  --renamed--> keySet
  2) keySet and _values are made externally accessible
  3) added an update function which merges duplicate values
2013-10-31 18:09:42 -07:00
Dan Crankshaw b3bcfc09c7 Merge branch 'master' of github.com:amplab/graphx 2013-10-31 18:03:00 -07:00
Joseph E. Gonzalez d74ad4ebc9 Adding ability to access local BitSet and to safely get a value at a given position 2013-10-31 18:01:34 -07:00
Joseph E. Gonzalez aeb773fa47 Merging with upstream master. 2013-10-31 10:12:12 -07:00
Reynold Xin 3f3c727bc5 Merge pull request #41 from jegonzal/LineageTracking
Optimizing Graph Lineage
2013-10-31 09:52:25 -07:00
Reynold Xin 944f6b8048 Merge pull request #43 from amplab/FixBitSetCastException
Fix BitSet cast exception
2013-10-31 09:40:35 -07:00
Joseph E. Gonzalez d6b5122532 Switching to the @rxin BitSet implementation for VertexSet Value tables. 2013-10-31 01:44:24 -07:00
Joseph E. Gonzalez 51aff8ddcf Adding logical AND/OR, setUntil, and iterators to the BitSet. 2013-10-31 01:43:50 -07:00
Dan Crankshaw c430d2e21d Added bitset to kryo register 2013-10-31 01:01:59 -07:00
Joseph E. Gonzalez a6267df25f Merge branch 'hash1' of https://github.com/rxin/incubator-spark into rxinBitSet 2013-10-30 23:24:33 -07:00
Dan Crankshaw 37b4afbbf9 Merge branch 'cleanup' 2013-10-30 23:17:50 -07:00
Joseph E. Gonzalez a3ce484a2c Adding additional type constraints to VertexSetRDD to help diagnose issues with recent benchmarks. 2013-10-30 21:02:21 -07:00
Matei Zaharia 8f1098a3f0 Merge pull request #117 from stephenh/avoid_concurrent_modification_exception
Handle ConcurrentModificationExceptions in SparkContext init.

System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.

Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.
2013-10-30 20:11:48 -07:00
Joseph E. Gonzalez 09ea661bbb removing completely unnecessary map operation. 2013-10-30 20:07:26 -07:00
Joseph E. Gonzalez 003f8a505d Removing potential additional shuffle dependency where an already partitioned RDD[(Vid, VD)] is repartitioned. 2013-10-30 20:06:54 -07:00
Joseph E. Gonzalez d513addb77 added lineage tracking code 2013-10-30 20:05:29 -07:00
Matei Zaharia dc9ce16f6b Merge pull request #126 from kayousterhout/local_fix
Fixed incorrect log message in local scheduler

This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)
2013-10-30 17:01:56 -07:00
Matei Zaharia 33de11c51d Merge pull request #124 from tgravescs/sparkHadoopUtilFix
Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886)

Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized.   This causes issues like https://spark-project.atlassian.net/browse/SPARK-886.  It also makes it hard to use in singleton objects.  For instance I want to use it in the security code.
2013-10-30 16:58:27 -07:00
Joseph E. Gonzalez a4b8ddf417 removing unused commented code 2013-10-30 16:07:05 -07:00
Ankur Dave 5064f9b2d2 Merge remote-tracking branch 'spark-upstream/master'
Conflicts:
	project/SparkBuild.scala
2013-10-30 15:59:09 -07:00
Dan Crankshaw a0c86c3689 Merge pull request #38 from jegonzal/Documentation
Improving Documentation
2013-10-30 15:34:39 -07:00
Kay Ousterhout ff038eb4e0 Fixed incorrect log message in local scheduler 2013-10-30 15:27:23 -07:00
Dan Crankshaw e1099f4d89 Fixed issue with canonical edge partitioner. 2013-10-30 15:03:21 -07:00
Matei Zaharia 618c1f6cf3 Merge pull request #125 from velvia/2013-10/local-jar-uri
Add support for local:// URI scheme for addJars()

This PR adds support for a new URI scheme for SparkContext.addJars():  `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node.    The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters.  Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.

I would add something to the docs, but it's not obvious where to add it.

Oh, and it would be great if this could be merged in time for 0.8.1.
2013-10-30 12:03:44 -07:00
Stephen Haberman 09f3b677cb Avoid match errors when filtering for spark.hadoop settings. 2013-10-30 12:29:39 -05:00
tgravescs f231aaa24c move the hadoopJobMetadata back into SparkEnv 2013-10-30 11:46:12 -05:00