Commit graph

4726 commits

Author SHA1 Message Date
Reynold Xin 64ad3b18d9 Merge branch 'master' into rxin
Conflicts:
	graph/src/main/scala/org/apache/spark/graph/impl/GraphImpl.scala
2013-11-07 19:23:42 -08:00
Reynold Xin 2406bf33e4 Use custom serializer for aggregation messages when the data type is int/double. 2013-11-07 19:18:58 -08:00
Ankur Dave 6ee05be1c8 Merge pull request #49 from jegonzal/graphxshell
GraphX Console with Logo Text
2013-11-07 19:12:41 -08:00
Ankur Dave a9f96b54e4 Merge pull request #56 from jegonzal/PregelAPIChanges
Changing Pregel API to use mapReduceTriplets instead of aggregateNeighbors
2013-11-07 18:56:56 -08:00
Joseph E. Gonzalez e9308e0e75 Changing Pregel API to operate directly on edge triplets in SendMessage rather than (Vid, EdgeTriplet) pairs. 2013-11-07 18:04:06 -08:00
Reynold Xin 5907137d11 Merge pull request #54 from amplab/rxin
Converted for loops to while loops in EdgePartition.
2013-11-07 16:58:31 -08:00
Reynold Xin 6fadff2b92 Converted for loops to while loops in EdgePartition. 2013-11-07 16:54:33 -08:00
Reynold Xin edf41647f4 Merge pull request #53 from amplab/rxin
Added GraphX to classpath.
2013-11-07 16:22:43 -08:00
Reynold Xin 95f1f5315e Added GraphX to classpath. 2013-11-07 16:22:05 -08:00
Reynold Xin c379e10455 Merge pull request #51 from jegonzal/VertexSetRDD
Reverting to Array based (materialized) output in VertexSetRDD
2013-11-07 16:01:47 -08:00
Joseph E. Gonzalez 8ac15e8e43 Merge branch 'master' of https://github.com/amplab/graphx into graphxshell 2013-11-05 01:37:12 -08:00
Joseph E. Gonzalez 3e504938c2 merging upstream changes 2013-11-05 01:36:48 -08:00
Joey ca44b5134a Merge pull request #50 from amplab/mergemerge
Merge Spark master into graphx
2013-11-05 01:32:55 -08:00
Joseph E. Gonzalez 2dc9ec2387 Reverting to Array based (materialized) output of all VertexSetRDD operations. 2013-11-05 01:15:12 -08:00
Reynold Xin 551a43fd3d Merge branch 'master' of github.com:apache/incubator-spark into mergemerge
Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
2013-11-04 21:02:36 -08:00
Joseph E. Gonzalez 3c37928fab This commit adds a new graphx-shell which is essentially the same as
the spark shell but with GraphX packages automatically imported
and with Kryo serialization enabled for GraphX types.

In addition the graphx-shell has a nifty new logo.

To make these changes minimally invasive in the SparkILoop.scala
I added some additional environment variables:

   SPARK_BANNER_TEXT: If set this string is displayed instead
   of the spark logo

   SPARK_SHELL_INIT_BLOCK: if set this expression is evaluated in the
   spark shell after the spark context is created.
2013-11-04 20:10:15 -08:00
Reynold Xin 7a26104ab7 Merge pull request #130 from aarondav/shuffle
Memory-optimized shuffle file consolidation

Reduces overhead of each shuffle block for consolidation from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes.

Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation.

This is accomplished by replacing the map from ShuffleBlockId to FileSegment (i.e., block id to where it's located), which had high overhead due to being a gigantic, timestamped, concurrent map with a more space-efficient structure. Namely, the following are introduced (I have omitted the word "Shuffle" from some names for clarity):
**ShuffleFile** - there is one ShuffleFile per consolidated shuffle file on disk. We store an array of offsets into the physical shuffle file for each ShuffleMapTask that wrote into the file. This is sufficient to reconstruct FileSegments for mappers that are in the file.
**FileGroup** - contains a set of ShuffleFiles, one per reducer, that a MapTask can use to write its output. There is one FileGroup created per _concurrent_ MapTask. The FileGroup contains an array of the mapIds that have been written to all files in the group. The positions of elements in this array map directly onto the positions in each ShuffleFile's offsets array.

In order to locate the FileSegment associated with a BlockId, we have another structure which maps each reducer to the set of ShuffleFiles that were created for it. (There will be as many ShuffleFiles per reducer as there are FileGroups.) To lookup a given ShuffleBlockId (shuffleId, reducerId, mapId), we thus search through all ShuffleFiles associated with that reducer.

As a time optimization, we ensure that FileGroups are only reused for MapTasks with monotonically increasing mapIds. This allows us to perform a binary search to locate a mapId inside a group, and also enables potential future optimization (based on the usual monotonic access order).
2013-11-04 17:54:06 -08:00
Aaron Davidson 1ba11b1c6a Minor cleanup in ShuffleBlockManager 2013-11-04 17:16:41 -08:00
Aaron Davidson 6201e5e249 Refactor ShuffleBlockManager to reduce public interface
- ShuffleBlocks has been removed and replaced by ShuffleWriterGroup.
- ShuffleWriterGroup no longer contains a reference to a ShuffleFileGroup.
- ShuffleFile has been removed and its contents are now within ShuffleFileGroup.
- ShuffleBlockManager.forShuffle has been replaced by a more stateful forMapTask.
2013-11-04 09:41:04 -08:00
Aaron Davidson b0cf19fe3c Add javadoc and remove unused code 2013-11-03 22:16:58 -08:00
Aaron Davidson 39d93ed4b9 Clean up test files properly
For some reason, even calling
java.nio.Files.createTempDirectory().getFile.deleteOnExit()
does not delete the directory on exit. Guava's analagous function
seems to work, however.
2013-11-03 21:52:59 -08:00
Aaron Davidson a0bb569a81 use OpenHashMap, remove monotonicity requirement, fix failure bug 2013-11-03 21:34:56 -08:00
Aaron Davidson 8703898d3f Address Reynold's comments 2013-11-03 21:34:44 -08:00
Aaron Davidson 3ca52309f2 Fix test breakage 2013-11-03 21:34:44 -08:00
Aaron Davidson 1592adfa25 Add documentation and address other comments 2013-11-03 21:34:44 -08:00
Aaron Davidson 7d44dec9bd Fix weird bug with specialized PrimitiveVector 2013-11-03 21:34:43 -08:00
Aaron Davidson 7453f31181 Address minor comments 2013-11-03 21:34:43 -08:00
Aaron Davidson 84991a1b91 Memory-optimized shuffle file consolidation
Overhead of each shuffle block for consolidation has been reduced from >300 bytes
to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks,
net overhead was ~8,400,000 bytes.

Despite the memory-optimized implementation incurring extra CPU overhead, the runtime
of the shuffle phase in this test was only around 2% slower, while the reduce phase
was 40% faster, when compared to not using any shuffle file consolidation.
2013-11-03 21:34:13 -08:00
Reynold Xin b5dc3393a5 Merge pull request #70 from rxin/hash1
Fast, memory-efficient hash set, hash table implementations optimized for primitive data types.

This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space).

Details:

This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs.

Building on top of OpenHashSet, this PR adds two different hash tables implementations:
1. OpenHashSet: for nullable keys, optional specialization for primitive values
2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values

I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items:

Int to Int: ~30%
java.lang.Integer to java.lang.Integer: ~100%
Int to java.lang.Integer: ~50%
java.lang.Integer to Int: ~85%
2013-11-03 20:43:15 -08:00
Reynold Xin eb5f8a3f97 Code review feedback. 2013-11-03 18:11:44 -08:00
Reynold Xin 1e9543b567 Fixed a bug that uses twice amount of memory for the primitive arrays due to a scala compiler bug.
Also addressed Matei's code review comment.
2013-11-02 23:19:01 -07:00
Reynold Xin da6bb0aedd Merge branch 'master' into hash1 2013-11-02 22:45:15 -07:00
Reynold Xin 41ead7a745 Merge pull request #133 from Mistobaan/link_fix
update default github
2013-11-02 14:41:50 -07:00
Reynold Xin d407c0732a Merge pull request #134 from rxin/readme
Fixed a typo in Hadoop version in README.
2013-11-02 14:36:37 -07:00
Reynold Xin 895747bb05 Fixed a typo in Hadoop version in README. 2013-11-02 12:58:44 -07:00
Fabrizio (Misto) Milo 4b5d61f31f update default github 2013-11-01 18:41:49 -07:00
Reynold Xin e7c7b804b5 Merge pull request #132 from Mistobaan/doc_fix
fix persistent-hdfs
2013-11-01 17:58:10 -07:00
Fabrizio (Misto) Milo 3f89354c45 fix persistent-hdfs 2013-11-01 17:47:37 -07:00
Matei Zaharia d6d11c2edb Merge pull request #129 from velvia/2013-11/document-local-uris
Document & finish support for local: URIs

Review all the supported URI schemes for addJar / addFile to the Cluster Overview page.
Add support for local: URI to addFile.
2013-11-01 15:40:33 -07:00
Evan Chan f3679fd494 Add local: URI support to addFile as well 2013-11-01 11:08:03 -07:00
Evan Chan e54a37fe15 Document all the URIs for addJar/addFile 2013-11-01 10:58:11 -07:00
Reynold Xin 99bfcc91e0 Merge pull request #46 from jegonzal/VertexSetWithHashSet
Switched VertexSetRDD and GraphImpl to use OpenHashSet
2013-10-31 21:38:10 -07:00
Joseph E. Gonzalez db89ac4bc8 Changing var to val for keySet in OpenHashMaps 2013-10-31 21:19:26 -07:00
Joseph E. Gonzalez e7d37472b8 After some testing I realized that the IndexedSeq is still instantiating the array (not maintaining a view) so I have replaced all IndexedSeq[V] with (Int => V) 2013-10-31 21:09:39 -07:00
Joseph E. Gonzalez 63311d9c72 renamed update to setMerge 2013-10-31 20:12:30 -07:00
Joseph E. Gonzalez 7f58440334 Merge branch 'master' of https://github.com/amplab/graphx into VertexSetWithHashSet 2013-10-31 18:30:50 -07:00
Reynold Xin fcaaf86803 Merge pull request #44 from jegonzal/rxinBitSet
Switching to VertexSetRDD to use @rxin BitSet and OpenHash
2013-10-31 18:27:30 -07:00
Joseph E. Gonzalez 8381aeffb3 This commit introduces the OpenHashSet and OpenHashMap as indexing primitives.
Large parts of the VertexSetRDD were restructured to take advantage of:

  1) the OpenHashSet as an index map
  2) view based lazy mapValues and mapValuesWithVertices
  3) the cogroup code is currently disabled (since it is not used in any of the tests)

The GraphImpl was updated to also use the OpenHashSet and PrimitiveOpenHashMap
wherever possible:

  1) the LocalVidMaps (used to track replicated vertices) are now implemented
     using the OpenHashSet
  2) an OpenHashMap is temporarily constructed to combine the local OpenHashSet
     with the local (replicated) vertex attribute arrays
  3) because the OpenHashSet constructor grabs a class manifest all operations
     that construct OpenHashSets have been moved to the GraphImpl Singleton to prevent
     implicit variable capture within closures.
2013-10-31 18:13:02 -07:00
Joseph E. Gonzalez 4ad58e2b9a This commit makes three changes to the (PrimitiveKey)OpenHashMap
1) _keySet  --renamed--> keySet
  2) keySet and _values are made externally accessible
  3) added an update function which merges duplicate values
2013-10-31 18:09:42 -07:00
Joseph E. Gonzalez d74ad4ebc9 Adding ability to access local BitSet and to safely get a value at a given position 2013-10-31 18:01:34 -07:00