Commit graph

4743 commits

Author SHA1 Message Date
Joseph E. Gonzalez bff223454a trying to address issues with GraphImpl being caught in closures. 2013-10-13 22:27:10 -07:00
Joseph E. Gonzalez f89e6e5cbf removing benchmark code 2013-10-13 20:45:01 -07:00
Joseph E. Gonzalez 141c22e28c merging in master changes 2013-10-13 20:43:23 -07:00
Joseph E. Gonzalez 637b67da56 merging changes from upstream benchmarking branch 2013-10-13 19:54:09 -07:00
Joseph E. Gonzalez 494472a6cc Integrated IndexedRDD into graph design. 2013-10-13 19:42:32 -07:00
Aaron Davidson da896115ec Change BlockId filename to name + rest of Patrick's comments 2013-10-13 11:15:02 -07:00
Aaron Davidson d60352283c Add unit test and address rest of Reynold's comments 2013-10-12 22:45:15 -07:00
Aaron Davidson a395911138 Refactor BlockId into an actual type
This is an unfortunately invasive change which converts all of our BlockId
strings into actual BlockId types. Here are some advantages of doing this now:

+ Type safety

+ Code clarity - it's now obvious what the key of a shuffle or rdd block is,
  for instance. Additionally, appearing in tuple/map type signatures is a big
  readability bonus. A Seq[(String, BlockStatus)] is not very clear.
  Further, we can now use more Scala features, like matching on BlockId types.

+ Explicit usage - we can now formally tell where various BlockIds are being used
  (without doing string searches); this makes updating current BlockIds a much
  clearer process, and compiler-supported.
  (I'm looking at you, shuffle file consolidation.)

+ It will only get harder to make this change as time goes on.

Since this touches a lot of files, it'd be best to either get this patch
in quickly or throw it on the ground to avoid too many secondary merge conflicts.
2013-10-12 22:44:57 -07:00
Reynold Xin 99796904ae Merge pull request #52 from harveyfeng/hadoop-closure
Add an optional closure parameter to HadoopRDD instantiation to use when creating local JobConfs.

Having HadoopRDD accept this optional closure eliminates the need for the HadoopFileRDD added earlier. It makes the HadoopRDD more general, in that the caller can specify any JobConf initialization flow.
2013-10-12 21:23:26 -07:00
Harvey Feng 6c32aab87d Remove the new HadoopRDD constructor from SparkContext API, plus some minor style changes. 2013-10-12 21:02:08 -07:00
Reynold Xin 88866ea9c9 Fixed PairRDDFunctionsSuite after removing InterruptibleRDD. 2013-10-12 20:05:23 -07:00
Reynold Xin 6b288b75d4 Job cancellation: address Matei's code review feedback. 2013-10-12 15:53:31 -07:00
jerryshao c23cd72b4b Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming 2013-10-12 20:00:42 +08:00
Dan Crankshaw 1a961dd1f2 Fixed connected components CL params 2013-10-12 01:47:38 +00:00
Shivaram Venkataraman c441904bce Add a comment and exclude tools 2013-10-11 18:23:15 -07:00
Reynold Xin ab0940f0c2 Job cancellation: addressed code review feedback round 2 from Kay. 2013-10-11 18:15:04 -07:00
Dan Crankshaw 1e5535cfcf Added connected components back 2013-10-11 16:38:52 -07:00
Reynold Xin 97ffebbe87 Fixed dagscheduler suite because of a logging message change. 2013-10-11 16:18:22 -07:00
Reynold Xin dca80094d3 Merge pull request #54 from aoiwelle/remove_unused_imports
Remove unnecessary mutable imports

It appears that the imports aren't necessary here.
2013-10-11 16:08:15 -07:00
Dan Crankshaw 543a54dffa Tried to fix some indenting 2013-10-11 16:07:49 -07:00
Reynold Xin a61cf40ab9 Job cancellation: addressed code review feedback from Kay. 2013-10-11 15:58:14 -07:00
Dan Crankshaw c4a23f95c3 Updated code so benchmarks actually run. 2013-10-11 22:57:43 +00:00
Matei Zaharia fb25f32300 Merge pull request #53 from witgo/master
Add a zookeeper compile dependency to fix build in maven

 Add a zookeeper compile dependency to fix build in maven
2013-10-11 15:44:43 -07:00
Matei Zaharia d6ead47809 Merge pull request #32 from mridulm/master
Address review comments, move to incubator spark

Also includes a small fix to speculative execution.

<edit> Continued from https://github.com/mesos/spark/pull/914 </edit>
2013-10-11 15:43:01 -07:00
Reynold Xin e2047d3927 Making takeAsync and collectAsync deterministic. 2013-10-11 13:04:45 -07:00
Reynold Xin 09f7609254 Properly handle interrupted exception in FutureAction. 2013-10-11 11:20:15 -07:00
Neal Wiggins 67d4a31f87 Remove unnecessary mutable imports 2013-10-11 09:47:27 -07:00
LiGuoqiang fc60c412ab Add a zookeeper compile dependency to fix build in maven 2013-10-11 16:31:47 +08:00
Reynold Xin 42fb1df694 Merge branch 'master' of github.com:apache/incubator-spark into kill
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
2013-10-10 23:48:05 -07:00
Reynold Xin d9e724e756 Fixed the broken local scheduler test. 2013-10-10 23:08:13 -07:00
Reynold Xin 37397b73ba Added comprehensive tests for job cancellation in a variety of environments (local vs cluster, fifo vs fair). 2013-10-10 22:57:43 -07:00
Reynold Xin 80cdbf4f49 Switched to use daemon thread in executor and fixed a bug in job cancellation for fair scheduler. 2013-10-10 22:40:48 -07:00
Matei Zaharia 8f11c36fe1 Merge remote-tracking branch 'tgravescs/sparkYarnDistCache'
Closes #11

Conflicts:
	docs/running-on-yarn.md
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
2013-10-10 19:34:33 -07:00
Reynold Xin 058508b625 Changed the name of the local cluster executor from local to localhost. 2013-10-10 19:24:00 -07:00
Reynold Xin ec2e2ed1e1 Use the same Executor in LocalScheduler as in ClusterScheduler. 2013-10-10 18:55:25 -07:00
Matei Zaharia c71499b779 Merge pull request #19 from aarondav/master-zk
Standalone Scheduler fault tolerance using ZooKeeper

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fe), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fe.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.
2013-10-10 17:16:42 -07:00
Harvey Feng 5a99e67894 Add an optional closure parameter to HadoopRDD instantiation to used when creating any local JobConfs. 2013-10-10 16:35:52 -07:00
Joseph E. Gonzalez fa2f87ca63 added replication and balance reporting 2013-10-10 14:48:40 -07:00
Aaron Davidson 66c20635fa Minor clarification and cleanup to spark-standalone.md 2013-10-10 14:45:12 -07:00
Joseph E. Gonzalez 5f756fb63f added support for random vertex cuts 2013-10-10 14:10:47 -07:00
Joseph E. Gonzalez 8dfac4ea8f added support for random vertex cuts 2013-10-10 14:09:01 -07:00
Dan Crankshaw 5867a824de Merge pull request #19 from dcrankshaw/master
Merge canonical 2d partitioner and group edges into benchmarks
2013-10-10 14:02:37 -07:00
Matei Zaharia cd08f73483 Merge pull request #44 from mateiz/fast-map
A fast and low-memory append-only map for shuffle operations

This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there.

The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap.

I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode:
```
scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache

scala> nums.count

scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }

scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0)
```

This prints the following times before and after this change:
```
Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195)

After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731)
```

I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.
2013-10-10 13:55:47 -07:00
Dan Crankshaw 9929e7b9a5 Merge branch 'benchmarks' of github.com:amplab/graphx 2013-10-10 13:36:51 -07:00
Dan Crankshaw 4b46d519db Merge pull request #17 from amplab/product2
product 2 change
2013-10-10 13:35:36 -07:00
Reynold Xin 357733d292 Rename kill -> cancel in user facing API / documentation. 2013-10-10 13:27:38 -07:00
Matei Zaharia 001d13f7b9 Merge branch 'master' into fast-map
Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
2013-10-10 13:26:43 -07:00
Reynold Xin ddf64f019f Support job cancellation in multi-pool scheduler. 2013-10-10 13:20:27 -07:00
Reynold Xin 3bd2890d2b Fixed the deadlock situation in multi-job actions and added more unit tests. 2013-10-10 12:07:09 -07:00
Aaron Davidson 42d8b8efe6 Address Matei's comments on documentation
Updates to the documentation and changing some logError()s to logWarning()s.
2013-10-10 00:33:47 -07:00