ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dan Crankshaw	1a961dd1f2	Fixed connected components CL params	2013-10-12 01:47:38 +00:00
Shivaram Venkataraman	c441904bce	Add a comment and exclude tools	2013-10-11 18:23:15 -07:00
Reynold Xin	ab0940f0c2	Job cancellation: addressed code review feedback round 2 from Kay.	2013-10-11 18:15:04 -07:00
Dan Crankshaw	1e5535cfcf	Added connected components back	2013-10-11 16:38:52 -07:00
Reynold Xin	97ffebbe87	Fixed dagscheduler suite because of a logging message change.	2013-10-11 16:18:22 -07:00
Reynold Xin	dca80094d3	Merge pull request #54 from aoiwelle/remove_unused_imports Remove unnecessary mutable imports It appears that the imports aren't necessary here.	2013-10-11 16:08:15 -07:00
Dan Crankshaw	543a54dffa	Tried to fix some indenting	2013-10-11 16:07:49 -07:00
Reynold Xin	a61cf40ab9	Job cancellation: addressed code review feedback from Kay.	2013-10-11 15:58:14 -07:00
Dan Crankshaw	c4a23f95c3	Updated code so benchmarks actually run.	2013-10-11 22:57:43 +00:00
Matei Zaharia	fb25f32300	Merge pull request #53 from witgo/master Add a zookeeper compile dependency to fix build in maven Add a zookeeper compile dependency to fix build in maven	2013-10-11 15:44:43 -07:00
Matei Zaharia	d6ead47809	Merge pull request #32 from mridulm/master Address review comments, move to incubator spark Also includes a small fix to speculative execution. <edit> Continued from https://github.com/mesos/spark/pull/914 </edit>	2013-10-11 15:43:01 -07:00
Reynold Xin	e2047d3927	Making takeAsync and collectAsync deterministic.	2013-10-11 13:04:45 -07:00
Reynold Xin	09f7609254	Properly handle interrupted exception in FutureAction.	2013-10-11 11:20:15 -07:00
Neal Wiggins	67d4a31f87	Remove unnecessary mutable imports	2013-10-11 09:47:27 -07:00
LiGuoqiang	fc60c412ab	Add a zookeeper compile dependency to fix build in maven	2013-10-11 16:31:47 +08:00
Reynold Xin	42fb1df694	Merge branch 'master' of github.com:apache/incubator-spark into kill Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala	2013-10-10 23:48:05 -07:00
Reynold Xin	d9e724e756	Fixed the broken local scheduler test.	2013-10-10 23:08:13 -07:00
Reynold Xin	37397b73ba	Added comprehensive tests for job cancellation in a variety of environments (local vs cluster, fifo vs fair).	2013-10-10 22:57:43 -07:00
Reynold Xin	80cdbf4f49	Switched to use daemon thread in executor and fixed a bug in job cancellation for fair scheduler.	2013-10-10 22:40:48 -07:00
Matei Zaharia	8f11c36fe1	Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala	2013-10-10 19:34:33 -07:00
Reynold Xin	058508b625	Changed the name of the local cluster executor from local to localhost.	2013-10-10 19:24:00 -07:00
Reynold Xin	ec2e2ed1e1	Use the same Executor in LocalScheduler as in ClusterScheduler.	2013-10-10 18:55:25 -07:00
Matei Zaharia	c71499b779	Merge pull request #19 from aarondav/master-zk Standalone Scheduler fault tolerance using ZooKeeper This patch implements full distributed fault tolerance for standalone scheduler Masters. There is only one master Leader at a time, which is actively serving scheduling requests. If this Leader crashes, another master will eventually be elected, reconstruct the state from the first Master, and continue serving scheduling requests. Leader election is performed using the ZooKeeper leader election pattern. We try to minimize the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of retries and session monitoring on top of the ZooKeeper client. Master failover follows directly from the single-node Master recovery via the file system (patch `d5a96fe`), save that the Master state is stored in ZooKeeper instead. Configuration: By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE). By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled. By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory to an appropriate directory accessible by the Master, we will keep the behavior of from `d5a96fe`. Additionally, places where a Master could be specificied by a spark:// url can now take comma-delimited lists to specify backup masters. Note that this is only used for registration of NEW Workers and application Clients. Once a Worker or Client has registered with the Master Leader, it is "in the system" and will never need to register again.	2013-10-10 17:16:42 -07:00
Harvey Feng	5a99e67894	Add an optional closure parameter to HadoopRDD instantiation to used when creating any local JobConfs.	2013-10-10 16:35:52 -07:00
Joseph E. Gonzalez	fa2f87ca63	added replication and balance reporting	2013-10-10 14:48:40 -07:00
Aaron Davidson	66c20635fa	Minor clarification and cleanup to spark-standalone.md	2013-10-10 14:45:12 -07:00
Joseph E. Gonzalez	5f756fb63f	added support for random vertex cuts	2013-10-10 14:10:47 -07:00
Joseph E. Gonzalez	8dfac4ea8f	added support for random vertex cuts	2013-10-10 14:09:01 -07:00
Dan Crankshaw	5867a824de	Merge pull request #19 from dcrankshaw/master Merge canonical 2d partitioner and group edges into benchmarks	2013-10-10 14:02:37 -07:00
Matei Zaharia	cd08f73483	Merge pull request #44 from mateiz/fast-map A fast and low-memory append-only map for shuffle operations This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there. The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap. I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode: ``` scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache scala> nums.count scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0) ``` This prints the following times before and after this change: ``` Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195) After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731) ``` I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.	2013-10-10 13:55:47 -07:00
Dan Crankshaw	9929e7b9a5	Merge branch 'benchmarks' of github.com:amplab/graphx	2013-10-10 13:36:51 -07:00
Dan Crankshaw	4b46d519db	Merge pull request #17 from amplab/product2 product 2 change	2013-10-10 13:35:36 -07:00
Reynold Xin	357733d292	Rename kill -> cancel in user facing API / documentation.	2013-10-10 13:27:38 -07:00
Matei Zaharia	001d13f7b9	Merge branch 'master' into fast-map Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala	2013-10-10 13:26:43 -07:00
Reynold Xin	ddf64f019f	Support job cancellation in multi-pool scheduler.	2013-10-10 13:20:27 -07:00
Reynold Xin	3bd2890d2b	Fixed the deadlock situation in multi-job actions and added more unit tests.	2013-10-10 12:07:09 -07:00
Aaron Davidson	42d8b8efe6	Address Matei's comments on documentation Updates to the documentation and changing some logError()s to logWarning()s.	2013-10-10 00:33:47 -07:00
Reynold Xin	0353f74a9a	Put the job cancellation handling into the dagscheduler's main event loop.	2013-10-10 00:28:00 -07:00
Reynold Xin	dbae7795ba	Merge branch 'master' of github.com:apache/incubator-spark into kill Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala core/src/main/scala/org/apache/spark/scheduler/DAGSchedulerSource.scala	2013-10-09 22:57:35 -07:00
Reynold Xin	53895f9cde	Implemented FutureAction, FutureJob, CancellablePromise. Implemented more unit tests for async actions.	2013-10-09 22:43:06 -07:00
Reynold Xin	320418f7c8	Merge pull request #49 from mateiz/kryo-fix-2 Fix Chill serialization of Range objects It used to write out each element one by one, creating very large objects.	2013-10-09 16:55:30 -07:00
Reynold Xin	215238cb39	Merge pull request #50 from kayousterhout/SPARK-908 Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 16:49:44 -07:00
Matei Zaharia	c84c205289	Fix Chill serialization of Range objects, which used to write out each element, and register user and Spark classes before Chill's serializers to let them override Chill's behavior in general.	2013-10-09 16:23:40 -07:00
Kay Ousterhout	36966f65df	Style fixes	2013-10-09 15:36:34 -07:00
Kay Ousterhout	3f7e9b265c	Fixed comment to use javadoc style	2013-10-09 15:23:04 -07:00
Kay Ousterhout	a34a4e8174	Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 15:07:53 -07:00
Matei Zaharia	7827efc87b	Merge pull request #46 from mateiz/py-sort-update Fix PySpark docs and an overly long line of code after #38 Just noticed these after merging that commit (https://github.com/apache/incubator-spark/pull/38).	2013-10-09 15:07:25 -07:00
Patrick Wendell	7b3ae04ea7	Merge pull request #45 from pwendell/metrics_units Use standard abbreviation in metrics description (MBytes -> MB) This is a small change - older commits are shown here because Github hasn't sync'ed yet with apache.	2013-10-09 12:14:19 -07:00
Matei Zaharia	478b2b7edc	Fix PySpark docs and an overly long line of code after `fdbae41e`	2013-10-09 12:08:04 -07:00
Matei Zaharia	b4fa11f6c9	Merge pull request #38 from AndreSchumacher/pyspark_sorting SPARK-705: implement sortByKey() in PySpark This PR contains the implementation of a RangePartitioner in Python and uses its partition ID's to get a global sort in PySpark.	2013-10-09 11:59:47 -07:00

... 5 6 7 8 9 ...

4680 commits