ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	2a7ae1736a	Merge pull request #84 from rxin/kill1 Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-20 11:45:21 -07:00
Reynold Xin	fabd05dabc	Updated setGroupId documentation and marked dagSchedulerSource and blockManagerSource as private in SparkContext.	2013-10-20 10:54:30 -07:00
Matei Zaharia	e4abb75d70	Merge pull request #85 from rxin/clean Moved the top level spark package object from spark to org.apache.spark This is a pretty annoying documentation bug ...	2013-10-20 09:38:37 -07:00
Matei Zaharia	747f538925	Merge pull request #83 from ewencp/pyspark-accumulator-add-method Add an add() method to pyspark accumulators. Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).	2013-10-19 23:40:40 -07:00
Reynold Xin	8396a6649e	Moved the top level spark package object from spark to org.apache.spark	2013-10-19 23:26:15 -07:00
Reynold Xin	eb9bf69462	Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-19 23:16:44 -07:00
Ewen Cheslack-Postava	7eaa56de7f	Add an add() method to pyspark accumulators. Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).	2013-10-19 19:55:39 -07:00
Reynold Xin	6511bbe2ad	Merge pull request #78 from mosharaf/master Removed BitTorrentBroadcast and TreeBroadcast. TorrentBroadcast replaces both.	2013-10-19 11:34:56 -07:00
Mosharaf Chowdhury	29617c27a1	Removed BitTorrentBroadcast and TreeBroadcast. TorrentBroadcast is replacing both.	2013-10-18 23:54:11 -07:00
Reynold Xin	f628804c02	Merge pull request #76 from pwendell/master Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:19:42 -07:00
Patrick Wendell	6b62836285	Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:08:44 -07:00
Matei Zaharia	599dcb0ddf	Merge pull request #74 from rxin/kill Job cancellation via job group id. This PR adds a simple API to group together a set of jobs belonging to a thread and threads spawned from it. It also allows the cancellation of all jobs in this group. An example: sc.setJobDescription("this_is_the_group_id", "some job description") sc.parallelize(1 to 10000, 2).map { i => Thread.sleep(10); i }.count() In a separate thread: sc.cancelJobGroup("this_is_the_group_id")	2013-10-18 22:49:00 -07:00
Reynold Xin	806f3a3adb	Job cancellation via job group id.	2013-10-18 21:46:08 -07:00
Matei Zaharia	8de9706b86	Merge pull request #66 from shivaram/sbt-assembly-deps Add SBT target to assemble dependencies This pull request is an attempt to address the long assembly build times during development. Instead of rebuilding the assembly jar for every Spark change, this pull request adds a new SBT target `spark` that packages all the Spark modules and builds an assembly of the dependencies. So the work flow that should work now would be something like ``` ./sbt/sbt spark # Doing this once should suffice ## Make changes ./sbt/sbt compile ./sbt/sbt test or ./spark-shell ```	2013-10-18 20:32:39 -07:00
Matei Zaharia	e5316d0685	Merge pull request #68 from mosharaf/master Faster and stable/reliable broadcast HttpBroadcast is noticeably slow, but the alternatives (TreeBroadcast or BitTorrentBroadcast) are notoriously unreliable. The main problem with them is they try to manage the memory for the pieces of a broadcast themselves. Right now, the BroadcastManager does not know which machines the tasks reading from a broadcast variable is running and when they have finished. Consequently, we try to guess and often guess wrong, which blows up the memory usage and kills/hangs jobs. This very simple implementation solves the problem by not trying to manage the intermediate pieces; instead, it offloads that duty to the BlockManager which is quite good at juggling blocks. Otherwise, it is very similar to the BitTorrentBroadcast implementation (without fancy optimizations). And it runs much faster than HttpBroadcast we have right now. I've been using this for another project for last couple of weeks, and just today did some benchmarking against the Http one. The following shows the improvements for increasing broadcast size for cold runs. Each line represent the number of receivers. ![fix-bc-first](https://f.cloud.github.com/assets/232966/1349342/ffa149e4-36e7-11e3-9fa6-c74555829356.png) After the first broadcast is over, i.e., after JVM is wormed up and for HttpBroadcast the server is already running (I think), the following are the improvements for warm runs. ![fix-bc-succ](https://f.cloud.github.com/assets/232966/1349352/5a948bae-36e8-11e3-98ce-34f19ebd33e0.jpg) The curves are not as nice as the cold runs, but the improvements are obvious, specially for larger broadcasts and more receivers. Depending on how it goes, we should deprecate and/or remove old TreeBroadcast and BitTorrentBroadcast implementations, and hopefully, SPARK-889 will not be necessary any more.	2013-10-18 20:30:56 -07:00
Matei Zaharia	8d528af829	Merge pull request #71 from aarondav/scdefaults Spark shell exits if it cannot create SparkContext Mainly, this occurs if you provide a messed up MASTER url (one that doesn't match one of our regexes). Previously, we would default to Mesos, fail, and then start the shell anyway, except that any Spark command would fail. Simply exiting seems clearer.	2013-10-18 20:24:10 -07:00
Mosharaf Chowdhury	08391dbcb8	Should compile now.	2013-10-17 23:06:17 -07:00
Mosharaf Chowdhury	8612641362	Added an after block to reset spark.broadcast.factory	2013-10-17 22:44:04 -07:00
Aaron Davidson	74737264c4	Spark shell exits if it cannot create SparkContext Mainly, this occurs if you provide a messed up MASTER url (one that doesn't match one of our regexes). Previously, we would default to Mesos, fail, and then start the shell anyway, except that any Spark command would fail.	2013-10-17 18:51:19 -07:00
Mosharaf Chowdhury	90ab55fd37	Merge remote-tracking branch 'upstream/master'	2013-10-17 18:12:28 -07:00
Mosharaf Chowdhury	e178ae4e9b	BroadcastSuite updated to test both HttpBroadcast and TorrentBroadcast in local, local[N], local-cluster settings.	2013-10-17 16:38:43 -07:00
Matei Zaharia	fc26e5b832	Merge pull request #69 from KarthikTunga/master Fix for issue SPARK-627. Implementing --config argument in the scripts. This code fix is for issue SPARK-627. I added code to consider --config arguments in the scripts. In case the <conf-dir> is not a directory the scripts exit. I removed the --hosts argument. It can be achieved by giving a different config directory. Let me know if an explicit --hosts argument is required.	2013-10-17 13:21:07 -07:00
Mosharaf Chowdhury	6a84e40efe	Merge remote-tracking branch 'upstream/master'	2013-10-17 13:14:33 -07:00
Mosharaf Chowdhury	35b2415fb3	Code styling. Updated doc.	2013-10-17 13:14:12 -07:00
Matei Zaharia	cf64f63f8a	Merge pull request #67 from kayousterhout/remove_tsl Removed TaskSchedulerListener interface. The interface was used only by the DAG scheduler (so it wasn't necessary to define the additional interface), and the naming makes it very confusing when reading the code (because "listener" was used to describe the DAG scheduler, rather than SparkListeners, which implement a nearly-identical interface but serve a different function). @mateiz - is there a reason for this interface that I'm missing?	2013-10-17 11:12:28 -07:00
Mosharaf Chowdhury	e663750488	Removed unused code. Changes to match Spark coding style.	2013-10-17 00:19:50 -07:00
Kay Ousterhout	809f547633	Fixed unit tests	2013-10-16 23:16:12 -07:00
KarthikTunga	8537f19268	SPARK-627 , Implementing --config arguments in the scripts	2013-10-16 23:00:33 -07:00
KarthikTunga	ff4fb1f7ee	SPARK-627 , Implementing --config arguments in the scripts	2013-10-16 22:55:15 -07:00
KarthikTunga	a32aa6b351	Implementing --config argument in the scripts	2013-10-16 22:51:09 -07:00
Mosharaf Chowdhury	e96bd0068f	BroadcastTest2 --> BroadcastTest	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	a8d0981832	Fixes for the new BlockId naming convention.	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	feb45d391f	Default blockSize is 4MB. BroadcastTest2 example added for testing broadcasts.	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	6e5a60fab4	Removed unnecessary code, and added comment of memory-latency tradeoff.	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	4602e2bf6e	Torrent-ish broadcast based on BlockManager.	2013-10-16 21:33:33 -07:00
Shivaram Venkataraman	0a4b76fcc2	Rename SBT target to assemble-deps.	2013-10-16 17:05:46 -07:00
Kay Ousterhout	ec512583ab	Removed TaskSchedulerListener interface. The interface was used only by the DAG scheduler (so it wasn't necessary to define the additional interface), and the naming makes it very confusing when reading the code (because "listener" was used to describe the DAG scheduler, rather than SparkListeners, which implement a nearly-identical interface but serve a different function).	2013-10-16 16:57:42 -07:00
Matei Zaharia	f9973cae3a	Merge pull request #65 from tgravescs/fixYarn Fix yarn build Fix the yarn build after renaming StandAloneX to CoarseGrainedX from pull request 34.	2013-10-16 15:58:41 -07:00
Shivaram Venkataraman	1dcded45e2	Exclude assembly jar from classpath if using deps	2013-10-16 13:43:41 -07:00
tgravescs	cc7df2b3cc	Fix yarn build	2013-10-16 10:09:16 -05:00
Matei Zaharia	28e9c2abc0	Merge pull request #63 from pwendell/master Fixing spark streaming example and a bug in examples build. - Examples assembly included a log4j.properties which clobbered Spark's - Example had an error where some classes weren't serializable - Did some other clean-up in this example	2013-10-15 23:59:56 -07:00
Matei Zaharia	4e46fde818	Merge pull request #62 from harveyfeng/master Make TaskContext's stageId publicly accessible.	2013-10-15 23:14:27 -07:00
Patrick Wendell	35befe07bb	Fixing spark streaming example and a bug in examples build. - Examples assembly included a log4j.properties which clobbered Spark's - Example had an error where some classes weren't serializable - Did some other clean-up in this example	2013-10-15 22:55:43 -07:00
Harvey Feng	65b46236e7	Proper formatting for SparkHadoopWriter class extensions.	2013-10-15 21:51:52 -07:00
Matei Zaharia	b5346064d6	Merge pull request #8 from vchekan/checkpoint-ttl-restore Serialize and restore spark.cleaner.ttl to savepoint In accordance to conversation in spark-dev maillist, preserve spark.cleaner.ttl parameter when serializing checkpoint.	2013-10-15 21:25:03 -07:00
Matei Zaharia	6dbd2208ff	Merge pull request #34 from kayousterhout/rename Renamed StandaloneX to CoarseGrainedX. (as suggested by @rxin here https://github.com/apache/incubator-spark/pull/14) The previous names were confusing because the components weren't just used in Standalone mode. The scheduler used for Standalone mode is called SparkDeploySchedulerBackend, so referring to the base class as StandaloneSchedulerBackend was misleading.	2013-10-15 19:02:57 -07:00
Matei Zaharia	983b83f24d	Merge pull request #61 from kayousterhout/daemon_thread Unified daemon thread pools As requested by @mateiz in an earlier pull request, this refactors various daemon thread pools to use a set of methods in utils.scala, and also changes the thread-pool-creation methods in utils.scala to use named thread pools for improved debugging.	2013-10-15 19:02:46 -07:00
Harvey Feng	c4c76e37a7	Fix line length > 100 chars in SparkHadoopWriter	2013-10-15 18:35:59 -07:00
Harvey Feng	5b8083fee5	Make TaskContext's stageId publicly accessible.	2013-10-15 18:06:37 -07:00
Kay Ousterhout	f95a2be045	Fixed build error after merging in master	2013-10-15 14:51:37 -07:00

1 2 3 4 5 ...

4309 commits