ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Ewen Cheslack-Postava	56d230e614	Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 00:22:37 -07:00
Matei Zaharia	b84193c5b8	Merge pull request #92 from tgravescs/sparkYarnFixClasspath Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath ... ...to be explicit about inclusion of spark.jar and app.jar. Be explicit so if there are any conflicts in packaging between spark.jar and app.jar we don't get random results due to the classpath having /*, which can including things in different order.	2013-10-21 23:35:13 -07:00
Matei Zaharia	731c94e91d	Merge pull request #56 from jerryshao/kafka-0.8-dev Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming Conflicts: streaming/pom.xml	2013-10-21 23:31:38 -07:00
Reynold Xin	48952d67e6	Merge pull request #87 from aarondav/shuffle-base Basic shuffle file consolidation The Spark shuffle phase can produce a large number of files, as one file is created per mapper per reducer. For large or repeated jobs, this often produces millions of shuffle files, which sees extremely degredaded performance from the OS file system. This patch seeks to reduce that burden by combining multipe shuffle files into one. This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669. However, it simplifies the design in order to get the majority of the gain with less overall intellectual and code burden. The vast majority of code in this pull request is a refactor to allow the insertion of a clean layer of indirection between logical block ids and physical files. This, I feel, provides some design clarity in addition to enabling shuffle file consolidation. The main goal is to produce one shuffle file per reducer per active mapper thread. This allows us to isolate the mappers (simplifying the failure modes), while still allowing us to reduce the number of mappers tremendously for large tasks. In order to accomplish this, we simply create a new set of shuffle files for every parallel task, and return the files to a pool which will be given out to the next run task. I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark: scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt)) scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0) For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two. Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.	2013-10-21 22:45:00 -07:00
Aaron Davidson	053ef949ac	Merge ShufflePerfTester patch into shuffle block consolidation	2013-10-21 22:17:53 -07:00
Prabeesh K	9ca1bd9530	Update MQTTWordCount.scala	2013-10-22 09:05:57 +05:30
Reynold Xin	a51359c917	Merge pull request #95 from aarondav/perftest Minor: Put StoragePerfTester in org/apache/	2013-10-21 20:33:29 -07:00
Aaron Davidson	97053c4a91	Put StoragePerfTester in org/apache/	2013-10-21 20:25:40 -07:00
Prabeesh K	dbafa11396	Update MQTTWordCount.scala	2013-10-22 08:50:34 +05:30
Matei Zaharia	39d2e9b293	Merge pull request #94 from aarondav/mesos-fix Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71. Previously, we explicitly removed the mesos:// part; with #71, this no longer occurs.	2013-10-21 18:58:48 -07:00
Aaron Davidson	0071f0899c	Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.	2013-10-21 15:56:14 -07:00
Kyle Ellrott	73bf8587e2	Fixing graph/pom.xml	2013-10-21 15:13:31 -07:00
Kay Ousterhout	916270f5f3	Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-21 12:46:57 -07:00
Aaron Davidson	4aa0ba1df7	Remove executorId from Task.run()	2013-10-21 12:19:15 -07:00
tgravescs	b6571541a6	Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath to be explicit about inclusion of spark.jar and app.jar	2013-10-21 14:05:15 -05:00
Patrick Wendell	aa61bfd399	Merge pull request #88 from rxin/clean Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor	2013-10-21 11:57:05 -07:00
Tathagata Das	0666498799	Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.	2013-10-21 05:34:09 -07:00
Aaron Davidson	444162afe7	Documentation update	2013-10-20 22:59:45 -07:00
Aaron Davidson	947fceaa73	Close shuffle writers during failure & remove executorId from TaskContext	2013-10-20 22:47:10 -07:00
Patrick Wendell	35886f3474	Merge pull request #41 from pwendell/shuffle-benchmark Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.	2013-10-20 22:20:32 -07:00
Reynold Xin	5b9380e017	Merge pull request #89 from rxin/executor Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming. java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)	2013-10-20 21:03:51 -07:00
Reynold Xin	b4d8478454	Made JobLogger public again and some minor cleanup.	2013-10-20 18:59:28 -07:00
Matei Zaharia	261bcf27b3	Merge pull request #80 from rxin/build Exclusion rules for Maven build files.	2013-10-20 17:59:51 -07:00
Aaron Davidson	4b68ddf3d0	Cleanup old shuffle file metadata from memory	2013-10-20 17:56:41 -07:00
Matei Zaharia	edc5e3f8f4	Merge pull request #75 from JoshRosen/block-manager-cleanup Code de-duplication in BlockManager The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects. I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.	2013-10-20 17:18:06 -07:00
Aaron Davidson	42a049723d	Address Josh and Reynold's comments	2013-10-20 16:11:59 -07:00
Josh Rosen	1fa5baf9ab	Unwrap a long line that actually fits.	2013-10-20 14:50:21 -07:00
Josh Rosen	640f253a65	Fix test failures in local mode due to updateEpoch	2013-10-20 14:49:05 -07:00
Josh Rosen	68d6806ea4	Minor cleanup based on @aarondav's code review.	2013-10-20 13:20:14 -07:00
Reynold Xin	7414805e4e	Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming.	2013-10-20 13:03:48 -07:00
Reynold Xin	8e1937f8ba	Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam JobLogger BlockManagerSlaveActor	2013-10-20 12:22:07 -07:00
Reynold Xin	2a7ae1736a	Merge pull request #84 from rxin/kill1 Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-20 11:45:21 -07:00
Aaron Davidson	38b8048f29	Fix compiler errors Whoops. Last-second changes require testing too, it seems.	2013-10-20 11:03:36 -07:00
Reynold Xin	fabd05dabc	Updated setGroupId documentation and marked dagSchedulerSource and blockManagerSource as private in SparkContext.	2013-10-20 10:54:30 -07:00
Matei Zaharia	e4abb75d70	Merge pull request #85 from rxin/clean Moved the top level spark package object from spark to org.apache.spark This is a pretty annoying documentation bug ...	2013-10-20 09:38:37 -07:00
Aaron Davidson	136b9b3a3e	Basic shuffle file consolidation The Spark shuffle phase can produce a large number of files, as one file is created per mapper per reducer. For large or repeated jobs, this often produces millions of shuffle files, which sees extremely degredaded performance from the OS file system. This patch seeks to reduce that burden by combining multipe shuffle files into one. This PR draws upon the work of Jason Dai in https://github.com/mesos/spark/pull/669. However, it simplifies the design in order to get the majority of the gain with less overall intellectual and code burden. The vast majority of code in this pull request is a refactor to allow the insertion of a clean layer of indirection between logical block ids and physical files. This, I feel, provides some design clarity in addition to enabling shuffle file consolidation. The main goal is to produce one shuffle file per reducer per active mapper thread. This allows us to isolate the mappers (simplifying the failure modes), while still allowing us to reduce the number of mappers tremendously for large tasks. In order to accomplish this, we simply create a new set of shuffle files for every parallel task, and return the files to a pool which will be given out to the next run task.	2013-10-20 02:58:26 -07:00
Aaron Davidson	861dc409d7	Refactor of DiskStore for shuffle file consolidation The main goal of this refactor was to allow the interposition of a new layer which maps logical BlockIds to physical locations other than a file with the same name as the BlockId. In particular, BlockIds will need to be mappable to chunks of files, as multiple will be stored in the same file. In order to accomplish this, the following changes have been made: - Creation of DiskBlockManager, which manages the association of logical BlockIds to physical disk locations (called FileSegments). By default, Blocks are simply mapped to physical files of the same name, as before. - The DiskStore now indirects all requests for a given BlockId through the DiskBlockManager in order to resolve the actual File location. - DiskBlockObjectWriter has been merged into BlockObjectWriter. - The Netty PathResolver has been changed to map BlockIds into FileSegments, as this codepath is the only one that uses Netty, and that is likely to remain the case. Overall, I think this refactor produces a clearer division between the logical Block paradigm and their physical on-disk location. There is now an explicit (and documented) mapping from one to the other.	2013-10-20 02:48:41 -07:00
Matei Zaharia	747f538925	Merge pull request #83 from ewencp/pyspark-accumulator-add-method Add an add() method to pyspark accumulators. Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).	2013-10-19 23:40:40 -07:00
Reynold Xin	8396a6649e	Moved the top level spark package object from spark to org.apache.spark	2013-10-19 23:26:15 -07:00
Reynold Xin	eb9bf69462	Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-19 23:16:44 -07:00
Josh Rosen	9159d2d09d	Split MapOutputTracker into Master/Worker classes. Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.	2013-10-19 20:01:22 -07:00
Ewen Cheslack-Postava	7eaa56de7f	Add an add() method to pyspark accumulators. Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).	2013-10-19 19:55:39 -07:00
Josh Rosen	867d8fdf2a	De-duplicate code in dropOld[Non]BroadcastBlocks.	2013-10-19 19:53:12 -07:00
Josh Rosen	6925a1322b	Code de-duplication in put() and putBytes().	2013-10-19 19:53:12 -07:00
Josh Rosen	8279185651	De-duplication in getRemote() and getRemoteBytes().	2013-10-19 19:53:12 -07:00
Josh Rosen	babccb695e	De-duplication in getLocal() and getLocalBytes().	2013-10-19 19:52:10 -07:00
Reynold Xin	4e44d65b5e	Exclusion rules for Maven build files.	2013-10-19 12:35:55 -07:00
Reynold Xin	6511bbe2ad	Merge pull request #78 from mosharaf/master Removed BitTorrentBroadcast and TreeBroadcast. TorrentBroadcast replaces both.	2013-10-19 11:34:56 -07:00
Joseph E. Gonzalez	ebdbedc3e9	Documenting VertexSetRDD and added some testing code for VertexSetRDD	2013-10-19 01:26:08 -07:00
Mosharaf Chowdhury	29617c27a1	Removed BitTorrentBroadcast and TreeBroadcast. TorrentBroadcast is replacing both.	2013-10-18 23:54:11 -07:00

... 3 4 5 6 7 ...

4750 commits