ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Joseph E. Gonzalez	ba5c75692a	Updating analytics to reflect changes in the pregel interface and moving degree information into the edge attribute.	2013-10-22 15:03:00 -07:00
Joseph E. Gonzalez	46b195253e	Adding some additional graph generators to support unit testing of the analytics package.	2013-10-22 15:01:49 -07:00
Joseph E. Gonzalez	14a3329a11	Changing the Pregel interface slightly to better support type inference.	2013-10-22 15:01:20 -07:00
Josh Rosen	768eb9c962	Remove redundant Java Function call() definitions This should fix SPARK-902, an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes / closes #30)	2013-10-22 14:26:52 -07:00
Patrick Wendell	97184de1db	Merge pull request #99 from pwendell/master Use correct formatting for comments in StoragePerfTester	2013-10-22 13:10:14 -07:00
Patrick Wendell	ab5ece19a3	Formatting cleanup	2013-10-22 13:03:08 -07:00
Ewen Cheslack-Postava	c8748c25eb	Add notes to python documentation about using SparkContext.setSystemProperty.	2013-10-22 11:49:52 -07:00
Patrick Wendell	c404adb9d2	Merge pull request #90 from pwendell/master SPARK-940: Do not directly pass Stage objects to SparkListener. This patch updates the SparkListener interface to pass StageInfo objects rather than directly pass spark Stages. The reason for this patch is explained in detail in SPARK-940.	2013-10-22 11:30:19 -07:00
Ewen Cheslack-Postava	317a9eb1ce	Pass self to SparkContext._ensure_initialized. The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.	2013-10-22 11:26:49 -07:00
Patrick Wendell	c22046b3cc	Minor clean-up in review	2013-10-22 11:00:50 -07:00
Patrick Wendell	7de0ea4d42	Response to code review and adding some more tests	2013-10-22 11:00:50 -07:00
Patrick Wendell	2fa3c4c49c	Fix for Spark-870. This patch fixes a bug where the Spark UI didn't display the correct number of total tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions. It also cleans up the listener API a bit by embedding this information in the StageInfo class rather than passing it seperately.	2013-10-22 11:00:25 -07:00
Patrick Wendell	a854f5bfcf	SPARK-940: Do not directly pass Stage objects to SparkListener.	2013-10-22 11:00:06 -07:00
Matei Zaharia	aa9019fc82	Merge pull request #98 from aarondav/docs Docs: Fix links to RDD API documentation	2013-10-22 10:30:02 -07:00
Matei Zaharia	a0e08f0fb9	Merge pull request #82 from JoshRosen/map-output-tracker-refactoring Split MapOutputTracker into Master/Worker classes Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.	2013-10-22 10:20:43 -07:00
Kay Ousterhout	37b9b4cc11	Shorten GETTING_RESULT to GET_RESULT	2013-10-22 10:05:33 -07:00
Aaron Davidson	962bec97ee	Docs: Fix links to RDD API documentation	2013-10-22 09:39:36 -07:00
Ewen Cheslack-Postava	56d230e614	Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 00:22:37 -07:00
Matei Zaharia	b84193c5b8	Merge pull request #92 from tgravescs/sparkYarnFixClasspath Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath ... ...to be explicit about inclusion of spark.jar and app.jar. Be explicit so if there are any conflicts in packaging between spark.jar and app.jar we don't get random results due to the classpath having /*, which can including things in different order.	2013-10-21 23:35:13 -07:00
Matei Zaharia	731c94e91d	Merge pull request #56 from jerryshao/kafka-0.8-dev Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming Conflicts: streaming/pom.xml	2013-10-21 23:31:38 -07:00
Reynold Xin	48952d67e6	Merge pull request #87 from aarondav/shuffle-base Basic shuffle file consolidation The Spark shuffle phase can produce a large number of files, as one file is created per mapper per reducer. For large or repeated jobs, this often produces millions of shuffle files, which sees extremely degredaded performance from the OS file system. This patch seeks to reduce that burden by combining multipe shuffle files into one. This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669. However, it simplifies the design in order to get the majority of the gain with less overall intellectual and code burden. The vast majority of code in this pull request is a refactor to allow the insertion of a clean layer of indirection between logical block ids and physical files. This, I feel, provides some design clarity in addition to enabling shuffle file consolidation. The main goal is to produce one shuffle file per reducer per active mapper thread. This allows us to isolate the mappers (simplifying the failure modes), while still allowing us to reduce the number of mappers tremendously for large tasks. In order to accomplish this, we simply create a new set of shuffle files for every parallel task, and return the files to a pool which will be given out to the next run task. I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark: scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt)) scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0) For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two. Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.	2013-10-21 22:45:00 -07:00
Aaron Davidson	053ef949ac	Merge ShufflePerfTester patch into shuffle block consolidation	2013-10-21 22:17:53 -07:00
Prabeesh K	9ca1bd9530	Update MQTTWordCount.scala	2013-10-22 09:05:57 +05:30
Reynold Xin	a51359c917	Merge pull request #95 from aarondav/perftest Minor: Put StoragePerfTester in org/apache/	2013-10-21 20:33:29 -07:00
Aaron Davidson	97053c4a91	Put StoragePerfTester in org/apache/	2013-10-21 20:25:40 -07:00
Prabeesh K	dbafa11396	Update MQTTWordCount.scala	2013-10-22 08:50:34 +05:30
Matei Zaharia	39d2e9b293	Merge pull request #94 from aarondav/mesos-fix Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71. Previously, we explicitly removed the mesos:// part; with #71, this no longer occurs.	2013-10-21 18:58:48 -07:00
Aaron Davidson	0071f0899c	Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.	2013-10-21 15:56:14 -07:00
Kyle Ellrott	73bf8587e2	Fixing graph/pom.xml	2013-10-21 15:13:31 -07:00
Kay Ousterhout	916270f5f3	Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-21 12:46:57 -07:00
Aaron Davidson	4aa0ba1df7	Remove executorId from Task.run()	2013-10-21 12:19:15 -07:00
tgravescs	b6571541a6	Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath to be explicit about inclusion of spark.jar and app.jar	2013-10-21 14:05:15 -05:00
Patrick Wendell	aa61bfd399	Merge pull request #88 from rxin/clean Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor	2013-10-21 11:57:05 -07:00
Tathagata Das	0666498799	Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.	2013-10-21 05:34:09 -07:00
Aaron Davidson	444162afe7	Documentation update	2013-10-20 22:59:45 -07:00
Aaron Davidson	947fceaa73	Close shuffle writers during failure & remove executorId from TaskContext	2013-10-20 22:47:10 -07:00
Patrick Wendell	35886f3474	Merge pull request #41 from pwendell/shuffle-benchmark Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.	2013-10-20 22:20:32 -07:00
Reynold Xin	5b9380e017	Merge pull request #89 from rxin/executor Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming. java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)	2013-10-20 21:03:51 -07:00
Reynold Xin	b4d8478454	Made JobLogger public again and some minor cleanup.	2013-10-20 18:59:28 -07:00
Matei Zaharia	261bcf27b3	Merge pull request #80 from rxin/build Exclusion rules for Maven build files.	2013-10-20 17:59:51 -07:00
Aaron Davidson	4b68ddf3d0	Cleanup old shuffle file metadata from memory	2013-10-20 17:56:41 -07:00
Matei Zaharia	edc5e3f8f4	Merge pull request #75 from JoshRosen/block-manager-cleanup Code de-duplication in BlockManager The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects. I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.	2013-10-20 17:18:06 -07:00
Aaron Davidson	42a049723d	Address Josh and Reynold's comments	2013-10-20 16:11:59 -07:00
Josh Rosen	1fa5baf9ab	Unwrap a long line that actually fits.	2013-10-20 14:50:21 -07:00
Josh Rosen	640f253a65	Fix test failures in local mode due to updateEpoch	2013-10-20 14:49:05 -07:00
Josh Rosen	68d6806ea4	Minor cleanup based on @aarondav's code review.	2013-10-20 13:20:14 -07:00
Reynold Xin	7414805e4e	Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming.	2013-10-20 13:03:48 -07:00
Reynold Xin	8e1937f8ba	Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam JobLogger BlockManagerSlaveActor	2013-10-20 12:22:07 -07:00
Reynold Xin	2a7ae1736a	Merge pull request #84 from rxin/kill1 Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-20 11:45:21 -07:00
Aaron Davidson	38b8048f29	Fix compiler errors Whoops. Last-second changes require testing too, it seems.	2013-10-20 11:03:36 -07:00

... 2 3 4 5 6 ...

4717 commits