ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	618c1f6cf3	Merge pull request #125 from velvia/2013-10/local-jar-uri Add support for local:// URI scheme for addJars() This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`. The local scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration. I would add something to the docs, but it's not obvious where to add it. Oh, and it would be great if this could be merged in time for 0.8.1.	2013-10-30 12:03:44 -07:00
Evan Chan	de0285556a	Add support for local:// URI scheme for addJars() This indicates that a jar is available locally on each worker node.	2013-10-30 09:41:35 -07:00
Josh Rosen	cb9c8a922f	Extract BlockInfo classes from BlockManager. This saves space, since the inner classes needed to keep a reference to the enclosing BlockManager.	2013-10-29 18:06:51 -07:00
Josh Rosen	846b1cf5ab	Store fewer BlockInfo fields for shuffle blocks.	2013-10-29 15:14:29 -07:00
Josh Rosen	2d7cf6a271	Restructure BlockInfo fields to reduce memory use.	2013-10-27 23:01:03 -07:00
Matei Zaharia	aec9bf9060	Merge pull request #112 from kayousterhout/ui_task_attempt_id Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant. This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.	2013-10-27 19:32:00 -07:00
Aaron Davidson	4261e834cb	Use flag instead of name check.	2013-10-26 23:53:38 -07:00
Aaron Davidson	596f18479e	Eliminate extra memory usage when shuffle file consolidation is disabled Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled. Fixing SPARK-946 is still forthcoming.	2013-10-26 22:35:01 -07:00
Kay Ousterhout	ae22b4dd99	Display both task ID and task index in UI	2013-10-26 22:18:39 -07:00
Matei Zaharia	bab496c120	Merge pull request #108 from alig/master Changes to enable executing by using HDFS as a synchronization point between driver and executors, as well as ensuring executors exit properly.	2013-10-25 18:28:43 -07:00
Matei Zaharia	d307db6e55	Merge pull request #102 from tdas/transform Added new Spark Streaming operations New operations - transformWith which allows arbitrary 2-to-1 DStream transform, added to Scala and Java API - StreamingContext.transform to allow arbitrary n-to-1 DStream - leftOuterJoin and rightOuterJoin between 2 DStreams, added to Scala and Java API - missing variations of join and cogroup added to Scala Java API - missing JavaStreamingContext.union Updated a number of Java and Scala API docs	2013-10-25 17:26:06 -07:00
Ali Ghodsi	eef261c892	fixing comments on PR	2013-10-25 16:48:33 -07:00
Matei Zaharia	85e2cab6f6	Merge pull request #111 from kayousterhout/ui_name Properly display the name of a stage in the UI. This fixes a bug introduced by the fix for SPARK-940, which changed the UI to display the RDD name rather than the stage name. As a result, no name for the stage was shown when using the Spark shell, which meant that there was no way to click on the stage to see more details (e.g., the running tasks). This commit changes the UI back to using the stage name. @pwendell -- let me know if this change was intentional	2013-10-25 14:46:06 -07:00
Tathagata Das	dc9570782a	Merge branch 'apache-master' into transform	2013-10-25 14:22:23 -07:00
Kay Ousterhout	a9c8d83aaf	Properly display the name of a stage in the UI. This fixes a bug introduced by the fix for SPARK-940, which changed the UI to display the RDD name rather than the stage name. As a result, no name for the stage was shown when using the Spark shell, which meant that there was no way to click on the stage to see more details (e.g., the running tasks). This commit changes the UI back to using the stage name.	2013-10-25 12:00:09 -07:00
Patrick Wendell	e5f6d5697b	Spacing fix	2013-10-24 22:08:06 -07:00
Patrick Wendell	31e92b72e3	Adding Java versions and associated tests	2013-10-24 21:14:56 -07:00
Patrick Wendell	05ac9940ee	Adding tests	2013-10-24 14:31:34 -07:00
Patrick Wendell	2fda84fe3f	Always use a shuffle	2013-10-24 14:31:34 -07:00
Patrick Wendell	08c1a42d7d	Add a `repartition` operator. This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.	2013-10-24 14:31:33 -07:00
Ali Ghodsi	05a0df2b9e	Makes Spark SIMR ready.	2013-10-24 11:59:51 -07:00
Tathagata Das	0400aba1c0	Merge branch 'apache-master' into transform	2013-10-24 11:05:00 -07:00
Tathagata Das	bacfe5ebca	Added JavaStreamingContext.transform	2013-10-24 10:56:24 -07:00
Matei Zaharia	1dc776b863	Merge pull request #93 from kayousterhout/ui_new_state Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-23 22:05:52 -07:00
Kay Ousterhout	b45352e373	Clear akka frame size property in tests	2013-10-23 18:23:28 -07:00
Kay Ousterhout	c42f5d1787	Fixed broken tests	2013-10-23 17:35:01 -07:00
Josh Rosen	210858ac02	Add unpersist() to JavaDoubleRDD and JavaPairRDD. Also add support for new optional `blocking` argument.	2013-10-23 17:27:01 -07:00
Kay Ousterhout	a5f8f54ecd	Merge remote-tracking branch 'upstream/master' into ui_new_state Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala	2013-10-23 16:06:28 -07:00
Tathagata Das	9fccb17a5f	Removed Function3.call() based on Josh's comment.	2013-10-23 12:07:07 -07:00
Tathagata Das	fe8626efd1	Merge branch 'apache-master' into transform	2013-10-22 23:40:40 -07:00
Tathagata Das	72d2e1dd77	Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs.	2013-10-22 23:35:51 -07:00
Josh Rosen	768eb9c962	Remove redundant Java Function call() definitions This should fix SPARK-902, an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes / closes #30)	2013-10-22 14:26:52 -07:00
Patrick Wendell	ab5ece19a3	Formatting cleanup	2013-10-22 13:03:08 -07:00
Patrick Wendell	c22046b3cc	Minor clean-up in review	2013-10-22 11:00:50 -07:00
Patrick Wendell	7de0ea4d42	Response to code review and adding some more tests	2013-10-22 11:00:50 -07:00
Patrick Wendell	2fa3c4c49c	Fix for Spark-870. This patch fixes a bug where the Spark UI didn't display the correct number of total tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions. It also cleans up the listener API a bit by embedding this information in the StageInfo class rather than passing it seperately.	2013-10-22 11:00:25 -07:00
Patrick Wendell	a854f5bfcf	SPARK-940: Do not directly pass Stage objects to SparkListener.	2013-10-22 11:00:06 -07:00
Matei Zaharia	a0e08f0fb9	Merge pull request #82 from JoshRosen/map-output-tracker-refactoring Split MapOutputTracker into Master/Worker classes Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.	2013-10-22 10:20:43 -07:00
Kay Ousterhout	37b9b4cc11	Shorten GETTING_RESULT to GET_RESULT	2013-10-22 10:05:33 -07:00
Aaron Davidson	053ef949ac	Merge ShufflePerfTester patch into shuffle block consolidation	2013-10-21 22:17:53 -07:00
Reynold Xin	a51359c917	Merge pull request #95 from aarondav/perftest Minor: Put StoragePerfTester in org/apache/	2013-10-21 20:33:29 -07:00
Aaron Davidson	97053c4a91	Put StoragePerfTester in org/apache/	2013-10-21 20:25:40 -07:00
Aaron Davidson	0071f0899c	Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.	2013-10-21 15:56:14 -07:00
Kay Ousterhout	916270f5f3	Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-21 12:46:57 -07:00
Aaron Davidson	4aa0ba1df7	Remove executorId from Task.run()	2013-10-21 12:19:15 -07:00
Patrick Wendell	aa61bfd399	Merge pull request #88 from rxin/clean Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor	2013-10-21 11:57:05 -07:00
Tathagata Das	0666498799	Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.	2013-10-21 05:34:09 -07:00
Aaron Davidson	444162afe7	Documentation update	2013-10-20 22:59:45 -07:00
Aaron Davidson	947fceaa73	Close shuffle writers during failure & remove executorId from TaskContext	2013-10-20 22:47:10 -07:00
Patrick Wendell	35886f3474	Merge pull request #41 from pwendell/shuffle-benchmark Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.	2013-10-20 22:20:32 -07:00

1 2 3 4 5 ...

2322 commits