ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	ed25105fd9	Merge pull request #174 from ahirreddy/master Write Spark UI url to driver file on HDFS This makes the SIMR code path simpler	2013-11-14 19:43:55 -08:00
Raymond Liu	d4cd32330e	Some fixes for previous master merge commits	2013-11-15 10:22:31 +08:00
Kay Ousterhout	29c88e408e	Don't retry tasks when they fail due to a NotSerializableException	2013-11-14 15:15:19 -08:00
Kay Ousterhout	52144caaa7	Don't retry tasks if result wasn't serializable	2013-11-14 14:56:53 -08:00
Kay Ousterhout	b4546ba9e6	Fix bug where scheduler could hang after task failure. When a task fails, we need to call reviveOffers() so that the task can be rescheduled on a different machine. In the current code, the state in ClusterTaskSetManager indicating which tasks are pending may be updated after revive offers is called (there's a race condition here), so when revive offers is called, the task set manager does not yet realize that there are failed tasks that need to be relaunched.	2013-11-14 13:55:03 -08:00
Kay Ousterhout	2b807e4f2f	Fix bug where scheduler could hang after task failure. When a task fails, we need to call reviveOffers() so that the task can be rescheduled on a different machine. In the current code, the state in ClusterTaskSetManager indicating which tasks are pending may be updated after revive offers is called (there's a race condition here), so when revive offers is called, the task set manager does not yet realize that there are failed tasks that need to be relaunched.	2013-11-14 13:33:11 -08:00
Reynold Xin	1a4cfbea33	Merge pull request #169 from kayousterhout/mesos_fix Don't ignore spark.cores.max when using Mesos Coarse mode totalCoresAcquired is decremented but never incremented, causing Spark to effectively ignore spark.cores.max in coarse grained Mesos mode.	2013-11-14 10:32:11 -08:00
Kay Ousterhout	c64690d725	Changed local backend to use Akka actor	2013-11-14 09:34:56 -08:00
Lian, Cheng	cc8995c8f4	Fixed a scaladoc typo in HadoopRDD.scala	2013-11-14 18:17:05 +08:00
Kay Ousterhout	5125cd3466	Don't ignore spark.cores.max when using Mesos Coarse mode	2013-11-13 23:06:17 -08:00
Raymond Liu	a60620b76a	Merge branch 'master' into scala-2.10	2013-11-14 12:44:19 +08:00
Kay Ousterhout	46f9c6b858	Fixed naming issues and added back ability to specify max task failures.	2013-11-13 17:12:14 -08:00
Matei Zaharia	2054c61a18	Merge pull request #159 from liancheng/dagscheduler-actor-refine Migrate the daemon thread started by DAGScheduler to Akka actor `DAGScheduler` adopts an event queue and a daemon thread polling the it to process events sent to a `DAGScheduler`. This is a classical actor use case. By migrating this thread to Akka actor, we may benefit from both cleaner code and better performance (context switching cost of Akka actor is much less than that of a native thread). But things become a little complicated when taking existing test code into consideration. Code in `DAGSchedulerSuite` is somewhat tightly coupled with `DAGScheduler`, and directly calls `DAGScheduler.processEvent` instead of posting event messages to `DAGScheduler`. To minimize code change, I chose to let the actor to delegate messages to `processEvent`. Maybe this doesn't follow conventional actor usage, but I tried to make it apparently correct. Another tricky part is that, since `DAGScheduler` depends on the `ActorSystem` provided by its field `env`, `env` cannot be null. But the `dagScheduler` field created in `DAGSchedulerSuite.before` was given a null `env`. What's more, `BlockManager.blockIdsToBlockManagers` checks whether `env` is null to determine whether to run the production code or the test code (bad smell here, huh?). I went through all callers of `BlockManager.blockIdsToBlockManagers`, and made sure that if `env != null` holds, then `blockManagerMaster == null` must also hold. That's the logic behind `BlockManager.scala` [line 896](https://github.com/liancheng/incubator-spark/compare/dagscheduler-actor-refine?expand=1#diff-2b643ea78c1add0381754b1f47eec132L896). At last, since `DAGScheduler` instances are always `start()`ed after creation, I removed the `start()` method, and starts the `eventProcessActor` within the constructor.	2013-11-13 16:49:55 -08:00
Ahir Reddy	0ea1f8b225	Write Spark UI url to driver file on HDFS	2013-11-13 15:23:36 -08:00
Kay Ousterhout	150615a31e	Merge remote-tracking branch 'upstream/master' into consolidate_schedulers Conflicts: core/src/main/scala/org/apache/spark/scheduler/ClusterScheduler.scala	2013-11-13 14:38:44 -08:00
Kay Ousterhout	68e5ad58b7	Extracted TaskScheduler interface. Also changed the default maximum number of task failures to be 0 when running in local mode.	2013-11-13 14:32:50 -08:00
Matei Zaharia	39af914b27	Merge pull request #166 from ahirreddy/simr-spark-ui SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to ... ...be retrieved by SIMR clients	2013-11-13 08:39:05 -08:00
Raymond Liu	0f2e3c6e31	Merge branch 'master' into scala-2.10	2013-11-13 16:55:11 +08:00
Matei Zaharia	b8bf04a085	Merge pull request #160 from xiajunluan/JIRA-923 Fix bug JIRA-923 Fix column sort issue in UI for JIRA-923. https://spark-project.atlassian.net/browse/SPARK-923 Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala	2013-11-12 16:19:50 -08:00
Ahir Reddy	ccb099e804	SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to be retrieved by SIMR clients	2013-11-12 15:58:41 -08:00
Prashant Sharma	6860b79f6e	Remove deprecated actorFor and use actorSelection everywhere.	2013-11-12 12:43:53 +05:30
Prashant Sharma	a8bfdd4377	Enabled remote death watch and a way to configure the timeouts for akka heartbeats.	2013-11-12 12:04:00 +05:30
Andrew xia	e13da05424	fix format error	2013-11-11 19:15:45 +08:00
Andrew xia	37d2f3749e	cut lines to less than 100	2013-11-11 15:49:32 +08:00
Andrew xia	b3208063af	Fix bug JIRA-923	2013-11-11 15:39:10 +08:00
Lian, Cheng	e2a43b3dcc	Made some changes according to suggestions from @aarondav	2013-11-11 12:21:54 +08:00
Josh Rosen	ffa5bedf46	Send PySpark commands as bytes insetad of strings.	2013-11-10 16:46:00 -08:00
Josh Rosen	cbb7f04aef	Add custom serializer support to PySpark. For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().	2013-11-10 16:45:38 -08:00
Lian, Cheng	ba55285177	Put the periodical resubmitFailedStages() call into a scheduled task	2013-11-11 01:25:35 +08:00
Reynold Xin	c845611fc3	Moved the Spark internal class registration for Kryo into an object, and added more classes (e.g. MapStatus, BlockManagerId) to the registration.	2013-11-09 23:00:08 -08:00
Reynold Xin	7c5f70d873	Call Kryo setReferences before calling user specified Kryo registrator.	2013-11-09 22:43:36 -08:00
Matei Zaharia	87954d4c85	Merge pull request #154 from soulmachine/ClusterScheduler Replace the thread inside ClusterScheduler.start() with an Akka scheduler Threads are precious resources so that we shouldn't abuse them	2013-11-09 17:53:25 -08:00
Reynold Xin	83bf1920c8	Merge pull request #155 from rxin/jobgroup Don't reset job group when a new job description is set.	2013-11-09 15:40:29 -08:00
Reynold Xin	28f27097cf	Don't reset job group when a new job description is set.	2013-11-09 13:59:31 -08:00
Matei Zaharia	8af99f2356	Merge pull request #149 from tgravescs/fixSecureHdfsAccess Fix secure hdfs access for spark on yarn https://github.com/apache/incubator-spark/pull/23 broke secure hdfs access. Not sure if it works with secure hdfs on standalone. Fixing it at least for spark on yarn. The broadcasting of jobconf change also broke secure hdfs access as it didn't take into account things calling the getPartitions before sparkContext is initialized. The DAGScheduler does this as it tries to getShuffleMapStage.	2013-11-09 13:48:00 -08:00
Matei Zaharia	72a601ec31	Merge pull request #152 from rxin/repl Propagate SparkContext local properties from spark-repl caller thread to the repl execution thread.	2013-11-09 11:55:16 -08:00
soulmachine	28115fa8cb	replace the thread with a Akka scheduler	2013-11-09 22:38:27 +08:00
Lian, Cheng	765ebca04f	Remove unnecessary null checking	2013-11-09 21:13:03 +08:00
Lian, Cheng	2539c06745	Replaced the daemon thread started by DAGScheduler with an actor	2013-11-09 19:05:18 +08:00
Reynold Xin	319299941d	Propagate the SparkContext local property from the thread that calls the spark-repl to the actual execution thread.	2013-11-09 00:32:14 -08:00
Russell Cardullo	ef85a51f85	Add graphite sink for metrics This adds a metrics sink for graphite. The sink must be configured with the host and port of a graphite node and optionally may be configured with a prefix that will be prepended to all metrics that are sent to graphite.	2013-11-08 16:36:03 -08:00
Aaron Davidson	dd63c548c2	Use SPARK_HOME instead of user.dir in ExecutorRunnerTest	2013-11-08 12:51:05 -08:00
tgravescs	13a19505e4	Don't call the doAs if user is unknown or the same user that is already running	2013-11-08 12:04:09 -06:00
tgravescs	f95cb04e40	Remove the runAsUser as it breaks secure hdfs access	2013-11-08 10:07:15 -06:00
tgravescs	5f9ed51719	Fix access to Secure HDFS	2013-11-08 08:41:57 -06:00
Reynold Xin	3d4ad84b63	Merge pull request #148 from squito/include_appId Include appId in executor cmd line args add the appId back into the executor cmd line args. I also made a pretty lame regression test, just to make sure it doesn't get dropped in the future. not sure it will run on the build server, though, b/c `ExecutorRunner.buildCommandSeq()` expects to be abel to run the scripts in `bin`.	2013-11-07 11:08:27 -08:00
Imran Rashid	ca66f5d5a2	fix formatting	2013-11-07 07:23:59 -06:00
Imran Rashid	8d3cdda9a2	very basic regression test to make sure appId doesnt get dropped in future	2013-11-07 01:35:48 -06:00
Imran Rashid	36e832bff0	include the appid in the cmd line arguments to Executors	2013-11-07 01:11:49 -06:00
jerryshao	12dc385a49	Add Spark multi-user support for standalone mode and Mesos	2013-11-07 11:18:09 +08:00
Reynold Xin	aadeda5e76	Merge pull request #144 from liancheng/runjob-clean Removed unused return value in SparkContext.runJob Return type of this `runJob` version is `Unit`: def runJob[T, U: ClassManifest]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], allowLocal: Boolean, resultHandler: (Int, U) => Unit) { ... } It's obviously unnecessary to "return" `result`.	2013-11-06 13:27:47 -08:00
Aaron Davidson	80e98d2bd7	Attempt to fix SparkListenerSuite breakage Could not reproduce locally, but this test could've been flaky if the build machine was too fast.	2013-11-06 08:03:35 -08:00
Lian, Cheng	a0c4565183	Removed unused return value in SparkContext.runJob	2013-11-06 23:18:59 +08:00
Reynold Xin	a02eed6811	Ignore a task update status if the executor doesn't exist anymore.	2013-11-05 18:46:38 -08:00
Lian, Cheng	8b4c994e8c	Using compact case class pattern matching syntax to simplify code in DAGScheduler.processEvent	2013-11-05 17:18:42 +08:00
Reynold Xin	81065321c0	Merge pull request #139 from aarondav/shuffle-next Never store shuffle blocks in BlockManager After the BlockId refactor (PR #114), it became very clear that ShuffleBlocks are of no use within BlockManager (they had a no-arg constructor!). This patch completely eliminates them, saving us around 100-150 bytes per shuffle block. The total, system-wide overhead per shuffle block is now a flat 8 bytes, excluding state saved by the MapOutputTracker. Note: This should not be merged directly into 0.8.0 -- see #138	2013-11-04 20:47:14 -08:00
Aaron Davidson	93c90844cb	Never store shuffle blocks in BlockManager After the BlockId refactor (PR #114), it became very clear that ShuffleBlocks are of no use within BlockManager (they had a no-arg constructor!). This patch completely eliminates them, saving us around 100-150 bytes per shuffle block. The total, system-wide overhead per shuffle block is now a flat 8 bytes, excluding state saved by the MapOutputTracker.	2013-11-04 18:43:42 -08:00
Reynold Xin	0b26a392df	Merge pull request #128 from shimingfei/joblogger-doc add javadoc to JobLogger, and some small fix against Spark-941 add javadoc to JobLogger, output more info for RDD, modify recordStageDepGraph to avoid output duplicate stage dependency information (cherry picked from commit `518cf22eb2`) Signed-off-by: Reynold Xin <rxin@apache.org>	2013-11-04 18:22:06 -08:00
Aaron Davidson	1ba11b1c6a	Minor cleanup in ShuffleBlockManager	2013-11-04 17:16:41 -08:00
Aaron Davidson	6201e5e249	Refactor ShuffleBlockManager to reduce public interface - ShuffleBlocks has been removed and replaced by ShuffleWriterGroup. - ShuffleWriterGroup no longer contains a reference to a ShuffleFileGroup. - ShuffleFile has been removed and its contents are now within ShuffleFileGroup. - ShuffleBlockManager.forShuffle has been replaced by a more stateful forMapTask.	2013-11-04 09:41:04 -08:00
Aaron Davidson	b0cf19fe3c	Add javadoc and remove unused code	2013-11-03 22:16:58 -08:00
Aaron Davidson	39d93ed4b9	Clean up test files properly For some reason, even calling java.nio.Files.createTempDirectory().getFile.deleteOnExit() does not delete the directory on exit. Guava's analagous function seems to work, however.	2013-11-03 21:52:59 -08:00
Aaron Davidson	a0bb569a81	use OpenHashMap, remove monotonicity requirement, fix failure bug	2013-11-03 21:34:56 -08:00
Aaron Davidson	8703898d3f	Address Reynold's comments	2013-11-03 21:34:44 -08:00
Aaron Davidson	3ca52309f2	Fix test breakage	2013-11-03 21:34:44 -08:00
Aaron Davidson	1592adfa25	Add documentation and address other comments	2013-11-03 21:34:44 -08:00
Aaron Davidson	7d44dec9bd	Fix weird bug with specialized PrimitiveVector	2013-11-03 21:34:43 -08:00
Aaron Davidson	7453f31181	Address minor comments	2013-11-03 21:34:43 -08:00
Aaron Davidson	84991a1b91	Memory-optimized shuffle file consolidation Overhead of each shuffle block for consolidation has been reduced from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes. Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation.	2013-11-03 21:34:13 -08:00
Reynold Xin	eb5f8a3f97	Code review feedback.	2013-11-03 18:11:44 -08:00
Josh Rosen	7d68a81a8e	Remove Pickle-wrapping of Java objects in PySpark. If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.	2013-11-03 11:03:02 -08:00
Josh Rosen	a48d88d206	Replace magic lengths with constants in PySpark. Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.	2013-11-03 10:54:24 -08:00
Reynold Xin	1e9543b567	Fixed a bug that uses twice amount of memory for the primitive arrays due to a scala compiler bug. Also addressed Matei's code review comment.	2013-11-02 23:19:01 -07:00
Reynold Xin	da6bb0aedd	Merge branch 'master' into hash1	2013-11-02 22:45:15 -07:00
Evan Chan	f3679fd494	Add local: URI support to addFile as well	2013-11-01 11:08:03 -07:00
Kay Ousterhout	fb64828b0b	Cleaned up imports and fixed test bug	2013-10-31 23:42:56 -07:00
Matei Zaharia	8f1098a3f0	Merge pull request #117 from stephenh/avoid_concurrent_modification_exception Handle ConcurrentModificationExceptions in SparkContext init. System.getProperties.toMap will fail-fast when concurrently modified, and it seems like some other thread started by SparkContext does a System.setProperty during it's initialization. Handle this by just looping on ConcurrentModificationException, which seems the safest, since the non-fail-fast methods (Hastable.entrySet) have undefined behavior under concurrent modification.	2013-10-30 20:11:48 -07:00
Kay Ousterhout	a124658e53	Fixed most issues with unit tests	2013-10-30 19:29:38 -07:00
Kay Ousterhout	5e91495f5c	Deduplicate Local and Cluster schedulers. The code in LocalScheduler/LocalTaskSetManager was nearly identical to the code in ClusterScheduler/ClusterTaskSetManager. The redundancy made making updating the schedulers unnecessarily painful and error- prone. This commit combines the two into a single TaskScheduler/ TaskSetManager.	2013-10-30 18:48:34 -07:00
Matei Zaharia	dc9ce16f6b	Merge pull request #126 from kayousterhout/local_fix Fixed incorrect log message in local scheduler This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)	2013-10-30 17:01:56 -07:00
Matei Zaharia	33de11c51d	Merge pull request #124 from tgravescs/sparkHadoopUtilFix Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886) Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized. This causes issues like https://spark-project.atlassian.net/browse/SPARK-886. It also makes it hard to use in singleton objects. For instance I want to use it in the security code.	2013-10-30 16:58:27 -07:00
Kay Ousterhout	ff038eb4e0	Fixed incorrect log message in local scheduler	2013-10-30 15:27:23 -07:00
Matei Zaharia	618c1f6cf3	Merge pull request #125 from velvia/2013-10/local-jar-uri Add support for local:// URI scheme for addJars() This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`. The local scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration. I would add something to the docs, but it's not obvious where to add it. Oh, and it would be great if this could be merged in time for 0.8.1.	2013-10-30 12:03:44 -07:00
Stephen Haberman	09f3b677cb	Avoid match errors when filtering for spark.hadoop settings.	2013-10-30 12:29:39 -05:00
tgravescs	f231aaa24c	move the hadoopJobMetadata back into SparkEnv	2013-10-30 11:46:12 -05:00
Evan Chan	de0285556a	Add support for local:// URI scheme for addJars() This indicates that a jar is available locally on each worker node.	2013-10-30 09:41:35 -07:00
tgravescs	54d9c6f253	Merge remote-tracking branch 'upstream/master' into sparkHadoopUtilFix	2013-10-30 10:41:21 -05:00
Josh Rosen	cb9c8a922f	Extract BlockInfo classes from BlockManager. This saves space, since the inner classes needed to keep a reference to the enclosing BlockManager.	2013-10-29 18:06:51 -07:00
Stephen Haberman	3a388c320c	Use Properties.clone() instead.	2013-10-29 19:20:40 -05:00
Josh Rosen	846b1cf5ab	Store fewer BlockInfo fields for shuffle blocks.	2013-10-29 15:14:29 -07:00
tgravescs	eeb5f64c67	Remove SparkHadoopUtil stuff from SparkEnv	2013-10-29 17:12:16 -05:00
Josh Rosen	2d7cf6a271	Restructure BlockInfo fields to reduce memory use.	2013-10-27 23:01:03 -07:00
Matei Zaharia	aec9bf9060	Merge pull request #112 from kayousterhout/ui_task_attempt_id Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant. This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.	2013-10-27 19:32:00 -07:00
Stephen Haberman	a6ae2b4832	Handle ConcurrentModificationExceptions in SparkContext init. System.getProperties.toMap will fail-fast when concurrently modified, and it seems like some other thread started by SparkContext does a System.setProperty during it's initialization. Handle this by just looping on ConcurrentModificationException, which seems the safest, since the non-fail-fast methods (Hastable.entrySet) have undefined behavior under concurrent modification.	2013-10-27 14:08:32 -05:00
Aaron Davidson	4261e834cb	Use flag instead of name check.	2013-10-26 23:53:38 -07:00
Aaron Davidson	596f18479e	Eliminate extra memory usage when shuffle file consolidation is disabled Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled. Fixing SPARK-946 is still forthcoming.	2013-10-26 22:35:01 -07:00
Kay Ousterhout	ae22b4dd99	Display both task ID and task index in UI	2013-10-26 22:18:39 -07:00
Matei Zaharia	bab496c120	Merge pull request #108 from alig/master Changes to enable executing by using HDFS as a synchronization point between driver and executors, as well as ensuring executors exit properly.	2013-10-25 18:28:43 -07:00
Matei Zaharia	d307db6e55	Merge pull request #102 from tdas/transform Added new Spark Streaming operations New operations - transformWith which allows arbitrary 2-to-1 DStream transform, added to Scala and Java API - StreamingContext.transform to allow arbitrary n-to-1 DStream - leftOuterJoin and rightOuterJoin between 2 DStreams, added to Scala and Java API - missing variations of join and cogroup added to Scala Java API - missing JavaStreamingContext.union Updated a number of Java and Scala API docs	2013-10-25 17:26:06 -07:00
Ali Ghodsi	eef261c892	fixing comments on PR	2013-10-25 16:48:33 -07:00
Matei Zaharia	85e2cab6f6	Merge pull request #111 from kayousterhout/ui_name Properly display the name of a stage in the UI. This fixes a bug introduced by the fix for SPARK-940, which changed the UI to display the RDD name rather than the stage name. As a result, no name for the stage was shown when using the Spark shell, which meant that there was no way to click on the stage to see more details (e.g., the running tasks). This commit changes the UI back to using the stage name. @pwendell -- let me know if this change was intentional	2013-10-25 14:46:06 -07:00
Tathagata Das	dc9570782a	Merge branch 'apache-master' into transform	2013-10-25 14:22:23 -07:00
Kay Ousterhout	a9c8d83aaf	Properly display the name of a stage in the UI. This fixes a bug introduced by the fix for SPARK-940, which changed the UI to display the RDD name rather than the stage name. As a result, no name for the stage was shown when using the Spark shell, which meant that there was no way to click on the stage to see more details (e.g., the running tasks). This commit changes the UI back to using the stage name.	2013-10-25 12:00:09 -07:00
Patrick Wendell	e5f6d5697b	Spacing fix	2013-10-24 22:08:06 -07:00
Patrick Wendell	31e92b72e3	Adding Java versions and associated tests	2013-10-24 21:14:56 -07:00
Patrick Wendell	05ac9940ee	Adding tests	2013-10-24 14:31:34 -07:00
Patrick Wendell	2fda84fe3f	Always use a shuffle	2013-10-24 14:31:34 -07:00
Patrick Wendell	08c1a42d7d	Add a `repartition` operator. This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.	2013-10-24 14:31:33 -07:00
Ali Ghodsi	05a0df2b9e	Makes Spark SIMR ready.	2013-10-24 11:59:51 -07:00
Tathagata Das	0400aba1c0	Merge branch 'apache-master' into transform	2013-10-24 11:05:00 -07:00
Tathagata Das	bacfe5ebca	Added JavaStreamingContext.transform	2013-10-24 10:56:24 -07:00
Matei Zaharia	1dc776b863	Merge pull request #93 from kayousterhout/ui_new_state Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-23 22:05:52 -07:00
Kay Ousterhout	b45352e373	Clear akka frame size property in tests	2013-10-23 18:23:28 -07:00
Kay Ousterhout	c42f5d1787	Fixed broken tests	2013-10-23 17:35:01 -07:00
Josh Rosen	210858ac02	Add unpersist() to JavaDoubleRDD and JavaPairRDD. Also add support for new optional `blocking` argument.	2013-10-23 17:27:01 -07:00
Kay Ousterhout	a5f8f54ecd	Merge remote-tracking branch 'upstream/master' into ui_new_state Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala	2013-10-23 16:06:28 -07:00
Tathagata Das	9fccb17a5f	Removed Function3.call() based on Josh's comment.	2013-10-23 12:07:07 -07:00
Tathagata Das	fe8626efd1	Merge branch 'apache-master' into transform	2013-10-22 23:40:40 -07:00
Tathagata Das	72d2e1dd77	Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs.	2013-10-22 23:35:51 -07:00
Josh Rosen	768eb9c962	Remove redundant Java Function call() definitions This should fix SPARK-902, an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes / closes #30)	2013-10-22 14:26:52 -07:00
Patrick Wendell	ab5ece19a3	Formatting cleanup	2013-10-22 13:03:08 -07:00
Patrick Wendell	c22046b3cc	Minor clean-up in review	2013-10-22 11:00:50 -07:00
Patrick Wendell	7de0ea4d42	Response to code review and adding some more tests	2013-10-22 11:00:50 -07:00
Patrick Wendell	2fa3c4c49c	Fix for Spark-870. This patch fixes a bug where the Spark UI didn't display the correct number of total tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions. It also cleans up the listener API a bit by embedding this information in the StageInfo class rather than passing it seperately.	2013-10-22 11:00:25 -07:00
Patrick Wendell	a854f5bfcf	SPARK-940: Do not directly pass Stage objects to SparkListener.	2013-10-22 11:00:06 -07:00
Matei Zaharia	a0e08f0fb9	Merge pull request #82 from JoshRosen/map-output-tracker-refactoring Split MapOutputTracker into Master/Worker classes Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.	2013-10-22 10:20:43 -07:00
Kay Ousterhout	37b9b4cc11	Shorten GETTING_RESULT to GET_RESULT	2013-10-22 10:05:33 -07:00
Aaron Davidson	053ef949ac	Merge ShufflePerfTester patch into shuffle block consolidation	2013-10-21 22:17:53 -07:00
Reynold Xin	a51359c917	Merge pull request #95 from aarondav/perftest Minor: Put StoragePerfTester in org/apache/	2013-10-21 20:33:29 -07:00
Aaron Davidson	97053c4a91	Put StoragePerfTester in org/apache/	2013-10-21 20:25:40 -07:00
Aaron Davidson	0071f0899c	Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.	2013-10-21 15:56:14 -07:00
Kay Ousterhout	916270f5f3	Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-21 12:46:57 -07:00
Aaron Davidson	4aa0ba1df7	Remove executorId from Task.run()	2013-10-21 12:19:15 -07:00
Patrick Wendell	aa61bfd399	Merge pull request #88 from rxin/clean Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor	2013-10-21 11:57:05 -07:00
Tathagata Das	0666498799	Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.	2013-10-21 05:34:09 -07:00
Holden Karau	20b33bc4b5	Remove extranious type declerations	2013-10-21 00:21:37 -07:00
Holden Karau	092b87e7c8	Remove extranious type definitions from inside of tests	2013-10-21 00:20:15 -07:00
Holden Karau	699f7d28c0	CR feedback	2013-10-21 00:10:03 -07:00
Aaron Davidson	444162afe7	Documentation update	2013-10-20 22:59:45 -07:00
Aaron Davidson	947fceaa73	Close shuffle writers during failure & remove executorId from TaskContext	2013-10-20 22:47:10 -07:00
Patrick Wendell	35886f3474	Merge pull request #41 from pwendell/shuffle-benchmark Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.	2013-10-20 22:20:32 -07:00
Reynold Xin	5b9380e017	Merge pull request #89 from rxin/executor Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming. java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)	2013-10-20 21:03:51 -07:00
Reynold Xin	b4d8478454	Made JobLogger public again and some minor cleanup.	2013-10-20 18:59:28 -07:00
Aaron Davidson	4b68ddf3d0	Cleanup old shuffle file metadata from memory	2013-10-20 17:56:41 -07:00
Matei Zaharia	edc5e3f8f4	Merge pull request #75 from JoshRosen/block-manager-cleanup Code de-duplication in BlockManager The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects. I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.	2013-10-20 17:18:06 -07:00
Aaron Davidson	42a049723d	Address Josh and Reynold's comments	2013-10-20 16:11:59 -07:00
Josh Rosen	1fa5baf9ab	Unwrap a long line that actually fits.	2013-10-20 14:50:21 -07:00
Josh Rosen	640f253a65	Fix test failures in local mode due to updateEpoch	2013-10-20 14:49:05 -07:00
Josh Rosen	68d6806ea4	Minor cleanup based on @aarondav's code review.	2013-10-20 13:20:14 -07:00
Reynold Xin	7414805e4e	Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming.	2013-10-20 13:03:48 -07:00
Reynold Xin	8e1937f8ba	Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam JobLogger BlockManagerSlaveActor	2013-10-20 12:22:07 -07:00
Reynold Xin	2a7ae1736a	Merge pull request #84 from rxin/kill1 Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-20 11:45:21 -07:00
Aaron Davidson	38b8048f29	Fix compiler errors Whoops. Last-second changes require testing too, it seems.	2013-10-20 11:03:36 -07:00
Reynold Xin	fabd05dabc	Updated setGroupId documentation and marked dagSchedulerSource and blockManagerSource as private in SparkContext.	2013-10-20 10:54:30 -07:00
Matei Zaharia	e4abb75d70	Merge pull request #85 from rxin/clean Moved the top level spark package object from spark to org.apache.spark This is a pretty annoying documentation bug ...	2013-10-20 09:38:37 -07:00
Aaron Davidson	136b9b3a3e	Basic shuffle file consolidation The Spark shuffle phase can produce a large number of files, as one file is created per mapper per reducer. For large or repeated jobs, this often produces millions of shuffle files, which sees extremely degredaded performance from the OS file system. This patch seeks to reduce that burden by combining multipe shuffle files into one. This PR draws upon the work of Jason Dai in https://github.com/mesos/spark/pull/669. However, it simplifies the design in order to get the majority of the gain with less overall intellectual and code burden. The vast majority of code in this pull request is a refactor to allow the insertion of a clean layer of indirection between logical block ids and physical files. This, I feel, provides some design clarity in addition to enabling shuffle file consolidation. The main goal is to produce one shuffle file per reducer per active mapper thread. This allows us to isolate the mappers (simplifying the failure modes), while still allowing us to reduce the number of mappers tremendously for large tasks. In order to accomplish this, we simply create a new set of shuffle files for every parallel task, and return the files to a pool which will be given out to the next run task.	2013-10-20 02:58:26 -07:00
Aaron Davidson	861dc409d7	Refactor of DiskStore for shuffle file consolidation The main goal of this refactor was to allow the interposition of a new layer which maps logical BlockIds to physical locations other than a file with the same name as the BlockId. In particular, BlockIds will need to be mappable to chunks of files, as multiple will be stored in the same file. In order to accomplish this, the following changes have been made: - Creation of DiskBlockManager, which manages the association of logical BlockIds to physical disk locations (called FileSegments). By default, Blocks are simply mapped to physical files of the same name, as before. - The DiskStore now indirects all requests for a given BlockId through the DiskBlockManager in order to resolve the actual File location. - DiskBlockObjectWriter has been merged into BlockObjectWriter. - The Netty PathResolver has been changed to map BlockIds into FileSegments, as this codepath is the only one that uses Netty, and that is likely to remain the case. Overall, I think this refactor produces a clearer division between the logical Block paradigm and their physical on-disk location. There is now an explicit (and documented) mapping from one to the other.	2013-10-20 02:48:41 -07:00
Holden Karau	e58c69d955	Add tests for the Java implementation.	2013-10-20 01:17:13 -07:00
Reynold Xin	8396a6649e	Moved the top level spark package object from spark to org.apache.spark	2013-10-19 23:26:15 -07:00
Reynold Xin	eb9bf69462	Added documentation for setJobGroup. Also some minor cleanup in SparkContext.	2013-10-19 23:16:44 -07:00
Josh Rosen	9159d2d09d	Split MapOutputTracker into Master/Worker classes. Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.	2013-10-19 20:01:22 -07:00
Josh Rosen	867d8fdf2a	De-duplicate code in dropOld[Non]BroadcastBlocks.	2013-10-19 19:53:12 -07:00
Josh Rosen	6925a1322b	Code de-duplication in put() and putBytes().	2013-10-19 19:53:12 -07:00
Josh Rosen	8279185651	De-duplication in getRemote() and getRemoteBytes().	2013-10-19 19:53:12 -07:00
Josh Rosen	babccb695e	De-duplication in getLocal() and getLocalBytes().	2013-10-19 19:52:10 -07:00
Reynold Xin	6511bbe2ad	Merge pull request #78 from mosharaf/master Removed BitTorrentBroadcast and TreeBroadcast. TorrentBroadcast replaces both.	2013-10-19 11:34:56 -07:00
Holden Karau	2a37235825	Initial commit of adding histogram functionality to the DoubleRDDFunctions.	2013-10-19 00:57:25 -07:00
Mosharaf Chowdhury	29617c27a1	Removed BitTorrentBroadcast and TreeBroadcast. TorrentBroadcast is replacing both.	2013-10-18 23:54:11 -07:00
Matei Zaharia	599dcb0ddf	Merge pull request #74 from rxin/kill Job cancellation via job group id. This PR adds a simple API to group together a set of jobs belonging to a thread and threads spawned from it. It also allows the cancellation of all jobs in this group. An example: sc.setJobDescription("this_is_the_group_id", "some job description") sc.parallelize(1 to 10000, 2).map { i => Thread.sleep(10); i }.count() In a separate thread: sc.cancelJobGroup("this_is_the_group_id")	2013-10-18 22:49:00 -07:00
Reynold Xin	806f3a3adb	Job cancellation via job group id.	2013-10-18 21:46:08 -07:00
Matei Zaharia	e5316d0685	Merge pull request #68 from mosharaf/master Faster and stable/reliable broadcast HttpBroadcast is noticeably slow, but the alternatives (TreeBroadcast or BitTorrentBroadcast) are notoriously unreliable. The main problem with them is they try to manage the memory for the pieces of a broadcast themselves. Right now, the BroadcastManager does not know which machines the tasks reading from a broadcast variable is running and when they have finished. Consequently, we try to guess and often guess wrong, which blows up the memory usage and kills/hangs jobs. This very simple implementation solves the problem by not trying to manage the intermediate pieces; instead, it offloads that duty to the BlockManager which is quite good at juggling blocks. Otherwise, it is very similar to the BitTorrentBroadcast implementation (without fancy optimizations). And it runs much faster than HttpBroadcast we have right now. I've been using this for another project for last couple of weeks, and just today did some benchmarking against the Http one. The following shows the improvements for increasing broadcast size for cold runs. Each line represent the number of receivers. ![fix-bc-first](https://f.cloud.github.com/assets/232966/1349342/ffa149e4-36e7-11e3-9fa6-c74555829356.png) After the first broadcast is over, i.e., after JVM is wormed up and for HttpBroadcast the server is already running (I think), the following are the improvements for warm runs. ![fix-bc-succ](https://f.cloud.github.com/assets/232966/1349352/5a948bae-36e8-11e3-98ce-34f19ebd33e0.jpg) The curves are not as nice as the cold runs, but the improvements are obvious, specially for larger broadcasts and more receivers. Depending on how it goes, we should deprecate and/or remove old TreeBroadcast and BitTorrentBroadcast implementations, and hopefully, SPARK-889 will not be necessary any more.	2013-10-18 20:30:56 -07:00
Hossein Falaki	2d511ab320	Made SerializableHyperLogLog Externalizable and added Kryo tests	2013-10-18 15:30:45 -07:00
Hossein Falaki	13227aaa28	Added stream-lib dependency to Maven build	2013-10-18 14:10:24 -07:00
Hossein Falaki	79868fe724	Improved code style.	2013-10-17 23:39:20 -07:00
Mosharaf Chowdhury	08391dbcb8	Should compile now.	2013-10-17 23:06:17 -07:00
Hossein Falaki	b611d9a65c	Fixed document typo	2013-10-17 23:05:22 -07:00
Mosharaf Chowdhury	8612641362	Added an after block to reset spark.broadcast.factory	2013-10-17 22:44:04 -07:00
Hossein Falaki	ec5df800fd	Added countDistinctByKey to PairRDDFunctions that counts the approximate number of unique values for each key in the RDD.	2013-10-17 22:26:00 -07:00
Hossein Falaki	1a701358c0	Added a countDistinct method to RDD that takes takes an accuracy parameter and returns the (approximate) number of distinct elements in the RDD.	2013-10-17 22:24:48 -07:00
Hossein Falaki	843727af99	Added a serializable wrapper for HyperLogLog	2013-10-17 22:17:06 -07:00
Aaron Davidson	74737264c4	Spark shell exits if it cannot create SparkContext Mainly, this occurs if you provide a messed up MASTER url (one that doesn't match one of our regexes). Previously, we would default to Mesos, fail, and then start the shell anyway, except that any Spark command would fail.	2013-10-17 18:51:19 -07:00
Mosharaf Chowdhury	e178ae4e9b	BroadcastSuite updated to test both HttpBroadcast and TorrentBroadcast in local, local[N], local-cluster settings.	2013-10-17 16:38:43 -07:00
Mosharaf Chowdhury	6a84e40efe	Merge remote-tracking branch 'upstream/master'	2013-10-17 13:14:33 -07:00
Mosharaf Chowdhury	35b2415fb3	Code styling. Updated doc.	2013-10-17 13:14:12 -07:00
Mosharaf Chowdhury	e663750488	Removed unused code. Changes to match Spark coding style.	2013-10-17 00:19:50 -07:00
Kay Ousterhout	809f547633	Fixed unit tests	2013-10-16 23:16:12 -07:00
Reynold Xin	3e7df8f6c6	Added a number of very fast, memory-efficient data structures: BitSet, OpenHashSet, OpenHashMap, PrimitiveKeyOpenHashMap.	2013-10-16 22:58:52 -07:00
Mosharaf Chowdhury	a8d0981832	Fixes for the new BlockId naming convention.	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	feb45d391f	Default blockSize is 4MB. BroadcastTest2 example added for testing broadcasts.	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	6e5a60fab4	Removed unnecessary code, and added comment of memory-latency tradeoff.	2013-10-16 21:33:33 -07:00
Mosharaf Chowdhury	4602e2bf6e	Torrent-ish broadcast based on BlockManager.	2013-10-16 21:33:33 -07:00
Kay Ousterhout	ec512583ab	Removed TaskSchedulerListener interface. The interface was used only by the DAG scheduler (so it wasn't necessary to define the additional interface), and the naming makes it very confusing when reading the code (because "listener" was used to describe the DAG scheduler, rather than SparkListeners, which implement a nearly-identical interface but serve a different function).	2013-10-16 16:57:42 -07:00
Matei Zaharia	4e46fde818	Merge pull request #62 from harveyfeng/master Make TaskContext's stageId publicly accessible.	2013-10-15 23:14:27 -07:00
Harvey Feng	65b46236e7	Proper formatting for SparkHadoopWriter class extensions.	2013-10-15 21:51:52 -07:00
Matei Zaharia	6dbd2208ff	Merge pull request #34 from kayousterhout/rename Renamed StandaloneX to CoarseGrainedX. (as suggested by @rxin here https://github.com/apache/incubator-spark/pull/14) The previous names were confusing because the components weren't just used in Standalone mode. The scheduler used for Standalone mode is called SparkDeploySchedulerBackend, so referring to the base class as StandaloneSchedulerBackend was misleading.	2013-10-15 19:02:57 -07:00
Harvey Feng	c4c76e37a7	Fix line length > 100 chars in SparkHadoopWriter	2013-10-15 18:35:59 -07:00
Harvey Feng	5b8083fee5	Make TaskContext's stageId publicly accessible.	2013-10-15 18:06:37 -07:00
Kay Ousterhout	f95a2be045	Fixed build error after merging in master	2013-10-15 14:51:37 -07:00
Kay Ousterhout	acc7638f7c	Merge remote branch 'upstream/master' into rename	2013-10-15 14:43:56 -07:00
Kay Ousterhout	707ad8cc4f	Unified daemon thread pools	2013-10-15 14:23:43 -07:00
Reynold Xin	f41feb7b33	Bump up logging level to warning for failed tasks.	2013-10-14 23:35:32 -07:00
Reynold Xin	9cd8786e4a	Merge branch 'master' of github.com:apache/incubator-spark into kill Conflicts: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala	2013-10-14 21:51:30 -07:00
Reynold Xin	3b11f43e36	Merge pull request #57 from aarondav/bid Refactor BlockId into an actual type Converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now: + Type safety + Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types. + Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported. (I'm looking at you, shuffle file consolidation.) + It will only get harder to make this change as time goes on. Downside is, of course, that this is a very invasive change touching a lot of different files, which will inevitably lead to merge conflicts for many.	2013-10-14 14:20:01 -07:00
Aaron Davidson	4a45019fb0	Address Matei's comments	2013-10-14 00:24:17 -07:00
Aaron Davidson	da896115ec	Change BlockId filename to name + rest of Patrick's comments	2013-10-13 11:15:02 -07:00
Aaron Davidson	d60352283c	Add unit test and address rest of Reynold's comments	2013-10-12 22:45:15 -07:00
Aaron Davidson	a395911138	Refactor BlockId into an actual type This is an unfortunately invasive change which converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now: + Type safety + Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types. + Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported. (I'm looking at you, shuffle file consolidation.) + It will only get harder to make this change as time goes on. Since this touches a lot of files, it'd be best to either get this patch in quickly or throw it on the ground to avoid too many secondary merge conflicts.	2013-10-12 22:44:57 -07:00
Reynold Xin	99796904ae	Merge pull request #52 from harveyfeng/hadoop-closure Add an optional closure parameter to HadoopRDD instantiation to use when creating local JobConfs. Having HadoopRDD accept this optional closure eliminates the need for the HadoopFileRDD added earlier. It makes the HadoopRDD more general, in that the caller can specify any JobConf initialization flow.	2013-10-12 21:23:26 -07:00
Harvey Feng	6c32aab87d	Remove the new HadoopRDD constructor from SparkContext API, plus some minor style changes.	2013-10-12 21:02:08 -07:00
Reynold Xin	88866ea9c9	Fixed PairRDDFunctionsSuite after removing InterruptibleRDD.	2013-10-12 20:05:23 -07:00
Reynold Xin	6b288b75d4	Job cancellation: address Matei's code review feedback.	2013-10-12 15:53:31 -07:00
Reynold Xin	ab0940f0c2	Job cancellation: addressed code review feedback round 2 from Kay.	2013-10-11 18:15:04 -07:00
Reynold Xin	97ffebbe87	Fixed dagscheduler suite because of a logging message change.	2013-10-11 16:18:22 -07:00
Reynold Xin	a61cf40ab9	Job cancellation: addressed code review feedback from Kay.	2013-10-11 15:58:14 -07:00
Matei Zaharia	fb25f32300	Merge pull request #53 from witgo/master Add a zookeeper compile dependency to fix build in maven Add a zookeeper compile dependency to fix build in maven	2013-10-11 15:44:43 -07:00
Matei Zaharia	d6ead47809	Merge pull request #32 from mridulm/master Address review comments, move to incubator spark Also includes a small fix to speculative execution. <edit> Continued from https://github.com/mesos/spark/pull/914 </edit>	2013-10-11 15:43:01 -07:00
Reynold Xin	e2047d3927	Making takeAsync and collectAsync deterministic.	2013-10-11 13:04:45 -07:00
Reynold Xin	09f7609254	Properly handle interrupted exception in FutureAction.	2013-10-11 11:20:15 -07:00
LiGuoqiang	fc60c412ab	Add a zookeeper compile dependency to fix build in maven	2013-10-11 16:31:47 +08:00
Reynold Xin	42fb1df694	Merge branch 'master' of github.com:apache/incubator-spark into kill Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala	2013-10-10 23:48:05 -07:00
Reynold Xin	d9e724e756	Fixed the broken local scheduler test.	2013-10-10 23:08:13 -07:00
Reynold Xin	37397b73ba	Added comprehensive tests for job cancellation in a variety of environments (local vs cluster, fifo vs fair).	2013-10-10 22:57:43 -07:00
Reynold Xin	80cdbf4f49	Switched to use daemon thread in executor and fixed a bug in job cancellation for fair scheduler.	2013-10-10 22:40:48 -07:00
Matei Zaharia	8f11c36fe1	Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala	2013-10-10 19:34:33 -07:00
Reynold Xin	058508b625	Changed the name of the local cluster executor from local to localhost.	2013-10-10 19:24:00 -07:00
Reynold Xin	ec2e2ed1e1	Use the same Executor in LocalScheduler as in ClusterScheduler.	2013-10-10 18:55:25 -07:00
Matei Zaharia	c71499b779	Merge pull request #19 from aarondav/master-zk Standalone Scheduler fault tolerance using ZooKeeper This patch implements full distributed fault tolerance for standalone scheduler Masters. There is only one master Leader at a time, which is actively serving scheduling requests. If this Leader crashes, another master will eventually be elected, reconstruct the state from the first Master, and continue serving scheduling requests. Leader election is performed using the ZooKeeper leader election pattern. We try to minimize the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of retries and session monitoring on top of the ZooKeeper client. Master failover follows directly from the single-node Master recovery via the file system (patch `d5a96fe`), save that the Master state is stored in ZooKeeper instead. Configuration: By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE). By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled. By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory to an appropriate directory accessible by the Master, we will keep the behavior of from `d5a96fe`. Additionally, places where a Master could be specificied by a spark:// url can now take comma-delimited lists to specify backup masters. Note that this is only used for registration of NEW Workers and application Clients. Once a Worker or Client has registered with the Master Leader, it is "in the system" and will never need to register again.	2013-10-10 17:16:42 -07:00
Harvey Feng	5a99e67894	Add an optional closure parameter to HadoopRDD instantiation to used when creating any local JobConfs.	2013-10-10 16:35:52 -07:00
Reynold Xin	357733d292	Rename kill -> cancel in user facing API / documentation.	2013-10-10 13:27:38 -07:00
Matei Zaharia	001d13f7b9	Merge branch 'master' into fast-map Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala	2013-10-10 13:26:43 -07:00
Reynold Xin	ddf64f019f	Support job cancellation in multi-pool scheduler.	2013-10-10 13:20:27 -07:00
Reynold Xin	3bd2890d2b	Fixed the deadlock situation in multi-job actions and added more unit tests.	2013-10-10 12:07:09 -07:00
Prashant Sharma	bfbd7e5d9f	Fixed some scala warnings in core.	2013-10-10 15:22:31 +05:30
Prashant Sharma	34da58ae50	Changed message-frame-size to maximum-frame-size as property. Removed a test accidentally added during merge.	2013-10-10 15:13:44 +05:30
Aaron Davidson	42d8b8efe6	Address Matei's comments on documentation Updates to the documentation and changing some logError()s to logWarning()s.	2013-10-10 00:33:47 -07:00
Reynold Xin	0353f74a9a	Put the job cancellation handling into the dagscheduler's main event loop.	2013-10-10 00:28:00 -07:00
Reynold Xin	dbae7795ba	Merge branch 'master' of github.com:apache/incubator-spark into kill Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala core/src/main/scala/org/apache/spark/scheduler/DAGSchedulerSource.scala	2013-10-09 22:57:35 -07:00
Reynold Xin	53895f9cde	Implemented FutureAction, FutureJob, CancellablePromise. Implemented more unit tests for async actions.	2013-10-09 22:43:06 -07:00
Prashant Sharma	026ab75661	Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10	2013-10-10 09:42:55 +05:30
Prashant Sharma	26860639c5	Merge branch 'scala-2.10' of github.com:ScrapCodes/spark into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala project/SparkBuild.scala	2013-10-10 09:42:23 +05:30
Reynold Xin	320418f7c8	Merge pull request #49 from mateiz/kryo-fix-2 Fix Chill serialization of Range objects It used to write out each element one by one, creating very large objects.	2013-10-09 16:55:30 -07:00
Reynold Xin	215238cb39	Merge pull request #50 from kayousterhout/SPARK-908 Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 16:49:44 -07:00
Matei Zaharia	c84c205289	Fix Chill serialization of Range objects, which used to write out each element, and register user and Spark classes before Chill's serializers to let them override Chill's behavior in general.	2013-10-09 16:23:40 -07:00
Kay Ousterhout	36966f65df	Style fixes	2013-10-09 15:36:34 -07:00
Kay Ousterhout	3f7e9b265c	Fixed comment to use javadoc style	2013-10-09 15:23:04 -07:00
Kay Ousterhout	a34a4e8174	Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 15:07:53 -07:00
Patrick Wendell	bd3bcc5f8e	Use standard abbreviations in metrics labels	2013-10-09 11:16:24 -07:00
Patrick Wendell	19d445d37c	Merge pull request #22 from GraceH/metrics-naming SPARK-900 Use coarser grained naming for metrics see SPARK-900 Use coarser grained naming for metrics. Now the new metric name is formatted as {XXX.YYY.ZZZ.COUNTER_UNIT}, XXX.YYY.ZZZ represents the group name, which can group several metrics under the same Ganglia view.	2013-10-09 11:08:34 -07:00
Matei Zaharia	12d593129d	Create fewer function objects in uses of AppendOnlyMap.changeValue	2013-10-08 23:16:51 -07:00
Matei Zaharia	0b35051f19	Address some comments on code clarity	2013-10-08 23:16:17 -07:00
Matei Zaharia	4acbc5afdd	Moved files that were in the wrong directory after package rename	2013-10-08 23:16:17 -07:00
Matei Zaharia	0e40cfabf8	Fix some review comments	2013-10-08 23:16:16 -07:00
Matei Zaharia	b535db7d89	Added a fast and low-memory append-only map implementation for cogroup and parallel reduce operations	2013-10-08 23:14:38 -07:00
Reynold Xin	e67d5b962a	Merge pull request #43 from mateiz/kryo-fix Don't allocate Kryo buffers unless needed I noticed that the Kryo serializer could be slower than the Java one by 2-3x on small shuffles because it spend a lot of time initializing Kryo Input and Output objects. This is because our default buffer size for them is very large. Since the serializer is often used on streams, I made the initialization lazy for that, and used a smaller buffer (auto-managed by Kryo) for input.	2013-10-08 22:57:38 -07:00
Grace Huang	f7628e4033	remove those futile suffixes like number/count	2013-10-09 08:36:41 +08:00
Aaron Davidson	749233b869	Revert change to spark-class Also adds comment about how to configure for FaultToleranceTest.	2013-10-08 11:41:52 -07:00
Grace Huang	22bed59d2d	create metrics name manually.	2013-10-08 18:01:11 +08:00
Grace Huang	188abbf8f1	Revert "SPARK-900 Use coarser grained naming for metrics" This reverts commit `4b68be5f3c`.	2013-10-08 17:45:14 +08:00
Grace Huang	a2af6b543a	Revert "remedy the line-wrap while exceeding 100 chars" This reverts commit `892fb8ffa8`.	2013-10-08 17:44:56 +08:00
Prashant Sharma	7be75682b9	Merge branch 'master' into wip-merge-master Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.	2013-10-08 11:29:40 +05:30
Patrick Wendell	9e9e9e1b42	Making the timing block more narrow for the sync	2013-10-07 21:28:12 -07:00
Patrick Wendell	8b377718b8	Responses to review	2013-10-07 20:03:35 -07:00
Matei Zaharia	a8725bf8f8	Don't allocate Kryo buffers unless needed	2013-10-07 19:16:35 -07:00
Patrick Wendell	391133f66a	Fix inconsistent and incorrect log messages in shuffle read path	2013-10-07 17:24:18 -07:00
Patrick Wendell	b08306c5cf	Minor cleanup	2013-10-07 16:30:25 -07:00
Patrick Wendell	524d01ea31	Perf benchmark	2013-10-07 15:15:42 -07:00
Patrick Wendell	d15acd6457	Trying new approach with writes	2013-10-07 15:15:42 -07:00
Patrick Wendell	a224c8c9b8	Adding option to force sync to the filesystem	2013-10-07 15:15:42 -07:00
Patrick Wendell	3478ca6762	Track and report write throughput for shuffle tasks.	2013-10-07 15:15:41 -07:00
Reynold Xin	213b70a2db	Merge pull request #31 from sundeepn/branch-0.8 Resolving package conflicts with hadoop 0.23.9 Hadoop 0.23.9 is having a package conflict with easymock's dependencies. (cherry picked from commit `023e3fdf00`) Signed-off-by: Reynold Xin <rxin@apache.org>	2013-10-07 10:54:22 -07:00
Kay Ousterhout	fdc52b2f8b	Added back fully qualified class name	2013-10-06 18:45:43 -07:00
Aaron Davidson	718e8c2052	Change url format to spark://host1:port1,host2:port2 This replaces the format of spark://host1:port1,spark://host2:port2 and is more consistent with ZooKeeper's zk:// urls.	2013-10-06 00:02:08 -07:00
Aaron Davidson	e1190229e1	Add end-to-end test for standalone scheduler fault tolerance Docker files drawn mostly from Matt Masse. Some updates from Andre Schumacher.	2013-10-05 23:20:31 -07:00
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Matei Zaharia	4a25b116d4	Merge pull request #20 from harveyfeng/hadoop-config-cache Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads. Note: originally from https://github.com/mesos/spark/pull/942 Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same. This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now). As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute(). Few other notes: Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv. SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.	2013-10-05 19:28:55 -07:00
Harvey Feng	6a2bbec5e3	Some comments regarding JobConf and InputFormat caching for HadoopRDDs.	2013-10-05 17:53:58 -07:00
Harvey Feng	96929f28bb	Make HadoopRDD object Spark private.	2013-10-05 17:14:19 -07:00
Harvey Feng	b5e93c1227	Fix API changes; lines > 100 chars.	2013-10-05 16:57:08 -07:00
Aaron Davidson	0f070279e7	Address Matei's comments	2013-10-05 15:15:29 -07:00
Martin Weindel	e09f4a9601	fixed some warnings	2013-10-05 23:08:23 +02:00
Matei Zaharia	100222b048	Merge pull request #27 from davidmccauley/master SPARK-920/921 - JSON endpoint updates 920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field. JSON now delivered as: url:spark://127.0.0.1:7077 921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.	2013-10-05 13:38:59 -07:00
Mridul Muralidharan	b5025d90bb	- Allow for finer control of cleaner - Address review comments, move to incubator spark - Also includes a change to speculation - including preventing exceptions in rare cases.	2013-10-06 00:35:51 +05:30
Prashant Sharma	3e41495288	Fixed tests, changed property akka.remote.netty.x to akka.remote.netty.tcp.x	2013-10-05 16:39:25 +05:30
Prashant Sharma	c810ee0690	Merge branch 'master' into scala-2.10 Conflicts: core/src/test/scala/org/apache/spark/DistributedSuite.scala project/SparkBuild.scala	2013-10-05 15:52:57 +05:30
Aaron Davidson	db6f154940	Fix race conditions during recovery One major change was the use of messages instead of raw functions as the parameter of Akka scheduled timers. Since messages are serialized, unlike raw functions, the behavior is easier to think about and doesn't cause race conditions when exceptions are thrown. Another change is to avoid using global pointers that might change without a lock.	2013-10-04 19:54:33 -07:00
Kay Ousterhout	7b5ae23a37	Renamed StandaloneX to CoarseGrainedX. The previous names were confusing because the components weren't just used in Standalone mode -- in fact, the scheduler used for Standalone mode is called SparkDeploySchedulerBackend. So, the previous names were misleading.	2013-10-04 13:56:43 -07:00
Andre Schumacher	c84946fe21	Fixing SPARK-602: PythonPartitioner Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.	2013-10-04 11:56:47 -07:00
Reynold Xin	d29e8035a0	Added countAsync and various unit tests for async actions.	2013-10-03 15:13:44 -07:00
tgravescs	0fff4ee852	Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup the classpaths	2013-10-03 11:52:16 -05:00
Reynold Xin	802bfb870d	- Created AsyncRDDActions. - Make FutureJob a Scala Future instead of Java Future.	2013-10-03 01:22:28 -07:00
Reynold Xin	e8e917f209	Merge branch 'master' into kill Conflicts: core/src/main/scala/org/apache/spark/TaskEndReason.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala	2013-10-02 23:01:34 -07:00
Reynold Xin	1c48ba0d9f	Merge remote-tracking branch 'origin' into kill Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala	2013-10-02 16:40:44 -07:00
David McCauley	1577b373a9	SPARK-921 - Add Application UI URL to ApplicationInfo Json output	2013-10-02 15:03:41 +01:00
David McCauley	351da54676	SPARK-920 - JSON endpoint URI scheme part (spark://) duplicated	2013-10-02 13:23:38 +01:00
Prashant Sharma	5829692885	Merge branch 'master' into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2013-10-01 11:57:24 +05:30
Kay Ousterhout	0dcad2edcb	Added additional unit test for repeated task failures	2013-09-30 23:26:15 -07:00
Kay Ousterhout	dea4677c88	Fixed compilation errors and broken test.	2013-09-30 22:07:01 -07:00
Kay Ousterhout	8deda427bc	Merge remote-tracking branch 'upstream/master' into results_through-bm Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterScheduler.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalTaskSetManager.scala	2013-09-30 10:16:58 -07:00
Kay Ousterhout	58b764b7c6	Addressed Matei's code review comments	2013-09-30 10:11:59 -07:00
Prashant Sharma	9865fd6aa0	Fixed non termination of Executor backend, when sc.stop is not called.	2013-09-30 18:09:12 +05:30

... 4 5 6 7 8 ...

2714 commits