ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Patrick Wendell	0ab505a29e	Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala Remove simple redundant return statements for Scala methods/functions Remove simple redundant return statements for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized -) Add small changes to making var to val if possible and remove () for simple get This hopefully makes the review simpler =) Pass compile and tests.	2014-01-12 21:31:04 -08:00
Patrick Wendell	405bfe86ef	Merge pull request #394 from tdas/error-handling Better error handling in Spark Streaming and more API cleanup Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug. This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext. Also fixes the bug that came up with PRs #393 and #381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in #392. And changed a lot of protected[streaming] to private[streaming].	2014-01-12 20:04:21 -08:00
Patrick Wendell	28a6b0cdbc	Merge pull request #398 from pwendell/streaming-api Rename DStream.foreach to DStream.foreachRDD `foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.	2014-01-12 19:49:36 -08:00
Henry Saputra	5a8abfb70e	Address code review concerns and comments.	2014-01-12 19:15:09 -08:00
Patrick Wendell	e6e20ceee0	Adding deprecated versions of old code	2014-01-12 18:54:03 -08:00
Tathagata Das	aa2c993858	Merge remote-tracking branch 'apache/master' into error-handling	2014-01-12 17:37:46 -08:00
Tathagata Das	c7fabb745b	Changed StreamingContext.stopForWait to awaitTermination.	2014-01-12 17:21:13 -08:00
Patrick Wendell	f4d77f8cb8	Rename DStream.foreach to DStream.foreachRDD `foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.	2014-01-12 17:21:00 -08:00
Patrick Wendell	074f50232f	Merge pull request #396 from pwendell/executor-env Setting load defaults to true in executor This preserves the behavior in earlier releases. If properties are set for the executors via `spark-env.sh` on the slaves, then they should take precedence over spark defaults. This is useful for if system administrators are setting properties for a standalone cluster, such as shuffle locations. /cc @andrewor14 who initially reported this issue.	2014-01-12 17:01:13 -08:00
Reynold Xin	82e2b92c6d	Merge pull request #392 from rxin/listenerbus Stop SparkListenerBus daemon thread when DAGScheduler is stopped. Otherwise this leads to hundreds of SparkListenerBus daemon threads in our unit tests (and also problematic if user applications launches multiple SparkContext).	2014-01-12 16:55:11 -08:00
Tathagata Das	7883b8f579	Fixed bugs to ensure better cleanup of JobScheduler, JobGenerator and NetworkInputTracker upon close.	2014-01-12 16:44:07 -08:00
Patrick Wendell	cfb1e6c13c	Setting load defaults to true in executor	2014-01-12 15:35:08 -08:00
Henry Saputra	f1c5eca494	Fix accidental comment modification.	2014-01-12 10:40:21 -08:00
Henry Saputra	91a563608e	Merge branch 'master' into remove_simpleredundantreturn_scala	2014-01-12 10:34:13 -08:00
Henry Saputra	93a65e5fde	Remove simple redundant return statement for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized	2014-01-12 10:30:04 -08:00
Tathagata Das	c5921e5c61	Fixed bugs.	2014-01-12 01:12:08 -08:00
Tathagata Das	18f4889d96	Merge remote-tracking branch 'apache/master' into error-handling	2014-01-11 23:40:57 -08:00
Tathagata Das	4d9b0ab420	Added waitForStop and stop to JavaStreamingContext.	2014-01-11 23:35:51 -08:00
Tathagata Das	f5108ffc24	Converted JobScheduler to use actors for event handling. Changed protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.	2014-01-11 23:15:09 -08:00
Reynold Xin	288a878999	Merge pull request #389 from rxin/clone-writables Minor update for clone writables and more documentation.	2014-01-11 21:53:19 -08:00
Reynold Xin	dbc11df411	Merge pull request #388 from pwendell/master Fix UI bug introduced in #244. The 'duration' field was incorrectly renamed to 'task time' in the table that lists stages.	2014-01-11 18:07:13 -08:00
Reynold Xin	362cda18bc	Renamed cloneKeyValues to cloneRecords; updated docs.	2014-01-11 18:01:29 -08:00
Patrick Wendell	409866b351	Merge pull request #393 from pwendell/revert-381 Revert PR 381 This PR missed a bunch of test cases that require "spark.cleaner.ttl". I think it is what is causing test failures on Jenkins right now (though it's a bit hard to tell because the DNS for cs.berkeley.edu is down). I'm submitting this to see if it fixes jeknins. I did try just patching various tests but it was taking a really long time because there are a bunch of them, so for now I'm just seeing if a revert works.	2014-01-11 17:12:06 -08:00
Patrick Wendell	07b952e1d1	Revert "Fix default TTL for metadata cleaner" This reverts commit `669ba4caa9`.	2014-01-11 16:07:10 -08:00
Patrick Wendell	22d4d62420	Revert "Fix one unit test that was not setting spark.cleaner.ttl" This reverts commit `942c80b34c`.	2014-01-11 16:07:03 -08:00
Reynold Xin	2180c87188	Stop SparkListenerBus daemon thread when DAGScheduler is stopped.	2014-01-11 13:36:37 -08:00
Reynold Xin	6510f04e4d	Merge pull request #387 from jerryshao/conf-fix Fix configure didn't work small problem in ALS	2014-01-11 12:48:26 -08:00
Reynold Xin	b0fbfccadc	Minor update for clone writables and more documentation.	2014-01-11 12:35:10 -08:00
Reynold Xin	ee6e7f9b8c	Merge pull request #359 from ScrapCodes/clone-writables We clone hadoop key and values by default and reuse objects if asked to. We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully. There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.	2014-01-11 12:07:55 -08:00
Patrick Wendell	b313e15616	Fix UI bug introduced in #244 . The 'duration' field was incorrectly renamed to 'task time' in the table that lists stages.	2014-01-11 10:52:57 -08:00
Patrick Wendell	4216178d5e	Merge pull request #373 from jerryshao/kafka-upgrade Upgrade Kafka dependecy to 0.8.0 release version	2014-01-11 09:46:48 -08:00
jerryshao	cbfbc01938	Fix configure didn't work small problem in ALS	2014-01-11 16:22:45 +08:00
Reynold Xin	92ad18b00e	Merge pull request #376 from prabeesh/master Change clientId to random clientId The client identifier should be unique across all clients connecting to the same server. A convenience method is provided to generate a random client id that should satisfy this criteria - generateClientId(). Returns a randomly generated client identifier based on the current user's login name and the system time. As the client identifier is used by the server to identify a client when it reconnects, the client must use the same identifier between connections if durable subscriptions are to be used.	2014-01-10 23:25:15 -08:00
Reynold Xin	0b5ce7af17	Merge pull request #386 from pwendell/typo-fix Small typo fix	2014-01-10 23:23:21 -08:00
Matei Zaharia	1d7bef0c91	Merge pull request #381 from mateiz/default-ttl Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.	2014-01-10 18:53:03 -08:00
Patrick Wendell	44d6a8e3d8	Merge pull request #382 from RongGu/master Fix a type error in comment lines Fix a type error in comment lines	2014-01-10 17:51:50 -08:00
Patrick Wendell	08370a52b8	Small typo fix	2014-01-10 17:47:15 -08:00
Patrick Wendell	88faa30a42	Merge pull request #385 from shivaram/add-i2-instances Add i2 instance types to Spark EC2. Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/	2014-01-10 17:14:22 -08:00
Matei Zaharia	942c80b34c	Fix one unit test that was not setting spark.cleaner.ttl	2014-01-10 16:32:36 -08:00
Patrick Wendell	f26553102c	Merge pull request #383 from tdas/driver-test API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.	2014-01-10 16:25:44 -08:00
Patrick Wendell	d37408f39c	Merge pull request #377 from andrewor14/master External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.	2014-01-10 16:25:01 -08:00
Tathagata Das	4f39e79c23	Merge remote-tracking branch 'apache/master' into driver-test Conflicts: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala	2014-01-10 15:47:01 -08:00
Andrew Or	2e393cd5fd	Update documentation for externalSorting	2014-01-10 15:45:38 -08:00
Tathagata Das	82f07deeda	Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate.	2014-01-10 15:37:05 -08:00
Reynold Xin	0eaf01c5ed	Merge pull request #369 from pillis/master SPARK-961 Add a Vector.random() method Added method and testcases	2014-01-10 15:32:19 -08:00
Andrew Or	e4c51d2113	Address Patrick's and Reynold's comments Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.	2014-01-10 15:09:51 -08:00
RongGu	94776f753f	fix a type error in comment lines	2014-01-11 05:43:56 +08:00
Thomas Graves	7cef8435d7	Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes Yarn client addjar and misc fixes Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.	2014-01-10 15:34:15 -06:00
Patrick Wendell	7b58f116e5	Merge pull request #384 from pwendell/debug-logs Make DEBUG-level logs consummable. Removes two things that caused issues with the debug logs: (a) Internal polling in the DAGScheduler was polluting the logs. (b) The Scala REPL logs were really noisy.	2014-01-10 12:47:46 -08:00
Shivaram Venkataraman	7c4e6e1bf1	Add i2 instance types to Spark EC2.	2014-01-10 12:44:55 -08:00

1 2 3 4 5 ...

5453 commits