ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Patrick Wendell	e6d4a74d2d	Revert "SPARK-729: Closures not always serialized at capture time" This reverts commit `8ca3b2bc90`.	2014-04-10 02:10:40 -07:00
William Benton	8ca3b2bc90	SPARK-729: Closures not always serialized at capture time [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). There are a few possible approaches to solving this problem and this PR will discuss some of them. The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing. ## What I did The basic approach I took depends on the work I did for #143, and so this PR is based atop that. Specifically: #143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled). Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used. Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in #143 is creating a serialized closure as part of `ClosureCleaner.clean`. `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument). This branch modifies `ClosureCleaner.clean` so that it returns a value: if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time. `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument. I've added tests for this behavior (777a1bc). The pull request as it stands, given the changes in #143, is nearly trivial. There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important. I think this is a fine way to solve this problem, but it's not perfect. ## What we might want to do The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure. If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later. By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches. Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like: ```scala package org.apache.spark.util import java.nio.ByteBuffer import scala.reflect.ClassManifest sealed abstract class ClosureBox[T] { def func: T } final case class RawClosure[T](func: T) extends ClosureBox[T] {} final case class CleanedClosure[T](func: T) extends ClosureBox[T] {} final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {} object ClosureBoxImplicits { implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun) } ``` With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`. A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer). Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time). Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate. Tracking this sort of state in the type system seems like the right thing to do to me. ### Why we might not want to do that _It's pretty invasive._ Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`. This obscures what's going on and is not a little ugly. Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions). _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._ It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future. Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal. Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once. ## Other possibilities I felt like the serialize-and-deserialize approach was best due to its obvious simplicity. But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`. It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness. I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general. Author: William Benton <willb@redhat.com> Closes #189 from willb/spark-729 and squashes the following commits: f4cafa0 [William Benton] Stylistic changes and cleanups b3d9c86 [William Benton] Fixed style issues in tests 9b56ce0 [William Benton] Added array-element capture test 97e9d91 [William Benton] Split closure-serializability failure tests 12ef6e3 [William Benton] Skip proactive closure capture for runJob 8ee3ee7 [William Benton] Predictable closure environment capture 12c63a7 [William Benton] Added tests for variable capture in closures d6e8dd6 [William Benton] Don't check serializability of DStream transforms. 4ecf841 [William Benton] Make proactive serializability checking optional. d8df3db [William Benton] Adds proactive closure-serializablilty checking 21b4b06 [William Benton] Test cases for SPARK-897. d5947b3 [William Benton] Ensure assertions in Graph.apply are asserted.	2014-04-09 18:56:27 -07:00
Kan Zhang	eb5f2b6423	SPARK-1407 drain event queue before stopping event logger Author: Kan Zhang <kzhang@apache.org> Closes #366 from kanzhang/SPARK-1407 and squashes the following commits: cd0629f [Kan Zhang] code refactoring and adding test b073ee6 [Kan Zhang] SPARK-1407 drain event queue before stopping event logger	2014-04-09 15:25:29 -07:00
Patrick Wendell	87bd1f9ef7	SPARK-1093: Annotate developer and experimental API's This patch marks some existing classes as private[spark] and adds two types of API annotations: - `EXPERIMENTAL API` = experimental user-facing module - `DEVELOPER API - UNSTABLE` = developer-facing API that might change There is some discussion of the different mechanisms for doing this here: https://issues.apache.org/jira/browse/SPARK-1081 I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility. A few notes here: - In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them. - Noted that compression and serialization formats don't have to be wire compatible across versions. - Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly. - Metrics sources are made private - user only interacts with them through Spark's reflection Author: Patrick Wendell <pwendell@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #274 from pwendell/private-apis and squashes the following commits: 44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis 042c803 [Patrick Wendell] spark.annotations -> spark.annotation bfe7b52 [Patrick Wendell] Adding experimental for approximate counts 8d0c873 [Patrick Wendell] Warning in SparkEnv 99b223a [Patrick Wendell] Cleaning up annotations e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations 982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi 0d48908 [Andrew Or] Comments and new lines (minor) f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug 99192ef [Andrew Or] Dynamically add badges based on annotations 824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs 037755c [Patrick Wendell] Some changes after working with andrew or f7d124f [Patrick Wendell] Small fixes c318b24 [Patrick Wendell] Use CSS styles e4c76b9 [Patrick Wendell] Logging f390b13 [Patrick Wendell] Better visibility for workaround constructors d6b0afd [Patrick Wendell] Small chang to existing constructor 403ba52 [Patrick Wendell] Style fix 870a7ba [Patrick Wendell] Work around for SI-8479 7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD 4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL c581dce [Patrick Wendell] Changes after building against Shark. 8452309 [Patrick Wendell] Style fixes 1ed27d2 [Patrick Wendell] Formatting and coloring of badges cd7a465 [Patrick Wendell] Code review feedback 2f706f1 [Patrick Wendell] Don't use floats 542a736 [Patrick Wendell] Small fixes cf23ec6 [Patrick Wendell] Marking GraphX as alpha d86818e [Patrick Wendell] Another naming change 5a76ed6 [Patrick Wendell] More visiblity clean-up 42c1f09 [Patrick Wendell] Using better labels 9d48cbf [Patrick Wendell] Initial pass	2014-04-09 01:14:46 -07:00
Holden Karau	fa0524fd02	Spark-939: allow user jars to take precedence over spark jars I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great. Author: Holden Karau <holden@pigscanfly.ca> Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits: cf0cac9 [Holden Karau] Fix the executorclassloader 1955232 [Holden Karau] Fix long line in TestUtils 8f89965 [Holden Karau] Fix tests for new class name 7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader 644719f [Holden Karau] User the class generator for the repl class loader tests too f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests 204b199 [Holden Karau] Fix the generated classes 9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes 858aba2 [Holden Karau] Remove a bunch of test junk 261aaee [Holden Karau] simplify executorurlclassloader a bit 7a7bf5f [Holden Karau] CR feedback d4ae848 [Holden Karau] rewrite component into scala aa95083 [Holden Karau] CR feedback 7752594 [Holden Karau] re-add https comment a0ef85a [Holden Karau] Fix style issues 125ea7f [Holden Karau] Easier to just remove those files, we don't need them bb8d179 [Holden Karau] Fix issues with the repl class loader 241b03d [Holden Karau] fix my rat excludes a343350 [Holden Karau] Update rat-excludes and remove a useless file d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it 4919bf9 [Holden Karau] Fix parent calling class loader issue 8a67302 [Holden Karau] Test are good 9e2d236 [Holden Karau] It works comrade 691ee00 [Holden Karau] It works ish dc4fe44 [Holden Karau] Does not depend on being in my home directory 47046ff [Holden Karau] Remove bad import' 22d83cb [Holden Karau] Add a test suite for the executor url class loader suite 7ef4628 [Holden Karau] Clean up 792d961 [Holden Karau] Almost works 16aecd1 [Holden Karau] Doesn't quite work 8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options 648b559 [Holden Karau] Both class loaders compile. Now for testing e1d9f71 [Holden Karau] One loader workers.	2014-04-08 22:30:03 -07:00
Holden Karau	ce8ec54561	Spark 1271: Co-Group and Group-By should pass Iterable[X] Author: Holden Karau <holden@pigscanfly.ca> Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits: f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator 77048f8 [Holden Karau] Fix merge up to master d3fe909 [Holden Karau] use toSeq instead 7a092a3 [Holden Karau] switch resultitr to resultiterable eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables c5075aa [Holden Karau] If guava 14 had iterables 2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API 11e730c [Holden Karau] Fix streaming tests 66b583d [Holden Karau] Fix the core test suite to compile 4ed579b [Holden Karau] Refactor from iterator to iterable d052c07 [Holden Karau] Python tests now pass with iterator pandas 3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work" cd1e81c [Holden Karau] Try and make pickling list iterators work c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well 88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming a5ee714 [Holden Karau] oops, was checking wrong iterator e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming ec8cc3e [Holden Karau] Fix test issues\! 4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD" ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas" b692868 [Holden Karau] Revert 7e533f7 [Holden Karau] Fix the bug 8a5153a [Holden Karau] Revert me, but we have some stuff to debug b4e86a9 [Holden Karau] Add a join based on the problem in SVD c4510e2 [Holden Karau] Revert this but for now put things in list pandas b4e0b1d [Holden Karau] Fix style issues 71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness. b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work 37888ec [Holden Karau] core/tests now pass 249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes 6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy" fe992fe [Holden Karau] hmmm try and fix up basic operation suite 172705c [Holden Karau] Fix Java API suite caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy 88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator 4991af6 [Holden Karau] Fix some tests be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after 687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures	2014-04-08 18:15:59 -07:00
Sandeep	12c077d5aa	SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 Author: Sandeep <sandeep@techaddict.me> Closes #355 from techaddict/mesos_update and squashes the following commits: f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0	2014-04-08 16:19:22 -07:00
Kay Ousterhout	fac6085cd7	[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. [I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, #305. You can look at just the second commit: `93f08baf73` to see just the changes relevant to this PR] Previously, when stages fail or get cancelled, the SparkListener is only notified indirectly through the SparkListenerJobEnd, where we sometimes pass in a single stage that failed. This worked before job cancellation, because jobs would only fail due to a single stage failure. However, with job cancellation, multiple running stages can fail when a job gets cancelled. Right now, this is not handled correctly, which results in stages that get stuck in the “Running Stages” window in the UI even though they’re dead. This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded event, and uses this event to tell SparkListeners when stages fail in addition to when they complete successfully. This change is NOT publicly backward compatible for two reasons. First, it changes the SparkListener interface. We could alternately add a new event, SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However, this is less consistent with the listener events for tasks / jobs ending, and will result in some code duplication for listeners (because failed and completed stages are handled in similar ways). Note that I haven’t finished updating the JSON code to correctly handle the new event because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”). It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed() method to no longer include a stage that caused the failure. I think this change should definitely stay, because with cancellation (as described above), a failure isn’t necessarily caused by a single stage. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #309 from kayousterhout/stage_cancellation and squashes the following commits: 5533ecd [Kay Ousterhout] Fixes in response to Mark's review 320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.	2014-04-08 14:42:02 -07:00
Kan Zhang	a8d86b080a	SPARK-1348 binding Master, Worker, and App Web UI to all interfaces Author: Kan Zhang <kzhang@apache.org> Closes #318 from kanzhang/SPARK-1348 and squashes the following commits: e625a5f [Kan Zhang] reverting the changes to startJettyServer() 7a8084e [Kan Zhang] SPARK-1348 binding Master, Worker, and App Web UI to all interfaces	2014-04-08 14:30:24 -07:00
Kay Ousterhout	6dc5f5849c	[SPARK-1396] Properly cleanup DAGScheduler on job cancellation. Previously, when jobs were cancelled, not all of the state in the DAGScheduler was cleaned up, leading to a slow memory leak in the DAGScheduler. As we expose easier ways to cancel jobs, it's more important to fix these issues. This commit also fixes a second and less serious problem, which is that previously, when a stage failed, not all of the appropriate stages were cancelled. See the "failure of stage used by two jobs" test for an example of this. This just meant that extra work was done, and is not a correctness problem. This commit adds 3 tests. “run shuffle with map stage failure” is a new test to more thoroughly test this functionality, and passes on both the old and new versions of the code. “trivial job cancellation” fails on the old code because all state wasn’t cleaned up correctly when jobs were cancelled (we didn’t remove the job from resultStageToJob). “failure of stage used by two jobs” fails on the old code because taskScheduler.cancelTasks wasn’t called for one of the stages (see test comments). This should be checked in before #246, which makes it easier to cancel stages / jobs. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #305 from kayousterhout/incremental_abort_fix and squashes the following commits: f33d844 [Kay Ousterhout] Mark review comments 9217080 [Kay Ousterhout] Properly cleanup DAGScheduler on job cancellation.	2014-04-08 01:03:33 -07:00
Tathagata Das	11eabbe125	[SPARK-1103] Automatic garbage collection of RDD, shuffle and broadcast data This PR allows Spark to automatically cleanup metadata and data related to persisted RDDs, shuffles and broadcast variables when the corresponding RDDs, shuffles and broadcast variables fall out of scope from the driver program. This is still a work in progress as broadcast cleanup has not been implemented. Implementation Details A new class `ContextCleaner` is responsible cleaning all the state. It is instantiated as part of a `SparkContext`. RDD and ShuffleDependency classes have overridden `finalize()` function that gets called whenever their instances go out of scope. The `finalize()` function enqueues the object’s identifier (i.e. RDD ID, shuffle ID, etc.) with the `ContextCleaner`, which is a very short and cheap operation and should not significantly affect the garbage collection mechanism. The `ContextCleaner`, on a different thread, performs the cleanup, whose details are given below. RDD cleanup: `ContextCleaner` calls `RDD.unpersist()` is used to cleanup persisted RDDs. Regarding metadata, the DAGScheduler automatically cleans up all metadata related to a RDD after all jobs have completed. Only the `SparkContext.persistentRDDs` keeps strong references to persisted RDDs. The `TimeStampedHashMap` used for that has been replaced by `TimeStampedWeakValueHashMap` that keeps only weak references to the RDDs, allowing them to be garbage collected. Shuffle cleanup: New BlockManager message `RemoveShuffle(<shuffle ID>)` asks the `BlockManagerMaster` and currently active `BlockManager`s to delete all the disk blocks related to the shuffle ID. `ContextCleaner` cleans up shuffle data using this message and also cleans up the metadata in the `MapOutputTracker` of the driver. The `MapOutputTracker` at the workers, that caches the shuffle metadata, maintains a `BoundedHashMap` to limit the shuffle information it caches. Refetching the shuffle information from the driver is not too costly. Broadcast cleanup: To be done. [This PR](https://github.com/apache/incubator-spark/pull/543/) adds mechanism for explicit cleanup of broadcast variables. `Broadcast.finalize()` will enqueue its own ID with ContextCleaner and the PRs mechanism will be used to unpersist the Broadcast data. Other cleanup: `ShuffleMapTask` and `ResultTask` caches tasks and used TTL based cleanup (using `TimeStampedHashMap`), so nothing got cleaned up if TTL was not set. Instead, they now use `BoundedHashMap` to keep a limited number of map output information. Cost of repopulating the cache if necessary is very small. Current state of implementation Implemented RDD and shuffle cleanup. Things left to be done are. - Cleaning up for broadcast variable still to be done. - Automatic cleaning up keys with empty weak refs as values in `TimeStampedWeakValueHashMap` Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Author: Roman Pastukhov <ignatich@mail.ru> Closes #126 from tdas/state-cleanup and squashes the following commits: 61b8d6e [Tathagata Das] Fixed issue with Tachyon + new BlockManager methods. f489fdc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup d25a86e [Tathagata Das] Fixed stupid typo. cff023c [Tathagata Das] Fixed issues based on Andrew's comments. 4d05314 [Tathagata Das] Scala style fix. 2b95b5e [Tathagata Das] Added more documentation on Broadcast implementations, specially which blocks are told about to the driver. Also, fixed Broadcast API to hide destroy functionality. 41c9ece [Tathagata Das] Added more unit tests for BlockManager, DiskBlockManager, and ContextCleaner. 6222697 [Tathagata Das] Fixed bug and adding unit test for removeBroadcast in BlockManagerSuite. 104a89a [Tathagata Das] Fixed failing BroadcastSuite unit tests by introducing blocking for removeShuffle and removeBroadcast in BlockManager* a430f06 [Tathagata Das] Fixed compilation errors. b27f8e8 [Tathagata Das] Merge pull request #3 from andrewor14/cleanup cd72d19 [Andrew Or] Make automatic cleanup configurable (not documented) ada45f0 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup a2cc8bc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup c5b1d98 [Andrew Or] Address Patrick's comments a6460d4 [Andrew Or] Merge github.com:apache/spark into cleanup 762a4d8 [Tathagata Das] Merge pull request #1 from andrewor14/cleanup f0aabb1 [Andrew Or] Correct semantics for TimeStampedWeakValueHashMap + add tests 5016375 [Andrew Or] Address TD's comments 7ed72fb [Andrew Or] Fix style test fail + remove verbose test message regarding broadcast 634a097 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup 7edbc98 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into state-cleanup 8557c12 [Andrew Or] Merge github.com:apache/spark into cleanup e442246 [Andrew Or] Merge github.com:apache/spark into cleanup 88904a3 [Andrew Or] Make TimeStampedWeakValueHashMap a wrapper of TimeStampedHashMap fbfeec8 [Andrew Or] Add functionality to query executors for their local BlockStatuses 34f436f [Andrew Or] Generalize BroadcastBlockId to remove BroadcastHelperBlockId 0d17060 [Andrew Or] Import, comments, and style fixes (minor) c92e4d9 [Andrew Or] Merge github.com:apache/spark into cleanup f201a8d [Andrew Or] Test broadcast cleanup in ContextCleanerSuite + remove BoundedHashMap e95479c [Andrew Or] Add tests for unpersisting broadcast 544ac86 [Andrew Or] Clean up broadcast blocks through BlockManager* d0edef3 [Andrew Or] Add framework for broadcast cleanup ba52e00 [Andrew Or] Refactor broadcast classes c7ccef1 [Andrew Or] Merge branch 'bc-unpersist-merge' of github.com:ignatich/incubator-spark into cleanup 6c9dcf6 [Tathagata Das] Added missing Apache license d2f8b97 [Tathagata Das] Removed duplicate unpersistRDD. a007307 [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup 620eca3 [Tathagata Das] Changes based on PR comments. f2881fd [Tathagata Das] Changed ContextCleaner to use ReferenceQueue instead of finalizer e1fba5f [Tathagata Das] Style fix 892b952 [Tathagata Das] Removed use of BoundedHashMap, and made BlockManagerSlaveActor cleanup shuffle metadata in MapOutputTrackerWorker. a7260d3 [Tathagata Das] Added try-catch in context cleaner and null value cleaning in TimeStampedWeakValueHashMap. e61daa0 [Tathagata Das] Modifications based on the comments on PR 126. ae9da88 [Tathagata Das] Removed unncessary TimeStampedHashMap from DAGScheduler, added try-catches in finalize() methods, and replaced ArrayBlockingQueue to LinkedBlockingQueue to avoid blocking in Java's finalizing thread. cb0a5a6 [Tathagata Das] Fixed docs and styles. a24fefc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup 8512612 [Tathagata Das] Changed TimeStampedHashMap to use WrappedJavaHashMap. e427a9e [Tathagata Das] Added ContextCleaner to automatically clean RDDs and shuffles when they fall out of scope. Also replaced TimeStampedHashMap to BoundedHashMaps and TimeStampedWeakValueHashMap for the necessary hashmap behavior. 80dd977 [Roman Pastukhov] Fix for Broadcast unpersist patch. 1e752f1 [Roman Pastukhov] Added unpersist method to Broadcast.	2014-04-07 23:40:36 -07:00
Aaron Davidson	0307db0f55	SPARK-1099: Introduce local[] mode to infer number of cores This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core. Author: Aaron Davidson <aaron@databricks.com> Closes #182 from aarondav/110 and squashes the following commits: a88294c [Aaron Davidson] Rebased changes for new spark-shell a9f393e [Aaron Davidson] SPARK-1099: Introduce local[] mode to infer number of cores	2014-04-07 13:06:30 -07:00
Davis Shepherd	a3c51c6ea2	SPARK-1432: Make sure that all metadata fields are properly cleaned While working on spark-1337 with @pwendell, we noticed that not all of the metadata maps in JobProgessListener were being properly cleaned. This could lead to a (hypothetical) memory leak issue should a job run long enough. This patch aims to address the issue. Author: Davis Shepherd <davis@conviva.com> Closes #338 from dgshep/master and squashes the following commits: a77b65c [Davis Shepherd] In the contex of SPARK-1337: Make sure that all metadata fields are properly cleaned	2014-04-07 10:02:00 -07:00
Evan Chan	1440154c27	SPARK-1154: Clean up app folders in worker nodes This is a fix for [SPARK-1154](https://issues.apache.org/jira/browse/SPARK-1154). The issue is that worker nodes fill up with a huge number of app-* folders after some time. This change adds a periodic cleanup task which asynchronously deletes app directories older than a configurable TTL. Two new configuration parameters have been introduced: spark.worker.cleanup_interval spark.worker.app_data_ttl This change does not include moving the downloads of application jars to a location outside of the work directory. We will address that if we have time, but that potentially involves caching so it will come either as part of this PR or a separate PR. Author: Evan Chan <ev@ooyala.com> Author: Kelvin Chu <kelvinkwchu@yahoo.com> Closes #288 from velvia/SPARK-1154-cleanup-app-folders and squashes the following commits: 0689995 [Evan Chan] CR from @aarondav - move config, clarify for standalone mode 9f10d96 [Evan Chan] CR from @pwendell - rename configs and add cleanup.enabled f2f6027 [Evan Chan] CR from @andrewor14 553d8c2 [Kelvin Chu] change the variable name to currentTimeMillis since it actually tracks in seconds 8dc9cb5 [Kelvin Chu] Fixed a bug in Utils.findOldFiles() after merge. cb52f2b [Kelvin Chu] Change the name of findOldestFiles() to findOldFiles() 72f7d2d [Kelvin Chu] Fix a bug of Utils.findOldestFiles(). file.lastModified is returned in milliseconds. ad99955 [Kelvin Chu] Add unit test for Utils.findOldestFiles() dc1a311 [Evan Chan] Don't recompute current time with every new file e3c408e [Evan Chan] Document the two new settings b92752b [Evan Chan] SPARK-1154: Add a periodic task to clean up app directories	2014-04-06 19:21:40 -07:00
Egor Pakhomov	e258e5040f	[SPARK-1259] Make RDD locally iterable Author: Egor Pakhomov <pahomov.egor@gmail.com> Closes #156 from epahomov/SPARK-1259 and squashes the following commits: 8ec8f24 [Egor Pakhomov] Make to local iterator shorter 34aa300 [Egor Pakhomov] Fix toLocalIterator docs 08363ef [Egor Pakhomov] SPARK-1259 from toLocallyIterable to toLocalIterator 6a994eb [Egor Pakhomov] SPARK-1259 Make RDD locally iterable 8be3dcf [Egor Pakhomov] SPARK-1259 Make RDD locally iterable 33ecb17 [Egor Pakhomov] SPARK-1259 Make RDD locally iterable	2014-04-06 16:43:01 -07:00
Mridul Muralidharan	6e88583aef	[SPARK-1371] fix computePreferredLocations signature to not depend on underlying implementation Change to Map and Set - not mutable HashMap and HashSet Author: Mridul Muralidharan <mridulm80@apache.org> Closes #302 from mridulm/master and squashes the following commits: df747af [Mridul Muralidharan] Address review comments 17e2907 [Mridul Muralidharan] fix computePreferredLocations signature to not depend on underlying implementation	2014-04-05 15:23:37 -07:00
Kay Ousterhout	2d0150c1a2	Remove the getStageInfo() method from SparkContext. This method exposes the Stage objects, which are private to Spark and should not be exposed to the user. This method was added in `01d77f329f`; ccing @squito here in case there's a good reason to keep this! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #308 from kayousterhout/remove_public_method and squashes the following commits: 2e2f009 [Kay Ousterhout] Remove the getStageInfo() method from SparkContext.	2014-04-05 15:17:50 -07:00
Haoyuan Li	b50ddfde03	SPARK-1305: Support persisting RDD's directly to Tachyon Move the PR#468 of apache-incubator-spark to the apache-spark "Adding an option to persist Spark RDD blocks into Tachyon." Author: Haoyuan Li <haoyuan@cs.berkeley.edu> Author: RongGu <gurongwalker@gmail.com> Closes #158 from RongGu/master and squashes the following commits: 72b7768 [Haoyuan Li] merge master 9f7fa1b [Haoyuan Li] fix code style ae7834b [Haoyuan Li] minor cleanup a8b3ec6 [Haoyuan Li] merge master branch e0f4891 [Haoyuan Li] better check offheap. 55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel 7cd4600 [RongGu] remove some logic code for tachyonstore's replication 51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore 8adfcfa [RongGu] address arron's comment on inTachyonSize 120e48a [RongGu] changed the root-level dir name in Tachyon 5cc041c [Haoyuan Li] address aaron's comments 9b97935 [Haoyuan Li] address aaron's comments d9a6438 [Haoyuan Li] fix for pspark 77d2703 [Haoyuan Li] change python api.git status 3dcace4 [Haoyuan Li] address matei's comments 91fa09d [Haoyuan Li] address patrick's comments 589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE 64348b2 [Haoyuan Li] update conf docs. ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1 619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler 49cc724 [Haoyuan Li] update docs with off_headp option 4572f9f [RongGu] reserving the old apply function API of StorageLevel 04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP 76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix 939e467 [Haoyuan Li] 0.4.1-thrift from maven central 86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1 16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem 6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 d827250 [RongGu] fix JsonProtocolSuie test failure 716e93b [Haoyuan Li] revert the version ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift 2825a13 [RongGu] up-merging to the current master branch of the apache spark 6a22c1a [Haoyuan Li] fix scalastyle 8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client. 77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice. 1dcadf9 [Haoyuan Li] typo bf278fa [Haoyuan Li] fix python tests e82909c [Haoyuan Li] minor cleanup 776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR 8859371 [Haoyuan Li] various minor fixes and clean up e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode. fcaeab2 [Haoyuan Li] address Aaron's comment e554b1e [Haoyuan Li] add python code 47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels. dc8ef24 [Haoyuan Li] add old storelevel constructor e01a271 [Haoyuan Li] update tachyon 0.4.1 8011a96 [RongGu] fix a brought-in mistake in StorageLevel `70ca182` [RongGu] a bit change in comment 556978b [RongGu] fix the scalastyle errors 791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark	2014-04-04 20:38:20 -07:00
Matei Zaharia	60e18ce7dd	SPARK-1414. Python API for SparkContext.wholeTextFiles Also clarified comment on each file having to fit in memory Author: Matei Zaharia <matei@databricks.com> Closes #327 from mateiz/py-whole-files and squashes the following commits: 9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles	2014-04-04 17:29:29 -07:00
Thomas Graves	198892fe8d	[SPARK-1198] Allow pipes tasks to run in different sub-directories This works as is on Linux/Mac/etc but doesn't cover working on Windows. In here I use ln -sf for symlinks. Putting this up for comments on that. Do we want to create perhaps some classes for doing shell commands - Linux vs Windows. Is there some other way we want to do this? I assume we are still supporting jdk1.6? Also should I update the Java API for pipes to allow this parameter? Author: Thomas Graves <tgraves@apache.org> Closes #128 from tgravescs/SPARK1198 and squashes the following commits: abc1289 [Thomas Graves] remove extra tag in pom file ba23fc0 [Thomas Graves] Add support for symlink on windows, remove commons-io usage da4b221 [Thomas Graves] Merge branch 'master' of https://github.com/tgravescs/spark into SPARK1198 61be271 [Thomas Graves] Fix file name filter 6b783bd [Thomas Graves] style fixes 1ab49ca [Thomas Graves] Add support for running pipe tasks is separate directories	2014-04-04 17:16:31 -07:00
Sandy Ryza	16b8308887	SPARK-1375. Additional spark-submit cleanup Author: Sandy Ryza <sandy@cloudera.com> Closes #278 from sryza/sandy-spark-1375 and squashes the following commits: 5fbf1e9 [Sandy Ryza] SPARK-1375. Additional spark-submit cleanup	2014-04-04 13:28:42 -07:00
Xusen Yin	f1fa617023	[SPARK-1133] Add whole text files reader in MLlib Here is a pointer to the former [PR164](https://github.com/apache/spark/pull/164). I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib. Author: Xusen Yin <yinxusen@gmail.com> Closes #252 from yinxusen/whole-files-input and squashes the following commits: 7191be6 [Xusen Yin] refine comments 0af3faf [Xusen Yin] add JavaAPI test 01745ee [Xusen Yin] fix deletion error cc97dca [Xusen Yin] move whole text file API to Spark core d792cee [Xusen Yin] remove the typo character "+" 6bdf2c2 [Xusen Yin] test for small local file system block size a1f1e7e [Xusen Yin] add two extra spaces 28cb0fe [Xusen Yin] add whole text files reader	2014-04-04 11:12:47 -07:00
Patrick Wendell	ee6e9e7d86	SPARK-1337: Application web UI garbage collects newest stages Simple fix... Author: Patrick Wendell <pwendell@gmail.com> Closes #320 from pwendell/stage-clean-up and squashes the following commits: 29be62e [Patrick Wendell] SPARK-1337: Application web UI garbage collects newest stages instead old ones	2014-04-03 22:13:56 -07:00
Andrew Or	de8eefa804	[SPARK-1385] Use existing code for JSON de/serialization of BlockId `BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type. With this PR, the de/serialization of BlockIds will go through the first (older) code path. (Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`) Author: Andrew Or <andrewor14@gmail.com> Closes #289 from andrewor14/blockid-json and squashes the following commits: 409d226 [Andrew Or] Simplify JSON de/serialization for BlockId	2014-04-02 10:43:09 -07:00
Kay Ousterhout	11973a7bda	Renamed stageIdToActiveJob to jobIdToActiveJob. This data structure was misused and, as a result, later renamed to an incorrect name. This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding. This patch renames it and removes an incorrect indexing into it. The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here `57579934f0`) was not always executed; this commit fixes that. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #301 from kayousterhout/fixCancellation and squashes the following commits: bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.	2014-04-02 10:35:52 -07:00
Andrew Or	ada310a9d3	[Hot Fix #42 ] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing information regarding that RDD with a fresh one, whether or not the information for the RDD already exists. Author: Andrew Or <andrewor14@gmail.com> Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug	2014-03-31 23:01:14 -07:00
Andrew Or	94fe7fd4fa	[SPARK-1377] Upgrade Jetty to 8.1.14v20131031 Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0. Author: Andrew Or <andrewor14@gmail.com> Closes #280 from andrewor14/jetty-upgrade and squashes the following commits: dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031	2014-03-31 21:42:36 -07:00
Patrick Wendell	841721e03c	SPARK-1352: Improve robustness of spark-submit script 1. Better error messages when required arguments are missing. 2. Support for unit testing cases where presented arguments are invalid. 3. Bug fix: Only use environment varaibles when they are set (otherwise will cause NPE). 4. A verbose mode to aid debugging. 5. Visibility of several variables is set to private. 6. Deprecation warning for existing scripts. Author: Patrick Wendell <pwendell@gmail.com> Closes #271 from pwendell/spark-submit and squashes the following commits: 9146def [Patrick Wendell] SPARK-1352: Improve robustness of spark-submit script	2014-03-31 12:07:14 -07:00
Prashant Sharma	d666053679	SPARK-1352 - Comment style single space before ending / check. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits: 6cde61e [Prashant Sharma] comment style space before ending / check.	2014-03-30 10:06:56 -07:00
Michael Armbrust	92b83959ca	Don't swallow all kryo errors, only those that indicate we are out of data. Author: Michael Armbrust <michael@databricks.com> Closes #142 from marmbrus/kryoErrors and squashes the following commits: 9c72d1f [Michael Armbrust] Make the test more future proof. 78f5a42 [Michael Armbrust] Don't swallow all kryo errors, only those that indicate we are out of data.	2014-03-29 22:01:29 -07:00
Sandy Ryza	1617816090	SPARK-1126. spark-app preliminary This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster. This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes). Author: Sandy Ryza <sandy@cloudera.com> Closes #86 from sryza/sandy-spark-1126 and squashes the following commits: d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments e7315c6 [Sandy Ryza] Fix failing tests 34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs 299ddca [Sandy Ryza] Fix scalastyle a94c627 [Sandy Ryza] Add newline at end of SparkSubmit 04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script	2014-03-29 14:41:36 -07:00
Prashant Sharma	60abc25254	SPARK-1096, a space after comment start style checker. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits: 214135a [Prashant Sharma] Review feedback. 5eba88c [Prashant Sharma] Fixed style checks for ///+ comments. e54b2f8 [Prashant Sharma] improved message, work around. 83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version. 810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker. ba33193 [Prashant Sharma] scala style as a project	2014-03-28 00:21:49 -07:00
Takuya UESHIN	3d89043b7e	[SPARK-1210] Prevent ContextClassLoader of Actor from becoming ClassLoader of Executo... ...r. Constructor of `org.apache.spark.executor.Executor` should not set context class loader of current thread, which is backend Actor's thread. Run the following code in local-mode REPL. ``` scala> case class Foo(i: Int) scala> val ret = sc.parallelize((1 to 100).map(Foo), 10).collect ``` This causes errors as follows: ``` ERROR actor.OneForOneStrategy: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo; java.lang.ArrayStoreException: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo; at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:859) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:616) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ``` This is because the class loaders to deserialize result `Foo` instances might be different from backend Actor's, and the Actor's class loader should be the same as Driver's. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #15 from ueshin/wip/wrongcontextclassloader and squashes the following commits: d79e8c0 [Takuya UESHIN] Change a parent class loader of ExecutorURLClassLoader. c6c09b6 [Takuya UESHIN] Add a test to collect objects of class defined in repl. 43e0feb [Takuya UESHIN] Prevent ContextClassLoader of Actor from becoming ClassLoader of Executor.	2014-03-27 22:17:15 -07:00
Petko Nikolov	6f986f0b87	[SPARK-1268] Adding XOR and AND-NOT operations to spark.util.collection.BitSet Symmetric difference (xor) in particular is useful for computing some distance metrics (e.g. Hamming). Unit tests added. Author: Petko Nikolov <nikolov@soundcloud.com> Closes #172 from petko-nikolov/bitset-imprv and squashes the following commits: 451f28b [Petko Nikolov] fixed style mistakes 5beba18 [Petko Nikolov] rm outer loop in andNot test 0e61035 [Petko Nikolov] conform to spark style; rm redundant asserts; more unit tests added; use arraycopy instead of loop d53cdb9 [Petko Nikolov] rm incidentally added space 4e1df43 [Petko Nikolov] adding xor and and-not to BitSet; unit tests added	2014-03-27 15:49:07 -07:00
NirmalReddy	3e63d98f09	Spark 1095 : Adding explicit return types to all public methods Excluded those that are self-evident and the cases that are discussed in the mailing list. Author: NirmalReddy <nirmal_reddy2000@yahoo.com> Author: NirmalReddy <nirmal.reddy@imaginea.com> Closes #168 from NirmalReddy/Spark-1095 and squashes the following commits: ac54b29 [NirmalReddy] import misplaced 8c5ff3e [NirmalReddy] Changed syntax of unit returning methods 02d0778 [NirmalReddy] fixed explicit types in all the other packages 1c17773 [NirmalReddy] fixed explicit types in core package	2014-03-26 18:24:55 -07:00
Patrick Wendell	be6d96c15b	SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS /cc @aarondav and @andrewor14 Author: Patrick Wendell <pwendell@gmail.com> Closes #231 from pwendell/ui-binding and squashes the following commits: e8025f8 [Patrick Wendell] SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS	2014-03-26 18:22:15 -07:00
Cheng Lian	345825d979	Unified package definition format in Spark SQL According to discussions in comments of PR #208, this PR unifies package definition format in Spark SQL. Some broken links in ScalaDoc and typos detected along the way are also fixed. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #225 from liancheng/packageDefinition and squashes the following commits: 75c47b3 [Cheng Lian] Fixed file line length 4f87968 [Cheng Lian] Unified package definition format in Spark SQL	2014-03-26 15:36:18 -07:00
Reynold Xin	b859853ba4	SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation Also updated the documentation for top and takeOrdered. On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X). Author: Reynold Xin <rxin@apache.org> Closes #229 from rxin/takeOrdered and squashes the following commits: 0d11844 [Reynold Xin] Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation. Also updated the documentation for top and takeOrdered.	2014-03-26 00:09:44 -07:00
witgo	8237df8060	Avoid Option while generating call site This is an update on https://github.com/apache/spark/pull/180, which changes the solution from blacklisting "Option.scala" to avoiding the Option code path while generating the call path. Also includes a unit test to prevent this issue in the future, and some minor refactoring. Thanks @witgo for reporting this issue and working on the initial solution! Author: witgo <witgo@qq.com> Author: Aaron Davidson <aaron@databricks.com> Closes #222 from aarondav/180 and squashes the following commits: f74aad1 [Aaron Davidson] Avoid Option while generating call site & add unit tests d2b4980 [witgo] Modify the position of the filter 1bc22d7 [witgo] Fix Stage.name return "apply at Option.scala:120"	2014-03-25 13:28:13 -07:00
Shivaram Venkataraman	f8111eaeb0	SPARK-1319: Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends. Thanks @kayousterhout for the design discussion Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu> Closes #219 from shivaram/multi-cpus and squashes the following commits: 5c7d685 [Shivaram Venkataraman] Don't pass availableCpus to TaskSetManager 260e4d5 [Shivaram Venkataraman] Add a check for non-zero CPUs in TaskSetManager 73fcf6f [Shivaram Venkataraman] Add documentation for spark.task.cpus 647bc45 [Shivaram Venkataraman] Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.	2014-03-25 13:05:30 -07:00
Sean Owen	71d4ed271b	SPARK-1316. Remove use of Commons IO (This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 ) Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark `Utils.scala` class. Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too. Author: Sean Owen <sowen@cloudera.com> Closes #226 from srowen/SPARK-1316 and squashes the following commits: 21efef3 [Sean Owen] Remove use of Commons IO	2014-03-25 10:21:25 -07:00
CodingCat	5140598df8	SPARK-1128: set hadoop task properties when constructing HadoopRDD https://spark-project.atlassian.net/browse/SPARK-1128 The task properties are not set when constructing HadoopRDD in current implementation, this may limit the implementation based on ``` mapred.tip.id mapred.task.id mapred.task.is.map mapred.task.partition mapred.job.id ``` This patch also contains a small fix in createJobID (SparkHadoopWriter.scala), where the current implementation actually is not using time parameter Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes #101 from CodingCat/SPARK-1128 and squashes the following commits: ed0980f [CodingCat] make SparkHiveHadoopWriter belongs to spark package 5b1ad7d [CodingCat] move SparkHiveHadoopWriter to org.apache.spark package 258f92c [CodingCat] code cleanup af88939 [CodingCat] update the comments and permission of SparkHadoopWriter 9bd1fe3 [CodingCat] move configuration for jobConf to HadoopRDD b7bdfa5 [Nan Zhu] style fix a3153a8 [Nan Zhu] style fix c3258d2 [CodingCat] set hadoop task properties while using InputFormat	2014-03-24 21:55:03 -07:00
Emtiaz Ahmed	646e55405b	Fix to Stage UI to display numbers on progress bar Fixes an issue on Stage UI to display numbers on progress bar which are today hidden behind the progress bar div. Please refer to the attached images to see the issue. ![screen shot 2014-03-21 at 4 48 46 pm](https://f.cloud.github.com/assets/563652/2489083/8c127e80-b153-11e3-807c-048ebd45104b.png) ![screen shot 2014-03-21 at 4 49 00 pm](https://f.cloud.github.com/assets/563652/2489084/8c12cf5c-b153-11e3-8747-9d93ff6fceb4.png) Author: Emtiaz Ahmed <emtiazahmed@gmail.com> Closes #201 from emtiazahmed/master and squashes the following commits: a7964fe [Emtiaz Ahmed] Fix to Stage UI to display numbers on progress bar	2014-03-21 18:05:53 -07:00
zsxwing	2c0aa22e2e	SPARK-1279: Fix improper use of SimpleDateFormat `SimpleDateFormat` is not thread-safe. Some places use the same SimpleDateFormat object without safeguard in the multiple threads. It will cause that the Web UI displays improper date. This PR creates a new `SimpleDateFormat` every time when it's necessary. Another solution is using `ThreadLocal` to store a `SimpleDateFormat` in each thread. If this PR impacts the performance, I can change to the latter one. Author: zsxwing <zsxwing@gmail.com> Closes #179 from zsxwing/SPARK-1278 and squashes the following commits: 21fabd3 [zsxwing] SPARK-1278: Fix improper use of SimpleDateFormat	2014-03-21 16:08:18 -07:00
Andrew Or	ca76423e23	[Hot Fix #42 ] Do not stop SparkUI if bind() is not called This is a bug fix for #42 (`79d07d6604`). In Master, we do not bind() each SparkUI because we do not want to start a server for each finished application. However, when we remove the associated application, we call stop() on the SparkUI, which throws an assertion failure. This fix ensures we don't call stop() on a SparkUI that was never bind()'ed. Author: Andrew Or <andrewor14@gmail.com> Closes #188 from andrewor14/ui-fix and squashes the following commits: 94a925f [Andrew Or] Do not stop SparkUI if bind() is not called	2014-03-20 14:13:16 -07:00
Aaron Davidson	ffe272d97c	Revert "SPARK-1099:Spark's local mode should probably respect spark.cores.max by default" This reverts commit `16789317a3`. Jenkins was not run for this PR.	2014-03-19 17:56:48 -07:00
qqsun8819	16789317a3	SPARK-1099:Spark's local mode should probably respect spark.cores.max by default This is for JIRA:https://spark-project.atlassian.net/browse/SPARK-1099 And this is what I do in this patch (also commented in the JIRA) @aarondav This is really a behavioral change, so I do this with great caution, and welcome any review advice: 1 I change the "MASTER=local" pattern of create LocalBackEnd . In the past, we passed 1 core to it . now it use a default cores The reason here is that when someone use spark-shell to start local mode , Repl will use this "MASTER=local" pattern as default. So if one also specify cores in the spark-shell command line, it will all go in here. So here pass 1 core is not suitalbe reponding to our change here. 2 In the LocalBackEnd , the "totalCores" variable are fetched following a different rule(in the past it just take in a userd passed cores, like 1 in "MASTER=local" pattern, 2 in "MASTER=local[2]" pattern" rules: a The second argument of LocalBackEnd 's constructor indicating cores have a default value which is Int.MaxValue. If user didn't pass it , its first default value is Int.MaxValue b In getMaxCores, we first compare the former value to Int.MaxValue. if it's not equal, we think that user has passed their desired value, so just use it c. If b is not satified, we then get cores from spark.cores.max, and we get real logical cores from Runtime. And if cores specified by spark.cores.max is bigger than logical cores, we use logical cores, otherwise we use spark.cores.max 3 In SparkContextSchedulerCreationSuite 's test("local") case, assertion is modified from 1 to logical cores, because "MASTER=local" pattern use default vaules. Author: qqsun8819 <jin.oyj@alibaba-inc.com> Closes #110 from qqsun8819/local-cores and squashes the following commits: 731aefa [qqsun8819] 1 LocalBackend not change 2 In SparkContext do some process to the cores and pass it to original LocalBackend constructor 78b9c60 [qqsun8819] 1 SparkContext MASTER=local pattern use default cores instead of 1 to construct LocalBackEnd , for use of spark-shell and cores specified in cmd line 2 some test case change from local to local[1]. 3 SparkContextSchedulerCreationSuite test spark.cores.max config in local pattern 6ae1ee8 [qqsun8819] Add a static function in LocalBackEnd to let it use spark.cores.max specified cores when no cores are passed to it	2014-03-19 16:33:54 -07:00
Andrew Or	79d07d6604	[SPARK-1132] Persisting Web UI through refactoring the SparkListener interface The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running. The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. In this design, each existing listener for the UI (e.g. ExecutorsListener) maintains state that can be fully constructed from SparkListenerEvents. This state is then supplied to the parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand. This PR introduces two important classes: the EventLoggingListener, and the ReplayListenerBus. In a live application, SparkUI registers an EventLoggingListener with the SparkContext in addition to the existing listeners. Over the course of the application, this listener serializes and logs all events to persisted storage. Then, after the application has finished, the SparkUI can be revived by replaying all the logged events to the existing UI listeners through the ReplayListenerBus. This feature is currently integrated with the Master Web UI, which optionally rebuilds a SparkUI from event logs as soon as the corresponding application finishes. More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome. Author: Andrew Or <andrewor14@gmail.com> Author: andrewor14 <andrewor14@gmail.com> Closes #42 from andrewor14/master and squashes the following commits: e5f14fa [Andrew Or] Merge github.com:apache/spark a1c5cd9 [Andrew Or] Merge github.com:apache/spark b8ba817 [Andrew Or] Remove UI from map when removing application in Master 83af656 [Andrew Or] Scraps and pieces (no functionality change) 222adcd [Andrew Or] Merge github.com:apache/spark 124429f [Andrew Or] Clarify LiveListenerBus behavior + Add tests for new behavior f80bd31 [Andrew Or] Simplify static handler and BlockManager status update logic 9e14f97 [Andrew Or] Moved around functionality + renamed classes per Patrick 6740e49 [Andrew Or] Fix comment nits 650eb12 [Andrew Or] Add unit tests + Fix bugs found through tests 45fd84c [Andrew Or] Remove now deprecated test c5c2c8f [Andrew Or] Remove list of (TaskInfo, TaskMetrics) from StageInfo 3456090 [Andrew Or] Address Patrick's comments bf80e3d [Andrew Or] Imports, comments, and code formatting, once again (minor) ac69ec8 [Andrew Or] Fix test fail d801d11 [Andrew Or] Merge github.com:apache/spark (major) dc93915 [Andrew Or] Imports, comments, and code formatting (minor) 77ba283 [Andrew Or] Address Kay's and Patrick's comments b6eaea7 [Andrew Or] Treating SparkUI as a handler of MasterUI d59da5f [Andrew Or] Avoid logging all the blocks on each executor d6e3b4a [Andrew Or] Merge github.com:apache/spark ca258a4 [Andrew Or] Master UI - add support for reading compressed event logs 176e68e [Andrew Or] Fix deprecated message for JavaSparkContext (minor) 4f69c4a [Andrew Or] Master UI - Rebuild SparkUI on application finish 291b2be [Andrew Or] Correct directory in log message "INFO: Logging events to <dir>" 1ba3407 [Andrew Or] Add a few configurable options to event logging e375431 [Andrew Or] Add new constructors for SparkUI 18b256d [Andrew Or] Refactor out event logging and replaying logic from UI bb4c503 [Andrew Or] Use a more mnemonic path for logging aef411c [Andrew Or] Fix bug: storage status was not reflected on UI in the local case 03eda0b [Andrew Or] Fix HDFS flush behavior 36b3e5d [Andrew Or] Add HDFS support for event logging cceff2b [andrewor14] Fix 100 char format fail 2fee310 [Andrew Or] Address Patrick's comments 2981d61 [Andrew Or] Move SparkListenerBus out of DAGScheduler + Clean up 5d2cec1 [Andrew Or] JobLogger: ID -> Id 0503e4b [Andrew Or] Fix PySpark tests + remove sc.clearFiles/clearJars 4d2fb0c [Andrew Or] Fix format fail faa113e [Andrew Or] General clean up d47585f [Andrew Or] Clean up FileLogger 472fd8a [Andrew Or] Fix a couple of tests 996d7a2 [Andrew Or] Reflect RDD unpersist on UI 7b2f811 [Andrew Or] Guard against TaskMetrics NPE + Fix tests d1f4285 [Andrew Or] Migrate from lift-json to json4s-jackson 28019ca [Andrew Or] Merge github.com:apache/spark bbe3501 [Andrew Or] Embed storage status and RDD info in Task events 6631c02 [Andrew Or] More formatting changes, this time mainly for Json DSL 70e7e7a [Andrew Or] Formatting changes e9e1c6d [Andrew Or] Move all JSON de/serialization logic to JsonProtocol d646df6 [Andrew Or] Completely decouple SparkUI from SparkContext 6814da0 [Andrew Or] Explicitly register each UI listener rather than through some magic 64d2ce1 [Andrew Or] Fix BlockManagerUI bug by introducing new event 4273013 [Andrew Or] Add a gateway SparkListener to simplify event logging 904c729 [Andrew Or] Fix another major bug 5ac906d [Andrew Or] Mostly naming, formatting, and code style changes 3fd584e [Andrew Or] Fix two major bugs f3fc13b [Andrew Or] General refactor 4dfcd22 [Andrew Or] Merge git://git.apache.org/incubator-spark into persist-ui b3976b0 [Andrew Or] Add functionality of reconstructing a persisted UI from SparkContext 8add36b [Andrew Or] JobProgressUI: Add JSON functionality d859efc [Andrew Or] BlockManagerUI: Add JSON functionality c4cd480 [Andrew Or] Also deserialize new events 8a2ebe6 [Andrew Or] Fix bugs for EnvironmentUI and ExecutorsUI de8a1cd [Andrew Or] Serialize events both to and from JSON (rather than just to) bf0b2e9 [Andrew Or] ExecutorUI: Serialize events rather than arbitary executor information bb222b9 [Andrew Or] ExecutorUI: render completely from JSON dcbd312 [Andrew Or] Add JSON Serializability for all SparkListenerEvent's 10ed49d [Andrew Or] Merge github.com:apache/incubator-spark into persist-ui 8e09306 [Andrew Or] Use JSON for ExecutorsUI e3ae35f [Andrew Or] Merge github.com:apache/incubator-spark 3ddeb7e [Andrew Or] Also privatize fields 090544a [Andrew Or] Privatize methods 13920c9 [Andrew Or] Update docs bd5a1d7 [Andrew Or] Typo: phyiscal -> physical 287ef44 [Andrew Or] Avoid reading the entire batch into memory; also simplify streaming logic 3df7005 [Andrew Or] Merge branch 'master' of github.com:andrewor14/incubator-spark a531d2e [Andrew Or] Relax assumptions on compressors and serializers when batching 164489d [Andrew Or] Relax assumptions on compressors and serializers when batching	2014-03-19 13:17:01 -07:00
Mridul Muralidharan	ab747d39dd	Bugfixes/improvements to scheduler Move the PR#517 of apache-incubator-spark to the apache-spark Author: Mridul Muralidharan <mridul@gmail.com> Closes #159 from mridulm/master and squashes the following commits: 5ff59c2 [Mridul Muralidharan] Change property in suite also 167fad8 [Mridul Muralidharan] Address review comments 9bda70e [Mridul Muralidharan] Address review comments, akwats add to failedExecutors 270d841 [Mridul Muralidharan] Address review comments fa5d9f1 [Mridul Muralidharan] Bugfixes/improvements to scheduler : PR #517	2014-03-19 12:46:55 -07:00
Thomas Graves	6112270c94	SPARK-1203 fix saving to hdfs from yarn Author: Thomas Graves <tgraves@apache.org> Closes #173 from tgravescs/SPARK-1203 and squashes the following commits: 4fd5ded [Thomas Graves] adding import 964e3f7 [Thomas Graves] SPARK-1203 fix saving to hdfs from yarn	2014-03-19 08:09:20 -05:00
shiyun.wxm	d55ec86de2	bugfix: Wrong "Duration" in "Active Stages" in stages page If a stage which has completed once loss parts of data, it will be resubmitted. At this time, it appears that stage.completionTime > stage.submissionTime. Author: shiyun.wxm <shiyun.wxm@taobao.com> Closes #170 from BlackNiuza/duration_problem and squashes the following commits: a86d261 [shiyun.wxm] tow space indent c0d7b24 [shiyun.wxm] change the style 3b072e1 [shiyun.wxm] fix scala style f20701e [shiyun.wxm] bugfix: "Duration" in "Active Stages" in stages page	2014-03-19 01:42:34 -07:00
witgo	cc2655a237	Fix SPARK-1256: Master web UI and Worker web UI returns a 404 error Author: witgo <witgo@qq.com> Closes #150 from witgo/SPARK-1256 and squashes the following commits: 08044a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1256 c99b030 [witgo] Fix SPARK-1256	2014-03-18 21:57:47 -07:00
CodingCat	2fa26ec02f	SPARK-1102: Create a saveAsNewAPIHadoopDataset method https://spark-project.atlassian.net/browse/SPARK-1102 Create a saveAsNewAPIHadoopDataset method By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset." Author: CodingCat <zhunansjtu@gmail.com> Closes #12 from CodingCat/SPARK-1102 and squashes the following commits: 6ba0c83 [CodingCat] add test cases for saveAsHadoopDataSet (new&old API) a8d11ba [CodingCat] style fix......... 95a6929 [CodingCat] code clean 7643c88 [CodingCat] change the parameter type back to Configuration a8583ee [CodingCat] Create a saveAsNewAPIHadoopDataset method	2014-03-18 11:06:18 -07:00
Patrick Wendell	e7423d4040	Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225." This reverts commit `ca4bf8c572`. Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin Author: Patrick Wendell <pwendell@gmail.com> Closes #167 from pwendell/jetty-revert and squashes the following commits: 811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."	2014-03-18 00:46:03 -07:00
Dan McClary	e3681f26fa	Spark 1246 add min max to stat counter Here's the addition of min and max to statscounter.py and min and max methods to rdd.py. Author: Dan McClary <dan.mcclary@gmail.com> Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits: fd3fd4b [Dan McClary] fixed error, updated test 82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter 5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark 21dd366 [Dan McClary] added max and min to StatCounter output, updated doc 1a97558 [Dan McClary] added max and min to StatCounter output, updated doc a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py 1e7056d [Dan McClary] added underscore to getBucket 37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived 29981f2 [Dan McClary] fixed indentation on doctest comment eaf89d9 [Dan McClary] added correct doctest for histogram 4916016 [Dan McClary] added histogram method, added max and min to statscounter	2014-03-18 00:45:47 -07:00
Patrick Wendell	796977acdb	SPARK-1244: Throw exception if map output status exceeds frame size This is a very small change on top of @andrewor14's patch in #147. Author: Patrick Wendell <pwendell@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #152 from pwendell/akka-frame and squashes the following commits: e5fb3ff [Patrick Wendell] Reversing test order 393af4c [Patrick Wendell] Small improvement suggested by Andrew Or 8045103 [Patrick Wendell] Breaking out into two tests 2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular 281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests	2014-03-17 14:03:32 -07:00
CodingCat	dc9654638f	SPARK-1240: handle the case of empty RDD when takeSample https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample	2014-03-16 22:14:59 -07:00
Reynold Xin	f5486e9f75	SPARK-1255: Allow user to pass Serializer object instead of class name for shuffle. This is more general than simply passing a string name and leaves more room for performance optimizations. Note that this is technically an API breaking change in the following two ways: 1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark. 2. Serializer's in Spark from now on are required to be serializable. Author: Reynold Xin <rxin@apache.org> Closes #149 from rxin/serializer and squashes the following commits: 5acaccd [Reynold Xin] Properly call serializer's constructors. 2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency. 7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.	2014-03-16 09:57:21 -07:00
Michael Armbrust	e19044cb10	Fix serialization of MutablePair. Also provide an interface for easy updating. Author: Michael Armbrust <michael@databricks.com> Closes #141 from marmbrus/mutablePair and squashes the following commits: f5c4783 [Michael Armbrust] Change function name to update 8bfd973 [Michael Armbrust] Fix serialization of MutablePair. Also provide an interface for easy updating.	2014-03-14 11:40:26 -07:00
Reynold Xin	ca4bf8c572	SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225. Author: Reynold Xin <rxin@apache.org> Closes #113 from rxin/jetty9 and squashes the following commits: 867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file. d7c97ca [Reynold Xin] Return the correctly bound port. d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.	2014-03-13 12:16:04 -07:00
Patrick Wendell	4ea23db0ef	SPARK-1019: pyspark RDD take() throws an NPE Author: Patrick Wendell <pwendell@gmail.com> Closes #112 from pwendell/pyspark-take and squashes the following commits: daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE	2014-03-12 23:16:59 -07:00
CodingCat	6bd2eaa4a5	hot fix for PR105 - change to Java annotation Author: CodingCat <zhunansjtu@gmail.com> Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits: 6607155 [CodingCat] hot fix for PR105 - change to Java annotation	2014-03-12 19:49:18 -07:00
CodingCat	9032f7c0d5	SPARK-1160: Deprecate toArray in RDD https://spark-project.atlassian.net/browse/SPARK-1160 reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python." In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly Author: CodingCat <zhunansjtu@gmail.com> Closes #105 from CodingCat/SPARK-1060 and squashes the following commits: 286f163 [CodingCat] deprecate in JavaRDDLike ee17b4e [CodingCat] add message and since 2ff7319 [CodingCat] deprecate toArray in RDD	2014-03-12 17:43:12 -07:00
liguoqiang	5d1ec64e79	Fix #SPARK-1149 Bad partitioners can cause Spark to hang Author: liguoqiang <liguoqiang@rd.tuan800.com> Closes #44 from witgo/SPARK-1149 and squashes the following commits: 3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149 8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 3dad595 [liguoqiang] review comment e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149 b0d5c07 [liguoqiang] review comment d0a6005 [liguoqiang] review comment 3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 ac006a3 [liguoqiang] code Formatting 3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149 adc443e [liguoqiang] partitions check bugfix 928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149 1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149 3348619 [liguoqiang] Optimize performance for partitions check 61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149 e68210a [liguoqiang] add partition index check to submitJob 3a65903 [liguoqiang] make the code more readable 6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang	2014-03-12 13:00:04 -07:00
Patrick Wendell	16788a6542	SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues... This patch removes Ganglia integration from the default build. It allows users willing to link against LGPL code to use Ganglia by adding build flags or linking against a new Spark artifact called spark-ganglia-lgpl. This brings Spark in line with the Apache policy on LGPL code enumerated here: https://www.apache.org/legal/3party.html#options-optional Author: Patrick Wendell <pwendell@gmail.com> Closes #108 from pwendell/ganglia and squashes the following commits: 326712a [Patrick Wendell] Responding to review feedback 5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.	2014-03-11 11:16:59 -07:00
Patrick Wendell	2a5161708f	SPARK-1205: Clean up callSite/origin/generator. This patch removes the `generator` field and simplifies + documents the tracking of callsites. There are two places where we care about call sites, when a job is run and when an RDD is created. This patch retains both of those features but does a slight refactoring and renaming to make things less confusing. There was another feature of an rdd called the `generator` which was by default the user class that in which the RDD was created. This is used exclusively in the JobLogger. It been subsumed by the ability to name a job group. The job logger can later be refectored to read the job group directly (will require some work) but for now this just preserves the default logged value of the user class. I'm not sure any users ever used the ability to override this. Author: Patrick Wendell <pwendell@gmail.com> Closes #106 from pwendell/callsite and squashes the following commits: fc1d009 [Patrick Wendell] Compile fix e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite 62e77ef [Patrick Wendell] Review feedback 576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.	2014-03-10 16:28:41 -07:00
Patrick Wendell	b9be160951	SPARK-782 Clean up for ASM dependency. This makes two changes. 1) Spark uses the shaded version of asm that is (conveniently) published with Kryo. 2) Existing exclude rules around asm are updated to reflect the new groupId of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop versions that pull in new asm versions. Author: Patrick Wendell <pwendell@gmail.com> Closes #100 from pwendell/asm and squashes the following commits: 9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.	2014-03-09 13:17:07 -07:00
Jiacheng Guo	f6f9d02e85	Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set. Author: Jiacheng Guo <guojc03@gmail.com> Closes #98 from guojc/master and squashes the following commits: abfe698 [Jiacheng Guo] add space according request 2a37c34 [Jiacheng Guo] Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set.	2014-03-09 11:38:40 -07:00
Aaron Davidson	52834d761b	SPARK-929: Fully deprecate usage of SPARK_MEM (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615) This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables: SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public. SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property. SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory. Other memory considerations: - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY. - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class). This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however. Author: Aaron Davidson <aaron@databricks.com> Closes #99 from aarondav/sparkmem and squashes the following commits: 9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM	2014-03-09 11:08:39 -07:00
Patrick Wendell	e59a3b6c41	SPARK-1190: Do not initialize log4j if slf4j log4j backend is not being used Author: Patrick Wendell <pwendell@gmail.com> Closes #107 from pwendell/logging and squashes the following commits: be21c11 [Patrick Wendell] Logging fix	2014-03-08 16:02:42 -08:00
Cheng Lian	0b7b7fd45c	[SPARK-1194] Fix the same-RDD rule for cache replacement SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194 In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails and aborts. In this PR, we keep selecting blocks not from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #96 from liancheng/fix-spark-1194 and squashes the following commits: 2524ab9 [Cheng Lian] Added regression test case for SPARK-1194 6e40c22 [Cheng Lian] Remove redundant comments 40cdcb2 [Cheng Lian] Bug fix, and addressed PR comments from @mridulm 62c92ac [Cheng Lian] Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194	2014-03-07 23:26:46 -08:00
Prashant Sharma	6e730edcde	Spark 1165 rdd.intersection in python and java Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits: 9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection. 1fea813 [Prashant Sharma] correct the lines wrapping d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.	2014-03-07 18:48:07 -08:00
Thomas Graves	b7cd9e992c	SPARK-1195: set map_input_file environment variable in PipedRDD Hadoop uses the config mapreduce.map.input.file to indicate the input filename to the map when the input split is of type FileSplit. Some of the hadoop input and output formats set or use this config. This config can also be used by user code. PipedRDD runs an external process and the configs aren't available to that process. Hadoop Streaming does something very similar and the way they make configs available is exporting them into the environment replacing '.' with '_'. Spark should also export this variable when launching the pipe command so the user code has access to that config. Note that the config mapreduce.map.input.file is the new one, the old one which is deprecated but not yet removed is map.input.file. So we should handle both. Perhaps it would be better to abstract this out somehow so it goes into the HadoopParition code? Author: Thomas Graves <tgraves@apache.org> Closes #94 from tgravescs/map_input_file and squashes the following commits: cc97a6a [Thomas Graves] Update test to check for existence of command, add a getPipeEnvVars function to HadoopRDD e3401dc [Thomas Graves] Merge remote-tracking branch 'upstream/master' into map_input_file 2ba805e [Thomas Graves] set map_input_file environment variable in PipedRDD	2014-03-07 10:36:55 -08:00
Aaron Davidson	dabeb6f160	SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1 This patch allows the FaultToleranceTest to work in newer versions of Docker. See https://spark-project.atlassian.net/browse/SPARK-1136 for more details. Besides changing the Docker and FaultToleranceTest internals, this patch also changes the behavior of Master to accept new Workers which share an address with a Worker that we are currently trying to recover. This can only happen when the Worker itself was restarted and got the same IP address/port at the same time as a Master recovery occurs. Finally, this adds a good bit of ASCII art to the test to make failures, successes, and actions more apparent. This is very much needed. Author: Aaron Davidson <aaron@databricks.com> Closes #5 from aarondav/zookeeper and squashes the following commits: 5d7a72a [Aaron Davidson] SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1	2014-03-07 10:22:27 -08:00
Sandy Ryza	328c73d037	SPARK-1197. Change yarn-standalone to yarn-cluster and fix up running on YARN docs This patch changes "yarn-standalone" to "yarn-cluster" (but still supports the former). It also cleans up the Running on YARN docs and adds a section on how to view logs. Author: Sandy Ryza <sandy@cloudera.com> Closes #95 from sryza/sandy-spark-1197 and squashes the following commits: 563ef3a [Sandy Ryza] Review feedback 6ad06d4 [Sandy Ryza] Change yarn-standalone to yarn-cluster and fix up running on YARN docs	2014-03-06 17:12:58 -08:00
Thomas Graves	7edbea41b4	SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets resubmit pull request. was https://github.com/apache/incubator-spark/pull/332. Author: Thomas Graves <tgraves@apache.org> Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits: dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser 05eebed [Thomas Graves] Fix dependency lost in upmerge d1040ec [Thomas Graves] Fix up various imports 05ff5e0 [Thomas Graves] Fix up imports after upmerging to master ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase 13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests. 4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets 2f77147 [Thomas Graves] Rework from comments 50dd9f2 [Thomas Graves] fix header in SecurityManager ecbfb65 [Thomas Graves] Fix spacing and formatting b514bec [Thomas Graves] Fix reference to config ed3d1c1 [Thomas Graves] Add security.md 6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments 2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework 5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets	2014-03-06 18:27:50 -06:00
Kyle Ellrott	40566e10aa	SPARK-942: Do not materialize partitions when DISK_ONLY storage level is used This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180 Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer. To do this, two changes where made: 1) The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly. 2) The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions. Author: Kyle Ellrott <kellrott@gmail.com> Closes #50 from kellrott/iterator-to-disk and squashes the following commits: 9ef7cb8 [Kyle Ellrott] Fixing formatting issues. 60e0c57 [Kyle Ellrott] Fixing issues (formatting, variable names, etc.) from review comments 8aa31cd [Kyle Ellrott] Merge ../incubator-spark into iterator-to-disk 33ac390 [Kyle Ellrott] Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spark into iterator-to-disk 2f684ea [Kyle Ellrott] Refactoring the BlockManager to replace the Either[Either[A,B]] usage. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues. f70d069 [Kyle Ellrott] Adding docs for spark.serializer.objectStreamReset configuration 7ccc74b [Kyle Ellrott] Moving the 'LargeIteratorSuite' to simply test persistance of iterators. It doesn't try to invoke a OOM error any more 16a4cea [Kyle Ellrott] Streamlined the LargeIteratorSuite unit test. It should now run in ~25 seconds. Confirmed that it still crashes an unpatched copy of Spark. c2fb430 [Kyle Ellrott] Removing more un-needed array-buffer to iterator conversions 627a8b7 [Kyle Ellrott] Wrapping a few long lines 0f28ec7 [Kyle Ellrott] Adding second putValues to BlockStore interface that accepts an ArrayBuffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication. 656c33e [Kyle Ellrott] Fixing the JavaSerializer to read from the SparkConf rather then the System property. 8644ee8 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 00c98e0 [Kyle Ellrott] Making the Java ObjectStreamSerializer reset rate configurable by the system variable 'spark.serializer.objectStreamReset', default is not 10000. 40fe1d7 [Kyle Ellrott] Removing rouge space 31fe08e [Kyle Ellrott] Removing un-needed semi-colons 9df0276 [Kyle Ellrott] Added check to make sure that streamed-to-dist RDD actually returns good data in the LargeIteratorSuite a6424ba [Kyle Ellrott] Wrapping long line 2eeda75 [Kyle Ellrott] Fixing dumb mistake ("\|\|" instead of "&&") 0e6f808 [Kyle Ellrott] Deleting temp output directory when done 95c7f67 [Kyle Ellrott] Simplifying StorageLevel checks 56f71cd [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 44ec35a [Kyle Ellrott] Adding some comments. 5eb2b7e [Kyle Ellrott] Changing the JavaSerializer reset to occur every 1000 objects. f403826 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 81d670c [Kyle Ellrott] Adding unit test for straight to disk iterator methods. d32992f [Kyle Ellrott] Merge remote-tracking branch 'origin/master' into iterator-to-disk cac1fad [Kyle Ellrott] Fixing MemoryStore, so that it converts incoming iterators to ArrayBuffer objects. This was previously done higher up the stack. efe1102 [Kyle Ellrott] Changing CacheManager and BlockManager to pass iterators directly to the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.	2014-03-06 14:51:19 -08:00
CodingCat	a3da508819	SPARK-1171: when executor is removed, we should minus totalCores instead of just freeCores on that executor https://spark-project.atlassian.net/browse/SPARK-1171 When the executor is removed, the current implementation will only minus the freeCores of that executor. Actually we should minus the totalCores... Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes #63 from CodingCat/simplify_CoarseGrainedSchedulerBackend and squashes the following commits: f6bf93f [Nan Zhu] code clean 19c2bb4 [CodingCat] use copy idiom to reconstruct the workerOffers 43c13e9 [CodingCat] keep WorkerOffer immutable af470d3 [CodingCat] style fix 0c0e409 [CodingCat] simplify the implementation of CoarseGrainedSchedulerBackend	2014-03-05 14:00:28 -08:00
Prashant Sharma	2d8e0a062c	SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally Author: Prashant Sharma <prashant.s@imaginea.com> Closes #72 from ScrapCodes/SPARK-1164/deprecate-reducebykeytodriver and squashes the following commits: ee521cd [Prashant Sharma] SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally	2014-03-04 10:27:02 -08:00
Prashant Sharma	181ec50307	[java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits: 95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch. 85a954e [Prashant Sharma] Nit. import orderings. 673f7ac [Prashant Sharma] Added support for -java-home as well 80a13e8 [Prashant Sharma] Used fake class tag syntax 26eb3f6 [Prashant Sharma] Patrick's comments on PR. 35d8d79 [Prashant Sharma] Specified java 8 building in the docs 31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag. 4ab87d3 [Prashant Sharma] Review feedback on the pr c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.	2014-03-03 22:31:30 -08:00
Kay Ousterhout	b14ede789a	Remove broken/unused Connection.getChunkFIFO method. This method appears to be broken -- since it never removes anything from messages, and it adds new messages to it, the while loop is an infinite loop. The method also does not appear to have ever been used since the code was added in 2012, so this commit removes it. cc @mateiz who originally added this method in case there's a reason it should be here! (`63051dd2bc`) Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #69 from kayousterhout/remove_get_fifo and squashes the following commits: 053bc59 [Kay Ousterhout] Remove broken/unused Connection.getChunkFIFO method.	2014-03-03 21:27:18 -08:00
Kay Ousterhout	b55cade853	Remove the remoteFetchTime metric. This metric is confusing: it adds up all of the time to fetch shuffle inputs, but fetches often happen in parallel, so remoteFetchTime can be much longer than the task execution time. @squito it looks like you added this metric -- do you have a use case for it? cc @shivaram -- I know you've looked at the shuffle performance a lot so chime in here if this metric has turned out to be useful for you! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #62 from kayousterhout/remove_fetch_variable and squashes the following commits: 43341eb [Kay Ousterhout] Remote the remoteFetchTime metric.	2014-03-03 16:12:00 -08:00
Kay Ousterhout	369aad6f9e	Removed accidentally checked in comment It looks like this comment was added a while ago by @mridulm as part of a merge and was accidentally checked in. We should remove it. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #61 from kayousterhout/remove_comment and squashes the following commits: 0b2b3f2 [Kay Ousterhout] Removed accidentally checked in comment	2014-03-03 14:39:49 -08:00
Aaron Davidson	46bcb9551e	SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID Previously, ZooKeeperPersistenceEngine would crash the whole Master process if there was stored data from a prior Spark version. Now, we just delete these files. Author: Aaron Davidson <aaron@databricks.com> Closes #4 from aarondav/zookeeper2 and squashes the following commits: fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID	2014-03-02 01:00:42 -08:00
CodingCat	3a8b698e96	[SPARK-1100] prevent Spark from overwriting directory silently Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100) the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because overwriting the data silently is not user-friendly if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings My fix includes: add some new APIs with a flag for users to define whether he/she wants to overwrite the directory: if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running; if the flag is set to false, Spark will throw an exception if the output directory already exists changed JavaAPI part default behaviour is overwriting Two questions should we deprecate the old APIs without such a flag? I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas Author: CodingCat <zhunansjtu@gmail.com> Closes #11 from CodingCat/SPARK-1100 and squashes the following commits: 6a4e3a3 [CodingCat] code clean ef2d43f [CodingCat] add new test cases and code clean ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory	2014-03-01 17:27:54 -08:00
Patrick Wendell	ec992e1822	Revert "[SPARK-1150] fix repo location in create script" This reverts commit `9aa0957118`.	2014-03-01 17:15:38 -08:00
Mark Grover	9aa0957118	[SPARK-1150] fix repo location in create script https://spark-project.atlassian.net/browse/SPARK-1150 fix the repo location in create_release script Author: Mark Grover <mark@apache.org> Closes #48 from CodingCat/script_fixes and squashes the following commits: 01f4bf7 [Mark Grover] Fixing some nitpicks d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY	2014-03-01 16:21:22 -08:00
Kay Ousterhout	556c56689b	[SPARK-979] Randomize order of offers. This commit randomizes the order of resource offers to avoid scheduling all tasks on the same small set of machines. This is a much simpler solution to SPARK-979 than #7. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #27 from kayousterhout/randomize and squashes the following commits: 435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.	2014-03-01 11:24:22 -08:00
Sandy Ryza	46dff34458	SPARK-1051. On YARN, executors don't doAs submitting user This reopens https://github.com/apache/incubator-spark/pull/538 against the new repo Author: Sandy Ryza <sandy@cloudera.com> Closes #29 from sryza/sandy-spark-1051 and squashes the following commits: 708ce49 [Sandy Ryza] SPARK-1051. doAs submitting user in YARN	2014-02-28 12:43:01 -06:00
Kay Ousterhout	edf8a56ab7	Remote BlockFetchTracker trait This trait seems to have been created a while ago when there were multiple implementations; now that there's just one, I think it makes sense to merge it into the BlockFetcherIterator trait. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #39 from kayousterhout/remove_tracker and squashes the following commits: 8173939 [Kay Ousterhout] Remote BlockFetchTracker.	2014-02-27 21:52:55 -08:00
Sean Owen	12bbca2065	SPARK 1084.1 (resubmitted) (Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>	2014-02-27 11:12:21 -08:00
Raymond Liu	aace2c097e	Show Master status on UI page For standalone HA mode, A status is useful to identify the current master, already in json format too. Author: Raymond Liu <raymond.liu@intel.com> Closes #24 from colorant/status and squashes the following commits: df630b3 [Raymond Liu] Show Master status on UI page	2014-02-26 23:51:32 -08:00
Xiangrui Meng	5a3ad107c0	SPARK-1129: use a predefined seed when seed is zero in XORShiftRandom If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark-project.atlassian.net/browse/SPARK-1129 Author: Xiangrui Meng <meng@databricks.com> Closes #645 from mengxr/xor and squashes the following commits: 1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom 45c6f16 [Xiangrui Meng] minor style change 51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom	2014-02-26 23:22:30 -08:00
Kay Ousterhout	71f69d66ce	Remove references to ClusterScheduler (SPARK-1140) ClusterScheduler was renamed to TaskSchedulerImpl; this commit updates comments and tests accordingly. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits: d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.	2014-02-26 22:52:42 -08:00
Prashant Sharma	0e40e2b126	Deprecated and added a few java api methods for corresponding scala api. PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits: 11d0c2b [Prashant Sharma] Integer -> java.lang.Integer 737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs. 3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs	2014-02-26 21:17:44 -08:00
William Benton	fbedc8eff2	SPARK-1078: Replace lift-json with json4s-jackson. The aim of the Json4s project is to provide a common API for Scala JSON libraries. It is Apache-licensed, easier for downstream distributions to package, and mostly API-compatible with lift-json. Furthermore, the Jackson-backed implementation parses faster than lift-json on all but the smallest inputs. Author: William Benton <willb@redhat.com> Closes #582 from willb/json4s and squashes the following commits: 7ca62c4 [William Benton] Replace lift-json with json4s-jackson.	2014-02-26 10:09:50 -08:00
Raymond Liu	c852201ce9	For SPARK-1082, Use Curator for ZK interaction in standalone cluster Author: Raymond Liu <raymond.liu@intel.com> Closes #611 from colorant/curator and squashes the following commits: 7556aa1 [Raymond Liu] Address review comments af92e1f [Raymond Liu] Fix coding style 964f3c2 [Raymond Liu] Ignore NodeExists exception 6df2966 [Raymond Liu] Rewrite zookeeper client code with curator	2014-02-24 23:20:38 -08:00
Bryn Keller	4d88030486	For outputformats that are Configurable, call setConf before sending data to them. [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. Note this bug also affects branch-0.9 Author: Bryn Keller <bryn.keller@intel.com> Closes #638 from xoltar/SPARK-1108 and squashes the following commits: 7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review 7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured	2014-02-24 17:35:22 -08:00
Matei Zaharia	0187cef0f2	Fix removal from shuffleToMapStage to search for a key-value pair with our stage instead of using our shuffleID.	2014-02-24 13:14:56 -08:00
Matei Zaharia	cd32d5e4de	SPARK-1124: Fix infinite retries of reduce stage when a map stage failed In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null".	2014-02-23 23:48:32 -08:00
Punya Biswal	29ac7ea52f	Migrate Java code to Scala or move it to src/main/java These classes can't be migrated: StorageLevels: impossible to create static fields in Scala JavaSparkContextVarargsWorkaround: incompatible varargs JavaAPISuite: should test Java APIs in pure Java (for sanity) Author: Punya Biswal <pbiswal@palantir.com> Closes #605 from punya/move-java-sources and squashes the following commits: 25b00b2 [Punya Biswal] Remove redundant type param; reformat 853da46 [Punya Biswal] Use factory method rather than constructor e5d53d9 [Punya Biswal] Migrate Java code to Scala or move it to src/main/java	2014-02-22 17:53:48 -08:00
Xiangrui Meng	aaec7d4a80	SPARK-1117: update accumulator docs The current doc hints spark doesn't support accumulators of type `Long`, which is wrong. JIRA: https://spark-project.atlassian.net/browse/SPARK-1117 Author: Xiangrui Meng <meng@databricks.com> Closes #631 from mengxr/acc and squashes the following commits: 45ecd25 [Xiangrui Meng] update accumulator docs	2014-02-21 22:44:45 -08:00
Andrew Or	fefd22f4c3	[SPARK-1113] External spilling - fix Int.MaxValue hash code collision bug The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612. ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException. The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again. This PR also includes two new tests for hash collisions. Author: Andrew Or <andrewor14@gmail.com> Closes #624 from andrewor14/spilling-bug and squashes the following commits: 9e7263d [Andrew Or] Slightly optimize next() 2037ae2 [Andrew Or] Move a few comments around... cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap 21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite	2014-02-21 20:05:39 -08:00
Patrick Wendell	45b15e27a8	SPARK-1111: URL Validation Throws Error for HDFS URL's Fixes an error where HDFS URL's cause an exception. Should be merged into master and 0.9. Author: Patrick Wendell <pwendell@gmail.com> Closes #625 from pwendell/url-validation and squashes the following commits: d14bfe3 [Patrick Wendell] SPARK-1111: URL Validation Throws Error for HDFS URL's	2014-02-21 11:11:55 -08:00
Aaron Davidson	3fede4831e	Super minor: Add require for mergeCombiners in combineByKey We changed the behavior in 0.9.0 from requiring that mergeCombiners be null when mapSideCombine was false to requiring that mergeCombiners never be null, for external sorting. This patch adds a require() to make this behavior change explicitly messaged rather than resulting in a NPE. Author: Aaron Davidson <aaron@databricks.com> Closes #623 from aarondav/master and squashes the following commits: 520b80c [Aaron Davidson] Super minor: Add require for mergeCombiners in combineByKey	2014-02-20 16:46:13 -08:00
NirmalReddy	ccb327a49a	Optimized imports Optimized imports and arranged according to scala style guide @ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports Author: NirmalReddy <nirmal.reddy@imaginea.com> Author: NirmalReddy <nirmal_reddy2000@yahoo.com> Closes #613 from NirmalReddy/opt-imports and squashes the following commits: 578b4f5 [NirmalReddy] imported java.lang.Double as JDouble a2cbcc5 [NirmalReddy] addressed the comments 776d664 [NirmalReddy] Optimized imports in core	2014-02-18 14:44:36 -08:00
Aaron Davidson	f74ae0ebce	SPARK-1098: Minor cleanup of ClassTag usage in Java API Our usage of fake ClassTags in this manner is probably not healthy, but I'm not sure if there's a better solution available, so I just cleaned up and documented the current one. Author: Aaron Davidson <aaron@databricks.com> Closes #604 from aarondav/master and squashes the following commits: b398e89 [Aaron Davidson] SPARK-1098: Minor cleanup of ClassTag usage in Java API	2014-02-17 19:23:27 -08:00
Andrew Ash	c0795cf481	Worker registration logging fix Author: Andrew Ash <andrew@andrewash.com> Closes #608 from ash211/patch-7 and squashes the following commits: bd85f2a [Andrew Ash] Worker registration logging fix	2014-02-17 09:51:55 -08:00
Punya Biswal	5af4477c2b	Add subtractByKey to the JavaPairRDD wrapper Author: Punya Biswal <pbiswal@palantir.com> Closes #600 from punya/subtractByKey-java and squashes the following commits: e961913 [Punya Biswal] Hide implicit ClassTags from Java API c5d317b [Punya Biswal] Add subtractByKey to the JavaPairRDD wrapper	2014-02-16 18:55:59 -08:00
Bijay Bisht	73cfdcfe71	fix for https://spark-project.atlassian.net/browse/SPARK-1052 Author: Bijay Bisht <bijay.bisht@gmail.com> Closes #568 from bijaybisht/SPARK-1052 and squashes the following commits: da70395 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 - comments incorporated fdb1d94 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 (cherry picked from commit `e797c1abd9`) Signed-off-by: Aaron Davidson <aaron@databricks.com>	2014-02-16 16:54:03 -08:00
CodingCat	1cad381387	[SPARK-1092] print warning information if user use SPARK_MEM to regulate executor memory usage https://spark-project.atlassian.net/browse/SPARK-1092?jql=project%20%3D%20SPARK print warning information if user set SPARK_MEM to regulate memory usage of executors ---- OUTDATED: Currently, users will usually set SPARK_MEM to control the memory usage of driver programs, (in spark-class) 91 JAVA_OPTS="$OUR_JAVA_OPTS" 92 JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH" 93 JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM" if they didn't set spark.executor.memory, the value in this environment variable will also affect the memory usage of executors, because the following lines in SparkContext privatespark val executorMemory = conf.getOption("spark.executor.memory") .orElse(Option(System.getenv("SPARK_MEM"))) .map(Utils.memoryStringToMb) .getOrElse(512) also since SPARK_MEM has been (proposed to) deprecated in SPARK-929 (https://spark-project.atlassian.net/browse/SPARK-929) and the corresponding PR (https://github.com/apache/incubator-spark/pull/104) we should remove this line Author: CodingCat <zhunansjtu@gmail.com> Closes #602 from CodingCat/clean_spark_mem and squashes the following commits: 302bb28 [CodingCat] print warning information if user use SPARK_MEM to regulate executor memory usage	2014-02-16 12:25:38 -08:00
Xiangrui Meng	7e29e02791	Merge pull request #591 from mengxr/transient-new. SPARK-1076: [Fix #578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <meng@databricks.com> Closes #591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD	2014-02-12 16:26:25 -08:00
Xiangrui Meng	2bea0709f9	Merge pull request #589 from mengxr/index. SPARK-1076: Convert Int to Long to avoid overflow Patch for PR #578. Author: Xiangrui Meng <meng@databricks.com> Closes #589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow	2014-02-12 10:47:52 -08:00
Xiangrui Meng	e733d655df	Merge pull request #578 from mengxr/rank. SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <meng@databricks.com> Closes #578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD	2014-02-12 00:42:42 -08:00
Raymond Liu	68b2c0d02d	Merge pull request #583 from colorant/zookeeper. Minor fix for ZooKeeperPersistenceEngine to use configured working dir Author: Raymond Liu <raymond.liu@intel.com> Closes #583 and squashes the following commits: 91b0609 [Raymond Liu] Minor fix for ZooKeeperPersistenceEngine to use configured working dir	2014-02-11 22:39:48 -08:00
Holden Karau	b0dab1bb9f	Merge pull request #571 from holdenk/switchtobinarysearch. SPARK-1072 Use binary search when needed in RangePartioner Author: Holden Karau <holden@pigscanfly.ca> Closes #571 and squashes the following commits: f31a2e1 [Holden Karau] Swith to using CollectionsUtils in Partitioner 4c7a0c3 [Holden Karau] Add CollectionsUtil as suggested by aarondav 7099962 [Holden Karau] Add the binary search to only init once 1bef01d [Holden Karau] CR feedback a21e097 [Holden Karau] Use binary search if we have more than 1000 elements inside of RangePartitioner	2014-02-11 14:48:59 -08:00
Patrick Wendell	d6a9bdc097	Revert "Merge pull request #560 from pwendell/logging. Closes #560." This reverts commit `b6d40b7823`.	2014-02-09 23:35:06 -08:00
Prashant Sharma	919bd7f669	Merge pull request #567 from ScrapCodes/style2. SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2 Continuation of PR #557 With this all scala style errors are fixed across the code base !! The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #567 and squashes the following commits: 3b1ec30 [Prashant Sharma] scala style fixes	2014-02-09 22:17:52 -08:00
qqsun8819	afc8f3cb9a	Merge pull request #551 from qqsun8819/json-protocol. [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself This is a PR for SPARK-1038. Two major changes: 1 add some fields to JsonProtocol which is new and important to standalone-related data structures 2 Use Diff in liftweb.json to verity the stringified Json output for detecting someone mod type T to Option[T] Author: qqsun8819 <jin.oyj@alibaba-inc.com> Closes #551 and squashes the following commits: fdf0b4e [qqsun8819] [SPARK-1038] 1. Change code style for more readable according to rxin review 2. change submitdate hard-coded string to a date object toString for more complexiblity 095a26f [qqsun8819] [SPARK-1038] mod according to review of pwendel, use hard-coded json string for json data validation. Each test use its own json string 0524e41 [qqsun8819] Merge remote-tracking branch 'upstream/master' into json-protocol d203d5c [qqsun8819] [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself	2014-02-09 13:57:29 -08:00
Patrick Wendell	b69f8b2a01	Merge pull request #557 from ScrapCodes/style. Closes #557 . SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <scrapcodes@gmail.com> == Merge branch commits == commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4 Author: Prashant Sharma <scrapcodes@gmail.com> Date: Sun Feb 9 17:39:07 2014 +0530 scala style fixes commit f91709887a8e0b608c5c2b282db19b8a44d53a43 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Jan 24 11:22:53 2014 -0800 Adding scalastyle snapshot	2014-02-09 10:09:19 -08:00
CodingCat	b6dba10ae5	Merge pull request #556 from CodingCat/JettyUtil. Closes #556 . [SPARK-1060] startJettyServer should explicitly use IP information https://spark-project.atlassian.net/browse/SPARK-1060 In the current implementation, the webserver in Master/Worker is started with val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers) inside startJettyServer: val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address Author: CodingCat <zhunansjtu@gmail.com> == Merge branch commits == commit 6c6d9a8ccc9ec4590678a3b34cb03df19092029d Author: CodingCat <zhunansjtu@gmail.com> Date: Thu Feb 6 14:53:34 2014 -0500 startJettyServer should explicitly use IP information	2014-02-08 23:39:17 -08:00
Patrick Wendell	b6d40b7823	Merge pull request #560 from pwendell/logging. Closes #560 . [WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j To fix this - we add a check when initializing log4j. Author: Patrick Wendell <pwendell@gmail.com> == Merge branch commits == commit ffdce513877f64b6eed6d36138c3e0003d392889 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Feb 7 15:22:29 2014 -0800 Logging fix	2014-02-08 23:35:31 -08:00
Qiuzhuang Lian	f0ce736fad	Merge pull request #561 from Qiuzhuang/master. Closes #561 . Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068 Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com> == Merge branch commits == commit 9c19ce63637eee9369edd235979288d3d9fc9105 Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com> Date: Sat Feb 8 16:07:39 2014 +0800 Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068	2014-02-08 12:59:48 -08:00
Andrew Ash	3a9d82cc9e	Merge pull request #506 from ash211/intersection. Closes #506 . SPARK-1062 Add rdd.intersection(otherRdd) method Author: Andrew Ash <andrew@andrewash.com> == Merge branch commits == commit 5d9982b171b9572649e9828f37ef0b43f0242912 Author: Andrew Ash <andrew@andrewash.com> Date: Thu Feb 6 18:11:45 2014 -0800 Minor fixes - style: (v,null) => (v, null) - mention the shuffle in Javadoc commit b86d02f14e810902719cef893cf6bfa18ff9acb0 Author: Andrew Ash <andrew@andrewash.com> Date: Sun Feb 2 13:17:40 2014 -0800 Overload .intersection() for numPartitions and custom Partitioner commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09 Author: Andrew Ash <andrew@andrewash.com> Date: Sun Feb 2 13:05:40 2014 -0800 Better naming of parameters in intersection's filter commit b10a6af2d793ec6e9a06c798007fac3f6b860d89 Author: Andrew Ash <andrew@andrewash.com> Date: Sat Jan 25 23:06:26 2014 -0800 Follow spark code format conventions of tab => 2 spaces commit 965256e4304cca514bb36a1a36087711dec535ec Author: Andrew Ash <andrew@andrewash.com> Date: Fri Jan 24 00:28:01 2014 -0800 Add rdd.intersection(otherRdd) method	2014-02-06 22:39:08 -08:00
Andrew Or	1896c6e7c9	Merge pull request #533 from andrewor14/master. Closes #533 . External spilling - generalize batching logic The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way. Author: Andrew Or <andrewor14@gmail.com> == Merge branch commits == commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 12:09:32 2014 -0800 Also privatize fields commit 090544a87a0767effd0c835a53952f72fc8d24f0 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 10:58:23 2014 -0800 Privatize methods commit 13920c918efe22e66a1760b14beceb17a61fd8cc Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 16:34:15 2014 -0800 Update docs commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:44:24 2014 -0800 Typo: phyiscal -> physical commit 287ef44e593ad72f7434b759be3170d9ee2723d2 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:38:32 2014 -0800 Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit 3df700509955f7074821e9aab1e74cb53c58b5a5 Merge: a531d2e 164489d Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:27:49 2014 -0800 Merge branch 'master' of github.com:andrewor14/incubator-spark commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8 Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF.	2014-02-06 22:05:53 -08:00
Kay Ousterhout	0b448df6ac	Merge pull request #450 from kayousterhout/fetch_failures. Closes #450 . Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit e603784b3a562980e6f1863845097effe2129d3b Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 11:34:41 2014 -0800 Re-add check for empty set of failed stages commit d258f0ef50caff4bbb19fb95a6b82186db1935bf Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 15 23:35:41 2014 -0800 Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs.	2014-02-06 16:15:24 -08:00
Kay Ousterhout	18ad59e2c6	Merge pull request #321 from kayousterhout/ui_kill_fix. Closes #321 . Inform DAG scheduler about all started/finished tasks. Previously, the DAG scheduler was not always informed when tasks started and finished. The simplest example here is for speculated tasks: the DAGScheduler was only told about the first attempt of a task, meaning that SparkListeners were also not told about multiple task attempts, so users can't see what's going on with speculation in the UI. The DAGScheduler also wasn't always told about finished tasks, so in the UI, some tasks will never be shown as finished (this occurs, for example, if a task set gets killed). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit c8d547d0f7a17f5a193bef05f5872b9f475675c5 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 15 16:47:33 2014 -0800 Addressed Reynold's review comments. Always use a TaskEndReason (remove the option), and explicitly signal when we don't know the reason. Also, always tell DAGScheduler (and associated listeners) about started tasks, even when they're speculated. commit 3fee1e2e3c06b975ff7f95d595448f38cce97a04 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 8 22:58:13 2014 -0800 Fixed broken test and improved logging commit ff12fcaa2567c5d02b75a1d5db35687225bcd46f Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Sun Dec 29 21:08:20 2013 -0800 Inform DAG scheduler about all finished tasks. Previously, the DAG scheduler was not always informed when tasks finished. For example, when a task set was aborted, the DAG scheduler was never told when the tasks in that task set finished. The DAG scheduler was also never told about the completion of speculated tasks. This led to confusion with SparkListeners because information about the completion of those tasks was never passed on to the listeners (so in the UI, for example, some tasks will never be shown as finished). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished.	2014-02-06 16:10:48 -08:00
Sandy Ryza	446403b637	Merge pull request #554 from sryza/sandy-spark-1056. Closes #554 . SPARK-1056. Fix header comment in Executor to not imply that it's only u... ...sed for Mesos and Standalone. Author: Sandy Ryza <sandy@cloudera.com> == Merge branch commits == commit 1f2443d902a26365a5c23e4af9077e1539ed2eab Author: Sandy Ryza <sandy@cloudera.com> Date: Thu Feb 6 15:03:50 2014 -0800 SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone	2014-02-06 15:41:16 -08:00
Kay Ousterhout	79c95527a7	Merge pull request #545 from kayousterhout/fix_progress. Closes #545 . Fix off-by-one error with task progress info log. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit 29798fc685c4e7e3eb3bf91c75df7fa8ec94a235 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 13:40:01 2014 -0800 Fix off-by-one error with task progress info log.	2014-02-05 23:38:12 -08:00
CodingCat	18c4ee71e2	Merge pull request #549 from CodingCat/deadcode_master. Closes #549 . remove actorToWorker in master.scala, which is actually not used actorToWorker is actually not used in the code....just remove it Author: CodingCat <zhunansjtu@gmail.com> == Merge branch commits == commit 52656c2d4bbf9abcd8bef65d454badb9cb14a32c Author: CodingCat <zhunansjtu@gmail.com> Date: Thu Feb 6 00:28:26 2014 -0500 remove actorToWorker in master.scala, which is actually not used	2014-02-05 22:08:47 -08:00
Stevo Slavić	0c05cd374d	Merge pull request #535 from sslavic/patch-2. Closes #535 . Fixed typo in scaladoc Author: Stevo Slavić <sslavic@gmail.com> == Merge branch commits == commit 0a77f789e281930f4168543cc0d3b3ffbf5b3764 Author: Stevo Slavić <sslavic@gmail.com> Date: Tue Feb 4 15:30:27 2014 +0100 Fixed typo in scaladoc	2014-02-04 09:45:46 -08:00
Xiangrui Meng	23af00f9e0	Merge pull request #528 from mengxr/sample. Closes #528 . Refactor RDD sampling and add randomSplit to RDD (update) Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are: 1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513 2) Stratified sampling and importance sampling can be implemented in the same manner as well. Unit tests are included for samplers and RDD.randomSplit. This should performance better than my previous request where the BernoulliSampler creates many Iterator instances: https://github.com/apache/incubator-spark/pull/513 Author: Xiangrui Meng <meng@databricks.com> == Merge branch commits == commit e8ce957e5f0a600f2dec057924f4a2ca6adba373 Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 12:21:08 2014 -0800 more docs to PartitionwiseSampledRDD commit fbb4586d0478ff638b24bce95f75ff06f713d43b Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 00:44:23 2014 -0800 move XORShiftRandom to util.random and use it in BernoulliSampler commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 11:06:59 2014 -0800 relax assertions in SortingSuite because the RangePartitioner has large variance in this case commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:56:28 2014 -0800 test split ratio of RDD.randomSplit commit 8a410bc933a60c4d63852606f8bbc812e416d6ae Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:25:22 2014 -0800 add a test to ensure seed distribution and minor style update commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:06:22 2014 -0800 minor style change commit 750912b4d77596ed807d361347bd2b7e3b9b7a74 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:04:54 2014 -0800 fix some long lines commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:59:59 2014 -0800 add complement to BernoulliSampler and minor style changes commit dbe2bc2bd888a7bdccb127ee6595840274499403 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:45:08 2014 -0800 switch to partition-wise sampling for better performance commit a1fca5232308feb369339eac67864c787455bb23 Merge: `ac712e4` cf6128f Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 16:33:09 2014 -0800 Merge branch 'sample' of github.com:mengxr/incubator-spark into sample commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:40:07 2014 -0800 set SampledRDD deprecated in 1.0 commit f430f847c3df91a3894687c513f23f823f77c255 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:38:59 2014 -0800 update code style commit a8b5e2021a9204e318c80a44d00c5c495f1befb6 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:56:27 2014 -0800 move package random to util.random commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:50:35 2014 -0800 add Apache headers and update code style commit 985609fe1a55655ad11966e05a93c18c138a403d Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:49:25 2014 -0800 add new lines commit b21bddf29850a2c006a868869b8f91960a029322 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:46:35 2014 -0800 move samplers to random.IndependentRandomSampler and add tests commit c02dacb4a941618e434cefc129c002915db08be6 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Jan 25 15:20:24 2014 -0800 add RandomSampler commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 24 13:23:22 2014 -0800 init impl of IndependentlySampledRDD	2014-02-03 13:02:09 -08:00
Aaron Davidson	1625d8c446	Merge pull request #530 from aarondav/cleanup. Closes #530 . Remove explicit conversion to PairRDDFunctions in cogroup() As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the conversion explicitly, however. Author: Aaron Davidson <aaron@databricks.com> == Merge branch commits == commit aa4a63f1bfd5b5178fe67364dd7ce4d84c357996 Author: Aaron Davidson <aaron@databricks.com> Date: Sun Feb 2 23:48:04 2014 -0800 Remove explicit conversion to PairRDDFunctions in cogroup() As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the converion explicitly, however.	2014-02-03 11:25:39 -08:00
Erik Selin	0ff38c2220	Merge pull request #494 from tyro89/worker_registration_issue Issue with failed worker registrations I've been going through the spark source after having some odd issues with workers dying and not coming back. After some digging (I'm very new to scala and spark) I believe I've found a worker registration issue. It looks to me like a failed registration follows the same code path as a successful registration which end up with workers believing they are connected (since they received a `RegisteredWorker` event) even tho they are not registered on the Master. This is a quick fix that I hope addresses this issue (assuming I didn't completely miss-read the code and I'm about to look like a silly person :P) I'm opening this pr now to start a chat with you guys while I do some more testing on my side :) Author: Erik Selin <erik.selin@jadedpixel.com> == Merge branch commits == commit 973012f8a2dcf1ac1e68a69a2086a1b9a50f401b Author: Erik Selin <erik.selin@jadedpixel.com> Date: Tue Jan 28 23:36:12 2014 -0500 break logwarning into two lines to respect line character limit. commit e3754dc5b94730f37e9806974340e6dd93400f85 Author: Erik Selin <erik.selin@jadedpixel.com> Date: Tue Jan 28 21:16:21 2014 -0500 add log warning when worker registration fails due to attempt to re-register on same address. commit 14baca241fa7823e1213cfc12a3ff2a9b865b1ed Author: Erik Selin <erik.selin@jadedpixel.com> Date: Wed Jan 22 21:23:26 2014 -0500 address code style comment commit 71c0d7e6f59cd378d4e24994c21140ab893954ee Author: Erik Selin <erik.selin@jadedpixel.com> Date: Wed Jan 22 16:01:42 2014 -0500 Make a failed registration not persist, not send a `RegisteredWordker` event and not run `schedule` but rather send a `RegisterWorkerFailed` message to the worker attempting to register.	2014-01-29 12:44:54 -08:00
Josh Rosen	1381fc72f7	Switch from MUTF8 to UTF8 in PySpark serializers. This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.	2014-01-28 20:20:08 -08:00
Reynold Xin	84670f2715	Merge pull request #466 from liyinan926/file-overwrite-new Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.	2014-01-27 17:08:35 -08:00
Reynold Xin	f16c21e22f	Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined Replace the check for None Option with isDefined and isEmpty in Scala code Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty. I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand. Pass compile and tests.	2014-01-27 14:24:06 -08:00
Reynold Xin	c40619d487	Merge pull request #504 from JoshRosen/SPARK-1025 Fix PySpark hang when input files are deleted (SPARK-1025) This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.	2014-01-25 22:41:30 -08:00
Josh Rosen	740e865f40	Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) This fixes an issue where collectAsMap() could fail when called on a JavaPairRDD that was derived by transforming a non-JavaPairRDD. The root problem was that we were creating the JavaPairRDD's ClassTag by casting a ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]]. To fix this, I cast a ClassTag[Tuple2[_, _]] instead, since this actually produces a ClassTag of the appropriate type because ClassTags don't capture type parameters: scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res8: Boolean = true scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res9: Boolean = false	2014-01-25 16:41:12 -08:00
Patrick Wendell	3d6e754193	Merge pull request #503 from pwendell/master Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.	2014-01-23 19:47:00 -08:00
Patrick Wendell	ff44732171	Minor fix	2014-01-23 19:23:12 -08:00
Patrick Wendell	c3196171f3	Merge pull request #502 from pwendell/clone-1 Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in various file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.	2014-01-23 19:11:59 -08:00
Patrick Wendell	cad3002fea	Merge pull request #501 from JoshRosen/cartesian-rdd-fixes Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034 This pull request fixes two bugs in PySpark's `cartesian()` method: - [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception - [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result The JIRAs have more details describing the fixes.	2014-01-23 19:08:34 -08:00
Patrick Wendell	268ecbd231	Minor changes after auditing diff from earlier version	2014-01-23 18:30:11 -08:00
Josh Rosen	f83068497b	Fix for SPARK-1025: PySpark hang on missing files.	2014-01-23 18:24:51 -08:00
Patrick Wendell	c58d4ea3d4	Response to Matei's review	2014-01-23 18:12:40 -08:00
Patrick Wendell	0213b4032a	Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.	2014-01-23 18:04:55 -08:00
Patrick Wendell	7101017803	Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in verious file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.	2014-01-23 17:39:23 -08:00
Josh Rosen	61569906cc	Fix SPARK-978: ClassCastException in PySpark cartesian.	2014-01-23 15:09:19 -08:00
Josh Rosen	0035dbbc81	Fix SPARK-1034: Py4JException on PySpark Cartesian Result	2014-01-23 13:05:59 -08:00
Josh Rosen	fad6aacfb0	Merge pull request #406 from eklavya/master Extending Java API coverage Hi, I have added three new methods to JavaRDD. Please review and merge.	2014-01-23 11:14:15 -08:00
eklavya	60e7457266	fixed ClassTag in mapPartitions	2014-01-23 17:40:36 +05:30
Patrick Wendell	a1cd185122	Merge pull request #496 from pwendell/master Fix bug in worker clean-up in UI Introduced in `d5a96fec` (/cc @aarondav). This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.	2014-01-22 19:37:29 -08:00
Patrick Wendell	034dce2a7e	Merge pull request #447 from CodingCat/SPARK-1027 fix for SPARK-1027 fix for SPARK-1027 (https://spark-project.atlassian.net/browse/SPARK-1027) FIXES 1. change sparkhome from String to Option(String) in ApplicationDesc 2. remove sparkhome parameter in LaunchExecutor message 3. adjust involved files	2014-01-22 18:58:02 -08:00
Patrick Wendell	6285513147	Fix bug in worker clean-up in UI Introduced in `d5a96fec`. This should be picked into 0.8 and 0.9 as well.	2014-01-22 18:19:52 -08:00
CodingCat	2b3c461451	refactor sparkHome to val clean code	2014-01-22 20:20:46 -05:00
Kay Ousterhout	19da82c50f	Fixed bug where task set managers are added to queue twice This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality.	2014-01-22 09:52:12 -08:00
Henry Saputra	90ea9d5a8f	Replace the code to check for Option != None with Option.isDefined call in Scala code. This hopefully will make the code cleaner.	2014-01-21 23:22:10 -08:00
Patrick Wendell	a9bcc980b6	Style clean-up	2014-01-21 00:05:28 -08:00
Patrick Wendell	a917a87e02	Adding small code comment	2014-01-20 23:11:45 -08:00
Patrick Wendell	d46df96de3	Avoid matching attempt files in the checkpoint	2014-01-20 20:03:23 -08:00
Patrick Wendell	de526ad527	Remove shuffle files if they are still present on a machine.	2014-01-20 19:11:22 -08:00
Patrick Wendell	f84400e86c	Fixing speculation bug	2014-01-20 19:05:03 -08:00
Patrick Wendell	c324ac10ee	Force use of LZF when spilling data	2014-01-20 19:00:48 -08:00
Patrick Wendell	1b299142a8	Bug fix for reporting of spill output	2014-01-20 18:34:00 -08:00
Patrick Wendell	54867e9566	Minor fixes	2014-01-20 18:33:21 -08:00
Patrick Wendell	cdb003e376	Removing docs on akka options	2014-01-20 16:40:58 -08:00
CodingCat	29f4b6a2d9	fix for SPARK-1027 change TestClient & Worker to Some("xxx") kill manager if it is started remove unnecessary .get when fetch "SPARK_HOME" values	2014-01-20 02:50:30 -05:00
CodingCat	f9a95d6736	executor creation failed should not make the worker restart	2014-01-20 02:50:30 -05:00
Thomas Graves	dd56b2125e	update comment	2014-01-19 12:21:39 -06:00
Thomas Graves	ceb79a3931	Only log error on missing jar to allow spark examples to jar.	2014-01-19 12:16:58 -06:00
Yinan Li	584323c6b1	Addressed comments from Reynold Signed-off-by: Yinan Li <liyinan926@gmail.com>	2014-01-18 21:28:17 -08:00
Patrick Wendell	73dfd42fba	Merge pull request #437 from mridulm/master Minor api usability changes - Expose checkpoint directory - since it is autogenerated now - null check for jars - Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark.	2014-01-18 16:23:56 -08:00
Patrick Wendell	bf5699543b	Merge pull request #462 from mateiz/conf-file-fix Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit `34e911ce9a`) Signed-off-by: Patrick Wendell <pwendell@gmail.com>	2014-01-18 16:20:00 -08:00
Yinan Li	fd833e7ab1	Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. Signed-off-by: Yinan Li <liyinan926@gmail.com>	2014-01-18 15:26:59 -08:00
Patrick Wendell	5316bcac3c	Use renamed shuffle spill config in CoGroupedRDD.scala	2014-01-18 11:58:42 -08:00
Mridul Muralidharan	b690e11d9c	Address review comment	2014-01-17 18:28:55 +05:30
Patrick Wendell	d4fd89e3c8	Merge pull request #438 from ScrapCodes/clone-records-java-api Clone records java api	2014-01-16 23:17:30 -08:00
Prashant Sharma	fcb4fc653d	adding clone records field to equivaled java apis	2014-01-17 11:16:03 +05:30
Mridul Muralidharan	edd82c58a2	Use method, not variable	2014-01-16 17:26:42 +05:30
Mridul Muralidharan	1a0da89277	Address review comments	2014-01-16 17:23:25 +05:30
Reynold Xin	c06a307ca2	Merge pull request #445 from kayousterhout/exec_lost Fail rather than hanging if a task crashes the JVM. Prior to this commit, if a task crashes the JVM, the task (and all other tasks running on that executor) is marked at KILLED rather than FAILED. As a result, the TaskSetManager will retry the task indefinitely rather than failing the job after maxFailures. Eventually, this makes the job hang, because the Standalone Scheduler removes the application after 10 works have failed, and then the app is left in a state where it's disconnected from the master and waiting to reconnect. This commit fixes that problem by marking tasks as FAILED rather than killed when an executor is lost. The downside of this commit is that if task A fails because another task running on the same executor caused the VM to crash, the failure will incorrectly be counted as a failure of task A. This should not be an issue because we typically set maxFailures to 3, and it is unlikely that a task will be co-located with a JVM-crashing task multiple times.	2014-01-15 23:47:25 -08:00
Kay Ousterhout	a268d63411	Fail rather than hanging if a task crashes the JVM. Prior to this commit, if a task crashes the JVM, the task (and all other tasks running on that executor) is marked at KILLED rather than FAILED. As a result, the TaskSetManager will retry the task indefiniteily rather than failing the job after maxFailures. This commit fixes that problem by marking tasks as FAILED rather than killed when an executor is lost. The downside of this commit is that if task A fails because another task running on the same executor caused the VM to crash, the failure will incorrectly be counted as a failure of task A. This should not be an issue because we typically set maxFailures to 3, and it is unlikely that a task will be co-located with a JVM-crashing task multiple times.	2014-01-15 16:03:40 -08:00
Patrick Wendell	59f475c79f	Merge pull request #442 from pwendell/standalone Workers should use working directory as spark home if it's not specified If users don't set SPARK_HOME in their environment file when launching an application, the standalone cluster should default to the spark home of the worker.	2014-01-15 13:55:14 -08:00
Patrick Wendell	00a3f7eec5	Workers should use working directory as spark home if it's not specified	2014-01-15 11:05:36 -08:00
Mridul Muralidharan	0aea33d39e	Expose method and class - so that we can use it from user code (particularly since checkpoint directory is autogenerated now	2014-01-15 12:44:44 +05:30
Tathagata Das	0e15bd7827	Merge remote-tracking branch 'apache/master' into filestream-fix	2014-01-14 22:21:20 -08:00
Tathagata Das	1f4718c480	Changed SparkConf to not be serializable. And also fixed unit-test log paths in log4j.properties of external modules.	2014-01-14 22:20:14 -08:00
Reynold Xin	74b46acdc5	Merge pull request #428 from pwendell/writeable-objects Don't clone records for text files	2014-01-14 14:59:13 -08:00
Reynold Xin	d601a76d1f	Merge pull request #427 from pwendell/deprecate-aggregator Deprecate rather than remove old combineValuesByKey function	2014-01-14 14:52:24 -08:00
Patrick Wendell	b1b22b7a13	Style fix	2014-01-14 13:56:27 -08:00
Patrick Wendell	8ea2cd56e4	Adding fix covering combineCombinersByKey as well	2014-01-14 13:52:23 -08:00
Patrick Wendell	b683608c9f	Deprecate rather than remove old combineValuesByKey function	2014-01-14 12:15:10 -08:00
Patrick Wendell	6f965a46a9	Don't clone records for text files	2014-01-14 11:57:53 -08:00
Reynold Xin	f12e506c9e	Fixed a typo in JavaSparkContext's API doc.	2014-01-14 11:42:28 -08:00
Reynold Xin	1b5623fd0b	Maintain Serializable API compatibility by reverting back to java.io.Serializable for Broadcast and Accumulator.	2014-01-14 11:30:59 -08:00
Reynold Xin	55db77416b	Added license header for package.scala in the Java API package.	2014-01-14 11:20:12 -08:00
Reynold Xin	f8c12e9457	Added package doc for the Java API.	2014-01-14 11:16:25 -08:00
Reynold Xin	6a12b9ebc5	Updated API doc for Accumulable and Accumulator.	2014-01-14 11:16:08 -08:00
Reynold Xin	71b3007dbd	Broadcast variable visibility change & doc update. Note that previously Broadcast class was accidentally marked as private[spark]. It needs to be public for broadcast variables to work. Also exposing the broadcast varaible id.	2014-01-14 11:15:21 -08:00
Patrick Wendell	23034798d7	Add missing header files	2014-01-14 01:17:13 -08:00
Saurabh Rawat	1442cd5d50	Modifications as suggested in PR feedback- - more variants of mapPartitions added to JavaRDDLike - move setGenerator to JavaRDDLike - clean up	2014-01-14 14:19:02 +05:30
Patrick Wendell	0984647aae	Enable compression by default for spills	2014-01-13 23:25:25 -08:00
Patrick Wendell	4a805aff5e	Merge pull request #367 from ankurdave/graphx GraphX: Unifying Graphs and Tables GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/. Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak. Tasks left: - [x] Graph-level uncache - [x] Uncache previous iterations in Pregel - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release) - [x] - Describe GC issue with GraphLab - [ ] Write `docs/graphx-programming-guide.md` - [x] - Mention future Bagel support in docs - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again. - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx - [x] Make Graph serializable to work around capture in Spark shell - [x] Rename graph -> graphx in package name and subproject - [x] Remove standalone PageRank - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~	2014-01-13 22:58:38 -08:00
Patrick Wendell	945fe7a37e	Merge pull request #408 from pwendell/external-serializers Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.	2014-01-13 22:56:12 -08:00
Patrick Wendell	68641bce61	Merge pull request #413 from rxin/scaladoc Adjusted visibility of various components and documentation for 0.9.0 release.	2014-01-13 22:54:13 -08:00
Patrick Wendell	0ca0d4d657	Merge pull request #401 from andrewor14/master External sorting - Add number of bytes spilled to Web UI Additionally, update test suite for external sorting to induce spilling.	2014-01-13 22:32:21 -08:00
Patrick Wendell	08b9fec93d	Merge pull request #409 from tdas/unpersist Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.	2014-01-13 22:29:03 -08:00
Andrew Or	839934140f	Wording changes per Patrick	2014-01-13 20:51:38 -08:00
Reynold Xin	33022d6656	Adjusted visibility of various components.	2014-01-13 19:58:53 -08:00
Harvey	9e84e70509	Add default value for HadoopRDD's `cloneRecords` constructor arg, to maintain backwards compatibility.	2014-01-13 19:43:40 -08:00
Patrick Wendell	d4cd5debf4	Fix for Kryo Serializer	2014-01-13 19:03:59 -08:00
Reynold Xin	e2d25d2dfe	Merge branch 'master' into graphx	2014-01-13 16:21:26 -08:00
Tathagata Das	27311b1332	Added unpersisting and modified testsuite to better test out metadata cleaning.	2014-01-13 14:57:07 -08:00
Patrick Wendell	c3816de504	Changing option wording per discussion with Andrew	2014-01-13 13:25:06 -08:00
Patrick Wendell	5d61e051c2	Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.	2014-01-13 12:21:39 -08:00
Saurabh Rawat	e922973373	Modifications as suggested in PR feedback- - mapPartitions, foreachPartition moved to JavaRDDLike - call scala rdd's setGenerator instead of setting directly in JavaRDD	2014-01-13 23:40:04 +05:30
eklavya	fa42951e3b	Remove default param from mapPartitions	2014-01-13 18:13:22 +05:30
eklavya	8fe562c0fa	Remove classtag from mapPartitions.	2014-01-13 18:09:58 +05:30
eklavya	6a65feebc7	Added foreachPartition method to JavaRDD.	2014-01-13 17:56:47 +05:30
eklavya	dbadc6b994	Added mapPartitions method to JavaRDD.	2014-01-13 17:56:10 +05:30
eklavya	aae8a01425	Added setter method setGenerator to JavaRDD.	2014-01-13 17:53:35 +05:30
Andrew Or	a1f0992fae	Report bytes spilled for both memory and disk on Web UI	2014-01-12 23:42:57 -08:00
Andrew Or	69c9aebed0	Enable external sorting by default	2014-01-12 22:43:01 -08:00
Reynold Xin	e6ed13f255	Merge pull request #397 from pwendell/host-port Remove now un-needed hostPort option I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).	2014-01-12 22:35:14 -08:00
Andrew Or	8d40e7222f	Get rid of spill map in SparkEnv	2014-01-12 22:34:33 -08:00
Patrick Wendell	0b96d85c20	Merge pull request #399 from pwendell/consolidate-off Disable shuffle file consolidation by default After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.	2014-01-12 21:31:43 -08:00
Patrick Wendell	0ab505a29e	Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala Remove simple redundant return statements for Scala methods/functions Remove simple redundant return statements for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized -) Add small changes to making var to val if possible and remove () for simple get This hopefully makes the review simpler =) Pass compile and tests.	2014-01-12 21:31:04 -08:00
Patrick Wendell	2802cc80bc	Disable shuffle file consolidation by default	2014-01-12 19:16:43 -08:00
Henry Saputra	5a8abfb70e	Address code review concerns and comments.	2014-01-12 19:15:09 -08:00
Tathagata Das	aa2c993858	Merge remote-tracking branch 'apache/master' into error-handling	2014-01-12 17:37:46 -08:00
Patrick Wendell	074f50232f	Merge pull request #396 from pwendell/executor-env Setting load defaults to true in executor This preserves the behavior in earlier releases. If properties are set for the executors via `spark-env.sh` on the slaves, then they should take precedence over spark defaults. This is useful for if system administrators are setting properties for a standalone cluster, such as shuffle locations. /cc @andrewor14 who initially reported this issue.	2014-01-12 17:01:13 -08:00
Reynold Xin	82e2b92c6d	Merge pull request #392 from rxin/listenerbus Stop SparkListenerBus daemon thread when DAGScheduler is stopped. Otherwise this leads to hundreds of SparkListenerBus daemon threads in our unit tests (and also problematic if user applications launches multiple SparkContext).	2014-01-12 16:55:11 -08:00
Patrick Wendell	0bb33076e2	Removing mentions in tests	2014-01-12 16:53:58 -08:00
Patrick Wendell	0d4886c000	Remove now un-needed hostPort option	2014-01-12 16:47:52 -08:00
Patrick Wendell	cfb1e6c13c	Setting load defaults to true in executor	2014-01-12 15:35:08 -08:00
Henry Saputra	f1c5eca494	Fix accidental comment modification.	2014-01-12 10:40:21 -08:00
Henry Saputra	91a563608e	Merge branch 'master' into remove_simpleredundantreturn_scala	2014-01-12 10:34:13 -08:00
Henry Saputra	93a65e5fde	Remove simple redundant return statement for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized	2014-01-12 10:30:04 -08:00
Tathagata Das	18f4889d96	Merge remote-tracking branch 'apache/master' into error-handling	2014-01-11 23:40:57 -08:00
Tathagata Das	f5108ffc24	Converted JobScheduler to use actors for event handling. Changed protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.	2014-01-11 23:15:09 -08:00
Reynold Xin	288a878999	Merge pull request #389 from rxin/clone-writables Minor update for clone writables and more documentation.	2014-01-11 21:53:19 -08:00
Reynold Xin	dbc11df411	Merge pull request #388 from pwendell/master Fix UI bug introduced in #244. The 'duration' field was incorrectly renamed to 'task time' in the table that lists stages.	2014-01-11 18:07:13 -08:00
Reynold Xin	362cda18bc	Renamed cloneKeyValues to cloneRecords; updated docs.	2014-01-11 18:01:29 -08:00
Patrick Wendell	07b952e1d1	Revert "Fix default TTL for metadata cleaner" This reverts commit `669ba4caa9`.	2014-01-11 16:07:10 -08:00
Reynold Xin	2180c87188	Stop SparkListenerBus daemon thread when DAGScheduler is stopped.	2014-01-11 13:36:37 -08:00
Reynold Xin	b0fbfccadc	Minor update for clone writables and more documentation.	2014-01-11 12:35:10 -08:00
Reynold Xin	ee6e7f9b8c	Merge pull request #359 from ScrapCodes/clone-writables We clone hadoop key and values by default and reuse objects if asked to. We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully. There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.	2014-01-11 12:07:55 -08:00
Patrick Wendell	b313e15616	Fix UI bug introduced in #244 . The 'duration' field was incorrectly renamed to 'task time' in the table that lists stages.	2014-01-11 10:52:57 -08:00
Reynold Xin	0b5ce7af17	Merge pull request #386 from pwendell/typo-fix Small typo fix	2014-01-10 23:23:21 -08:00
Andrew Or	bb8098f203	Add number of bytes spilled to Web UI	2014-01-10 21:40:55 -08:00
Ankur Dave	d1d2b6d9b6	Remove blank lines added to Spark core	2014-01-10 21:17:32 -08:00
Matei Zaharia	1d7bef0c91	Merge pull request #381 from mateiz/default-ttl Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.	2014-01-10 18:53:03 -08:00
Ankur Dave	41d6586e8e	Revert changes to Spark's (PrimitiveKey)OpenHashMap; copy PKOHM to graphx	2014-01-10 18:00:54 -08:00
Patrick Wendell	44d6a8e3d8	Merge pull request #382 from RongGu/master Fix a type error in comment lines Fix a type error in comment lines	2014-01-10 17:51:50 -08:00
Patrick Wendell	08370a52b8	Small typo fix	2014-01-10 17:47:15 -08:00
Patrick Wendell	f26553102c	Merge pull request #383 from tdas/driver-test API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.	2014-01-10 16:25:44 -08:00
Patrick Wendell	d37408f39c	Merge pull request #377 from andrewor14/master External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.	2014-01-10 16:25:01 -08:00
Tathagata Das	4f39e79c23	Merge remote-tracking branch 'apache/master' into driver-test Conflicts: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala	2014-01-10 15:47:01 -08:00
Reynold Xin	0eaf01c5ed	Merge pull request #369 from pillis/master SPARK-961 Add a Vector.random() method Added method and testcases	2014-01-10 15:32:19 -08:00
Andrew Or	e4c51d2113	Address Patrick's and Reynold's comments Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.	2014-01-10 15:09:51 -08:00
RongGu	94776f753f	fix a type error in comment lines	2014-01-11 05:43:56 +08:00
Thomas Graves	7cef8435d7	Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes Yarn client addjar and misc fixes Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.	2014-01-10 15:34:15 -06:00
Patrick Wendell	7b58f116e5	Merge pull request #384 from pwendell/debug-logs Make DEBUG-level logs consummable. Removes two things that caused issues with the debug logs: (a) Internal polling in the DAGScheduler was polluting the logs. (b) The Scala REPL logs were really noisy.	2014-01-10 12:47:46 -08:00
Tathagata Das	e4bb845238	Updated docs based on Patrick's comments in PR 383.	2014-01-10 12:17:09 -08:00
Patrick Wendell	e9ed2d9e82	Make DEBUG-level logs consummable. Removes two things that caused issues with the debug logs: (a) Internal polling in the DAGScheduler was polluting the logs. (b) The Scala REPL logs were really noisy.	2014-01-10 10:33:24 -08:00
Tathagata Das	740730a179	Fixed conf/slaves and updated docs.	2014-01-10 05:06:15 -08:00
Matei Zaharia	669ba4caa9	Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default	2014-01-10 00:21:36 -08:00
Pillis	8d021b42bc	SPARK-961. Add a Vector.random() method - update 1	2014-01-10 00:07:36 -08:00
Matei Zaharia	0ebc97305a	Merge pull request #375 from mateiz/option-fix Fix bug added when we changed AppDescription.maxCores to an Option The Scala compiler warned about this -- we were comparing an Option against an integer now.	2014-01-09 23:58:49 -08:00
Patrick Wendell	460f655cc6	Enable shuffle consolidation by default. Bump this to being enabled for 0.9.0.	2014-01-09 22:42:50 -08:00
Andrew Or	aa5002bb96	Defensively allocate memory from global pool This is an alternative to the existing approach, which evenly distributes the collective shuffle memory among all running tasks. In the new approach, each thread requests a chunk of memory whenever its map is about to multiplicatively grow. If there is sufficient memory in the global pool, the thread allocates it and grows its map. Otherwise, it spills. A danger with the previous approach is that a new task may quickly fill up its map before old tasks finish spilling, potentially causing an OOM. This approach prevents this scenario as it favors existing tasks over new tasks; any thread that may step over the boundary of other threads defensively backs off and starts spilling. Testing through spark-perf reveals: (1) When no spills have occured, the performance of external sorting using this memory management approach is essentially the same as without external sorting. (2) When one or more spills have occured, the performance of external sorting is a small multiple (3x) worse	2014-01-09 21:43:58 -08:00
Andrew Or	d76e1f90a8	Merge github.com:apache/incubator-spark Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java	2014-01-09 21:38:48 -08:00
Tathagata Das	38d75e18fa	Merge remote-tracking branch 'apache/master' into driver-test	2014-01-09 19:31:36 -08:00
Reynold Xin	4b074fac05	Merge pull request #374 from mateiz/completeness Add some missing Java API methods These are primarily for setting job groups, canceling jobs, and setting names on RDDs. Seemed like useful stuff to expose in Java.	2014-01-09 19:03:55 -08:00
Reynold Xin	a9d533333d	Merge pull request #294 from RongGu/master Bug fixes for updating the RDD block's memory and disk usage information Bug fixes for updating the RDD block's memory and disk usage information. From the code context, we can find that the memSize and diskSize here are both always equal to the size of the block. Actually, they never be zero. Thus, the logic here is wrong for recording the block usage in BlockStatus, especially for the blocks which are dropped from memory to ensure space for the new input rdd blocks. I have tested it that this would cause the storage metrics shown in the Storage webpage wrong and misleading. With this patch, the metrics will be okay. Finally, Merry Christmas, guys:)	2014-01-09 18:46:46 -08:00
Patrick Wendell	d86a85e9ca	Merge pull request #293 from pwendell/standalone-driver SPARK-998: Support Launching Driver Inside of Standalone Mode [NOTE: I need to bring the tests up to date with new changes, so for now they will fail] This patch provides support for launching driver programs inside of a standalone cluster manager. It also supports monitoring and re-launching of driver programs which is useful for long running, recoverable applications such as Spark Streaming jobs. For those jobs, this patch allows a deployment mode which is resilient to the failure of any worker node, failure of a master node (provided a multi-master setup), and even failures of the applicaiton itself, provided they are recoverable on a restart. Driver information, such as the status and logs from a driver, is displayed in the UI There are a few small TODO's here, but the code is generally feature-complete. They are: - Bring tests up to date and add test coverage - Restarting on failure should be optional and maybe off by default. - See if we can re-use akka connections to facilitate clients behind a firewall A sensible place to start for review would be to look at the `DriverClient` class which presents users the ability to launch their driver program. I've also added an example program (`DriverSubmissionTest`) that allows you to test this locally and play around with killing workers, etc. Most of the code is devoted to persisting driver state in the cluster manger, exposing it in the UI, and dealing correctly with various types of failures. Instructions to test locally: - `sbt/sbt assembly/assembly examples/assembly` - start a local version of the standalone cluster manager ``` ./spark-class org.apache.spark.deploy.client.DriverClient \ -j -Dspark.test.property=something \ -e SPARK_TEST_KEY=SOMEVALUE \ launch spark://10.99.1.14:7077 \ ../path-to-examples-assembly-jar \ org.apache.spark.examples.DriverSubmissionTest 1000 some extra options --some-option-here -X 13 ``` - Go in the UI and make sure it started correctly, look at the output etc - Kill workers, the driver program, masters, etc.	2014-01-09 18:37:52 -08:00
Matei Zaharia	c43eb00644	Fix bug added when we changed AppDescription.maxCores to an Option The Scala compiler warned about this -- we were comparing an Option against an integer now.	2014-01-09 18:14:20 -08:00
Matei Zaharia	142921c6c0	Add some missing Java API methods	2014-01-09 18:11:12 -08:00
Patrick Wendell	26cdb5f68a	Merge pull request #372 from pwendell/log4j-fix-1 Send logs to stderr by default (instead of stdout).	2014-01-09 17:16:34 -08:00
Patrick Wendell	2af98198ad	Send logs to stderr by default (instead of stdout).	2014-01-09 15:57:44 -08:00
Matei Zaharia	12f414ed43	Merge pull request #362 from mateiz/conf-getters Use typed getters for configuration settings This improves some of the code style after SPARK-544.	2014-01-09 15:31:30 -08:00
Tathagata Das	f1d206c6b4	Merge branch 'standalone-driver' into driver-test Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2014-01-09 15:06:24 -08:00
Tathagata Das	6f713e2a3e	Changed the way StreamingContext finds and reads checkpoint files, and added JavaStreamingContext.getOrCreate.	2014-01-09 13:42:04 -08:00
Patrick Wendell	67b9a33628	Some usability improvements	2014-01-09 12:42:37 -08:00
Thomas Graves	c617083e47	yarn-client addJar fix and misc other	2014-01-09 10:24:35 -06:00
Pillis	181471906e	SPARK-961 Add a Vector.random() method	2014-01-09 10:16:19 +01:00
Reynold Xin	365cac9465	Merge pull request #361 from rxin/clean Minor style cleanup. Mostly on indenting & line width changes. Focused on the few important files since they are the files that new contributors usually read first.	2014-01-09 00:56:16 -08:00
Reynold Xin	295d82583a	Minor update on SparkContext.broadcast's JavaDoc.	2014-01-09 00:30:22 -08:00
Ankur Dave	7309a29c75	Removed Kryo dependency and graphx-shell	2014-01-09 00:13:23 -08:00
Matei Zaharia	a01f3401e3	Use typed getters for configuration settings	2014-01-09 00:07:29 -08:00
Prashant Sharma	59b03e015d	Fixes corresponding to Reynolds feedback comments	2014-01-09 12:26:30 +05:30
Ankur Dave	78d6b13ac8	Fix mis-merge in `44fd30d3fb`	2014-01-08 21:19:14 -08:00
Ankur Dave	91227566bc	Merge remote-tracking branch 'spark-upstream/master' into HEAD Conflicts: README.md core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala pom.xml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2014-01-08 21:19:08 -08:00
Patrick Wendell	112c0a1776	Fixing config option "retained_stages" => "retainedStages". This is a very esoteric option and it's out of sync with the style we use. So it seems fitting to fix it for 0.9.0.	2014-01-08 21:16:16 -08:00
Patrick Wendell	0f9d2ace6b	Adding polling to driver submission client.	2014-01-08 16:56:26 -08:00
Reynold Xin	46f6a3b6aa	Minor style cleanup. Mostly on indenting & line width changes.	2014-01-08 14:55:04 -08:00
Reynold Xin	56ebfeaa52	Merge pull request #357 from hsaputra/set_boolean_paramname Set boolean param name for call to SparkHadoopMapReduceUtil.newTaskAttemptID Set boolean param name for call to SparkHadoopMapReduceUtil.newTaskAttemptID to make it clear which param being set.	2014-01-08 11:50:06 -08:00
Reynold Xin	5cae05f59e	Merge pull request #356 from hsaputra/remove_deprecated_cleanup_method Remove calls to deprecated mapred's OutputCommitter.cleanupJob Since Hadoop 1.0.4 the mapred OutputCommitter.commitJob should do cleanup job via call to OutputCommitter.cleanupJob, Remove SparkHadoopWriter.cleanup since it is used only by PairRDDFunctions. In fact the implementation of mapred OutputCommitter.commitJob looks like this: public void commitJob(JobContext jobContext) throws IOException { cleanupJob(jobContext); }	2014-01-08 11:47:28 -08:00
walker	d942f95d7e	Merge remote branch 'upstream/master'	2014-01-09 01:22:26 +08:00
Prashant Sharma	277b4a36c5	we clone hadoop key and values by default and reuse if specified.	2014-01-08 16:32:55 +05:30
Patrick Wendell	bc81ce040d	Merge remote-tracking branch 'apache-github/master' into standalone-driver Conflicts: core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala pom.xml	2014-01-08 00:38:31 -08:00
Henry Saputra	aa56585d21	Resolve PR review over 100 chars	2014-01-08 00:38:29 -08:00
Patrick Wendell	3ec21f2eee	Show more helpful information in UI	2014-01-08 00:30:10 -08:00
Patrick Wendell	c78b381e91	Fixes	2014-01-08 00:09:12 -08:00
Patrick Wendell	d0533f7046	Rename to Client	2014-01-07 23:38:51 -08:00
Patrick Wendell	3d939e5fe8	Adding --verbose option to DriverClient	2014-01-07 23:27:18 -08:00
Henry Saputra	f6b6f88367	Set boolean param name for two files call to SparkHadoopMapReduceUtil.newTaskAttemptID to make it clear which param being set.	2014-01-07 23:23:17 -08:00
Henry Saputra	4517326ec6	Remove calls to deprecated mapred's OutputCommitter.cleanupJob because since Hadoop 1.0.4 the mapred OutputCommitter.commitJob should do cleanup job. In fact the implementation of mapred OutputCommitter.commitJob looks like this: public void commitJob(JobContext jobContext) throws IOException { cleanupJob(jobContext); } (The jobContext input argument is type of org.apache.hadoop.mapred.JobContext)	2014-01-07 22:55:56 -08:00
Patrick Wendell	f5f12dc282	Merge pull request #336 from liancheng/akka-remote-lookup Get rid of `Either[ActorRef, ActorSelection]' In this pull request, instead of returning an `Either[ActorRef, ActorSelection]`, `registerOrLookup` identifies the remote actor blockingly to obtain an `ActorRef`, or throws an exception if the remote actor doesn't exist or the lookup times out (configured by `spark.akka.lookupTimeout`). This function is only called when an `SparkEnv` is constructed (instantiating driver or executor), so the blocking call is considered acceptable. Executor side `ActorSelection`s/`ActorRef`s to driver side `MapOutputTrackerMasterActor` and `BlockManagerMasterActor` are affected by this pull request. `ActorSelection` is dangerous and should be used with care. It's only absolutely safe to send messages via an `ActorSelection` when the remote actor is stateless, so that actor incarnation is irrelevant. But as pointed by @ScrapCodes in the comments below, executor exits immediately once the connection to the driver lost, `ActorSelection`s are not harmful in this scenario. So this pull request is mostly a code style patch.	2014-01-07 21:56:35 -08:00
Matei Zaharia	d75dc428da	Merge pull request #350 from mateiz/standalone-limit Add way to limit default # of cores used by apps in standalone mode Also documents the spark.deploy.spreadOut option, and fixes a config option that had a dash in its name.	2014-01-08 00:30:03 -05:00
Matei Zaharia	2c421749ea	Address review comments	2014-01-07 19:30:23 -05:00
Patrick Wendell	e21a707a13	Adding unit tests and some refactoring to promote testability.	2014-01-07 15:39:47 -08:00
Patrick Wendell	e688e11206	Add log4j exclusion rule to maven. To make this work I had to rename the defaults file. Otherwise maven's pattern matching rules included it when trying to match other log4j.properties files. I also fixed a bug in the existing maven build where two <transformers> tags were present in assembly/pom.xml such that one overwrote the other.	2014-01-07 12:56:24 -08:00
Andrew Or	80ba9f8ba0	Get SparkConf from SparkEnv, rather than creating new ones	2014-01-07 12:44:22 -08:00
Matei Zaharia	d8bcc8e9a0	Add way to limit default # of cores used by applications on standalone mode Also documents the spark.deploy.spreadOut option.	2014-01-07 14:35:52 -05:00
Reynold Xin	15d9534501	Merge pull request #318 from srowen/master Suggested small changes to Java code for slightly more standard style, encapsulation and in some cases performance Sorry if this is too abrupt or not a welcome set of changes, but thought I'd see if I could contribute a little. I'm a Java developer and just getting seriously into Spark. So I thought I'd suggest a number of small changes to the couple Java parts of the code to make it a little tighter, more standard and even a bit faster. Feel free to take all, some or none of this. Happy to explain any of it.	2014-01-07 08:10:02 -08:00
Prashant Sharma	c729fa7c8e	formatting related fixes suggested by Patrick.	2014-01-07 13:08:16 +05:30
Prashant Sharma	b84dc780d3	Allow configuration to be printed in logs for diagnosis.	2014-01-07 13:01:43 +05:30
Prashant Sharma	b3018811e1	Allow users to set arbitrary akka configurations via spark conf.	2014-01-07 13:01:43 +05:30
Patrick Wendell	6a3daead2d	Fixes after merge	2014-01-06 20:12:45 -08:00
Patrick Wendell	c0498f9265	Merge remote-tracking branch 'apache-github/master' into standalone-driver Conflicts: core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala core/src/main/scala/org/apache/spark/deploy/client/TestClient.scala core/src/main/scala/org/apache/spark/deploy/master/Master.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala	2014-01-06 17:29:21 -08:00
Patrick Wendell	f236ddd1a2	Changes based on review feedback.	2014-01-06 17:15:52 -08:00
Adam Novak	fa8ce3fdd7	Changing org.apache.spark.util.collection.PrimitiveKeyOpenHashMap to have a real no-argument constructor, instead of a one-argument constructor with a default value. The lack of a real no-argument constructor was causing "sbt/sbt publish-local" to fail thusly: ``` [error] /pod/home/anovak/build/graphx/core/src/main/scala/org/apache/spark/storage/ShuffleBlockManager.scala:172: not enough arguments for constructor PrimitiveKeyOpenHashMap: (initialCapacity: Int)(implicit evidence$3: ClassManifest[Int], implicit evidence$4: ClassManifest[Int])org.apache.spark.util.collection.PrimitiveKeyOpenHashMap[Int,Int] [error] private val mapIdToIndex = new PrimitiveKeyOpenHashMap[Int, Int]() [error] ^ [info] No documentation generated with unsucessful compiler run [error] one error found [error] (core/compile:doc) Scaladoc generation failed [error] Total time: 67 s, completed Jan 6, 2014 2:20:51 PM ``` In theory a no-argument constructor ought not to differ from one with a single argument that has a default value, but in practice there seems to be an issue.	2014-01-06 14:52:15 -08:00
Patrick Wendell	357083c29f	Merge pull request #330 from tgravescs/fix_addjars_null_handling Fix handling of empty SPARK_EXAMPLES_JAR Currently if SPARK_EXAMPLES_JAR is left unset you get a null pointer exception when running the examples (atleast on spark on yarn). The null now gets turned into a string of "null" when its put into the SparkConf so addJar no longer properly ignores it. This fixes that so that it can be left unset.	2014-01-06 10:29:04 -08:00
walker	2ad315e80f	add inline comments	2014-01-07 01:27:57 +08:00
walker	6ab1db8071	add inline comments	2014-01-07 01:21:25 +08:00
walker	a0c6d96e27	Merge remote branch 'upstream/master'	2014-01-07 01:05:18 +08:00
Sean Owen	7379b2915f	Merge remote-tracking branch 'upstream/master'	2014-01-06 15:13:16 +00:00
Thomas Graves	25446dd931	Add warning to null setJars check	2014-01-06 07:58:59 -06:00
Tathagata Das	ac1f4b06c1	Added a hashmap to cache file mod times.	2014-01-05 23:42:53 -08:00
Patrick Wendell	a2e7e04974	Merge pull request #333 from pwendell/logging-silence Quiet ERROR-level Akka Logs This fixes an issue I've seen where akka logs a bunch of things at ERROR level when connecting to a standalone cluster, even in the normal case. I noticed that even when lifecycle logging was disabled, the netty code inside of akka still logged away via akka's EndpointWriter class. There are also some other log streams that I think are new in akka 2.2.1 that I've disabled. Finally, I added some better logging to the standalone client. This makes it more clear when a connection failure occurs what is going on. Previously it never explicitly said if a connection attempt had failed. The commit messages here have some more detail.	2014-01-05 22:37:36 -08:00
Patrick Wendell	675d7eb4f0	Responding to Aaron's review	2014-01-05 21:23:14 -08:00
Reynold Xin	5b0986a1d6	Merge pull request #334 from pwendell/examples-fix Removing SPARK_EXAMPLES_JAR in the code This re-writes all of the examples to use the `SparkContext.jarOfClass` mechanism for loading the examples jar. This necessary for environments like YARN and the Standalone mode where example programs will be submit from inside the cluster rather than at the client using `./spark-example`. This still leaves SPARK_EXAMPLES_JAR in place in the shell scripts for setting up the classpath if `./spark-example` is run.	2014-01-05 19:25:09 -08:00
Tathagata Das	2394794591	Merge branch 'filestream-fix' into driver-test Conflicts: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala	2014-01-06 02:23:53 +00:00
Tathagata Das	8e88db3ca5	Bug fixes to the DriverRunner and minor changes here and there.	2014-01-06 02:21:56 +00:00
Lian, Cheng	a4048ff31e	Get rid of `Either[ActorRef, ActorSelection]' Although we can send messages via an ActorSelection, it would be better to identify the actor and obtain an ActorRef first, so that we can get informed earlier if the remote actor doesn't exist, and get rid of the annoying Either wrapper.	2014-01-06 09:18:17 +08:00
Reynold Xin	63f906322d	Fall back to zero-arg constructor for Serializer initialization if there is no constructor that accepts SparkConf. This maintains backward compatibility with older serializers implemented by users.	2014-01-05 15:52:43 -08:00
Patrick Wendell	94fdcda896	Provide logging when attempts to connect to the master fail. Without these it's a bit less clear what's going on for the user. One thing I realize when doing this is that akka itself actually retries the initial association. So the retry we currently have is redundant with akka's.	2014-01-05 15:16:01 -08:00
Patrick Wendell	aaaa673184	Quite akka when remote lifecycle logging is disabled. I noticed when connecting to a standalone cluster Spark gives a bunch of Akka ERROR logs that make it seem like something is failing. This patch does two things: 1. Akka dead letter logging is turned on/off according to the existing lifecycle spark property. 2. We explicitly silence akka's EndpointWriter log in log4j. This is necessary because for some reason that log doesn't pick up on the lifecycle logging settings. After a few hours of debugging this was the only solution I found that worked.	2014-01-05 15:15:59 -08:00
Patrick Wendell	79f52809c8	Removing SPARK_EXAMPLES_JAR in the code	2014-01-05 11:49:42 -08:00
Andrew Or	4de9c9554c	Use AtomicInteger for numRunningTasks	2014-01-04 11:16:30 -08:00
Thomas Graves	ad35c1a5f2	Fix handling of empty SPARK_EXAMPLES_JAR	2014-01-04 11:42:17 -06:00
Tathagata Das	3d4474330d	Removed the exponential backoff for testing.	2014-01-04 08:39:00 -08:00
Andrew Or	2db7884f6f	Address Mark's comments	2014-01-04 01:20:09 -08:00
Andrew Or	4296d96c82	Assign spill threshold as a fraction of maximum memory Further, divide this threshold by the number of tasks running concurrently. Note that this does not guard against the following scenario: a new task quickly fills up its share of the memory before old tasks finish spilling their contents, in which case the total memory used by such maps may exceed what was specified. Currently, spark.shuffle.safetyFraction mitigates the effect of this.	2014-01-04 00:00:57 -08:00
Patrick Wendell	604fad9c39	Merge remote-tracking branch 'apache-github/master' into remove-binaries Conflicts: core/src/test/scala/org/apache/spark/DriverSuite.scala docs/python-programming-guide.md	2014-01-03 21:29:33 -08:00
Patrick Wendell	9e6f3bdcda	Changes on top of Prashant's patch. Closes #316	2014-01-03 18:30:17 -08:00
Andrew Or	333d58df86	Remove unnecessary ClassTag's	2014-01-03 17:55:26 -08:00
Andrew Or	838b0e7d15	Refactor using SparkConf	2014-01-03 16:13:40 -08:00
Patrick Wendell	4ae101ff38	Merge pull request #317 from ScrapCodes/spark-915-segregate-scripts Spark-915 segregate scripts	2014-01-03 11:24:35 -08:00
Prashant Sharma	9ae382c363	sbin/compute-classpath* bin/compute-classpath*	2014-01-03 15:12:29 +05:30
Prashant Sharma	74ba97fcf7	sbin/spark-class* -> bin/spark-class*	2014-01-03 15:08:01 +05:30
Prashant Sharma	94f2fffa23	fixed review comments	2014-01-03 14:43:37 +05:30
Andrew Or	df413e996f	Merge remote-tracking branch 'spark/master' Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala	2014-01-02 20:51:23 -08:00
Tathagata Das	a1b8dd53e3	Added StreamingContext.getOrCreate to for automatic recovery, and added RecoverableNetworkWordCount example to use it.	2014-01-02 19:07:22 -08:00
Reynold Xin	0475ca8f81	Merge pull request #320 from kayousterhout/erroneous_failed_msg Remove erroneous FAILED state for killed tasks. Currently, when tasks are killed, the Executor first sends a status update for the task with a "KILLED" state, and then sends a second status update with a "FAILED" state saying that the task failed due to an exception. The second FAILED state is misleading/unncessary, and occurs due to a NonLocalReturnControl Exception that gets thrown due to the way we kill tasks. This commit eliminates that problem. I'm not at all sure that this is the best way to fix this problem, so alternate suggestions welcome. @rxin guessing you're the right person to look at this.	2014-01-02 15:17:08 -08:00
Aaron Davidson	8831923219	TempBlockId takes UUID and is explicitly non-serializable	2014-01-02 13:52:35 -08:00
Patrick Wendell	588a1695f4	Merge pull request #297 from tdas/window-improvement Improvements to DStream window ops and refactoring of Spark's CheckpointSuite - Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located. - Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads. - Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary. - Added mapSideCombine option to combineByKeyAndWindow.	2014-01-02 13:20:54 -08:00
Matei Zaharia	7bafb68d77	Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spark	2014-01-02 15:57:28 -05:00
Matei Zaharia	ca67909cd4	Merge pull request #311 from tmyklebu/master SPARK-991: Report information gleaned from a Python stacktrace in the UI Scala: - Added setCallSite/clearCallSite to SparkContext and JavaSparkContext. These functions mutate a LocalProperty called "externalCallSite." - Add a wrapper, getCallSite, that checks for an externalCallSite and, if none is found, calls the usual Utils.formatSparkCallSite. - Change everything that calls Utils.formatSparkCallSite to call getCallSite instead. Except getCallSite. - Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext. Python: - Add a gruesome hack to rdd.py that inspects the traceback and guesses what you want to see in the UI. - Add a RAII wrapper around said gruesome hack that calls setCallSite/clearCallSite as appropriate. - Wire said RAII wrapper up around three calls into the Scala code. I'm not sure that I hit all the spots with the RAII wrapper. I'm also not sure that my gruesome hack does exactly what we want. One could also approach this change by refactoring runJob/submitJob/runApproximateJob to take a call site, then threading that parameter through everything that needs to know it. One might object to the pointless-looking wrappers in JavaSparkContext. Unfortunately, I can't directly access the SparkContext from Python---or, if I can, I don't know how---so I need to wrap everything that matters in JavaSparkContext. Conflicts: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala	2014-01-02 15:54:54 -05:00
Kay Ousterhout	a1b438d94d	Remove erroneous FAILED state for killed tasks. Currently, when tasks are killed, the Executor first sends a status update for the task with a "KILLED" state, and then sends a second status update with a "FAILED" state saying that the task failed due to an exception. The second FAILED state is misleading/unncessary, and occurs due to a NonLocalReturnControl Exception that gets thrown due to the way we kill tasks. This commit eliminates that problem.	2014-01-02 12:34:46 -08:00
Kay Ousterhout	5a3c00c958	Removed redundant TaskSetManager.error() function. This function was leftover from a while ago, and now just passes all calls through to the abort() function, so this commit deletes it.	2014-01-02 11:13:58 -08:00
Sean Owen	66d501276b	Suggested small changes to Java code for slightly more standard style, encapsulation and in some cases performance	2014-01-02 16:17:57 +00:00
Prashant Sharma	980afd280a	Merge branch 'scripts-reorg' of github.com:shane-huang/incubator-spark into spark-915-segregate-scripts Conflicts: bin/spark-shell core/pom.xml core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala core/src/test/scala/org/apache/spark/DriverSuite.scala python/run-tests sbin/compute-classpath.sh sbin/spark-class sbin/stop-slaves.sh	2014-01-02 17:55:21 +05:30
Matei Zaharia	0f6060733d	Fixed two uses of conf.get with no default value in Mesos	2014-01-01 22:09:42 -05:00
Matei Zaharia	e2c68642c6	Miscellaneous fixes from code review. Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.	2014-01-01 22:03:39 -05:00
Matei Zaharia	45ff8f413d	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala	2014-01-01 21:25:00 -05:00
Patrick Wendell	f8d245bdfc	Merge remote-tracking branch 'apache-github/master' into log4j-fix-2 Conflicts: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2014-01-01 16:10:51 -08:00
Andrew Or	92c304fd03	Simplify ExternalAppendOnlyMap on the assumption that the mergeCombiners function is specified	2014-01-01 11:42:33 -08:00
Matei Zaharia	0e5b2adb5c	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: project/SparkBuild.scala	2014-01-01 13:28:54 -05:00
Andrew Or	3bc9e391a3	Merge branch 'master' of github.com:andrewor14/incubator-spark	2013-12-31 20:02:12 -08:00
Andrew Or	83dfa16664	Address Patrick's and Reynold's comments	2013-12-31 20:02:05 -08:00
Reynold Xin	8b8e70ebde	Merge pull request #73 from falaki/ApproximateDistinctCount Approximate distinct count Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.	2013-12-31 17:48:24 -08:00
Aaron Davidson	08302b113a	Rename IntermediateBlockId to TempBlockId	2013-12-31 17:44:15 -08:00
Patrick Wendell	37c43c9dd1	Adding outer checkout when initializing logging	2013-12-31 17:36:56 -08:00
Andrew Or	8bbe08b21e	Merge branch 'master' of github.com:andrewor14/incubator-spark	2013-12-31 17:26:26 -08:00
Andrew Or	53d8d36684	Add support and test for null keys in ExternalAppendOnlyMap Also add safeguard against use of destructively sorted AppendOnlyMap	2013-12-31 17:19:02 -08:00
Hossein Falaki	bee445c927	Made the code more compact and readable	2013-12-31 16:58:18 -08:00
Hossein Falaki	acb0323053	minor improvements	2013-12-31 15:34:26 -08:00
Matei Zaharia	ba9338f104	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2013-12-31 18:23:14 -05:00
Patrick Wendell	63b411dd86	Merge pull request #238 from ngbinh/upgradeNetty upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final the changes are listed at https://github.com/netty/netty/wiki/New-and-noteworthy	2013-12-31 14:31:28 -08:00
Andrew Or	3ce22df954	Add warning message for spilling	2013-12-31 11:33:10 -08:00
Andrew Or	94ddc91d06	Address Aaron's and Jerry's comments	2013-12-31 10:50:08 -08:00
Patrick Wendell	55b7e2fdff	Merge pull request #289 from tdas/filestream-fix Bug fixes for file input stream and checkpointing - Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.) - Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration. - Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten. - Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.	2013-12-31 10:12:51 -08:00
Tathagata Das	fcd17a1e8e	Fixed comments and long lines based on comments on PR 289.	2013-12-31 02:01:45 -08:00
Patrick Wendell	4abb0c57ab	Tiny typo fix	2013-12-31 00:05:03 -08:00
Patrick Wendell	3c254f2eec	Minor fixes	2013-12-30 23:55:33 -08:00
Aaron Davidson	375d11743c	Add new line at end of file	2013-12-30 23:42:37 -08:00
Patrick Wendell	18181e6c41	Removing initLogging entirely	2013-12-30 23:39:47 -08:00
Aaron Davidson	daa7792ad6	Refactor SamplingSizeTracker into SizeTrackingAppendOnlyMap	2013-12-30 23:39:02 -08:00
Hossein Falaki	c3073b6cf2	Added Java API for countApproxDistinct	2013-12-30 19:31:06 -08:00
Hossein Falaki	ed06500d30	Added Java API for countApproxDistinctByKey	2013-12-30 19:30:42 -08:00
Hossein Falaki	a7de8e9b1c	Renamed countDistinct and countDistinctByKey methods to include Approx	2013-12-30 19:28:03 -08:00
Matei Zaharia	0fa5809768	Updated docs for SparkConf and handled review comments	2013-12-30 22:17:28 -05:00
Hossein Falaki	d50ccc5ca9	Using origin version	2013-12-30 15:08:34 -08:00
Andrew Or	347fafe4fc	Fix CheckpointSuite test fail	2013-12-30 13:10:33 -08:00
Andrew Or	d6e7910d92	Simplify merge logic based on the invariant that all spills contain unique keys	2013-12-30 13:01:00 -08:00
Patrick Wendell	1cbef081e3	Response to Shivaram's review	2013-12-30 12:46:09 -08:00
Andrew Or	2b71ab97c4	Merge pull request from aarondav: Utilize DiskBlockManager pathway for temp file writing This gives us a couple advantages: - Uses spark.local.dir and randomly selects a directory/disk. - Ensure files are deleted on normal DiskBlockManager cleanup. - Availability of same stats as usual DiskBlockObjectWriter (currenty unused). Also enable basic cleanup when iterator is fully drained. Still requires cleanup for operations that fail or don't go through all elements.	2013-12-30 11:01:30 -08:00
Patrick Wendell	50e3b8ec4c	Merge pull request #308 from kayousterhout/stage_naming Changed naming of StageCompleted event to be consistent The rest of the SparkListener events are named with "SparkListener" as the prefix of the name; this commit renames the StageCompleted event to SparkListenerStageCompleted for consistency.	2013-12-30 07:44:26 -08:00
Patrick Wendell	cffe1c1d5c	SPARK-1008: Logging improvments 1. Adds a default log4j file that gets loaded if users haven't specified a log4j file. 2. Isolates use of the tools assembly jar. I found this produced SLF4J warnings after building with SBT (and I've seen similar warnings on the mailing list).	2013-12-29 23:14:33 -08:00
Andrew Or	015a510b0a	Merge branch 'master' of github.com:andrewor14/incubator-spark	2013-12-29 22:03:47 -08:00
Andrew Or	4a014dc59c	Make serializer a parameter to ExternalAppendOnlyMap	2013-12-29 21:55:53 -08:00
Kay Ousterhout	c2c1af39f5	Updated code style according to Patrick's comments	2013-12-29 21:10:08 -08:00
Aaron Davidson	e3cac47e65	Use Comparator instead of Ordering lower object creation costs	2013-12-29 19:58:37 -08:00
Matei Zaharia	994f080f8a	Properly show Spark properties on web UI, and change app name property	2013-12-29 22:19:33 -05:00
Andrew Or	8fbff9f5d0	Address Aaron's comments	2013-12-29 16:22:44 -08:00
Matei Zaharia	11540b798d	Added tests for SparkConf and fixed a bug Typesafe Config caches system properties the first time it's invoked by default, ignoring later changes unless you do something special	2013-12-29 18:44:06 -05:00
Matei Zaharia	1ee7f5aee4	Fix a change that was lost during merge	2013-12-29 18:15:46 -05:00
Matei Zaharia	0bd1900cbc	Fix a few settings that were being read as system properties after merge	2013-12-29 15:38:46 -05:00
Patrick Wendell	7a99702ce2	Respect supervise option at Master	2013-12-29 12:12:58 -08:00
Matei Zaharia	b4ceed40d6	Merge remote-tracking branch 'origin/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala	2013-12-29 15:08:08 -05:00
Patrick Wendell	a8729770f5	Slight change to retry logic	2013-12-29 11:57:57 -08:00
Patrick Wendell	8da1012f9b	TODO clean-up	2013-12-29 11:38:12 -08:00
Patrick Wendell	faefea3fd8	Adding driver ID to submission response	2013-12-29 11:31:10 -08:00
Patrick Wendell	6ffa9bb226	Documentation and adding supervise option	2013-12-29 11:26:56 -08:00
Patrick Wendell	35f6dc252a	Changes to allow fate sharing of drivers/executors and workers.	2013-12-29 11:14:36 -08:00
Matei Zaharia	cd00225db9	Add SparkConf support in Python	2013-12-29 14:03:39 -05:00
Tor Myklebust	d812aeece9	Factor call site reporting out to SparkContext.	2013-12-28 23:21:49 -05:00
Matei Zaharia	20631348d1	Fix other failing tests	2013-12-28 23:17:58 -05:00
Matei Zaharia	5bbe73864e	Fix Executor not getting properties in local mode	2013-12-28 17:31:58 -05:00
Matei Zaharia	a16c52ed1b	Check for SPARK_YARN_MODE through a system property too since it can sometimes be set that way (undoes a change in previous commit)	2013-12-28 17:24:21 -05:00
Matei Zaharia	642029e7f4	Various fixes to configuration code - Got rid of global SparkContext.globalConf - Pass SparkConf to serializers and compression codecs - Made SparkConf public instead of private[spark] - Improved API of SparkContext and SparkConf - Switched executor environment vars to be passed through SparkConf - Fixed some places that were still using system properties - Fixed some tests, though others are still failing This still fails several tests in core, repl and streaming, likely due to properties not being set or cleared correctly (some of the tests run fine in isolation).	2013-12-28 17:13:15 -05:00
Patrick Wendell	7375047d51	Merge pull request #304 from kayousterhout/remove_unused Removed unused failed and causeOfFailure variables (in TaskSetManager)	2013-12-28 13:25:06 -08:00
Matei Zaharia	ad3dfd1531	Merge pull request #307 from kayousterhout/other_failure Removed unused OtherFailure TaskEndReason. The OtherFailure TaskEndReason was added by @mateiz 3 years ago in this commit: `24a1e7f838` Unless I am missing something, it doesn't seem to have been used then, and is not used now, so seems safe for deletion.	2013-12-27 22:10:14 -05:00
Kay Ousterhout	b4619e509b	Changed naming of StageCompleted event to be consistent The rest of the SparkListener events are named with "SparkListener" as the prefix of the name; this commit renames the StageCompleted event to SparkListenerStageCompleted for consistency.	2013-12-27 17:45:20 -08:00
Kay Ousterhout	e17d7518ab	Removed unused OtherFailure TaskEndReason.	2013-12-27 15:51:27 -08:00
Kay Ousterhout	8419148e5f	Remove unused hasPendingTasks methods	2013-12-27 15:19:42 -08:00
Patrick Wendell	c8c8b42a6f	Some notes and TODO about dependencies	2013-12-27 15:13:11 -08:00
Kay Ousterhout	0c71ffe924	Style fixes as per Reynold's review	2013-12-27 12:19:38 -08:00
Kay Ousterhout	8c81068e16	Fixed >100char lines in DAGScheduler.scala	2013-12-27 11:36:54 -08:00
Binh Nguyen	2c5bade4ee	Fix failed unit tests Also clean up a bit.	2013-12-27 11:24:30 -08:00
Kay Ousterhout	baaabcedc9	Removed unused failed and causeOfFailure variables	2013-12-27 11:12:36 -08:00
Aaron Davidson	2a7b3511f4	Add Apache headers	2013-12-27 10:55:16 -08:00
Reynold Xin	7be1e57786	Merge pull request #298 from aarondav/minor Minor: Decrease margin of left side of Log page Before ![before](https://f.cloud.github.com/assets/1400247/1812647/1a4be53e-6e87-11e3-9d5b-f851274be0e9.png) After ![after](https://f.cloud.github.com/assets/1400247/1812648/1ca1ea2c-6e87-11e3-946c-31be9258f450.png) It's a start anyway...	2013-12-26 23:41:40 -10:00
Andrew Or	d0cfbc41e2	Rename spark.shuffle.buffer variables	2013-12-27 00:07:09 -08:00
Andrew Or	8f3175773c	Final cleanup	2013-12-26 23:40:08 -08:00
Aaron Davidson	1dc0440c1a	Use real serializer & manual ordering	2013-12-26 23:40:08 -08:00
Aaron Davidson	0f66b7f2fc	Return efficient iterator if no spillage happened	2013-12-26 23:40:08 -08:00
Andrew Or	ec8c5dc644	Sort AppendOnlyMap in-place	2013-12-26 23:40:08 -08:00
Aaron Davidson	0289eb752a	Allow Product2 rather than just tuple kv pairs	2013-12-26 23:40:07 -08:00
Andrew Or	64b2d54a02	Move maps to util, and refactor more	2013-12-26 23:40:07 -08:00
Aaron Davidson	804beb43be	SamplingSizeTracker + Map + test suite	2013-12-26 23:40:07 -08:00
Andrew Or	7ad4408255	New minor edits	2013-12-26 23:40:07 -08:00
Aaron Davidson	fcc443b3db	Minor cleanup for Scala style	2013-12-26 23:40:07 -08:00
Andrew Or	2a2ca2a661	Add toggle for ExternalAppendOnlyMap in Aggregator and CoGroupedRDD	2013-12-26 23:40:07 -08:00
Andrew Or	28685a4820	Provide for cases when mergeCombiners is not specified in ExternalAppendOnlyMap	2013-12-26 23:40:07 -08:00
Andrew Or	17def8cc11	Refactor ExternalAppendOnlyMap to take in KVC instead of just KV	2013-12-26 23:40:07 -08:00
Andrew Or	6a45ec1972	Working ExternalAppendOnlyMap for both CoGroupedRDDs and Aggregator	2013-12-26 23:40:07 -08:00
Andrew Or	97fbb3ec52	Working ExternalAppendOnlyMap for Aggregator, but not for CoGroupedRDD	2013-12-26 23:40:07 -08:00
Aaron Davidson	4f2fb761b0	Decrease margin of left side of log page	2013-12-26 15:38:45 -08:00
Patrick Wendell	5c1b4f6405	Minor fixes	2013-12-26 14:39:39 -08:00
Tathagata Das	5fde4566ea	Added Apache boilerplate and class docs to PartitionerAwareUnionRDD.	2013-12-26 14:33:37 -08:00
Patrick Wendell	c23d640516	Addressing smaller changes from Aaron's review	2013-12-26 12:38:39 -08:00
Tathagata Das	3579647cdc	Merge branch 'apache-master' into window-improvement	2013-12-26 12:12:10 -08:00
Patrick Wendell	da20270b83	Merge pull request #1 from aarondav/driver Refactor DriverClient to be more Actor-based	2013-12-26 12:11:52 -08:00
Patrick Wendell	a97ad55c45	Removing accidental file	2013-12-26 12:11:28 -08:00
Tathagata Das	c4a54f51b5	Merge branch 'master' into window-improvement	2013-12-26 12:03:11 -08:00
Patrick Wendell	5938cfc153	Updated approach to driver restarting	2013-12-26 12:02:19 -08:00
Mark Hamstra	c529dceaff	Avoid a lump of coal (NPE) in JobProgressListener's stocking.	2013-12-25 23:10:02 -08:00
Tathagata Das	94479673eb	Fixed bug in PartitionAwareUnionRDD	2013-12-26 00:07:45 +00:00
Aaron Davidson	61372b11f4	Refactor DriverClient to be more Actor-based	2013-12-25 10:55:25 -08:00
walker	0af4b4f3e8	Bug fixes for updating the RDD block's memory and disk usage information	2013-12-25 20:07:01 +08:00
Patrick Wendell	bbc362833b	Removing un-used variable	2013-12-25 01:38:57 -08:00
Patrick Wendell	18ad419b52	Small fix from rebase	2013-12-25 01:22:38 -08:00
Patrick Wendell	55f833803a	Minor bug fix	2013-12-25 01:19:25 -08:00
Patrick Wendell	c9c0f745af	Minor style clean-up	2013-12-25 01:19:25 -08:00
Patrick Wendell	b2b7514ba3	Import clean-up (yay Aaron)	2013-12-25 01:19:25 -08:00
Patrick Wendell	d5f23e0083	Adding scheduling and reporting based on cores	2013-12-25 01:19:01 -08:00
Patrick Wendell	760823d393	Adding better option parsing	2013-12-25 01:19:01 -08:00
Patrick Wendell	6a4acc4c2d	Initial cut at driver submission.	2013-12-25 01:19:01 -08:00
Patrick Wendell	1070b566d4	Renaming Client => AppClient	2013-12-25 01:17:01 -08:00
Patrick Wendell	85a344b4f0	Merge pull request #127 from kayousterhout/consolidate_schedulers Deduplicate Local and Cluster schedulers. The code in LocalScheduler/LocalTaskSetManager was nearly identical to the code in ClusterScheduler/ClusterTaskSetManager. The redundancy made making updating the schedulers unnecessarily painful and error- prone. This commit combines the two into a single TaskScheduler/ TaskSetManager. Unfortunately the diff makes this change look much more invasive than it is -- TaskScheduler.scala is only superficially changed (names updated, overrides removed) from the old ClusterScheduler.scala, and the same with TaskSetManager.scala. Thanks @rxin for suggesting this change!	2013-12-24 16:35:06 -08:00
Binh Nguyen	786f393a98	Fix imports order	2013-12-24 14:59:30 -08:00
Binh Nguyen	9115a5de62	Remove import * and fix some formatting	2013-12-24 14:59:30 -08:00
Binh Nguyen	040dd3ecd5	upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final	2013-12-24 14:58:18 -08:00
Patrick Wendell	c2dd6bcd6e	Merge pull request #279 from aarondav/shuffle-cleanup0 Clean up shuffle files once their metadata is gone Previously, we would only clean the in-memory metadata for consolidated shuffle files. Additionally, fixes a bug where the Metadata Cleaner was ignoring type-specific TTLs.	2013-12-24 14:36:47 -08:00
Kay Ousterhout	1efe3adf56	Responded to Reynold's style comments	2013-12-24 14:18:39 -08:00
Tathagata Das	d4dfab503a	Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289.	2013-12-24 14:01:13 -08:00
Tathagata Das	9f79fd89dc	Merge branch 'apache-master' into filestream-fix	2013-12-24 11:38:17 -08:00
Prashant Sharma	2573add94c	spark-544, introducing SparkConf and related configuration overhaul.	2013-12-25 00:09:36 +05:30
Matei Zaharia	23a9ae6be3	Merge pull request #277 from tdas/scheduler-update Refactored the streaming scheduler and added StreamingListener interface - Refactored the streaming scheduler for cleaner code. Specifically, the JobManager was renamed to JobScheduler, as it does the actual scheduling of Spark jobs to the SparkContext. The earlier Scheduler was renamed to JobGenerator, as it actually generates the jobs from the DStreams. The JobScheduler starts the JobGenerator. Also, moved all the scheduler related code from spark.streaming to spark.streaming.scheduler package. - Implemented the StreamingListener interface, similar to SparkListener. The streaming version of StatusReportListener prints the batch processing time statistics (for now). Added StreamingListernerSuite to test it. - Refactored streaming TestSuiteBase for deduping code in the other streaming testsuites.	2013-12-24 00:08:48 -05:00
Reynold Xin	11107c9de5	Merge pull request #244 from leftnoteasy/master Added SPARK-968 implementation for review Added SPARK-968 implementation for review	2013-12-23 10:38:20 -08:00
wangda.tan	2f689ba97b	SPARK-968, added executor address showing in aggregated metrics by executors table	2013-12-23 15:03:45 +08:00
Kay Ousterhout	b7bfae1afe	Correctly merged in maxTaskFailures fix	2013-12-22 07:34:44 -08:00
wangda.tan	c979eecdf6	added changes according to comments from rxin	2013-12-22 21:43:15 +08:00
Kay Ousterhout	30186aa264	Renamed ClusterScheduler to TaskSchedulerImpl	2013-12-20 14:58:04 -08:00
Kay Ousterhout	c06945cfe0	Merge remote branch 'upstream/master' into consolidate_schedulers Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala	2013-12-20 14:39:30 -08:00
Patrick Wendell	0bc57c5767	Merge pull request #280 from aarondav/minor Minor cleanup for standalone scheduler See commit messages	2013-12-20 11:56:54 -08:00
Tathagata Das	61f4bbda0d	Added tests for PartitionerAwareUnionRDD in the CheckpointSuite. Refactored CheckpointSuite to make the tests simpler and more reliable. Added missing test for ZippedRDD.	2013-12-20 00:41:47 -08:00
Patrick Wendell	eca68d4425	Merge pull request #272 from tmyklebu/master Track and report task result serialisation time. - DirectTaskResult now has a ByteBuffer valueBytes instead of a T value. - DirectTaskResult now has a member function T value() that deserialises valueBytes. - Executor serialises value into a ByteBuffer and passes it to DTR's ctor. - Executor tracks the time taken to do so and puts it in a new field in TaskMetrics. - StagePage now reports serialisation time from TaskMetrics along with the other things it reported.	2013-12-19 18:12:22 -08:00
Aaron Davidson	6613ab663d	Fix compiler warning in SparkZooKeeperSession	2013-12-19 17:56:13 -08:00
Aaron Davidson	4d74b899b7	Remove firstApp from the standalone scheduler Master As a lonely child with no one to care for it... we had to put it down.	2013-12-19 17:53:41 -08:00
Aaron Davidson	1ab031eaff	Extraordinarily minor code/comment cleanup	2013-12-19 17:51:29 -08:00
Aaron Davidson	0647ec9757	Clean up shuffle files once their metadata is gone Previously, we would only clean the in-memory metadata for consolidated shuffle files. Additionally, fixes a bug where the Metadata Cleaner was ignoring type- specific TTLs.	2013-12-19 15:40:48 -08:00
Reynold Xin	7990c56375	Merge pull request #276 from shivaram/collectPartition Add collectPartition to JavaRDD interface. This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py. Thanks @concretevitamin for the original change and tests.	2013-12-19 13:35:09 -08:00
Tathagata Das	de41c436a0	Merge branch 'scheduler-update' into window-improvement Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala	2013-12-19 12:05:08 -08:00
Shivaram Venkataraman	9cc3a6d3c0	Add comment explaining collectPartitions's use	2013-12-19 11:49:17 -08:00
Shivaram Venkataraman	d3234f9726	Make collectPartitions take an array of partitions Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.	2013-12-19 11:40:34 -08:00
Tathagata Das	984c582487	Merge branch 'scheduler-update' into filestream-fix Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala	2013-12-19 11:20:48 -08:00
Nick Pentreath	a76f53416c	Add toString to Java RDD, and __repr__ to Python RDD	2013-12-19 14:38:20 +02:00
Tathagata Das	ec71b445ad	Minor changes.	2013-12-18 23:39:28 -08:00
Aaron Davidson	293a0af5a1	In experimental clusters we've observed that a 10 second timeout was insufficient, despite having a low number of nodes and relatively small workload (16 nodes, <1.5 TB data). This would cause an entire job to fail at the beginning of the reduce phase. There is no particular reason for this value to be small as a timeout should only occur in an exceptional situation. Also centralized the reading of spark.akka.askTimeout to AkkaUtils (surely this can later be cleaned up to use Typesafe). Finally, deleted some lurking implicits. If anyone can think of a reason they should still be there, please let me know.	2013-12-18 21:42:29 -08:00
Tathagata Das	e93b391d75	Merge branch 'apache-master' into scheduler-update Conflicts: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala	2013-12-18 17:51:14 -08:00
Tathagata Das	b80ec05635	Added StatsReportListener to generate processing time statistics across multiple batches.	2013-12-18 15:35:24 -08:00
Shivaram Venkataraman	af0cd6bd27	Add collectPartition to JavaRDD interface. Also remove takePartition from PythonRDD and use collectPartition in rdd.py.	2013-12-18 11:40:07 -08:00
Tor Myklebust	d3b1af4b6c	Add a serialisation time column to the StagePage.	2013-12-18 14:25:56 -05:00
Reynold Xin	9a6864d016	Fixed a performance problem in RDD.top and BoundedPriorityQueue (size in BoundedPriority was actually traversing the entire queue to calculate the size, resulting in bad performance in insertion).	2013-12-17 18:44:39 -08:00
wangda.tan	59e53fa21c	spark-968, changes for avoid a NPE	2013-12-17 17:57:27 +08:00
wangda.tan	36060f4f50	spark-898, changes according to review comments	2013-12-17 17:55:38 +08:00
Tor Myklebust	b2f0329511	Missed a spot; had an objectSer here too.	2013-12-17 00:18:46 -05:00
Tor Myklebust	25fa976580	Merge branch 'master' of git://github.com/apache/incubator-spark	2013-12-16 23:48:37 -05:00
Tor Myklebust	963d6f065a	Incorporate pwendell's code review suggestions.	2013-12-16 23:14:52 -05:00
Reynold Xin	883e034aeb	Merge pull request #245 from gregakespret/task-maxfailures-fix Fix for spark.task.maxFailures not enforced correctly. Docs at http://spark.incubator.apache.org/docs/latest/configuration.html say: ``` spark.task.maxFailures Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1. ``` Previous implementation worked incorrectly. When for example `spark.task.maxFailures` was set to 1, the job was aborted only after the second task failure, not after the first one.	2013-12-16 14:16:02 -08:00
Tor Myklebust	882d544856	UI to display serialisation time of a stage.	2013-12-16 13:27:03 -05:00
Tor Myklebust	8a397a959b	Track task value serialisation time in TaskMetrics.	2013-12-16 12:07:39 -05:00
wangda.tan	8ab8c6a526	Merge branch 'master' of git://github.com/apache/incubator-spark	2013-12-16 21:45:43 +08:00
Reynold Xin	bad85b051d	Use murmur3 hash for open hashset. (cherry picked from commit 212ff6834515543163aa63a3f4f762ebe641f8ca) Signed-off-by: Ankur Dave <ankurdave@gmail.com>	2013-12-15 17:23:15 -08:00
Josh Rosen	2fd781d347	Merge pull request #249 from ngbinh/partitionInJavaSortByKey Expose numPartitions parameter in JavaPairRDD.sortByKey() This change makes Java and Scala API on sortByKey() the same.	2013-12-14 12:59:37 -08:00
Prashant Sharma	1ae3c0fc5e	Added a comment about ActorRef and ActorSelection difference.	2013-12-14 10:44:24 +05:30
Prashant Sharma	a854cc536d	Review comments on the PR for scala 2.10 migration.	2013-12-13 15:19:51 +05:30
Tathagata Das	097e120c0c	Refactored streaming scheduler and added listener interface. - Refactored Scheduler + JobManager to JobGenerator + JobScheduler and added JobSet for cleaner code. Moved scheduler related code to streaming.scheduler package. - Added StreamingListener trait (similar to SparkListener) to enable gathering to streaming stats like processing times and delays. StreamingContext.addListener() to added listeners. - Deduped some code in streaming tests by modifying TestSuiteBase, and added StreamingListenerSuite.	2013-12-12 20:48:02 -08:00
Tathagata Das	5e9ce83d68	Fixed multiple file stream and checkpointing bugs. - Made file stream more robust to transient failures. - Changed Spark.setCheckpointDir API to not have the second 'useExisting' parameter. Spark will always create a unique directory for checkpointing underneath the directory provide to the funtion. - Fixed bug wrt local relative paths as checkpoint directory. - Made DStream and RDD checkpointing use SparkContext.hadoopConfiguration, so that more HDFS compatible filesystems are supported for checkpointing.	2013-12-11 14:01:36 -08:00
Prashant Sharma	603af51bb5	Merge branch 'master' into akka-bug-fix Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala	2013-12-11 10:21:53 +05:30
Binh Nguyen	0b494f7db4	Hook directly to Scala API	2013-12-10 11:17:52 -08:00
Binh Nguyen	e85af50767	Leave default value of numPartitions to Scala code.	2013-12-10 11:04:14 -08:00
Binh Nguyen	c82d4f079b	Use braces to shorten the line.	2013-12-10 01:04:52 -08:00
Binh Nguyen	5013fb64b2	Expose numPartitions parameter in JavaPairRDD.sortByKey() This change make Java and Scala API on sortByKey() the same.	2013-12-10 00:38:16 -08:00
Prashant Sharma	17db6a9041	Style fixes and addressed review comments at #221	2013-12-10 11:47:16 +05:30
Patrick Wendell	5b74609d97	License headers	2013-12-09 16:41:01 -08:00
Grega Kespret	14a1df6572	Fix for spark.task.maxFailures not enforced correctly.	2013-12-09 10:39:02 +01:00
Aaron Davidson	40f63eb034	Merge master into 127	2013-12-08 11:16:52 -08:00
wangda.tan	850c4b709a	Merge branch 'master' of https://github.com/leftnoteasy/incubator-spark-1	2013-12-09 00:12:46 +08:00
wangda.tan	48e4f2ad14	SPARK-968, In stage UI, add an overview section that shows task stats grouped by executor id	2013-12-09 00:02:59 +08:00
Matei Zaharia	e0392343a0	Merge pull request #190 from markhamstra/Stages4Jobs stageId <--> jobId mapping in DAGScheduler Okay, I think this one is ready to go -- or at least it's ready for review and discussion. It's a carry-over of https://github.com/mesos/spark/pull/842 with updates for the newer job cancellation functionality. The prior discussion still applies. I've actually changed the job cancellation flow a bit: Instead of ``cancelTasks`` going to the TaskScheduler and then ``taskSetFailed`` coming back to the DAGScheduler (resulting in ``abortStage`` there), the DAGScheduler now takes care of figuring out which stages should be cancelled, tells the TaskScheduler to cancel tasks for those stages, then does the cleanup within the DAGScheduler directly without the need for any further prompting by the TaskScheduler. I know of three outstanding issues, each of which can and should, I believe, be handled in follow-up pull requests: 1) https://spark-project.atlassian.net/browse/SPARK-960 2) JobLogger should be re-factored to eliminate duplication 3) Related to 2), the WebUI should also become a consumer of the DAGScheduler's new understanding of the relationship between jobs and stages so that it can display progress indication and the like grouped by job. Right now, some of this information is just being sent out as part of ``SparkListenerJobStart`` messages, but more or different job <--> stage information may need to be exported from the DAGScheduler to meet listeners needs. Except for the eventQueue -> Actor commit, the rest can be cherry-picked almost cleanly into branch-0.8. A little merging is needed in MapOutputTracker and the DAGScheduler. Merged versions of those files are in `aba2b40ce0` Note that between the recent Actor change in the DAGScheduler and the cleaning up of DAGScheduler data structures on job completion in this PR, some races have been introduced into the DAGSchedulerSuite. Those tests usually pass, and I don't think that better-behaved code that doesn't directly inspect DAGScheduler data structures should be seeing any problems, but I'll work on fixing DAGSchedulerSuite as either an addition to this PR or as a separate request. UPDATE: Fixed the race that I introduced. Created a JIRA issue (SPARK-965) for the one that was introduced with the switch to eventProcessorActor in the DAGScheduler.	2013-12-06 11:49:59 -08:00
Matei Zaharia	bfa68609d9	Merge pull request #233 from hsaputra/changecontexttobackend Change the name of input argument in ClusterScheduler#initialize from context to backend. The SchedulerBackend used to be called ClusterSchedulerContext so just want to make small change of the input param in the ClusterScheduler#initialize to reflect this.	2013-12-06 11:04:03 -08:00
Matei Zaharia	3fb302c08d	Merge pull request #205 from kayousterhout/logging Added logging of scheduler delays to UI This commit adds two metrics to the UI: 1) The time to get task results, if they're fetched remotely 2) The scheduler delay. When the scheduler starts getting overwhelmed (because it can't keep up with the rate at which tasks are being submitted), the result is that tasks get delayed on the tail-end: the message from the worker saying that the task has completed ends up in a long queue and takes a while to be processed by the scheduler. This commit records that delay in the UI so that users can tell when the scheduler is becoming the bottleneck.	2013-12-06 11:03:32 -08:00
Matei Zaharia	87676a6af2	Merge pull request #220 from rxin/zippart Memoize preferred locations in ZippedPartitionsBaseRDD so preferred location computation doesn't lead to exponential explosion. This was a problem in GraphX where we have a whole chain of RDDs that are ZippedPartitionsRDD's, and the preferred locations were taking eternity to compute. (cherry picked from commit `e36fe55a03`) Signed-off-by: Reynold Xin <rxin@apache.org>	2013-12-06 11:01:42 -08:00
Aaron Davidson	94b5881ee9	Fix long lines	2013-12-06 00:22:00 -08:00
Aaron Davidson	5a864e3fce	Rename SparkActorSystem to IndestructibleActorSystem	2013-12-06 00:21:43 -08:00
Prashant Sharma	c9cd2af71e	Merge branch 'wip-scala-2.10' into akka-bug-fix	2013-12-06 13:32:15 +05:30
Prashant Sharma	4e70480038	A left over akka -> akka.tcp changes	2013-12-06 12:29:53 +05:30
Henry Saputra	1cb259cb57	Change the name of input ragument in ClusterScheduler#initialize from context to backend. The SchedulerBackend used to be called ClusterSchedulerContext so just want to make small change of the input param in the ClusterScheduler#initialize to reflect this.	2013-12-05 18:50:26 -08:00
Mark Hamstra	aebb123fd3	jobWaiter.synchronized before jobWaiter.wait	2013-12-05 17:16:44 -08:00
Reynold Xin	3fc4534d19	wip delta join.	2013-12-05 14:55:26 -08:00
Patrick Wendell	5d460253d6	Merge pull request #228 from pwendell/master Document missing configs and set shuffle consolidation to false.	2013-12-05 12:31:24 -08:00
Matei Zaharia	72b696156c	Merge pull request #199 from harveyfeng/yarn-2.2 Hadoop 2.2 migration Includes support for the YARN API stabilized in the Hadoop 2.2 release, and a few style patches. Short description for each set of commits: `a98f5a0` - "Misc style changes in the 'yarn' package" `a67ebf4` - "A few more style fixes in the 'yarn' package" Both of these are some minor style changes, such as fixing lines over 100 chars, to the existing YARN code. `ab8652f` - "Add a 'new-yarn' directory ... " Copies everything from `SPARK_HOME/yarn` to `SPARK_HOME/new-yarn`. No actual code changes here. `4f1c3fa` - "Hadoop 2.2 YARN API migration ..." API patches to code in the `SPARK_HOME/new-yarn` directory. There are a few more small style changes mixed in, too. Based on @colorant's Hadoop 2.2 support for the scala-2.10 branch in #141. `a1a1c62` - "Add optional Hadoop 2.2 settings in sbt build ... " If Spark should be built against Hadoop 2.2, then: a) the `org.apache.spark.deploy.yarn` package will be compiled from the `new-yarn` directory. b) Protobuf v2.5 will be used as a Spark dependency, since Hadoop 2.2 depends on it. Also, Spark will be built against a version of Akka v2.0.5 that's built against Protobuf 2.5, named `akka-2.0.5-protobuf-2.5`. The patched Akka is here: https://github.com/harveyfeng/akka/tree/2.0.5-protobuf-2.5, and was published to local Ivy during testing. There's also a new boolean environment variable, `SPARK_IS_NEW_HADOOP`, that users can manually set if their `SPARK_HADOOP_VERSION` specification does not start with `2.2`, which is how the build file tries to detect a 2.2 version. Not sure if this is necessary or done in the best way, though...	2013-12-04 23:33:04 -08:00
Patrick Wendell	b1c6fa1584	Document missing configs and set shuffle consolidation to false.	2013-12-04 18:39:34 -08:00
Patrick Wendell	182f9baeed	Merge pull request #227 from pwendell/master Fix small bug in web UI and minor clean-up. There was a bug where sorting order didn't work correctly for write time metrics. I also cleaned up some earlier code that fixed the same issue for read and write bytes.	2013-12-04 15:52:07 -08:00

... 9 10 11 12 13 ...

3383 commits