Commit graph

2666 commits

Author SHA1 Message Date
Aaron Davidson 08302b113a Rename IntermediateBlockId to TempBlockId 2013-12-31 17:44:15 -08:00
Andrew Or 8bbe08b21e Merge branch 'master' of github.com:andrewor14/incubator-spark 2013-12-31 17:26:26 -08:00
Andrew Or 53d8d36684 Add support and test for null keys in ExternalAppendOnlyMap
Also add safeguard against use of destructively sorted AppendOnlyMap
2013-12-31 17:19:02 -08:00
Andrew Or 3ce22df954 Add warning message for spilling 2013-12-31 11:33:10 -08:00
Andrew Or 94ddc91d06 Address Aaron's and Jerry's comments 2013-12-31 10:50:08 -08:00
Aaron Davidson 375d11743c Add new line at end of file 2013-12-30 23:42:37 -08:00
Aaron Davidson daa7792ad6 Refactor SamplingSizeTracker into SizeTrackingAppendOnlyMap 2013-12-30 23:39:02 -08:00
Andrew Or 347fafe4fc Fix CheckpointSuite test fail 2013-12-30 13:10:33 -08:00
Andrew Or d6e7910d92 Simplify merge logic based on the invariant that all spills contain unique keys 2013-12-30 13:01:00 -08:00
Andrew Or 2b71ab97c4 Merge pull request from aarondav: Utilize DiskBlockManager pathway for temp file writing
This gives us a couple advantages:

- Uses spark.local.dir and randomly selects a directory/disk.
- Ensure files are deleted on normal DiskBlockManager cleanup.
- Availability of same stats as usual DiskBlockObjectWriter (currenty unused).

Also enable basic cleanup when iterator is fully drained.
Still requires cleanup for operations that fail or don't go through all elements.
2013-12-30 11:01:30 -08:00
Andrew Or 015a510b0a Merge branch 'master' of github.com:andrewor14/incubator-spark 2013-12-29 22:03:47 -08:00
Andrew Or 2a48d71528 Add test suite for ExternalAppendOnlyMap 2013-12-29 21:56:13 -08:00
Andrew Or 4a014dc59c Make serializer a parameter to ExternalAppendOnlyMap 2013-12-29 21:55:53 -08:00
Aaron Davidson e3cac47e65 Use Comparator instead of Ordering
lower object creation costs
2013-12-29 19:58:37 -08:00
Andrew Or 8fbff9f5d0 Address Aaron's comments 2013-12-29 16:22:44 -08:00
Aaron Davidson 2a7b3511f4 Add Apache headers 2013-12-27 10:55:16 -08:00
Andrew Or d0cfbc41e2 Rename spark.shuffle.buffer variables 2013-12-27 00:07:09 -08:00
Andrew Or 8f3175773c Final cleanup 2013-12-26 23:40:08 -08:00
Aaron Davidson 1dc0440c1a Use real serializer & manual ordering 2013-12-26 23:40:08 -08:00
Aaron Davidson 0f66b7f2fc Return efficient iterator if no spillage happened 2013-12-26 23:40:08 -08:00
Andrew Or ec8c5dc644 Sort AppendOnlyMap in-place 2013-12-26 23:40:08 -08:00
Aaron Davidson 0289eb752a Allow Product2 rather than just tuple kv pairs 2013-12-26 23:40:07 -08:00
Andrew Or 64b2d54a02 Move maps to util, and refactor more 2013-12-26 23:40:07 -08:00
Aaron Davidson 804beb43be SamplingSizeTracker + Map + test suite 2013-12-26 23:40:07 -08:00
Andrew Or 7ad4408255 New minor edits 2013-12-26 23:40:07 -08:00
Aaron Davidson fcc443b3db Minor cleanup for Scala style 2013-12-26 23:40:07 -08:00
Andrew Or 2a2ca2a661 Add toggle for ExternalAppendOnlyMap in Aggregator and CoGroupedRDD 2013-12-26 23:40:07 -08:00
Andrew Or 28685a4820 Provide for cases when mergeCombiners is not specified in ExternalAppendOnlyMap 2013-12-26 23:40:07 -08:00
Andrew Or 17def8cc11 Refactor ExternalAppendOnlyMap to take in KVC instead of just KV 2013-12-26 23:40:07 -08:00
Andrew Or 6a45ec1972 Working ExternalAppendOnlyMap for both CoGroupedRDDs and Aggregator 2013-12-26 23:40:07 -08:00
Andrew Or 97fbb3ec52 Working ExternalAppendOnlyMap for Aggregator, but not for CoGroupedRDD 2013-12-26 23:40:07 -08:00
Mark Hamstra c529dceaff Avoid a lump of coal (NPE) in JobProgressListener's stocking. 2013-12-25 23:10:02 -08:00
Patrick Wendell 85a344b4f0 Merge pull request #127 from kayousterhout/consolidate_schedulers
Deduplicate Local and Cluster schedulers.

The code in LocalScheduler/LocalTaskSetManager was nearly identical
to the code in ClusterScheduler/ClusterTaskSetManager. The redundancy
made making updating the schedulers unnecessarily painful and error-
prone. This commit combines the two into a single TaskScheduler/
TaskSetManager.

Unfortunately the diff makes this change look much more invasive than it is -- TaskScheduler.scala is only superficially changed (names updated, overrides removed) from the old ClusterScheduler.scala, and the same with
TaskSetManager.scala.

Thanks @rxin for suggesting this change!
2013-12-24 16:35:06 -08:00
Patrick Wendell c2dd6bcd6e Merge pull request #279 from aarondav/shuffle-cleanup0
Clean up shuffle files once their metadata is gone

Previously, we would only clean the in-memory metadata for consolidated shuffle files.

Additionally, fixes a bug where the Metadata Cleaner was ignoring type-specific TTLs.
2013-12-24 14:36:47 -08:00
Kay Ousterhout 1efe3adf56 Responded to Reynold's style comments 2013-12-24 14:18:39 -08:00
Matei Zaharia 23a9ae6be3 Merge pull request #277 from tdas/scheduler-update
Refactored the streaming scheduler and added StreamingListener interface

- Refactored the streaming scheduler for cleaner code. Specifically, the JobManager was renamed to JobScheduler, as it does the actual scheduling of Spark jobs to the SparkContext. The earlier Scheduler was renamed to JobGenerator, as it actually generates the jobs from the DStreams. The JobScheduler starts the JobGenerator. Also, moved all the scheduler related code from spark.streaming to spark.streaming.scheduler package.
- Implemented the StreamingListener interface, similar to SparkListener. The streaming version of StatusReportListener prints the batch processing time statistics (for now). Added StreamingListernerSuite to test it.
- Refactored streaming TestSuiteBase for deduping code in the other streaming testsuites.
2013-12-24 00:08:48 -05:00
Reynold Xin 11107c9de5 Merge pull request #244 from leftnoteasy/master
Added SPARK-968 implementation for review

Added SPARK-968 implementation for review
2013-12-23 10:38:20 -08:00
wangda.tan 2f689ba97b SPARK-968, added executor address showing in aggregated metrics by executors table 2013-12-23 15:03:45 +08:00
Kay Ousterhout b7bfae1afe Correctly merged in maxTaskFailures fix 2013-12-22 07:34:44 -08:00
wangda.tan c979eecdf6 added changes according to comments from rxin 2013-12-22 21:43:15 +08:00
Kay Ousterhout b8ae096a40 Fix build error in test 2013-12-21 23:28:48 -08:00
Kay Ousterhout 30186aa264 Renamed ClusterScheduler to TaskSchedulerImpl 2013-12-20 14:58:04 -08:00
Kay Ousterhout c06945cfe0 Merge remote branch 'upstream/master' into consolidate_schedulers
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
2013-12-20 14:39:30 -08:00
Patrick Wendell 0bc57c5767 Merge pull request #280 from aarondav/minor
Minor cleanup for standalone scheduler

See commit messages
2013-12-20 11:56:54 -08:00
Patrick Wendell eca68d4425 Merge pull request #272 from tmyklebu/master
Track and report task result serialisation time.

 - DirectTaskResult now has a ByteBuffer valueBytes instead of a T value.
 - DirectTaskResult now has a member function T value() that deserialises valueBytes.
 - Executor serialises value into a ByteBuffer and passes it to DTR's ctor.
 - Executor tracks the time taken to do so and puts it in a new field in TaskMetrics.
 - StagePage now reports serialisation time from TaskMetrics along with the other things it reported.
2013-12-19 18:12:22 -08:00
Aaron Davidson 6613ab663d Fix compiler warning in SparkZooKeeperSession 2013-12-19 17:56:13 -08:00
Aaron Davidson 4d74b899b7 Remove firstApp from the standalone scheduler Master
As a lonely child with no one to care for it... we had to put it down.
2013-12-19 17:53:41 -08:00
Aaron Davidson 1ab031eaff Extraordinarily minor code/comment cleanup 2013-12-19 17:51:29 -08:00
Aaron Davidson 0647ec9757 Clean up shuffle files once their metadata is gone
Previously, we would only clean the in-memory metadata for consolidated
shuffle files.

Additionally, fixes a bug where the Metadata Cleaner was ignoring type-
specific TTLs.
2013-12-19 15:40:48 -08:00
Reynold Xin 7990c56375 Merge pull request #276 from shivaram/collectPartition
Add collectPartition to JavaRDD interface.

This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.

Thanks @concretevitamin for the original change and tests.
2013-12-19 13:35:09 -08:00