Commit graph

133 commits

Author SHA1 Message Date
Matei Zaharia ccb67ff2ca Merge pull request #425 from stephenh/toDebugString
Add RDD.toDebugString.
2013-01-29 10:44:18 -08:00
Matei Zaharia 64ba6a8c2c Simplify checkpointing code and RDD class a little:
- RDD's getDependencies and getSplits methods are now guaranteed to be
  called only once, so subclasses can safely do computation in there
  without worrying about caching the results.

- The management of a "splits_" variable that is cleared out when we
  checkpoint an RDD is now done in the RDD class.

- A few of the RDD subclasses are simpler.

- CheckpointRDD's compute() method no longer assumes that it is given a
  CheckpointRDDSplit -- it can work just as well on a split from the
  original RDD, because it only looks at its index. This is important
  because things like UnionRDD and ZippedRDD remember the parent's
  splits as part of their own and wouldn't work on checkpointed parents.

- RDD.iterator can now reuse cached data if an RDD is computed before it
  is checkpointed. It seems like it wouldn't do this before (it always
  called iterator() on the CheckpointRDD, which read from HDFS).
2013-01-28 22:30:12 -08:00
Stephen Haberman cbf72bffa5 Include name, if set, in RDD.toString(). 2013-01-29 00:20:36 -06:00
Stephen Haberman 3cda14af3f Add number of splits. 2013-01-29 00:12:31 -06:00
Stephen Haberman b45857c965 Add RDD.toDebugString.
Original idea by Nathan Kronenfeld.
2013-01-28 23:56:56 -06:00
Matei Zaharia 6ad8540b40 Merge pull request #401 from squito/blockmanager_ui
Blockmanager ui
2013-01-27 15:51:08 -08:00
Imran Rashid 539491bbc3 code reformatting 2013-01-25 09:29:59 -08:00
Charles Reiss 2849931000 Eliminate CacheTracker.
Replaces DAGScheduler's queries of CacheTracker with BlockManagerMaster
queries.

Adds CacheManager to locally coordinate computation of cached RDDs.
2013-01-22 22:19:30 -08:00
Imran Rashid 905c720e5e Merge branch 'master' into blockmanager_ui
Conflicts:
	core/src/main/scala/spark/RDD.scala
2013-01-22 12:02:27 -08:00
Tathagata Das 214345ceac Fixed issue https://spark-project.atlassian.net/browse/STREAMING-29, along with updates to doc comments in SparkContext.checkpoint(). 2013-01-19 23:50:17 -08:00
Imran Rashid d98caa0fa0 Merge remote-tracking branch 'dennybritz/blockmanagerUI' into blockmanager_ui
Conflicts:
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/storage/BlockManagerMaster.scala
	core/src/main/scala/spark/storage/StorageLevel.scala
2013-01-18 18:11:26 -08:00
Tathagata Das cd1521cfdb Merge branch 'master' into streaming
Conflicts:
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/FilteredRDD.scala
	docs/_layouts/global.html
	docs/index.md
	run
2013-01-15 12:08:51 -08:00
Stephen Haberman 4ee6b22775 Merge branch 'master' into tupleBy
Conflicts:
	core/src/test/scala/spark/RDDSuite.scala
2013-01-08 09:10:10 -06:00
Stephen Haberman 8dc06069fe Rename RDD.tupleBy to keyBy. 2013-01-06 15:21:45 -06:00
Stephen Haberman 1fdb6946b5 Add RDD.tupleBy. 2013-01-05 13:07:59 -06:00
Stephen Haberman f4e6b9361f Add RDD.collect(PartialFunction). 2013-01-05 12:14:08 -06:00
Tathagata Das d34dba25c2 Merge branch 'mesos' into dev-merge 2013-01-01 15:48:39 -08:00
Josh Rosen f803953998 Raise exception when hashing Java arrays (SPARK-597) 2012-12-31 20:20:11 -08:00
Tathagata Das 9e644402c1 Improved jekyll and scala docs. Made many classes and method private to remove them from scala docs. 2012-12-29 18:31:51 -08:00
Tathagata Das 7c33f76291 Merge branch 'mesos' into dev-merge 2012-12-26 19:19:07 -08:00
Tathagata Das 836042bb9f Merge branch 'dev-checkpoint' of github.com:radlab/spark into dev-merge
Conflicts:
	core/src/main/scala/spark/ParallelCollection.scala
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/rdd/BlockRDD.scala
	core/src/main/scala/spark/rdd/CartesianRDD.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/CoalescedRDD.scala
	core/src/main/scala/spark/rdd/FilteredRDD.scala
	core/src/main/scala/spark/rdd/FlatMappedRDD.scala
	core/src/main/scala/spark/rdd/GlommedRDD.scala
	core/src/main/scala/spark/rdd/HadoopRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
	core/src/main/scala/spark/rdd/MappedRDD.scala
	core/src/main/scala/spark/rdd/PipedRDD.scala
	core/src/main/scala/spark/rdd/SampledRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
	core/src/main/scala/spark/rdd/UnionRDD.scala
	core/src/main/scala/spark/scheduler/ResultTask.scala
	core/src/test/scala/spark/CheckpointSuite.scala
2012-12-26 19:09:01 -08:00
Mark Hamstra 61be8566e2 Allow distinct() to be called without parentheses when using the default number of splits. 2012-12-24 02:36:47 -08:00
Reynold Xin eac566a7f4 Merge branch 'master' of github.com:mesos/spark into dev
Conflicts:
	core/src/main/scala/spark/MapOutputTracker.scala
	core/src/main/scala/spark/PairRDDFunctions.scala
	core/src/main/scala/spark/ParallelCollection.scala
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/rdd/BlockRDD.scala
	core/src/main/scala/spark/rdd/CartesianRDD.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/CoalescedRDD.scala
	core/src/main/scala/spark/rdd/FilteredRDD.scala
	core/src/main/scala/spark/rdd/FlatMappedRDD.scala
	core/src/main/scala/spark/rdd/GlommedRDD.scala
	core/src/main/scala/spark/rdd/HadoopRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
	core/src/main/scala/spark/rdd/MappedRDD.scala
	core/src/main/scala/spark/rdd/PipedRDD.scala
	core/src/main/scala/spark/rdd/SampledRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
	core/src/main/scala/spark/rdd/UnionRDD.scala
	core/src/main/scala/spark/storage/BlockManager.scala
	core/src/main/scala/spark/storage/BlockManagerId.scala
	core/src/main/scala/spark/storage/BlockManagerMaster.scala
	core/src/main/scala/spark/storage/StorageLevel.scala
	core/src/main/scala/spark/util/MetadataCleaner.scala
	core/src/main/scala/spark/util/TimeStampedHashMap.scala
	core/src/test/scala/spark/storage/BlockManagerSuite.scala
	run
2012-12-20 14:53:40 -08:00
Tathagata Das 5184141936 Introduced getSpits, getDependencies, and getPreferredLocations in RDD and RDDCheckpointData. 2012-12-18 13:30:53 -08:00
Reynold Xin 4f076e105e SPARK-635: Pass a TaskContext object to compute() interface and use
that to close Hadoop input stream. Incorporated Matei's command.
2012-12-13 16:41:15 -08:00
Reynold Xin eacb98e900 SPARK-635: Pass a TaskContext object to compute() interface and use that
to close Hadoop input stream.
2012-12-13 15:41:53 -08:00
Tathagata Das 8e74fac215 Made checkpoint data in RDDs optional to further reduce serialized size. 2012-12-11 15:36:12 -08:00
Tathagata Das 746afc2e65 Bunch of bug fixes related to checkpointing in RDDs. RDDCheckpointData object is used to lock all serialization and dependency changes for checkpointing. ResultTask converted to Externalizable and serialized RDD is cached like ShuffleMapTask. 2012-12-10 23:36:37 -08:00
Tathagata Das 21a0852976 Refactored RDD checkpointing to minimize extra fields in RDD class. 2012-12-04 22:10:25 -08:00
Tathagata Das b4dba55f78 Made RDD checkpoint not create a new thread. Fixed bug in detecting when spark.cleaner.delay is insufficient. 2012-12-02 02:03:05 +00:00
Matei Zaharia f86960cba9 Merge pull request #313 from rxin/pde_size_compress
Added a partition preserving flag to MapPartitionsWithSplitRDD.
2012-11-27 22:39:25 -08:00
Matei Zaharia 27e43abd19 Added a zip() operation for RDDs with the same shape (number of
partitions and number of elements in each partition)
2012-11-27 22:27:47 -08:00
Reynold Xin bd6dd1a3a6 Added a partition preserving flag to MapPartitionsWithSplitRDD. 2012-11-27 19:43:30 -08:00
Tathagata Das c97ebf6437 Fixed bug in the number of splits in RDD after checkpointing. Modified reduceByKeyAndWindow (naive) computation from window+reduceByKey to reduceByKey+window+reduceByKey. 2012-11-19 23:22:07 +00:00
Tathagata Das 8a25d530ed Optimized checkpoint writing by reusing FileSystem object. Fixed bug in updating of checkpoint data in DStream where the checkpointed RDDs, upon recovery, were not recognized as checkpointed RDDs and therefore deleted from HDFS. Made InputStreamsSuite more robust to timing delays. 2012-11-13 02:16:28 -08:00
Denny 4a1be7e0db Refactor BlockManager UI and adding worker details. 2012-11-12 10:56:35 -08:00
Tathagata Das d154238789 Made checkpointing of dstream graph to work with checkpointing of RDDs. For streams requiring checkpointing of its RDD, the default checkpoint interval is set to 10 seconds. 2012-11-04 12:12:06 -08:00
Tathagata Das 34e569f40e Added 'synchronized' to RDD serialization to ensure checkpoint-related changes are reflected atomically in the task closure. Added to tests to ensure that jobs running on an RDD on which checkpointing is in progress does hurt the result of the job. 2012-10-31 00:56:40 -07:00
Tathagata Das 0dcd770fdc Added checkpointing support to all RDDs, along with CheckpointSuite to test checkpointing in them. 2012-10-30 16:09:37 -07:00
Denny 531ac136bf BlockManager UI. 2012-10-29 14:53:47 -07:00
Tathagata Das ac12abc17f Modified RDD API to make dependencies a var (therefore can be changed to checkpointed hadoop rdd) and othere references to parent RDDs either through dependencies or through a weak reference (to allow finalizing when dependencies do not refer to it any more). 2012-10-29 11:55:27 -07:00
Josh Rosen 4775c55641 Change ShuffleFetcher to return an Iterator. 2012-10-13 14:59:20 -07:00
Matei Zaharia b4067cbad4 More doc updates, and moved Serializer to a subpackage. 2012-10-12 18:19:21 -07:00
Matei Zaharia dca496bb77 Document cartesian() operation 2012-10-12 14:46:41 -07:00
Patrick Wendell dc8adbd359 Adding Java documentation 2012-10-11 00:49:03 -07:00
Matei Zaharia ee2fcb2ce6 Added documentation to all the *RDDFunction classes, and moved them into
the spark package to make them more visible. Also documented various
other miscellaneous things in the API.
2012-10-09 18:38:36 -07:00
Andy Konwinski 1d79ff6028 Fixes a typo, adds scaladoc comments to SparkContext constructors. 2012-10-08 22:49:17 -07:00
Patrick Wendell ac310098ef More docs in RDD class 2012-10-08 22:25:11 -07:00
Andy Konwinski bd688940a1 A start on scaladoc for the public APIs. 2012-10-08 21:13:29 -07:00
Matei Zaharia 65113b7e1b Only group elements ten at a time into SequenceFile records in
saveAsObjectFile
2012-10-06 17:14:41 -07:00