Stephen Haberman
1c67c7dfd1
Add a shuffle parameter to coalesce.
...
This is useful for when you want just 1 output file (part-00000) but
still up the upstream RDD to be computed in parallel.
2013-03-22 08:54:44 -05:00
Stephen Haberman
4632c45af1
Finished subtractByKeys.
2013-03-14 10:35:34 -05:00
Stephen Haberman
63fe225587
Simplify SubtractedRDD in preparation from subtractByKey.
2013-03-13 17:17:34 -05:00
Stephen Haberman
44032bc476
Merge branch 'master' into bettersplits
...
Conflicts:
core/src/main/scala/spark/RDD.scala
core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
core/src/test/scala/spark/ShuffleSuite.scala
2013-02-24 22:08:14 -06:00
Stephen Haberman
f442e7d83c
Update for split->partition rename.
2013-02-24 00:27:14 -06:00
Stephen Haberman
cec87a0653
Merge branch 'master' into subtract
2013-02-23 23:27:55 -06:00
Matei Zaharia
06e5e6627f
Renamed "splits" to "partitions"
2013-02-17 22:13:26 -08:00
Stephen Haberman
924f47dd11
Add RDD.subtract.
...
Instead of reusing the cogroup primitive, this adds a SubtractedRDD
that knows it only needs to keep rdd1's values (per split) in memory.
2013-02-16 13:38:42 -06:00
Stephen Haberman
6cd68c31cb
Update default.parallelism docs, have StandaloneSchedulerBackend use it.
...
Only brand new RDDs (e.g. parallelize and makeRDD) now use default
parallelism, everything else uses their largest parent's partitioner
or partition size.
2013-02-16 00:29:11 -06:00
Matei Zaharia
ea08537143
Fixed an exponential recursion that could happen with doCheckpoint due
...
to lack of memoization
2013-02-11 13:23:50 -08:00
Matei Zaharia
f750daa510
Merge pull request #452 from stephenh/misc
...
Add RDD.coalesce, clean up some RDDs, other misc.
2013-02-09 18:12:56 -08:00
Stephen Haberman
da52b16b38
Remove RDD.coalesce default arguments.
2013-02-09 10:11:54 -06:00
Mark Hamstra
b8863a79d3
Merge branch 'master' of https://github.com/mesos/spark into commutative
...
Conflicts:
core/src/main/scala/spark/RDD.scala
2013-02-08 18:26:00 -08:00
Mark Hamstra
934a53c8b6
Change docs on 'reduce' since the merging of local reduces no longer preserves
...
ordering, so the reduce function must also be commutative.
2013-02-05 22:19:58 -08:00
Stephen Haberman
a9c8d53cfa
Clean up RDDs, mainly to use getSplits.
...
Also made sure clearDependencies() was calling super, to ensure
the getSplits/getDependencies vars in the RDD base class get
cleaned up.
2013-02-05 22:16:59 -06:00
Stephen Haberman
f2bc748013
Add RDD.coalesce.
2013-02-05 21:23:36 -06:00
Matei Zaharia
8b3041c723
Reduced the memory usage of reduce and similar operations
...
These operations used to wait for all the results to be available in an
array on the driver program before merging them. They now merge values
incrementally as they arrive.
2013-02-01 15:38:42 -08:00
Matei Zaharia
9970926ede
formatting
2013-02-01 14:07:34 -08:00
Matei Zaharia
ccb67ff2ca
Merge pull request #425 from stephenh/toDebugString
...
Add RDD.toDebugString.
2013-01-29 10:44:18 -08:00
Matei Zaharia
64ba6a8c2c
Simplify checkpointing code and RDD class a little:
...
- RDD's getDependencies and getSplits methods are now guaranteed to be
called only once, so subclasses can safely do computation in there
without worrying about caching the results.
- The management of a "splits_" variable that is cleared out when we
checkpoint an RDD is now done in the RDD class.
- A few of the RDD subclasses are simpler.
- CheckpointRDD's compute() method no longer assumes that it is given a
CheckpointRDDSplit -- it can work just as well on a split from the
original RDD, because it only looks at its index. This is important
because things like UnionRDD and ZippedRDD remember the parent's
splits as part of their own and wouldn't work on checkpointed parents.
- RDD.iterator can now reuse cached data if an RDD is computed before it
is checkpointed. It seems like it wouldn't do this before (it always
called iterator() on the CheckpointRDD, which read from HDFS).
2013-01-28 22:30:12 -08:00
Stephen Haberman
cbf72bffa5
Include name, if set, in RDD.toString().
2013-01-29 00:20:36 -06:00
Stephen Haberman
3cda14af3f
Add number of splits.
2013-01-29 00:12:31 -06:00
Stephen Haberman
b45857c965
Add RDD.toDebugString.
...
Original idea by Nathan Kronenfeld.
2013-01-28 23:56:56 -06:00
Matei Zaharia
6ad8540b40
Merge pull request #401 from squito/blockmanager_ui
...
Blockmanager ui
2013-01-27 15:51:08 -08:00
Imran Rashid
539491bbc3
code reformatting
2013-01-25 09:29:59 -08:00
Charles Reiss
2849931000
Eliminate CacheTracker.
...
Replaces DAGScheduler's queries of CacheTracker with BlockManagerMaster
queries.
Adds CacheManager to locally coordinate computation of cached RDDs.
2013-01-22 22:19:30 -08:00
Imran Rashid
905c720e5e
Merge branch 'master' into blockmanager_ui
...
Conflicts:
core/src/main/scala/spark/RDD.scala
2013-01-22 12:02:27 -08:00
Tathagata Das
214345ceac
Fixed issue https://spark-project.atlassian.net/browse/STREAMING-29 , along with updates to doc comments in SparkContext.checkpoint().
2013-01-19 23:50:17 -08:00
Imran Rashid
d98caa0fa0
Merge remote-tracking branch 'dennybritz/blockmanagerUI' into blockmanager_ui
...
Conflicts:
core/src/main/scala/spark/RDD.scala
core/src/main/scala/spark/storage/BlockManagerMaster.scala
core/src/main/scala/spark/storage/StorageLevel.scala
2013-01-18 18:11:26 -08:00
Tathagata Das
cd1521cfdb
Merge branch 'master' into streaming
...
Conflicts:
core/src/main/scala/spark/rdd/CoGroupedRDD.scala
core/src/main/scala/spark/rdd/FilteredRDD.scala
docs/_layouts/global.html
docs/index.md
run
2013-01-15 12:08:51 -08:00
Stephen Haberman
4ee6b22775
Merge branch 'master' into tupleBy
...
Conflicts:
core/src/test/scala/spark/RDDSuite.scala
2013-01-08 09:10:10 -06:00
Stephen Haberman
8dc06069fe
Rename RDD.tupleBy to keyBy.
2013-01-06 15:21:45 -06:00
Stephen Haberman
1fdb6946b5
Add RDD.tupleBy.
2013-01-05 13:07:59 -06:00
Stephen Haberman
f4e6b9361f
Add RDD.collect(PartialFunction).
2013-01-05 12:14:08 -06:00
Tathagata Das
d34dba25c2
Merge branch 'mesos' into dev-merge
2013-01-01 15:48:39 -08:00
Josh Rosen
f803953998
Raise exception when hashing Java arrays (SPARK-597)
2012-12-31 20:20:11 -08:00
Tathagata Das
9e644402c1
Improved jekyll and scala docs. Made many classes and method private to remove them from scala docs.
2012-12-29 18:31:51 -08:00
Tathagata Das
7c33f76291
Merge branch 'mesos' into dev-merge
2012-12-26 19:19:07 -08:00
Tathagata Das
836042bb9f
Merge branch 'dev-checkpoint' of github.com:radlab/spark into dev-merge
...
Conflicts:
core/src/main/scala/spark/ParallelCollection.scala
core/src/main/scala/spark/RDD.scala
core/src/main/scala/spark/rdd/BlockRDD.scala
core/src/main/scala/spark/rdd/CartesianRDD.scala
core/src/main/scala/spark/rdd/CoGroupedRDD.scala
core/src/main/scala/spark/rdd/CoalescedRDD.scala
core/src/main/scala/spark/rdd/FilteredRDD.scala
core/src/main/scala/spark/rdd/FlatMappedRDD.scala
core/src/main/scala/spark/rdd/GlommedRDD.scala
core/src/main/scala/spark/rdd/HadoopRDD.scala
core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
core/src/main/scala/spark/rdd/MappedRDD.scala
core/src/main/scala/spark/rdd/PipedRDD.scala
core/src/main/scala/spark/rdd/SampledRDD.scala
core/src/main/scala/spark/rdd/ShuffledRDD.scala
core/src/main/scala/spark/rdd/UnionRDD.scala
core/src/main/scala/spark/scheduler/ResultTask.scala
core/src/test/scala/spark/CheckpointSuite.scala
2012-12-26 19:09:01 -08:00
Mark Hamstra
61be8566e2
Allow distinct() to be called without parentheses when using the default number of splits.
2012-12-24 02:36:47 -08:00
Reynold Xin
eac566a7f4
Merge branch 'master' of github.com:mesos/spark into dev
...
Conflicts:
core/src/main/scala/spark/MapOutputTracker.scala
core/src/main/scala/spark/PairRDDFunctions.scala
core/src/main/scala/spark/ParallelCollection.scala
core/src/main/scala/spark/RDD.scala
core/src/main/scala/spark/rdd/BlockRDD.scala
core/src/main/scala/spark/rdd/CartesianRDD.scala
core/src/main/scala/spark/rdd/CoGroupedRDD.scala
core/src/main/scala/spark/rdd/CoalescedRDD.scala
core/src/main/scala/spark/rdd/FilteredRDD.scala
core/src/main/scala/spark/rdd/FlatMappedRDD.scala
core/src/main/scala/spark/rdd/GlommedRDD.scala
core/src/main/scala/spark/rdd/HadoopRDD.scala
core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
core/src/main/scala/spark/rdd/MappedRDD.scala
core/src/main/scala/spark/rdd/PipedRDD.scala
core/src/main/scala/spark/rdd/SampledRDD.scala
core/src/main/scala/spark/rdd/ShuffledRDD.scala
core/src/main/scala/spark/rdd/UnionRDD.scala
core/src/main/scala/spark/storage/BlockManager.scala
core/src/main/scala/spark/storage/BlockManagerId.scala
core/src/main/scala/spark/storage/BlockManagerMaster.scala
core/src/main/scala/spark/storage/StorageLevel.scala
core/src/main/scala/spark/util/MetadataCleaner.scala
core/src/main/scala/spark/util/TimeStampedHashMap.scala
core/src/test/scala/spark/storage/BlockManagerSuite.scala
run
2012-12-20 14:53:40 -08:00
Tathagata Das
5184141936
Introduced getSpits, getDependencies, and getPreferredLocations in RDD and RDDCheckpointData.
2012-12-18 13:30:53 -08:00
Reynold Xin
4f076e105e
SPARK-635: Pass a TaskContext object to compute() interface and use
...
that to close Hadoop input stream. Incorporated Matei's command.
2012-12-13 16:41:15 -08:00
Reynold Xin
eacb98e900
SPARK-635: Pass a TaskContext object to compute() interface and use that
...
to close Hadoop input stream.
2012-12-13 15:41:53 -08:00
Tathagata Das
8e74fac215
Made checkpoint data in RDDs optional to further reduce serialized size.
2012-12-11 15:36:12 -08:00
Tathagata Das
746afc2e65
Bunch of bug fixes related to checkpointing in RDDs. RDDCheckpointData object is used to lock all serialization and dependency changes for checkpointing. ResultTask converted to Externalizable and serialized RDD is cached like ShuffleMapTask.
2012-12-10 23:36:37 -08:00
Tathagata Das
21a0852976
Refactored RDD checkpointing to minimize extra fields in RDD class.
2012-12-04 22:10:25 -08:00
Tathagata Das
b4dba55f78
Made RDD checkpoint not create a new thread. Fixed bug in detecting when spark.cleaner.delay is insufficient.
2012-12-02 02:03:05 +00:00
Matei Zaharia
f86960cba9
Merge pull request #313 from rxin/pde_size_compress
...
Added a partition preserving flag to MapPartitionsWithSplitRDD.
2012-11-27 22:39:25 -08:00
Matei Zaharia
27e43abd19
Added a zip() operation for RDDs with the same shape (number of
...
partitions and number of elements in each partition)
2012-11-27 22:27:47 -08:00