Commit graph

34 commits

Author SHA1 Message Date
Reynold Xin eacb98e900 SPARK-635: Pass a TaskContext object to compute() interface and use that
to close Hadoop input stream.
2012-12-13 15:41:53 -08:00
Matei Zaharia 0bd20c63e2 Merge remote-tracking branch 'JoshRosen/shuffle_refactoring' into dev
Conflicts:
	core/src/main/scala/spark/Dependency.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
2012-10-23 22:01:45 -07:00
Thomas Dudziak d9c2a89c57 Support for Hadoop 2 distributions such as cdh4 2012-10-18 16:08:54 -07:00
Josh Rosen 33cd3a0c12 Remove map-side combining from ShuffleMapTask.
This separation of concerns simplifies the 
ShuffleDependency and ShuffledRDD interfaces.

Map-side combining can be performed in a
mapPartitions() call prior to shuffling the RDD.

I don't anticipate this having much of a 
performance impact: in both approaches, each tuple
is hashed twice: once in the bucket partitioning
and once in the combiner's hashtable.  The same
steps are being performed, but in a different
order and through one extra Iterator.
2012-10-13 14:59:20 -07:00
Josh Rosen 4775c55641 Change ShuffleFetcher to return an Iterator. 2012-10-13 14:59:20 -07:00
Matei Zaharia ee2fcb2ce6 Added documentation to all the *RDDFunction classes, and moved them into
the spark package to make them more visible. Also documented various
other miscellaneous things in the API.
2012-10-09 18:38:36 -07:00
Andy Konwinski d7363a6b8a Moves all files in core/src/main/scala/ that have RDD in their name
from that directory to a new core/src/main/scala/rdd directory.
2012-10-05 19:23:45 -07:00
Andy Konwinski e0067da082 Moves all files in core/src/main/scala/ that have RDD in them from
package spark to package spark.rdd and updates all references to them.
2012-10-05 19:23:45 -07:00
Denny 18a1faedf6 Stylistic changes and Public Accumulable and Broadcast 2012-10-02 19:28:37 -07:00
Denny 4d9f4b01af Make classes package private 2012-10-02 19:00:19 -07:00
Matei Zaharia 1ef4f0fbd2 Allow controlling number of splits in sortByKey. 2012-09-26 19:18:47 -07:00
Reynold Xin 397d3816e1 Separated ShuffledRDD into multiple classes: RepartitionShuffledRDD,
ShuffledSortedRDD, and ShuffledAggregatedRDD.
2012-09-19 12:31:45 -07:00
Matei Zaharia be622cf867 Formatting 2012-07-11 17:31:44 -07:00
Matei Zaharia e8ae77df24 Added more methods for loading/saving with new Hadoop API 2012-07-11 17:31:33 -07:00
Matei Zaharia e72afdb817 Some refactoring to make cluster scheduler pluggable. 2012-07-06 15:23:26 -07:00
Matei Zaharia f58da6164e Merge branch 'master' into dev 2012-06-15 23:47:11 -07:00
Matei Zaharia a96558caa3 Performance improvements to shuffle operations: in particular, preserve
RDD partitioning in more cases where it's possible, and use iterators
instead of materializing collections when doing joins.
2012-06-09 14:44:18 -07:00
Matei Zaharia 63051dd2bc Merge in engine improvements from the Spark Streaming project, developed
jointly with Tathagata Das and Haoyuan Li. This commit imports the changes
and ports them to Mesos 0.9, but does not yet pass unit tests due to
various classes not supporting a graceful stop() yet.
2012-06-07 12:45:38 -07:00
Matei Zaharia 048276799a Commit task outputs to Hadoop-supported storage systems in parallel on the
cluster instead of on the master. Fixes #110.
2012-06-06 16:46:53 -07:00
Reynold Xin 42dcdbcb2f Removed the extra spaces in OrderedRDDFunctions and SortedRDD. 2012-03-29 15:21:57 -07:00
Antonio 620798161b Added fixes to sorting 2012-02-13 00:07:39 -08:00
Antonio e93f622665 Added sorting by key for pair RDDs 2012-02-11 00:56:28 -08:00
haoyuan 194c42ab79 Code format. 2012-02-10 08:19:53 -08:00
Charles Reiss 02d43e6986 Add new Hadoop API writing support. 2011-12-01 14:01:28 -08:00
root 3a0e6c4363 Miscellaneous fixes:
- Executor should initialize logging properly
- groupByKey should allow custom partitioner
2011-10-17 18:07:35 +00:00
Ankur Dave 2d7057bf5d Implement PairRDDFunctions.partitionBy 2011-10-09 15:52:09 -07:00
Ankur Dave 06637cb69e Fix PairRDDFunctions.groupWith partitioning
This commit fixes a bug in groupWith that was causing it to destroy
partitioning information. It replaces a call to map with a call to
mapValues, which preserves partitioning.
2011-10-09 15:48:46 -07:00
Ankur Dave 2911a783d6 Add custom partitioner support to PairRDDFunctions.combineByKey 2011-10-09 15:47:20 -07:00
Ismael Juma 0fba22b3d2 Fix issue #65: Change @serializable to extends Serializable in 2.9 branch
Note that we use scala.Serializable introduced in Scala 2.9 instead of
java.io.Serializable. Also, case classes inherit from scala.Serializable by
default.
2011-08-02 10:16:33 +01:00
Matei Zaharia 38f38dda5b Merge branch 'master' into scala-2.9 2011-07-14 12:42:02 -04:00
Matei Zaharia 969644df8e Cleaned up a few issues to do with default parallelism levels. Also
renamed HadoopFileWriter to HadoopWriter (since it's not only for files)
and fixed a bug for lookup().
2011-07-14 12:40:56 -04:00
Matei Zaharia d0c7958364 Merge branch 'master' into scala-2.9
Conflicts:
	core/src/main/scala/spark/HadoopFileWriter.scala
2011-07-13 23:09:33 -04:00
Matei Zaharia 9c0069188b Updated save code to allow non-file-based OutputFormats and added a test
for file-related stuff
2011-07-13 23:04:06 -04:00
Matei Zaharia 25c3a7781c Moved PairRDD and SequenceFileRDD functions to separate source files 2011-07-10 00:06:15 -04:00