Commit graph

15 commits

Author SHA1 Message Date
Matei Zaharia 93a200bc7e Renamed aggregateSplit() to splitRdd(), plus some style fixes 2010-10-23 15:34:03 -07:00
Matei Zaharia 023ed194b4 Fixed some whitespace 2010-10-16 21:21:16 -07:00
Matei Zaharia 0e2adecdab Simplified UnionRDD slightly and added a SparkContext.union method for efficiently union-ing a large number of RDDs 2010-10-16 17:13:52 -07:00
Matei Zaharia 630a982b88 Added a getId method to split to force classes to specify a unique ID
for each split. This replaces the previous method of calling
split.toString, which would produce different results for the same split
each time it is deserialized (because the default implementation returns
the Java object's address).
2010-10-07 17:17:07 -07:00
Justin Ma b3517614d8 Added toString() methods to UnionSplit, SeededSplit and CartesianSplit to
ensure that the proper keys will be generated when they cached.
2010-10-07 14:38:25 -07:00
Matei Zaharia 9f20b6b433 Added reduceByKey operation for RDDs containing pairs 2010-10-03 20:28:20 -07:00
root 34eccedbf5 Fixed a rather bad bug in HDFS files that has been in for a while:
caching was not working because Split objects did not have a
consistent toString value
2010-10-03 05:06:06 +00:00
Matei Zaharia 7090dea44b Changed printlns to log statements and fixed a bug in run that was causing it to fail on a Mesos cluster 2010-09-28 23:54:29 -07:00
Justin Ma 7a9ff1cc9a - Got rid of 'Split' type parameter in RDD
- Added SampledRDD, SplitRDD and CartesianRDD
- Made Split a class rather than a type parameter
- Added numCores() to Scheduler to help set default level of parallelism
2010-08-31 12:08:09 -07:00
Justin Ma ea8c2785dd now we have sampling with replacement (at least on a per-split basis) 2010-08-18 15:59:35 -07:00
Justin Ma 156bccbe23 HdfsFile.scala: added a try/catch block to exit gracefully for correupted gzip files
MesosScheduler.scala: formatted the slaveOffer() output to include the serialized task size
RDD.scala: added support for aggregating RDDs on a per-split basis
(aggregateSplit()) as well as for sampling without replacement (sample())
2010-08-18 15:25:57 -07:00
Matei Zaharia b56ed67553 Updated code to work with Nexus->Mesos name change 2010-07-25 23:53:46 -04:00
Matei Zaharia 7d0eae17e3 Merge branch 'dev'
Conflicts:
	src/scala/spark/HdfsFile.scala
	src/scala/spark/NexusScheduler.scala
	src/test/spark/repl/ReplSuite.scala
2010-06-27 15:21:54 -07:00
Matei Zaharia 323571a177 Initial work on union operation. 2010-06-18 12:54:33 -07:00
Matei Zaharia cd247b7d86 Created common RDD superclass for distributed files and parallel arrays.
This also means that parallel arrays now get all the functionality files
used to have (filter, map, reduce, cache, etc).
2010-06-17 12:49:42 -07:00