Matei Zaharia
93a200bc7e
Renamed aggregateSplit() to splitRdd(), plus some style fixes
2010-10-23 15:34:03 -07:00
Matei Zaharia
023ed194b4
Fixed some whitespace
2010-10-16 21:21:16 -07:00
Matei Zaharia
0e2adecdab
Simplified UnionRDD slightly and added a SparkContext.union method for efficiently union-ing a large number of RDDs
2010-10-16 17:13:52 -07:00
Matei Zaharia
630a982b88
Added a getId method to split to force classes to specify a unique ID
...
for each split. This replaces the previous method of calling
split.toString, which would produce different results for the same split
each time it is deserialized (because the default implementation returns
the Java object's address).
2010-10-07 17:17:07 -07:00
Justin Ma
b3517614d8
Added toString() methods to UnionSplit, SeededSplit and CartesianSplit to
...
ensure that the proper keys will be generated when they cached.
2010-10-07 14:38:25 -07:00
Matei Zaharia
9f20b6b433
Added reduceByKey operation for RDDs containing pairs
2010-10-03 20:28:20 -07:00
root
34eccedbf5
Fixed a rather bad bug in HDFS files that has been in for a while:
...
caching was not working because Split objects did not have a
consistent toString value
2010-10-03 05:06:06 +00:00
Matei Zaharia
7090dea44b
Changed printlns to log statements and fixed a bug in run that was causing it to fail on a Mesos cluster
2010-09-28 23:54:29 -07:00
Justin Ma
7a9ff1cc9a
- Got rid of 'Split' type parameter in RDD
...
- Added SampledRDD, SplitRDD and CartesianRDD
- Made Split a class rather than a type parameter
- Added numCores() to Scheduler to help set default level of parallelism
2010-08-31 12:08:09 -07:00
Justin Ma
ea8c2785dd
now we have sampling with replacement (at least on a per-split basis)
2010-08-18 15:59:35 -07:00
Justin Ma
156bccbe23
HdfsFile.scala: added a try/catch block to exit gracefully for correupted gzip files
...
MesosScheduler.scala: formatted the slaveOffer() output to include the serialized task size
RDD.scala: added support for aggregating RDDs on a per-split basis
(aggregateSplit()) as well as for sampling without replacement (sample())
2010-08-18 15:25:57 -07:00
Matei Zaharia
b56ed67553
Updated code to work with Nexus->Mesos name change
2010-07-25 23:53:46 -04:00
Matei Zaharia
7d0eae17e3
Merge branch 'dev'
...
Conflicts:
src/scala/spark/HdfsFile.scala
src/scala/spark/NexusScheduler.scala
src/test/spark/repl/ReplSuite.scala
2010-06-27 15:21:54 -07:00
Matei Zaharia
323571a177
Initial work on union operation.
2010-06-18 12:54:33 -07:00
Matei Zaharia
cd247b7d86
Created common RDD superclass for distributed files and parallel arrays.
...
This also means that parallel arrays now get all the functionality files
used to have (filter, map, reduce, cache, etc).
2010-06-17 12:49:42 -07:00