Reynold Xin
348bcbca1f
Added a method to RDD to expose the ClassManifest.
2012-09-24 16:56:27 -07:00
Matei Zaharia
2d761e3353
Ported performance and FT improvements from latest streaming work
2012-09-12 14:54:40 -07:00
Matei Zaharia
6601a6212b
Added a unit test for cross-partition balancing in sort, and changes to
...
RangePartitioner to make it pass. It turns out that the first partition
was always kind of small due to how we picked partition boundaries.
2012-08-03 16:40:45 -04:00
Matei Zaharia
400221f851
Merge branch 'dev' of git://github.com/tdas/spark into dev
2012-07-30 13:54:57 -07:00
Tathagata Das
cf429699e1
Updated the new checkpoint RDD to remember partitioning of the original RDD.
2012-07-27 23:16:37 +00:00
Tathagata Das
024905f682
Added BlockRDD and a first-cut version of checkpoint() to RDD class.
2012-07-27 12:00:49 -07:00
Josh Rosen
e23938c3be
Use mapValues() in JavaPairRDD.cogroupResultToJava().
2012-07-22 15:10:01 -07:00
Josh Rosen
01dce3f569
Add Java API
...
Add distinct() method to RDD.
Fix bug in DoubleRDDFunctions.
2012-07-18 17:34:29 -07:00
Matei Zaharia
c53670b9bf
Various code style fixes, mostly from IntelliJ IDEA
2012-06-29 18:47:12 -07:00
Matei Zaharia
f58da6164e
Merge branch 'master' into dev
2012-06-15 23:47:11 -07:00
Matei Zaharia
a96558caa3
Performance improvements to shuffle operations: in particular, preserve
...
RDD partitioning in more cases where it's possible, and use iterators
instead of materializing collections when doing joins.
2012-06-09 14:44:18 -07:00
Matei Zaharia
63051dd2bc
Merge in engine improvements from the Spark Streaming project, developed
...
jointly with Tathagata Das and Haoyuan Li. This commit imports the changes
and ports them to Mesos 0.9, but does not yet pass unit tests due to
various classes not supporting a graceful stop() yet.
2012-06-07 12:45:38 -07:00
Reynold Xin
d0c6e9f639
Made some RDD dependencies transient to reduce the amount of data needed
...
to be serialized in closure serialization. This can significantly reduce
the task setup time in Shark when the query involves a large number of
(Hive) partitions.
2012-05-16 14:16:55 -07:00
Reynold Xin
e601b3b9e5
Added the ability to set environmental variables in piped rdd.
2012-04-17 16:40:56 -07:00
Matei Zaharia
335a6036ad
Converted some tabs to spaces
2012-04-05 11:58:01 -07:00
haoyuan
194c42ab79
Code format.
2012-02-10 08:19:53 -08:00
haoyuan
445e0bb1b5
Format the code a bit mroe.
2012-02-09 15:50:26 -08:00
haoyuan
651932e703
Format the code as coding style agreed by Matei/TD/Haoyuan
2012-02-09 13:26:23 -08:00
Matei Zaharia
fabcc82528
Merge pull request #103 from edisontung/master
...
Made improvements to takeSample. Also changed SparkLocalKMeans to SparkKMeans
2012-01-13 19:20:03 -08:00
Edison Tung
1ecc221f84
Fixed bugs
...
I've fixed the bugs detailed in the diff. One of the bugs was already
fixed on the local file (forgot to commit).
2012-01-09 11:59:52 -08:00
Edison Tung
42f8847a21
Revert de01b6deaaee1b43321e0aac330f4a98c0ea61c6^..HEAD
2011-12-01 13:43:25 -08:00
Edison Tung
de01b6deaa
Fixed bug in RDD
...
Math.min takes 2 args, not 1. This was not committed earlier for some
reason
2011-12-01 13:34:37 -08:00
Matei Zaharia
22b8fcf632
Added fold() and aggregate() operations that reuse an object to
...
merge results into rather than requiring a new object allocation
for each element merged. Fixes #95 .
2011-11-30 11:37:47 -08:00
Edison Tung
a3bc012af8
added takeSamples method
...
takeSamples method takes a specified number of samples from the RDD and
outputs it in an array.
2011-11-21 16:38:44 -08:00
Ismael Juma
0fba22b3d2
Fix issue #65 : Change @serializable to extends Serializable in 2.9 branch
...
Note that we use scala.Serializable introduced in Scala 2.9 instead of
java.io.Serializable. Also, case classes inherit from scala.Serializable by
default.
2011-08-02 10:16:33 +01:00
Matei Zaharia
8ea67307b9
Merge branch 'master' into scala-2.9
2011-07-14 14:47:12 -04:00
Matei Zaharia
9ac461d85d
Remove RDD.toString because it looked confusing
2011-07-14 14:39:32 -04:00
Matei Zaharia
38f38dda5b
Merge branch 'master' into scala-2.9
2011-07-14 12:42:02 -04:00
Matei Zaharia
969644df8e
Cleaned up a few issues to do with default parallelism levels. Also
...
renamed HadoopFileWriter to HadoopWriter (since it's not only for files)
and fixed a bug for lookup().
2011-07-14 12:40:56 -04:00
Matei Zaharia
d0c7958364
Merge branch 'master' into scala-2.9
...
Conflicts:
core/src/main/scala/spark/HadoopFileWriter.scala
2011-07-13 23:09:33 -04:00
Matei Zaharia
9c0069188b
Updated save code to allow non-file-based OutputFormats and added a test
...
for file-related stuff
2011-07-13 23:04:06 -04:00
Matei Zaharia
080869c6ef
Merge branch 'master' into scala-2.9
2011-07-13 00:20:08 -04:00
Matei Zaharia
842e14d567
Added mapPartitions operation and a bunch of tests for RDD ops
2011-07-13 00:19:52 -04:00
Matei Zaharia
9b568d37f7
Merge branch 'master' into scala-2.9
...
Conflicts:
core/src/main/scala/spark/RDD.scala
2011-07-11 22:25:53 -04:00
Matei Zaharia
25c3a7781c
Moved PairRDD and SequenceFileRDD functions to separate source files
2011-07-10 00:06:15 -04:00
Matei Zaharia
393607d5ef
Merge branch 'master' into scala-2.9
2011-06-27 18:08:25 -07:00
Matei Zaharia
2f652f1656
Fix a compile error
2011-06-27 18:07:16 -07:00
Tathagata Das
3f08e1129f
Merge branch 'master' into td-rdd-save
...
Conflicts:
core/src/main/scala/spark/SparkContext.scala
2011-06-27 13:43:44 -07:00
Tathagata Das
ad842ac823
Merge branch 'master' into td-rdd-save
...
Conflicts:
core/src/main/scala/spark/RDD.scala
2011-06-27 13:39:11 -07:00
Matei Zaharia
bae8a97968
Merge branch 'master' into scala-2.9
...
Conflicts:
repl/src/main/scala/spark/repl/SparkInterpreterLoop.scala
2011-06-26 19:22:27 -07:00
Tathagata Das
38f2ba99cc
Further changes to HadoopFileWriter. Implemented ability to save RDDs as SequenceFiles and ObjectFiles.
...
1> HadoopFileWriter changed to take class types as constructor parameters (no more generic type)
2> Multiple types of RDD.saveAsHadoopFile() implemented to provide more saving options
3> RDD.saveAsSequenceFile() automatically converts basic types to Writable types before saving as SequenceFile
4> RDD.saveAsObjectFile() serializes objects and saves them to a ObjectFile
5> SparkContext.objectFile() opens the saved ObjectFiles
2011-06-24 19:51:21 -07:00
Olivier Grisel
2e3531d8bf
Implemented RDD.leftOuterJoin and RDD.rightOuterJoin
2011-06-24 11:00:51 +02:00
Matei Zaharia
214250016a
Added simple version of lookup
2011-06-20 11:59:16 -07:00
Matei Zaharia
23b42af70a
Merge branch 'master' into scala-2.9
2011-06-19 23:06:21 -07:00
Matei Zaharia
23b1c309fb
Added pipe() operation on RDDs for mapping through a shell command.
2011-06-19 23:05:19 -07:00
Tathagata Das
b5e6645505
Cleaner reimplementation of HadoopFileWriter. Introduced TaskContext.
...
1> HadoopFileWriter works correctly with task failures
2> It can also take an user specified JobConf object for configuration settings
3> A Task can now get information like stage ID, split ID, and attempt ID using TaskContext class
4> Minor changes in SparkContext, DAGScheduler and subclasses to allow specification of TaskContext as a parameter
2011-06-16 20:57:57 -07:00
Tathagata Das
869836a2fa
Implemented TaskContext to hold contextual information (jobID, taskID, attemptID) of a task
2011-06-10 19:47:28 -07:00
Tathagata Das
389e56156f
HadoopFileWriter changed to use Hadoop's OutputCommitter
2011-06-09 15:29:22 -07:00
Tathagata Das
24d845833c
First-cut implementation of RDD.SaveAsText
2011-06-05 04:14:43 -07:00
Ismael Juma
82f10bd794
Remove unnecessary toStream calls.
2011-06-01 16:12:42 +01:00