Mark Hamstra
61be8566e2
Allow distinct() to be called without parentheses when using the default number of splits.
2012-12-24 02:36:47 -08:00
Reynold Xin
4f076e105e
SPARK-635: Pass a TaskContext object to compute() interface and use
...
that to close Hadoop input stream. Incorporated Matei's command.
2012-12-13 16:41:15 -08:00
Reynold Xin
eacb98e900
SPARK-635: Pass a TaskContext object to compute() interface and use that
...
to close Hadoop input stream.
2012-12-13 15:41:53 -08:00
Matei Zaharia
f86960cba9
Merge pull request #313 from rxin/pde_size_compress
...
Added a partition preserving flag to MapPartitionsWithSplitRDD.
2012-11-27 22:39:25 -08:00
Matei Zaharia
27e43abd19
Added a zip() operation for RDDs with the same shape (number of
...
partitions and number of elements in each partition)
2012-11-27 22:27:47 -08:00
Reynold Xin
bd6dd1a3a6
Added a partition preserving flag to MapPartitionsWithSplitRDD.
2012-11-27 19:43:30 -08:00
Josh Rosen
4775c55641
Change ShuffleFetcher to return an Iterator.
2012-10-13 14:59:20 -07:00
Matei Zaharia
b4067cbad4
More doc updates, and moved Serializer to a subpackage.
2012-10-12 18:19:21 -07:00
Matei Zaharia
dca496bb77
Document cartesian() operation
2012-10-12 14:46:41 -07:00
Patrick Wendell
dc8adbd359
Adding Java documentation
2012-10-11 00:49:03 -07:00
Matei Zaharia
ee2fcb2ce6
Added documentation to all the *RDDFunction classes, and moved them into
...
the spark package to make them more visible. Also documented various
other miscellaneous things in the API.
2012-10-09 18:38:36 -07:00
Andy Konwinski
1d79ff6028
Fixes a typo, adds scaladoc comments to SparkContext constructors.
2012-10-08 22:49:17 -07:00
Patrick Wendell
ac310098ef
More docs in RDD class
2012-10-08 22:25:11 -07:00
Andy Konwinski
bd688940a1
A start on scaladoc for the public APIs.
2012-10-08 21:13:29 -07:00
Matei Zaharia
65113b7e1b
Only group elements ten at a time into SequenceFile records in
...
saveAsObjectFile
2012-10-06 17:14:41 -07:00
Andy Konwinski
a242cdd0a6
Factor subclasses of RDD out of RDD.scala into their own classes
...
in the rdd package.
2012-10-05 19:53:54 -07:00
Andy Konwinski
e0067da082
Moves all files in core/src/main/scala/ that have RDD in them from
...
package spark to package spark.rdd and updates all references to them.
2012-10-05 19:23:45 -07:00
Matei Zaharia
8c82f43db3
Scaladoc documentation for some core Spark functionality
2012-10-04 22:59:36 -07:00
Matei Zaharia
6cf5dffc72
Make more stuff private[spark]
2012-10-02 22:28:55 -07:00
Matei Zaharia
802aa8aef9
Some bug fixes and logging fixes for broadcast.
2012-10-01 15:20:42 -07:00
Matei Zaharia
53f90d0f0e
Use underscores instead of colons in RDD IDs
2012-10-01 10:48:53 -07:00
Matei Zaharia
2f11e3c285
Merge pull request #227 from JoshRosen/fix/distinct_numsplits
...
Allow controlling number of splits in distinct().
2012-09-28 23:57:24 -07:00
Josh Rosen
8654165e69
Use null as dummy value in distinct().
2012-09-28 23:55:17 -07:00
Josh Rosen
37c199bbb0
Allow controlling number of splits in distinct().
2012-09-28 23:44:19 -07:00
Matei Zaharia
3d7267999d
Print and track user call sites in more places in Spark
2012-09-28 17:42:00 -07:00
Matei Zaharia
9f6efbf06a
Merge pull request #225 from pwendell/dev
...
Log message which records RDD origin
2012-09-28 16:28:07 -07:00
Patrick Wendell
9fc78f8f29
Fixing some whitespace issues
2012-09-28 16:05:50 -07:00
Patrick Wendell
bc909c2903
Changes based on Matei's comments
2012-09-28 16:04:36 -07:00
Patrick Wendell
c387e40fb1
Log message which records RDD origin
...
This adds tracking to determine the "origin" of an RDD. Origin is defined by
the boundary between the user's code and the spark code, during an RDD's
instantiation. It is meant to help users understand where a Spark RDD is
coming from in their code.
This patch also logs origin data when stages are submitted to the scheduler.
Finally, it adds a new log message to fix an inconsitency in the way that
dependent stages (those missing parents) and independent stages (those
without) are logged during submission.
2012-09-28 15:51:46 -07:00
Matei Zaharia
7bcb08cef5
Renamed storage levels to something cleaner; fixes #223 .
2012-09-27 17:50:59 -07:00
Reynold Xin
1ad1331a34
Added MapPartitionsWithSplitRDD.
2012-09-26 17:11:28 -07:00
Matei Zaharia
051785c7e6
Several fixes to sampling issues pointed out by Henry Milner:
...
- takeSample was biased towards earlier partitions
- There were some range errors in takeSample
- SampledRDDs with replacement didn't produce appropriate counts
across partitions (we took exactly frac of each one)
2012-09-25 21:46:58 -07:00
Reynold Xin
7a4cd92861
Renamed RDD.manifest to RDD.elementClassManifest
2012-09-24 23:42:33 -07:00
Reynold Xin
348bcbca1f
Added a method to RDD to expose the ClassManifest.
2012-09-24 16:56:27 -07:00
Matei Zaharia
2d761e3353
Ported performance and FT improvements from latest streaming work
2012-09-12 14:54:40 -07:00
Matei Zaharia
6601a6212b
Added a unit test for cross-partition balancing in sort, and changes to
...
RangePartitioner to make it pass. It turns out that the first partition
was always kind of small due to how we picked partition boundaries.
2012-08-03 16:40:45 -04:00
Matei Zaharia
400221f851
Merge branch 'dev' of git://github.com/tdas/spark into dev
2012-07-30 13:54:57 -07:00
Tathagata Das
cf429699e1
Updated the new checkpoint RDD to remember partitioning of the original RDD.
2012-07-27 23:16:37 +00:00
Tathagata Das
024905f682
Added BlockRDD and a first-cut version of checkpoint() to RDD class.
2012-07-27 12:00:49 -07:00
Josh Rosen
e23938c3be
Use mapValues() in JavaPairRDD.cogroupResultToJava().
2012-07-22 15:10:01 -07:00
Josh Rosen
01dce3f569
Add Java API
...
Add distinct() method to RDD.
Fix bug in DoubleRDDFunctions.
2012-07-18 17:34:29 -07:00
Matei Zaharia
c53670b9bf
Various code style fixes, mostly from IntelliJ IDEA
2012-06-29 18:47:12 -07:00
Matei Zaharia
f58da6164e
Merge branch 'master' into dev
2012-06-15 23:47:11 -07:00
Matei Zaharia
a96558caa3
Performance improvements to shuffle operations: in particular, preserve
...
RDD partitioning in more cases where it's possible, and use iterators
instead of materializing collections when doing joins.
2012-06-09 14:44:18 -07:00
Matei Zaharia
63051dd2bc
Merge in engine improvements from the Spark Streaming project, developed
...
jointly with Tathagata Das and Haoyuan Li. This commit imports the changes
and ports them to Mesos 0.9, but does not yet pass unit tests due to
various classes not supporting a graceful stop() yet.
2012-06-07 12:45:38 -07:00
Reynold Xin
d0c6e9f639
Made some RDD dependencies transient to reduce the amount of data needed
...
to be serialized in closure serialization. This can significantly reduce
the task setup time in Shark when the query involves a large number of
(Hive) partitions.
2012-05-16 14:16:55 -07:00
Reynold Xin
e601b3b9e5
Added the ability to set environmental variables in piped rdd.
2012-04-17 16:40:56 -07:00
Matei Zaharia
335a6036ad
Converted some tabs to spaces
2012-04-05 11:58:01 -07:00
haoyuan
194c42ab79
Code format.
2012-02-10 08:19:53 -08:00
haoyuan
445e0bb1b5
Format the code a bit mroe.
2012-02-09 15:50:26 -08:00