Commit graph

784 commits

Author SHA1 Message Date
Reynold Xin 98df9d2853 Added removeRdd function in BlockManager. 2013-05-01 20:17:09 -07:00
Reynold Xin 3227ec8edd Cleaned up Ram's code. Moved SparkContext.remove to RDD.unpersist.
Also updated unit tests to make sure they are properly testing for
concurrency.
2013-05-01 16:07:44 -07:00
harshars 8481562731 Merged Ram's commit on removing RDDs.
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
2013-05-01 14:42:17 -07:00
Mridul Muralidharan d960e7e0f8 a) Add support for hyper local scheduling - specific to a host + port - before trying host local scheduling.
b) Add some fixes to test code to ensure it passes (and fixes some other issues).

c) Fix bug in task scheduling which incorrectly used availableCores instead of all cores on the node.
2013-05-01 20:24:00 +05:30
Prashant Sharma e8a9d1cdf9 Fixed Warning: expect -> expectResult 2013-05-01 11:35:02 +05:30
Matei Zaharia f708dda81e Merge pull request #585 from pwendell/listener-perf
[Fix SPARK-742] Task Metrics should not employ per-record timing by default
2013-04-30 07:51:40 -07:00
Patrick Wendell 540be6b154 Modified version of the fix which just removes all per-record tracking. 2013-04-29 11:32:07 -07:00
Patrick Wendell 224fbac061 Spark-742: TaskMetrics should not employ per-record timing.
This patch does three things:

1. Makes TimedIterator a trait with two implementations (one a no-op)
2. Makes the default behavior to use the no-op implementation
3. Removes DelegateBlockFetchTracker. This is just cleanup, but it seems like
   the triat doesn't really reduce complexity in any way.

In the future we can add other implementations, e.g. ones which perform sampling.
2013-04-29 11:13:43 -07:00
Prashant Sharma 8f3ac240cb Fixed Warning: ClassManifest -> ClassTag 2013-04-29 16:39:13 +05:30
Matei Zaharia 0f45347c7b More unit test fixes 2013-04-28 22:29:27 -07:00
Matei Zaharia bce4089f22 Fix BlockManagerSuite to deal with clearing spark.hostPort 2013-04-28 22:23:48 -07:00
Matei Zaharia 68c07ea198 Merge pull request #582 from shivaram/master
Add zip partitions interface
2013-04-28 20:19:33 -07:00
Shivaram Venkataraman 15acd49f07 Actually rename classes to ZippedPartitions*
(the previous commit only renamed the file)
2013-04-28 16:03:22 -07:00
Shivaram Venkataraman 6e84635ab9 Rename classes from MapZipped* to Zipped* 2013-04-28 15:58:40 -07:00
Mridul Muralidharan afee902443 Attempt to fix streaming test failures after yarn branch merge 2013-04-28 22:26:45 +05:30
Shivaram Venkataraman 0cc6642b7c Rename to zipPartitions and style changes 2013-04-28 05:11:03 -07:00
Shivaram Venkataraman c9c4954d99 Add an interface to zip iterators of multiple RDDs
The current code supports 2, 3 or 4 arguments but can be extended
to more arguments if required.
2013-04-26 16:57:46 -07:00
Prashant Sharma ad88f083a6 scala 2.10 and master merge 2013-04-24 18:08:26 +05:30
Prashant Sharma 185bb9525a Manually merged scala-2.10 and master 2013-04-22 14:14:03 +05:30
Andrew xia e0603d7e8b refactor the Schedulable interface and add unit test for SchedulingAlgorithm 2013-04-18 13:13:54 +08:00
Mridul Muralidharan 19652a44be Fix issue with FileSuite failing 2013-04-15 19:16:36 +05:30
Mridul Muralidharan d90d2af103 Checkpoint commit - compiles and passes a lot of tests - not all though, looking into FileSuite issues 2013-04-15 18:12:11 +05:30
Stephen Haberman dd854d5b9f Use Boolean in the Java API, and != for assert. 2013-03-23 11:49:45 -05:00
Stephen Haberman 4ca273edc4 Merge branch 'master' into shufflecoalesce
Conflicts:
	core/src/test/scala/spark/RDDSuite.scala
2013-03-23 11:45:45 -05:00
Matei Zaharia fd53f2fc7b Merge pull request #510 from markhamstra/WithThing
mapWith, flatMapWith and filterWith
2013-03-23 07:13:21 -07:00
Stephen Haberman 1c67c7dfd1 Add a shuffle parameter to coalesce.
This is useful for when you want just 1 output file (part-00000) but
still up the upstream RDD to be computed in parallel.
2013-03-22 08:54:44 -05:00
Matei Zaharia 35588490cb Merge pull request #538 from rxin/cogroup
Added mapSideCombine flag to CoGroupedRDD. Added unit test for CoGroupedRDD.
2013-03-20 19:27:47 -07:00
Reynold Xin 00a11304fd Added mapSideCombine flag to CoGroupedRDD. Added unit test for
CoGroupedRDD.
2013-03-20 13:49:51 +08:00
Mark Hamstra 1fb192ef40 Merge branch 'master' of https://github.com/mesos/spark into foldByKey 2013-03-16 12:17:13 -07:00
Mark Hamstra 80fc8c82ed _With[Matei] 2013-03-16 12:16:29 -07:00
Mark Hamstra 38454c4aed Merge branch 'master' of https://github.com/mesos/spark into WithThing 2013-03-16 11:54:44 -07:00
Matei Zaharia c1e9cdc49f Merge pull request #525 from stephenh/subtractByKey
Add PairRDDFunctions.subtractByKey.
2013-03-16 11:47:45 -07:00
Mark Hamstra ef75be3bf7 Merge branch 'master' of https://github.com/mesos/spark into foldByKey 2013-03-15 21:41:24 -07:00
Matei Zaharia cdbfd1e196 Merge pull request #516 from squito/fix_local_metrics
Fix local metrics
2013-03-15 15:13:28 -07:00
Mark Hamstra 1a4070477d whitespace cleanup 2013-03-15 11:28:28 -07:00
Mark Hamstra 16a4ca4537 restrict V type of foldByKey in order to retain ClassManifest; added foldByKey to Java API and test 2013-03-14 13:58:37 -07:00
Stephen Haberman 7d8bb4df3a Allow subtractByKey's other argument to have a different value type. 2013-03-14 14:44:15 -05:00
Stephen Haberman 4632c45af1 Finished subtractByKeys. 2013-03-14 10:35:34 -05:00
Stephen Haberman e7f1a69c6b Add a test for NextIterator. 2013-03-13 10:46:33 -05:00
Mark Hamstra 562893bea3 deleted excess curly braces 2013-03-10 22:43:08 -07:00
Imran Rashid 8a11ac3dc7 increase sleep time 2013-03-10 22:31:44 -07:00
Imran Rashid 9f97f2f9d8 add a small wait to one task to make sure some task runtime really is non-zero 2013-03-10 22:30:18 -07:00
Mark Hamstra 1289e7176b refactored _With API and added foreachPartition 2013-03-10 22:27:13 -07:00
Mark Hamstra b57df1f5e3 Merge branch 'master' of https://github.com/mesos/spark into WithThing 2013-03-10 16:56:31 -07:00
Matei Zaharia 2e1bbc4e7e Merge remote-tracking branch 'woggling/dag-sched-driver-port'
Conflicts:
	core/src/test/scala/spark/scheduler/DAGSchedulerSuite.scala
2013-03-10 16:52:54 -07:00
Matei Zaharia 91a9d093bd Merge pull request #512 from patelh/fix-kryo-serializer
Fix reference bug in Kryo serializer, add test, update version
2013-03-10 15:48:23 -07:00
Matei Zaharia a59cc6060f Merge remote-tracking branch 'stephenh/nomocks'
Conflicts:
	core/src/main/scala/spark/storage/BlockManagerMaster.scala
	core/src/test/scala/spark/scheduler/DAGSchedulerSuite.scala
2013-03-10 13:39:10 -07:00
Imran Rashid 20f01a0a1b enable task metrics in local mode, add tests 2013-03-09 21:17:31 -08:00
Charles Reiss d0216cb38b Prevent DAGSchedulerSuite from corrupting driver.port.
Use the LocalSparkContext abstraction to properly manage clearing
spark.driver.port.
2013-03-09 10:49:02 -08:00
Hiral Patel 664e5fd24b Fix reference bug in Kryo serializer, add test, update version 2013-03-07 22:16:11 -08:00
Mark Hamstra 5ff0810b11 refactor mapWith, flatMapWith and filterWith to each use two parameter lists 2013-03-05 12:25:44 -08:00
Mark Hamstra d046d8ad32 whitespace formatting 2013-03-05 00:48:13 -08:00
Mark Hamstra 9148b968cf mapWith, flatMapWith and filterWith 2013-03-04 15:48:47 -08:00
Matei Zaharia 04fb81ffe5 Merge pull request #506 from rxin/spark-706
Fixed SPARK-706: Failures in block manager put leads to read task hanging.
2013-03-03 17:20:07 -08:00
Imran Rashid d36abdb053 Merge branch 'master' into stageInfo 2013-03-03 15:20:46 -08:00
Reynold Xin 44134e12bb Fixed SPARK-706: Failures in block manager put leads to read task
hanging.
2013-02-28 15:14:59 -08:00
Stephen Haberman db957e5bd7 Fix MapOutputTrackerSuite. 2013-02-26 01:38:50 -06:00
Stephen Haberman a65aa549ff Override DAGScheduler.runLocally so we can remove the Thread.sleep. 2013-02-25 23:49:32 -06:00
Stephen Haberman a4adeb255c Merge branch 'master' into nomocks
Conflicts:
	core/src/test/scala/spark/scheduler/DAGSchedulerSuite.scala
2013-02-25 23:48:52 -06:00
Tathagata Das c02e064938 Fixed replication bug in BlockManager 2013-02-25 17:27:46 -08:00
Matei Zaharia d6e6abece3 Merge pull request #459 from stephenh/bettersplits
Change defaultPartitioner to use upstream split size.
2013-02-25 09:22:04 -08:00
Stephen Haberman c44ccf2862 Use default parallelism if its set. 2013-02-24 23:54:03 -06:00
Stephen Haberman 44032bc476 Merge branch 'master' into bettersplits
Conflicts:
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
	core/src/test/scala/spark/ShuffleSuite.scala
2013-02-24 22:08:14 -06:00
Tathagata Das dff53d1b94 Merge branch 'mesos-master' into streaming 2013-02-24 12:17:22 -08:00
Stephen Haberman f442e7d83c Update for split->partition rename. 2013-02-24 00:27:14 -06:00
Stephen Haberman cec87a0653 Merge branch 'master' into subtract 2013-02-23 23:27:55 -06:00
Charles Reiss 50cf8c8b79 Add fault tolerance test that uses replicated RDDs. 2013-02-22 16:11:53 -08:00
Imran Rashid ff127cfcd3 Merge branch 'master' into stageInfo
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
	core/src/main/scala/spark/storage/BlockManager.scala
2013-02-21 15:16:21 -08:00
Imran Rashid 69f9a7035f fully revert change to addOnCompleteCallback -- missed this in e9f53ec 2013-02-21 15:07:46 -08:00
Tathagata Das 334ab92441 Fixed bug in CheckpointSuite 2013-02-20 10:26:36 -08:00
Tathagata Das fb9956256d Merge branch 'mesos-master' into streaming
Conflicts:
	core/src/main/scala/spark/rdd/CheckpointRDD.scala
	streaming/src/main/scala/spark/streaming/dstream/ReducedWindowedDStream.scala
2013-02-20 09:01:29 -08:00
Matei Zaharia 06e5e6627f Renamed "splits" to "partitions" 2013-02-17 22:13:26 -08:00
Matei Zaharia 340cc54e47 Merge pull request #471 from stephenh/parallelrdd
Move ParallelCollection into spark.rdd package.
2013-02-16 16:39:15 -08:00
Matei Zaharia 3260b6120e Merge pull request #470 from stephenh/morek
Make CoGroupedRDDs explicitly have the same key type.
2013-02-16 16:38:38 -08:00
Stephen Haberman 924f47dd11 Add RDD.subtract.
Instead of reusing the cogroup primitive, this adds a SubtractedRDD
that knows it only needs to keep rdd1's values (per split) in memory.
2013-02-16 13:38:42 -06:00
Stephen Haberman e7713adb99 Move ParallelCollection into spark.rdd package. 2013-02-16 13:20:48 -06:00
Stephen Haberman ae2234687d Make CoGroupedRDDs explicitly have the same key type. 2013-02-16 13:10:31 -06:00
Stephen Haberman 4328873294 Add assertion about dependencies. 2013-02-16 01:16:40 -06:00
Stephen Haberman c34b8ad2c5 Avoid a shuffle if combineByKey is passed the same partitioner. 2013-02-16 00:54:03 -06:00
Stephen Haberman 6a2d957843 Tweak test names. 2013-02-16 00:33:49 -06:00
Imran Rashid bffee929ab Merge branch 'master' into stageInfo
Conflicts:
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/storage/BlockManager.scala
2013-02-15 10:35:04 -08:00
Patrick Wendell f0b68c623c Initial cut at replacing K, V in Java files 2013-02-11 10:03:37 -08:00
Tathagata Das 16baea62bc Fixed bug in CheckpointRDD to prevent exception when the original RDD had zero splits. 2013-02-10 19:14:49 -08:00
Imran Rashid b7d9e24394 use TaskMetrics to gather all stats; lots of plumbing to get it all the way back to driver 2013-02-10 14:18:52 -08:00
Stephen Haberman 680f42e6cd Change defaultPartitioner to use upstream split size.
Previously it used the SparkContext.defaultParallelism, which occassionally
ended up being a very bad guess. Looking at upstream RDDs seems to make
better use of the context.

Also sorted the upstream RDDs by partition size first, as if we have
a hugely-partitioned RDD and tiny-partitioned RDD, it is unlikely
we want the resulting RDD to be tiny-partitioned.
2013-02-10 02:27:03 -06:00
Stephen Haberman 921be76533 Use stubs instead of mocks for DAGSchedulerSuite. 2013-02-09 16:42:18 -06:00
Imran Rashid 04e828f7c1 general fixes to Distribution, plus some tests 2013-02-08 19:07:36 -08:00
Stephen Haberman f2bc748013 Add RDD.coalesce. 2013-02-05 21:23:36 -06:00
Imran Rashid 379564c7e0 setup plumbing to get task metrics; lots of unfinished parts, but basic flow in place 2013-02-05 18:30:21 -08:00
Matei Zaharia a4611d66f0 Merge pull request #449 from stephenh/longerdriversuite
Increase DriverSuite timeout.
2013-02-05 17:58:22 -08:00
Stephen Haberman 1ba3393ceb Increase DriverSuite timeout. 2013-02-05 17:56:50 -06:00
Imran Rashid 295b534398 task context keeps a handle on Task -- giant hack, temporary for tracking shuffle times & amount 2013-02-05 10:18:16 -08:00
Matei Zaharia f6ec547ea7 Small fix to test for distinct 2013-02-04 13:14:54 -08:00
Matei Zaharia aa4ee1e9e5 Fix failing test 2013-02-04 11:06:31 -08:00
Charles Reiss 6107957962 Merge remote-tracking branch 'base/master' into dag-sched-tests
Conflicts:
	core/src/main/scala/spark/scheduler/DAGScheduler.scala
2013-02-02 00:33:30 -08:00
Charles Reiss 1fd5ee323d Code review changes: add sc.stop; style of multiline comments; parens on procedure calls. 2013-02-01 22:33:38 -08:00
Matei Zaharia ae26911ec0 Add back test for distinct without parens 2013-02-01 21:07:24 -08:00
Matei Zaharia 8b3041c723 Reduced the memory usage of reduce and similar operations
These operations used to wait for all the results to be available in an
array on the driver program before merging them. They now merge values
incrementally as they arrive.
2013-02-01 15:38:42 -08:00
Charles Reiss 7f51458774 Comment at top of DAGSchedulerSuite 2013-01-30 09:34:53 -08:00
Charles Reiss 9c0bae75ad Change DAGSchedulerSuite to run DAGScheduler in the same Thread. 2013-01-30 09:22:07 -08:00
Charles Reiss 4bf3d7ea12 Clear spark.master.port to cleanup for other tests 2013-01-29 19:05:58 -08:00
Charles Reiss 9eac7d01f0 Add DAGScheduler tests. 2013-01-29 18:55:43 -08:00
Matei Zaharia 9ae11603b4 Merge pull request #415 from stephenh/driver
Replace old 'master' term with 'driver'.
2013-01-29 10:41:42 -08:00
Matei Zaharia 64ba6a8c2c Simplify checkpointing code and RDD class a little:
- RDD's getDependencies and getSplits methods are now guaranteed to be
  called only once, so subclasses can safely do computation in there
  without worrying about caching the results.

- The management of a "splits_" variable that is cleared out when we
  checkpoint an RDD is now done in the RDD class.

- A few of the RDD subclasses are simpler.

- CheckpointRDD's compute() method no longer assumes that it is given a
  CheckpointRDDSplit -- it can work just as well on a split from the
  original RDD, because it only looks at its index. This is important
  because things like UnionRDD and ZippedRDD remember the parent's
  splits as part of their own and wouldn't work on checkpointed parents.

- RDD.iterator can now reuse cached data if an RDD is computed before it
  is checkpointed. It seems like it wouldn't do this before (it always
  called iterator() on the CheckpointRDD, which read from HDFS).
2013-01-28 22:30:12 -08:00
Stephen Haberman 13368818af Merge branch 'master' into driver
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
	core/src/main/scala/spark/SparkEnv.scala
	core/src/main/scala/spark/deploy/LocalSparkCluster.scala
	core/src/main/scala/spark/executor/StandaloneExecutorBackend.scala
	core/src/main/scala/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
	core/src/main/scala/spark/scheduler/cluster/StandaloneClusterMessage.scala
	core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
	core/src/main/scala/spark/storage/BlockManagerMaster.scala
	core/src/main/scala/spark/storage/ThreadingTest.scala
	core/src/test/scala/spark/MapOutputTrackerSuite.scala
2013-01-28 23:30:24 -06:00
Imran Rashid efff7bfb33 add long and float accumulatorparams 2013-01-28 20:23:11 -08:00
Matei Zaharia 44b4a0f88f Track workers by executor ID instead of hostname to allow multiple
executors per machine and remove the need for multiple IP addresses in
unit tests.
2013-01-27 19:23:49 -08:00
Matei Zaharia 49f6472c0f Merge pull request #418 from woggling/reregister-deadlock
Fix BlockManager reregistration deadlock; do BlockManager reregistration more asynchronously
2013-01-26 18:59:02 -08:00
Charles Reiss ad4232b4da Fix deadlock in BlockManager reregistration triggered by failed updates. 2013-01-26 18:30:38 -08:00
Josh Rosen d49cf0e587 Fix JavaRDDLike.flatMap(PairFlatMapFunction) (SPARK-668).
This workaround is easier than rewriting JavaRDDLike in Java.
2013-01-26 16:13:18 -08:00
Stephen Haberman 7dfb82a992 Replace old 'master' term with 'driver'. 2013-01-25 11:03:00 -06:00
Stephen Haberman ec43a51b38 Merge branch 'master' into localsparkcontext
Conflicts:
	core/src/test/scala/spark/FileServerSuite.scala
	core/src/test/scala/spark/RDDSuite.scala
2013-01-24 21:17:30 -06:00
Stephen Haberman 230bda2047 Add LocalSparkContext to manage common sc variable. 2013-01-24 11:01:01 -06:00
Matei Zaharia fe5e4812fc Merge pull request #409 from rxin/splitpruningrdd
Added pruntSplits method to RDD.
2013-01-23 22:23:22 -08:00
Reynold Xin eedc542a02 Removed pruneSplits method in RDD and renamed SplitsPruningRDD to
PartitionPruningRDD.
2013-01-23 22:14:23 -08:00
Reynold Xin 45cd50d5fe Updated assert == to ===. 2013-01-23 16:06:58 -08:00
Matei Zaharia 548856a224 Merge remote-tracking branch 'woggling/remove-machines'
Conflicts:
	core/src/main/scala/spark/scheduler/DAGScheduler.scala
2013-01-23 15:44:17 -08:00
Reynold Xin c24b3819dd Added an extra assert for split size check. 2013-01-23 15:34:59 -08:00
Reynold Xin eb222b7206 Added pruntSplits method to RDD. 2013-01-23 15:29:02 -08:00
Charles Reiss 5c7422292e Remove more dead code from test. 2013-01-23 12:59:51 -08:00
Charles Reiss 88b9d240fd Remove dead code in test. 2013-01-23 12:40:38 -08:00
Matei Zaharia 1a3aeeca23 Merge pull request #407 from woggling/no-cache-tracker
Eliminate CacheTracker
2013-01-23 12:28:48 -08:00
Matei Zaharia 4147e1d47b Merge pull request #406 from tdas/master
Changed StorageLevel and BlockManagerId API to prevent duplication in memory
2013-01-23 12:18:31 -08:00
Matei Zaharia 4d77d554e1 Merge pull request #394 from JoshRosen/add_file_fix
Add SparkFiles.get() API to access files added through addFile().
2013-01-23 12:16:30 -08:00
Charles Reiss 0b506dd2ec Add tests of various node failure scenarios. 2013-01-23 01:38:15 -08:00
Tathagata Das 5e11f1e51f Modified StorageLevel API to ensure zero duplicate objects. 2013-01-22 23:42:53 -08:00
Tathagata Das bacade6caf Modified BlockManagerId API to ensure zero duplicate objects. Fixed BlockManagerId testcase in BlockManagerTestSuite. 2013-01-22 22:55:26 -08:00
Josh Rosen 43e9ff9596 Add test for driver hanging on exit (SPARK-530). 2013-01-22 22:47:26 -08:00
Charles Reiss 2849931000 Eliminate CacheTracker.
Replaces DAGScheduler's queries of CacheTracker with BlockManagerMaster
queries.

Adds CacheManager to locally coordinate computation of cached RDDs.
2013-01-22 22:19:30 -08:00
Josh Rosen ef711902c1 Don't download files to master's working directory.
This should avoid exceptions caused by existing
files with different contents.

I also removed some unused code.
2013-01-21 17:34:17 -08:00
Stephen Haberman ffd1623595 Minor cleanup. 2013-01-21 15:55:46 -06:00
folone fd6e51deec Fixed the failing test. 2013-01-20 17:02:58 +01:00
folone ad8aff6ca4 Merge remote-tracking branch 'upstream/master' 2013-01-20 14:43:20 +01:00
Tathagata Das 4f8fe58b25 Merge branch 'mesos-streaming' into streaming
Conflicts:
	core/src/main/scala/spark/api/java/JavaRDDLike.scala
	core/src/main/scala/spark/api/java/JavaSparkContext.scala
	core/src/test/scala/spark/JavaAPISuite.java
2013-01-20 01:13:56 -08:00
Patrick Wendell d5570c7968 Adding checkpointing to Java API 2013-01-17 18:41:58 -08:00
Tathagata Das f466ee44bc Merge branch 'master' into streaming
Conflicts:
	core/src/main/scala/spark/MapOutputTracker.scala
2013-01-16 12:57:11 -08:00
Matei Zaharia 4beb084f64 Merge pull request #374 from woggling/null-mapout
Generate FetchFailedException even for cached missing map outputs
2013-01-15 14:22:29 -08:00
Tathagata Das cd1521cfdb Merge branch 'master' into streaming
Conflicts:
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/FilteredRDD.scala
	docs/_layouts/global.html
	docs/index.md
	run
2013-01-15 12:08:51 -08:00
Charles Reiss 4078623b9f Remove broken attempt to test fetching case. 2013-01-15 12:05:54 -08:00
Stephen Haberman d228bff440 Add a test. 2013-01-15 11:48:50 -06:00
Charles Reiss b038999797 Fix accidental spark.master.host reuse 2013-01-14 17:04:44 -08:00
Charles Reiss 7ba34bc007 Additional tests for MapOutputTracker. 2013-01-14 15:27:02 -08:00
folone 25c0739bad Moved to scala 2.10.0. Notable changes are:
- akka  2.0.3  → 2.1.0
- spray 1.0-M1 → 1.1-M7
For now the repl subproject is commented out, as scala reflection api changed very much since the introduction of macros.
2013-01-14 09:52:11 +01:00
Matei Zaharia 72408e8dfa Make filter preserve partitioner info, since it can 2013-01-13 19:34:07 -08:00
Ryan LeCompte ea20ae6618 add one extra test 2013-01-12 09:18:00 -08:00
Ryan LeCompte 2c77eeebb6 correct test params 2013-01-12 00:13:45 -08:00
Ryan LeCompte 0cfea7a2ec add unit test 2013-01-11 23:48:07 -08:00
Stephen Haberman 8ac0f35be4 Add JavaRDDLike.keyBy. 2013-01-08 09:57:45 -06:00
Stephen Haberman 4ee6b22775 Merge branch 'master' into tupleBy
Conflicts:
	core/src/test/scala/spark/RDDSuite.scala
2013-01-08 09:10:10 -06:00
Matei Zaharia f7cf035b9b Merge pull request #350 from tdas/streaming
Spark Streaming
2013-01-07 17:40:11 -08:00
Shivaram Venkataraman b1336e2fe4 Update expected size of strings to match our dummy string class 2013-01-07 17:00:32 -08:00
Tathagata Das 4719e6d8fe Changed locations for unit test logs. 2013-01-07 16:06:07 -08:00
Shivaram Venkataraman 55c66d365f Use a dummy string class in Size Estimator tests to make it resistant to jdk
versions
2013-01-07 15:58:00 -08:00
Shivaram Venkataraman 77d751731c Remove unused BoundedMemoryCache file and associated test case. 2013-01-07 15:57:46 -08:00
Matei Zaharia 1941d9602d Merge branch 'master' of github.com:mesos/spark 2013-01-07 16:50:39 -05:00
Matei Zaharia 9c32f300fb Add Accumulable.setValue for easier use in Java 2013-01-07 16:50:23 -05:00
Stephen Haberman 8dc06069fe Rename RDD.tupleBy to keyBy. 2013-01-06 15:21:45 -06:00
Matei Zaharia b1663752c6 Merge pull request #351 from stephenh/values
Add PairRDDFunctions.keys and values.
2013-01-05 19:15:54 -08:00
Matei Zaharia 0982572519 Add methods called just 'accumulator' for int/double in Java API 2013-01-05 22:11:28 -05:00
Matei Zaharia 86af64b0a6 Fix Accumulators in Java, and add a test for them 2013-01-05 20:55:17 -05:00
Matei Zaharia ecf9c08901 Fix Accumulators in Java, and add a test for them 2013-01-05 20:54:08 -05:00
Stephen Haberman 1fdb6946b5 Add RDD.tupleBy. 2013-01-05 13:07:59 -06:00
Stephen Haberman 6a0db3b449 Fix typo. 2013-01-05 12:56:17 -06:00
Stephen Haberman f4e6b9361f Add RDD.collect(PartialFunction). 2013-01-05 12:14:08 -06:00
Stephen Haberman 8d57c78c83 Add PairRDDFunctions.keys and values. 2013-01-05 12:04:01 -06:00
Tathagata Das 3dc87dd923 Fixed compilation bug in RDDSuite created during merge for mesos/master. 2013-01-01 16:38:04 -08:00
Tathagata Das d34dba25c2 Merge branch 'mesos' into dev-merge 2013-01-01 15:48:39 -08:00
Matei Zaharia 55809fbc6d Merge pull request #349 from woggling/cache-finally
Avoid stalls when computation of cached RDD throws exception
2013-01-01 08:21:33 -08:00
Charles Reiss 21636ee4fa Test with exception while computing cached RDD. 2013-01-01 08:07:40 -08:00
Josh Rosen f803953998 Raise exception when hashing Java arrays (SPARK-597) 2012-12-31 20:20:11 -08:00
Tathagata Das 0bc0a60d30 Modifications to make sure LocalScheduler terminate cleanly without errors when SparkContext is shutdown, to minimize spurious exception during master failure tests. 2012-12-27 15:37:33 -08:00
Tathagata Das 7c33f76291 Merge branch 'mesos' into dev-merge 2012-12-26 19:19:07 -08:00
Tathagata Das 836042bb9f Merge branch 'dev-checkpoint' of github.com:radlab/spark into dev-merge
Conflicts:
	core/src/main/scala/spark/ParallelCollection.scala
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/rdd/BlockRDD.scala
	core/src/main/scala/spark/rdd/CartesianRDD.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/CoalescedRDD.scala
	core/src/main/scala/spark/rdd/FilteredRDD.scala
	core/src/main/scala/spark/rdd/FlatMappedRDD.scala
	core/src/main/scala/spark/rdd/GlommedRDD.scala
	core/src/main/scala/spark/rdd/HadoopRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
	core/src/main/scala/spark/rdd/MappedRDD.scala
	core/src/main/scala/spark/rdd/PipedRDD.scala
	core/src/main/scala/spark/rdd/SampledRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
	core/src/main/scala/spark/rdd/UnionRDD.scala
	core/src/main/scala/spark/scheduler/ResultTask.scala
	core/src/test/scala/spark/CheckpointSuite.scala
2012-12-26 19:09:01 -08:00
Mark Hamstra 903f3518df fall back to filter-map-collect when calling lookup() on an RDD without a partitioner 2012-12-24 13:18:45 -08:00
Mark Hamstra 61be8566e2 Allow distinct() to be called without parentheses when using the default number of splits. 2012-12-24 02:36:47 -08:00
Reynold Xin eac566a7f4 Merge branch 'master' of github.com:mesos/spark into dev
Conflicts:
	core/src/main/scala/spark/MapOutputTracker.scala
	core/src/main/scala/spark/PairRDDFunctions.scala
	core/src/main/scala/spark/ParallelCollection.scala
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/rdd/BlockRDD.scala
	core/src/main/scala/spark/rdd/CartesianRDD.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/CoalescedRDD.scala
	core/src/main/scala/spark/rdd/FilteredRDD.scala
	core/src/main/scala/spark/rdd/FlatMappedRDD.scala
	core/src/main/scala/spark/rdd/GlommedRDD.scala
	core/src/main/scala/spark/rdd/HadoopRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
	core/src/main/scala/spark/rdd/MappedRDD.scala
	core/src/main/scala/spark/rdd/PipedRDD.scala
	core/src/main/scala/spark/rdd/SampledRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
	core/src/main/scala/spark/rdd/UnionRDD.scala
	core/src/main/scala/spark/storage/BlockManager.scala
	core/src/main/scala/spark/storage/BlockManagerId.scala
	core/src/main/scala/spark/storage/BlockManagerMaster.scala
	core/src/main/scala/spark/storage/StorageLevel.scala
	core/src/main/scala/spark/util/MetadataCleaner.scala
	core/src/main/scala/spark/util/TimeStampedHashMap.scala
	core/src/test/scala/spark/storage/BlockManagerSuite.scala
	run
2012-12-20 14:53:40 -08:00
Tathagata Das 8512dd3225 Merge branch 'dev' of github.com:radlab/spark into dev-checkpoint
Conflicts:
	core/src/main/scala/spark/ParallelCollection.scala
	core/src/test/scala/spark/CheckpointSuite.scala
	streaming/src/main/scala/spark/streaming/DStream.scala
2012-12-20 14:24:19 -08:00
Tathagata Das fe777eb77d Fixed bugs in CheckpointRDD and spark.CheckpointSuite. 2012-12-20 13:39:27 -08:00
Reynold Xin 9397c5014e Let the slave notify the master block removal. 2012-12-20 01:37:09 -08:00
Tathagata Das 5184141936 Introduced getSpits, getDependencies, and getPreferredLocations in RDD and RDDCheckpointData. 2012-12-18 13:30:53 -08:00
Reynold Xin 8c01295b85 Fixed conflicts from merging Charles' and TD's block manager changes. 2012-12-14 00:26:36 -08:00
Reynold Xin 0235667f73 Merge branch 'master' of github.com:mesos/spark into spark-633 2012-12-13 22:33:41 -08:00
Reynold Xin 97434f49b8 Merged TD's block manager refactoring. 2012-12-13 22:32:19 -08:00
Reynold Xin f4a9e1b9be Fixed the broken Java unit test from SPARK-635. 2012-12-13 22:22:12 -08:00
Reynold Xin 1b7a0451ed Added the ability in block manager to remove blocks. 2012-12-13 00:04:42 -08:00
Tathagata Das 8e74fac215 Made checkpoint data in RDDs optional to further reduce serialized size. 2012-12-11 15:36:12 -08:00
Tathagata Das fa28f25619 Fixed bug in UnionRDD and CoGroupedRDD 2012-12-11 13:59:43 -08:00
Tathagata Das 2a87d816a2 Added clear property to JavaAPISuite to remove port binding errors. 2012-12-11 01:44:43 -08:00
Tathagata Das 746afc2e65 Bunch of bug fixes related to checkpointing in RDDs. RDDCheckpointData object is used to lock all serialization and dependency changes for checkpointing. ResultTask converted to Externalizable and serialized RDD is cached like ShuffleMapTask. 2012-12-10 23:36:37 -08:00
Charles Reiss 5d3e917d09 Use Akka scheduler for BlockManager heart beats.
Adds required ActorSystem argument to BlockManager constructors.
2012-12-10 00:31:50 -08:00
Tathagata Das e427216018 Removed unnecessary testcases. 2012-12-08 12:46:59 -08:00
Tathagata Das 1f3a75ae9e Modified checkpoint testsuite to more comprehensively test checkpointing of various RDDs. Fixed checkpoint bug (splits referring to parent RDDs or parent splits) in UnionRDD and CoalescedRDD. Fixed bug in testing ShuffledRDD. Removed unnecessary and useless map-side combining step for narrow dependencies in CoGroupedRDD. Removed unncessary WeakReference stuff from many other RDDs. 2012-12-07 13:45:52 -08:00
Charles Reiss a2a94fdbc7 Tests for block manager heartbeats. 2012-12-05 23:36:05 -08:00
Tathagata Das 21a0852976 Refactored RDD checkpointing to minimize extra fields in RDD class. 2012-12-04 22:10:25 -08:00
Tathagata Das e463ae4920 Modified StorageLevel and BlockManagerId to cache common objects and use cached object while deserializing. 2012-11-28 14:05:01 -08:00
Matei Zaharia 3ebd8e1885 Added zip to Java API 2012-11-27 22:38:09 -08:00
Matei Zaharia 27e43abd19 Added a zip() operation for RDDs with the same shape (number of
partitions and number of elements in each partition)
2012-11-27 22:27:47 -08:00
Matei Zaharia 935c468b71 Merge pull request #311 from woggling/map-output-npe
Fix NullPointerException when map output unregistered from MapOutputTracker twice
2012-11-27 20:50:48 -08:00
Reynold Xin f24bfd2dd1 For size compression, compress non zero values into non zero values. 2012-11-27 19:20:45 -08:00
Charles Reiss 5fa868b98b Tests for MapOutputTracker. 2012-11-27 16:05:36 -08:00
Tathagata Das 10c1abcb6a Fixed checkpointing bug in CoGroupedRDD. CoGroupSplits kept around the RDD splits of its parent RDDs, thus checkpointing its parents did not release the references to the parent splits. 2012-11-17 17:27:00 -08:00
Tathagata Das 04e9e9d93c Refactored BlockManagerMaster (not BlockManagerMasterActor) to simplify the code and fix live lock problem in unlimited attempts to contact the master. Also added testcases in the BlockManagerSuite to test BlockManagerMaster methods getPeers and getLocations. 2012-11-11 08:54:21 -08:00
Tathagata Das 34e569f40e Added 'synchronized' to RDD serialization to ensure checkpoint-related changes are reflected atomically in the task closure. Added to tests to ensure that jobs running on an RDD on which checkpointing is in progress does hurt the result of the job. 2012-10-31 00:56:40 -07:00
Tathagata Das 0dcd770fdc Added checkpointing support to all RDDs, along with CheckpointSuite to test checkpointing in them. 2012-10-30 16:09:37 -07:00
Matei Zaharia 0bd20c63e2 Merge remote-tracking branch 'JoshRosen/shuffle_refactoring' into dev
Conflicts:
	core/src/main/scala/spark/Dependency.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
2012-10-23 22:01:45 -07:00
Matei Zaharia 8815aeba0c Take executor environment vars as an arguemnt to SparkContext 2012-10-13 15:31:11 -07:00
Josh Rosen 33cd3a0c12 Remove map-side combining from ShuffleMapTask.
This separation of concerns simplifies the 
ShuffleDependency and ShuffledRDD interfaces.

Map-side combining can be performed in a
mapPartitions() call prior to shuffling the RDD.

I don't anticipate this having much of a 
performance impact: in both approaches, each tuple
is hashed twice: once in the bucket partitioning
and once in the combiner's hashtable.  The same
steps are being performed, but in a different
order and through one extra Iterator.
2012-10-13 14:59:20 -07:00
Josh Rosen 10bcd217d2 Remove mapSideCombine field from Aggregator.
Instead, the presence or absense of a ShuffleDependency's aggregator
will control whether map-side combining is performed.
2012-10-13 14:59:20 -07:00
Josh Rosen 4775c55641 Change ShuffleFetcher to return an Iterator. 2012-10-13 14:59:20 -07:00
Matei Zaharia 682b2d9329 Added a test for when an RDD only partially fits in memory 2012-10-12 14:58:26 -07:00
Shivaram Venkataraman 8577523f37 Add test to verify if RDD is computed even if block manager has insufficient
memory
2012-10-12 14:14:57 -07:00
Shivaram Venkataraman 2cf40c5fd5 Change block manager to accept a ArrayBuffer instead of an iterator to ensure
that the computation can proceed even if we run out of memory to cache the
block. Update CacheTracker to use this new interface
2012-10-11 00:42:46 -07:00
Matei Zaharia efc5423210 Made compression configurable separately for shuffle, broadcast and RDDs 2012-10-07 11:30:53 -07:00
Reynold Xin 80f59e17e2 Fixed a bug in addFile that if the file is specified as "file:///", the
symlink is created wrong for local mode.
2012-10-07 00:54:38 -07:00
Matei Zaharia eca570f66a Removed the need to sleep in tests due to waiting for Akka to shut down 2012-10-07 00:17:59 -07:00
Matei Zaharia dc28a3ac0a Modified shuffle to limit the maximum outstanding data size in bytes,
instead of the maximum number of outstanding fetches. This should make
it faster when there are many small map output files, as well as more
robust to overallocating memory on large map outputs.
2012-10-06 20:07:10 -07:00
Matei Zaharia 9a3b3f32a3 Pass sizes of map outputs back to MapOutputTracker 2012-10-06 18:46:04 -07:00
Matei Zaharia 716e10ca32 Minor formatting fixes 2012-10-05 22:03:06 -07:00
Andy Konwinski a242cdd0a6 Factor subclasses of RDD out of RDD.scala into their own classes
in the rdd package.
2012-10-05 19:53:54 -07:00
Andy Konwinski e0067da082 Moves all files in core/src/main/scala/ that have RDD in them from
package spark to package spark.rdd and updates all references to them.
2012-10-05 19:23:45 -07:00
Shivaram Venkataraman b6e4f46a96 Fix SizeEstimator tests to work with String classes in JDK 6 and 7
Conflicts:

	core/src/test/scala/spark/BoundedMemoryCacheSuite.scala
2012-10-05 16:58:57 -07:00
Imran Rashid e0698f8f26 change tests to show utility of localValue 2012-10-04 23:05:42 -07:00
Imran Rashid 82a3327862 make accumulator.localValue public, add tests
Conflicts:
	core/src/test/scala/spark/AccumulatorSuite.scala
2012-10-04 23:05:01 -07:00
Matei Zaharia 97cbd699d7 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-02 17:31:01 -07:00
Matei Zaharia 5fda59ab99 Added a test for overly large blocks in memory store 2012-10-02 17:30:40 -07:00
Matei Zaharia 6098f7e87a Fixed cache replacement behavior of BlockManager:
- Partitions that get dropped to disk will now be loaded back into RAM
  after they're accessed again
- Same-RDD rule for cache replacement is now implemented (don't drop
  partitions from an RDD to make room for other partitions from itself)
- Items stored as MEMORY_AND_DISK go into memory only first, instead of
  being eagerly written out to disk
- MemoryStore.ensureFreeSpace is called within a lock on the writer
  thread to prevent race conditions (this can still be optimized to
  allow multiple concurrent calls to it but it's a start)
- MemoryStore does not accept blocks larger than its limit
2012-10-02 17:25:38 -07:00
Reynold Xin 0898a21b95 Merge branch 'dev' of https://github.com/mesos/spark into dev 2012-10-02 13:08:01 -07:00
Matei Zaharia 22684653a5 Revert "Place Spray repo ahead of Cloudera in Maven search path"
This reverts commit 42e0a68082.
2012-10-02 12:01:32 -07:00
Reynold Xin b8cd681169 Allow whitespaces in cluster URL configuration for local cluster. 2012-10-02 11:52:12 -07:00
Matei Zaharia 42e0a68082 Place Spray repo ahead of Cloudera in Maven search path 2012-10-02 11:37:19 -07:00
Matei Zaharia 74a9244255 Write all unit test output to a file 2012-10-01 15:07:42 -07:00
Matei Zaharia 0b84871dbc Remove some printlns in tests 2012-10-01 10:57:26 -07:00
Matei Zaharia 2314132d57 Added a (failing) test for LRU with MEMORY_AND_DISK. 2012-09-30 22:52:16 -07:00
Matei Zaharia 83143f9a5f Fixed several bugs that caused weird behavior with files in spark-shell:
- SizeEstimator was following through a ClassLoader field of Hadoop
  JobConfs, which referenced the whole interpreter, Scala compiler, etc.
  Chaos ensued, giving an estimated size in the tens of gigabytes.
- Broadcast variables in local mode were only stored as MEMORY_ONLY and
  never made accessible over a server, so they fell out of the cache when
  they were deemed too large and couldn't be reloaded.
2012-09-30 21:19:39 -07:00
Matei Zaharia fd0374b9de Comment 2012-09-29 21:43:06 -07:00
Matei Zaharia 143ef4f90d Added a CoalescedRDD class for reducing the number of partitions in an RDD. 2012-09-29 21:30:52 -07:00
Matei Zaharia c45758ddde Comment 2012-09-29 20:27:54 -07:00
Matei Zaharia 9b326d01e9 Made BlockManager unmap memory-mapped files when necessary to reduce the
number of open files. Also optimized sending of disk-based blocks.
2012-09-29 20:21:54 -07:00
Matei Zaharia 009b0e37e7 Added an option to compress blocks in the block store 2012-09-27 18:45:44 -07:00
Matei Zaharia 7bcb08cef5 Renamed storage levels to something cleaner; fixes #223. 2012-09-27 17:50:59 -07:00
Matei Zaharia 920fab23c3 Merge pull request #222 from rxin/dev
Added MapPartitionsWithSplitRDD.
2012-09-26 23:16:45 -07:00
Matei Zaharia 1ef4f0fbd2 Allow controlling number of splits in sortByKey. 2012-09-26 19:18:47 -07:00
Reynold Xin 1ad1331a34 Added MapPartitionsWithSplitRDD. 2012-09-26 17:11:28 -07:00
Matei Zaharia d71a358c46 Fixed a test that was getting extremely lucky before, and increased the
number of samples used for sorting
2012-09-26 00:25:34 -07:00
Matei Zaharia 6eeb379cf8 Fix some test issues 2012-09-24 15:39:58 -07:00
Reynold Xin 397d3816e1 Separated ShuffledRDD into multiple classes: RepartitionShuffledRDD,
ShuffledSortedRDD, and ShuffledAggregatedRDD.
2012-09-19 12:31:45 -07:00
Denny 5e4076e3f2 Merge branch 'dev' into feature/fileserver
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
2012-09-11 16:57:17 -07:00
Matei Zaharia 6d7f907e73 Manually merge pull request #175 by Imran Rashid 2012-09-11 16:00:06 -07:00
Denny 4d3471dd07 Fix serialization bugs and added local cluster tests 2012-09-10 15:39:58 -07:00
Denny b864c36a30 Dynamically adding jar files and caching fileSets. 2012-09-10 12:49:09 -07:00
Denny f275fb07da General FileServer
A general fileserver for both JARs and regular files.
2012-09-10 12:48:59 -07:00
Matei Zaharia a13780670d Added a unit test for local-cluster mode and simplified some of the code involved in that 2012-09-10 12:48:58 -07:00
Matei Zaharia 995982b3c9 Added a unit test for local-cluster mode and simplified some of the code involved in that 2012-09-07 17:08:36 -07:00
Reynold Xin c308fbcb79 Removed cache add/remove log messages from CacheTracker.
Added log messages on BlockManagerMaster to reflect block add/remove.
Also did some minor cleanup of storage package code.
2012-09-05 15:59:48 -07:00
Reynold Xin a8a2a08a1a Added a test for testing map-side combine on/off switch. 2012-08-30 12:34:28 -07:00
Matei Zaharia 2c16ae36d7 Set log level in tests to WARN 2012-08-23 20:38:14 -07:00
Matei Zaharia deedb9e7b7 Fix further issues with tests and broadcast.
The broadcast fix is to store values as MEMORY_ONLY_DESER instead of
MEMORY_ONLY, which will save substantial time on serialization.
2012-08-23 20:31:49 -07:00
Shivaram Venkataraman 0f4fbb057b Change BlockManagerSuite test cases to use a deterministic size estimator and
update the results to match the new estimates
2012-08-13 13:32:23 -07:00
Shivaram Venkataraman 22ba3a3f77 Add test-cases for 32-bit and no-compressed oops scenarios. 2012-08-13 13:32:10 -07:00
Shivaram Venkataraman 1f68c4b03b Update test cases to match the new size estimates. Uses 64-bit and compressed
oops setting to get deterministic results
2012-08-13 13:31:54 -07:00
Matei Zaharia 6ae3c375a9 Renamed apply() to call() in Java API and allowed it to throw Exceptions 2012-08-12 23:10:19 +02:00
Matei Zaharia e463e7a333 Merge pull request #167 from JoshRosen/piped-rdd-fixes
Detect non-zero exit status from PipedRDD process
2012-08-10 00:56:42 -07:00
Shivaram Venkataraman ce3444d2cb Fix testcheckpoint to reuse spark context defined in the class 2012-08-03 18:52:26 -07:00
Matei Zaharia 62898b631f Made range partition balance tests more aggressive.
This is because we pull out such a large sample (10x the number of
partitions) that we should expect pretty good balance. The tests are
also deterministic so there's no worry about them failing irreproducibly.
2012-08-03 16:46:48 -04:00
Matei Zaharia 6601a6212b Added a unit test for cross-partition balancing in sort, and changes to
RangePartitioner to make it pass. It turns out that the first partition
was always kind of small due to how we picked partition boundaries.
2012-08-03 16:40:45 -04:00
Matei Zaharia 3ee2530c0c Merge branch 'block-manager-fix' into dev 2012-07-30 13:58:46 -07:00
Matei Zaharia 400221f851 Merge branch 'dev' of git://github.com/tdas/spark into dev 2012-07-30 13:54:57 -07:00
Matei Zaharia ed1b0f8388 Made BlockManagerMaster no longer be a singleton.
Also cleaned up a few formatting things throughout block manager code.
2012-07-30 13:53:47 -07:00
Matei Zaharia d7f089323a Fixed AccumulatorSuite to clean up SparkContext with BeforeAndAfter 2012-07-28 20:25:42 -07:00
Imran Rashid f7149c5e46 tasks cannot access value of accumulator 2012-07-28 20:16:17 -07:00
Imran Rashid f1face1ea9 rename addToAccum to addAccumulator 2012-07-28 20:16:01 -07:00
Imran Rashid 2d666b9d76 add some functionality to Vector, delete copy in AccumulatorSuite 2012-07-28 20:15:51 -07:00
Imran Rashid 83659af11c Accumulator now inherits from Accumulable, whcih simplifies a bunch of other things (eg., no +:=)
Conflicts:

	core/src/main/scala/spark/Accumulators.scala
2012-07-28 20:13:51 -07:00
Imran Rashid ae07f3864c add Accumulatable, add corresponding docs & tests for accumulators 2012-07-28 20:12:41 -07:00
Matei Zaharia f6f917bd00 Add a sleep to prevent a failing test.
The BlockManager's put seems to be slightly asynchronous, which can
cause it to fail this test by not removing stuff from the cache before
we put the next value. We should probably change the semantics of put()
in this case but it's hard right now. It will also be hard for
asynchronously replicated puts.
2012-07-27 16:59:36 -07:00
Matei Zaharia c0c78d2119 Renamed test more descriptively 2012-07-27 16:28:18 -07:00
Matei Zaharia dee8ff1b9d Added a second version of union() without varargs. 2012-07-27 16:27:52 -07:00
Matei Zaharia b51d733a57 Fixed Java union methods having same erasure.
Changed union() methods on lists to take a separate "first element"
argument in order to differentiate them to the compiler, because Java 7
considered it an error to have them all take Lists parameterized with
different types.
2012-07-27 12:23:27 -07:00
Tathagata Das 024905f682 Added BlockRDD and a first-cut version of checkpoint() to RDD class. 2012-07-27 12:00:49 -07:00
Tathagata Das 0426769f89 Modified the block dropping code for better performance. 2012-07-26 20:53:45 -07:00
Matei Zaharia 5c5aa2ff81 Merge pull request #153 from JoshRosen/new-java-api
Java API
2012-07-26 17:20:52 -07:00
Josh Rosen c5e2810dc7 Add persist(), splits(), glom(), and mapPartitions() to Java API. 2012-07-26 12:46:47 -07:00
Josh Rosen bf61c10072 Detect non-zero exit status from PipedRDD process. 2012-07-26 11:32:59 -07:00
Denny 4f4a34c025 Stlystic changes
Conflicts:

	core/src/test/scala/spark/MesosSchedulerSuite.scala
2012-07-23 16:32:20 -07:00
Denny 866e6949df Always destroy SparkContext in after block for the unit tests.
Conflicts:

	core/src/test/scala/spark/ShuffleSuite.scala
2012-07-23 16:29:17 -07:00
Josh Rosen 042dcbde33 Add type annotations to Java API methods.
Add missing Scala Map to java.util.Map conversions.
2012-07-22 17:35:29 -07:00
Josh Rosen 01dce3f569 Add Java API
Add distinct() method to RDD.

Fix bug in DoubleRDDFunctions.
2012-07-18 17:34:29 -07:00
Matei Zaharia 408b5a1332 More work on deploy code (adding Worker class) 2012-06-30 16:45:57 -07:00
Matei Zaharia 2fb6e7d71e Initial framework to get a master and web UI up. 2012-06-30 14:45:55 -07:00
Matei Zaharia c53670b9bf Various code style fixes, mostly from IntelliJ IDEA 2012-06-29 18:47:12 -07:00
Matei Zaharia 3920189932 Upgraded to Akka 2 and fixed test execution (which was still parallel
across projects).
2012-06-28 23:51:28 -07:00
Tathagata Das e896a505e2 Added testcase for ByteBufferInputStream bugs. 2012-06-17 16:11:12 -07:00
Matei Zaharia f58da6164e Merge branch 'master' into dev 2012-06-15 23:47:11 -07:00
Tathagata Das c6156da9e2 Multiple bug fixes to pass the testsuites ShuffleSuite and BlockManagerSuite. 2012-06-13 16:26:49 -04:00
Matei Zaharia e75b1b5cb4 Change the default broadcast implementation to a simple HTTP-based
broadcast. Fixes #139.
2012-06-09 15:58:07 -07:00
Matei Zaharia a96558caa3 Performance improvements to shuffle operations: in particular, preserve
RDD partitioning in more cases where it's possible, and use iterators
instead of materializing collections when doing joins.
2012-06-09 14:44:18 -07:00
Matei Zaharia c2c7299d7a Added BlockManagerSuite, which I'd forgotten to merge. 2012-06-07 13:47:10 -07:00
Matei Zaharia 63051dd2bc Merge in engine improvements from the Spark Streaming project, developed
jointly with Tathagata Das and Haoyuan Li. This commit imports the changes
and ports them to Mesos 0.9, but does not yet pass unit tests due to
various classes not supporting a graceful stop() yet.
2012-06-07 12:45:38 -07:00
Matei Zaharia 6ae2746d1e Handle arrays that contain the same element many times better in
SizeEstimator. Also added a test for SizeEstimator. Fixes #136.
2012-06-06 16:13:02 -07:00
Matei Zaharia 0a617958d1 Some refactoring to make BoundedMemoryCache test similar to others 2012-06-06 16:12:08 -07:00
Matei Zaharia e141f644ca Merge pull request #132 from Benky/rb-first-iteration
Little refactoring and unit tests for CacheTrackerActor
2012-05-26 13:15:06 -07:00
Richard Benkovsky ae64920337 MesosScheduler refactoring 2012-05-22 11:04:54 +02:00
Richard Benkovsky 3a1bcd4028 Added tests for CacheTrackerActor 2012-05-22 11:04:54 +02:00
Richard Benkovsky 518506a7c5 Added tests for Utils.copyStream 2012-05-22 11:04:51 +02:00
Richard Benkovsky 565245871f BoundedMemoryCache.put fails when estimated size of 'value' is larger than cache capacity 2012-05-20 22:13:35 +02:00
Reynold Xin 16461e2eda Updated Cache's put method to use a case class for response. Previously
it was pretty ugly that put() should return -1 for failures.
2012-05-15 00:31:52 -07:00
Reynold Xin 019e48833f Added the capacity to report cache usage status back to the cache
trackor. This is essential for building a dashboard to see the status of
caches on all slaves.
2012-05-14 18:39:04 -07:00
Reynold Xin 761ea65a98 Added a test for the previous commit (failing to serialize task results
would throw an exception for local tasks).
2012-04-24 15:14:35 -07:00
Reynold Xin e601b3b9e5 Added the ability to set environmental variables in piped rdd. 2012-04-17 16:40:56 -07:00
Matei Zaharia c7af538ac1 Some fixes to sorting for when the RDD has fewer elements than the
number of partitions we ask to partition it into. Also, removed a test
that was taking way too long to run.
2012-03-17 13:08:36 -07:00
Matei Zaharia 1e10df0a46 Merge pull request #111 from alupher/master
Adding sorting to RDDs
2012-02-24 15:50:14 -08:00
Antonio 0d93d95bcf Removed unnecessary import 2012-02-21 19:57:12 -08:00
Antonio 2990298f71 Added sorting testing suite 2012-02-21 19:54:21 -08:00
Matei Zaharia aa04f87cd2 Added support for parallel execution of jobs in DAGScheduler. 2012-02-19 22:50:23 -08:00
Matei Zaharia a766780f4c Added some tests for multithreaded access to Spark. 2012-02-09 22:27:53 -08:00
Matei Zaharia 43a3335090 Simplifying test 2012-02-05 22:46:51 -08:00
Matei Zaharia eb05154b7a Fixed a failure recovery bug and added some tests for fault recovery. 2012-01-13 19:08:25 -08:00
Matei Zaharia e269f6f7ea Register RDDs with the MapOutputTracker even if they have no partitions.
Fixes #105.
2012-01-05 15:59:20 -05:00
Matei Zaharia 735843a049 Merge remote-tracking branch 'origin/charles-newhadoop' 2011-12-02 21:59:30 -08:00
Charles Reiss 66f05f383e Add new Hadoop API reading support. 2011-12-01 14:02:10 -08:00
Charles Reiss 02d43e6986 Add new Hadoop API writing support. 2011-12-01 14:01:28 -08:00
Matei Zaharia 22b8fcf632 Added fold() and aggregate() operations that reuse an object to
merge results into rather than requiring a new object allocation
for each element merged. Fixes #95.
2011-11-30 11:37:47 -08:00
Matei Zaharia 9e4c79a4d3 Closure cleaner unit test 2011-11-08 00:40:15 -08:00
Matei Zaharia c2b7fd6899 Make parallelize() work efficiently for ranges of Long, Double, etc
(splitting them into sub-ranges). Fixes #87.
2011-11-02 15:16:02 -07:00
Matei Zaharia d12122502b Various improvements to Kryo serializer:
- Replaced modified Kryo version with the standard one augmented with
  the kryo-serializers package, which includes support for classes with
  no-arg constructors (that was why we had a modified Kryo before)
- The kryo-serializers version also fixes issue #72.
- Added a bunch of tests.
- Serialize maps and a few other common types properly by default.
2011-07-21 22:09:33 -07:00
Matei Zaharia e4c3402d2d Renamed ParallelArray to ParallelCollection 2011-07-14 14:47:01 -04:00
Matei Zaharia 2604939f64 Simplified and documented code a little and added test 2011-07-14 00:19:00 -04:00
Matei Zaharia 9c0069188b Updated save code to allow non-file-based OutputFormats and added a test
for file-related stuff
2011-07-13 23:04:06 -04:00
Matei Zaharia 842e14d567 Added mapPartitions operation and a bunch of tests for RDD ops 2011-07-13 00:19:52 -04:00
Olivier Grisel 2e3531d8bf Implemented RDD.leftOuterJoin and RDD.rightOuterJoin 2011-06-24 11:00:51 +02:00
Olivier Grisel 005d1605a4 add missing test for RDD.groupWith 2011-06-23 02:10:52 +02:00
Ismael Juma 1396678baa Move REPL classes to separate module. 2011-05-27 11:22:50 +01:00
Matei Zaharia 4db50e26c7 Fixed unit tests by making them clean up the SparkContext after use and
thus clean up the various singletons (RDDCache, MapOutputTracker, etc).
This isn't perfect yet (ideally we shouldn't use singleton objects at
all) but we can fix that later.
2011-05-13 12:03:58 -07:00
Matei Zaharia e5c4cd8a5e Made examples and core subprojects 2011-02-01 15:11:08 -08:00