ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	8815aeba0c	Take executor environment vars as an arguemnt to SparkContext	2012-10-13 15:31:11 -07:00
Josh Rosen	33cd3a0c12	Remove map-side combining from ShuffleMapTask. This separation of concerns simplifies the ShuffleDependency and ShuffledRDD interfaces. Map-side combining can be performed in a mapPartitions() call prior to shuffling the RDD. I don't anticipate this having much of a performance impact: in both approaches, each tuple is hashed twice: once in the bucket partitioning and once in the combiner's hashtable. The same steps are being performed, but in a different order and through one extra Iterator.	2012-10-13 14:59:20 -07:00
Josh Rosen	10bcd217d2	Remove mapSideCombine field from Aggregator. Instead, the presence or absense of a ShuffleDependency's aggregator will control whether map-side combining is performed.	2012-10-13 14:59:20 -07:00
Josh Rosen	4775c55641	Change ShuffleFetcher to return an Iterator.	2012-10-13 14:59:20 -07:00
Matei Zaharia	682b2d9329	Added a test for when an RDD only partially fits in memory	2012-10-12 14:58:26 -07:00
Shivaram Venkataraman	8577523f37	Add test to verify if RDD is computed even if block manager has insufficient memory	2012-10-12 14:14:57 -07:00
Shivaram Venkataraman	2cf40c5fd5	Change block manager to accept a ArrayBuffer instead of an iterator to ensure that the computation can proceed even if we run out of memory to cache the block. Update CacheTracker to use this new interface	2012-10-11 00:42:46 -07:00
Matei Zaharia	efc5423210	Made compression configurable separately for shuffle, broadcast and RDDs	2012-10-07 11:30:53 -07:00
Reynold Xin	80f59e17e2	Fixed a bug in addFile that if the file is specified as "file:///", the symlink is created wrong for local mode.	2012-10-07 00:54:38 -07:00
Matei Zaharia	eca570f66a	Removed the need to sleep in tests due to waiting for Akka to shut down	2012-10-07 00:17:59 -07:00
Matei Zaharia	dc28a3ac0a	Modified shuffle to limit the maximum outstanding data size in bytes, instead of the maximum number of outstanding fetches. This should make it faster when there are many small map output files, as well as more robust to overallocating memory on large map outputs.	2012-10-06 20:07:10 -07:00
Matei Zaharia	9a3b3f32a3	Pass sizes of map outputs back to MapOutputTracker	2012-10-06 18:46:04 -07:00
Matei Zaharia	716e10ca32	Minor formatting fixes	2012-10-05 22:03:06 -07:00
Andy Konwinski	a242cdd0a6	Factor subclasses of RDD out of RDD.scala into their own classes in the rdd package.	2012-10-05 19:53:54 -07:00
Andy Konwinski	e0067da082	Moves all files in core/src/main/scala/ that have RDD in them from package spark to package spark.rdd and updates all references to them.	2012-10-05 19:23:45 -07:00
Shivaram Venkataraman	b6e4f46a96	Fix SizeEstimator tests to work with String classes in JDK 6 and 7 Conflicts: core/src/test/scala/spark/BoundedMemoryCacheSuite.scala	2012-10-05 16:58:57 -07:00
Imran Rashid	e0698f8f26	change tests to show utility of localValue	2012-10-04 23:05:42 -07:00
Imran Rashid	82a3327862	make accumulator.localValue public, add tests Conflicts: core/src/test/scala/spark/AccumulatorSuite.scala	2012-10-04 23:05:01 -07:00
Matei Zaharia	97cbd699d7	Merge branch 'dev' of github.com:mesos/spark into dev	2012-10-02 17:31:01 -07:00
Matei Zaharia	5fda59ab99	Added a test for overly large blocks in memory store	2012-10-02 17:30:40 -07:00
Matei Zaharia	6098f7e87a	Fixed cache replacement behavior of BlockManager: - Partitions that get dropped to disk will now be loaded back into RAM after they're accessed again - Same-RDD rule for cache replacement is now implemented (don't drop partitions from an RDD to make room for other partitions from itself) - Items stored as MEMORY_AND_DISK go into memory only first, instead of being eagerly written out to disk - MemoryStore.ensureFreeSpace is called within a lock on the writer thread to prevent race conditions (this can still be optimized to allow multiple concurrent calls to it but it's a start) - MemoryStore does not accept blocks larger than its limit	2012-10-02 17:25:38 -07:00
Reynold Xin	0898a21b95	Merge branch 'dev' of https://github.com/mesos/spark into dev	2012-10-02 13:08:01 -07:00
Matei Zaharia	22684653a5	Revert "Place Spray repo ahead of Cloudera in Maven search path" This reverts commit `42e0a68082`.	2012-10-02 12:01:32 -07:00
Reynold Xin	b8cd681169	Allow whitespaces in cluster URL configuration for local cluster.	2012-10-02 11:52:12 -07:00
Matei Zaharia	42e0a68082	Place Spray repo ahead of Cloudera in Maven search path	2012-10-02 11:37:19 -07:00
Matei Zaharia	74a9244255	Write all unit test output to a file	2012-10-01 15:07:42 -07:00
Matei Zaharia	0b84871dbc	Remove some printlns in tests	2012-10-01 10:57:26 -07:00
Matei Zaharia	2314132d57	Added a (failing) test for LRU with MEMORY_AND_DISK.	2012-09-30 22:52:16 -07:00
Matei Zaharia	83143f9a5f	Fixed several bugs that caused weird behavior with files in spark-shell: - SizeEstimator was following through a ClassLoader field of Hadoop JobConfs, which referenced the whole interpreter, Scala compiler, etc. Chaos ensued, giving an estimated size in the tens of gigabytes. - Broadcast variables in local mode were only stored as MEMORY_ONLY and never made accessible over a server, so they fell out of the cache when they were deemed too large and couldn't be reloaded.	2012-09-30 21:19:39 -07:00
Matei Zaharia	fd0374b9de	Comment	2012-09-29 21:43:06 -07:00
Matei Zaharia	143ef4f90d	Added a CoalescedRDD class for reducing the number of partitions in an RDD.	2012-09-29 21:30:52 -07:00
Matei Zaharia	c45758ddde	Comment	2012-09-29 20:27:54 -07:00
Matei Zaharia	9b326d01e9	Made BlockManager unmap memory-mapped files when necessary to reduce the number of open files. Also optimized sending of disk-based blocks.	2012-09-29 20:21:54 -07:00
Matei Zaharia	009b0e37e7	Added an option to compress blocks in the block store	2012-09-27 18:45:44 -07:00
Matei Zaharia	7bcb08cef5	Renamed storage levels to something cleaner; fixes #223 .	2012-09-27 17:50:59 -07:00
Matei Zaharia	920fab23c3	Merge pull request #222 from rxin/dev Added MapPartitionsWithSplitRDD.	2012-09-26 23:16:45 -07:00
Matei Zaharia	1ef4f0fbd2	Allow controlling number of splits in sortByKey.	2012-09-26 19:18:47 -07:00
Reynold Xin	1ad1331a34	Added MapPartitionsWithSplitRDD.	2012-09-26 17:11:28 -07:00
Matei Zaharia	d71a358c46	Fixed a test that was getting extremely lucky before, and increased the number of samples used for sorting	2012-09-26 00:25:34 -07:00
Matei Zaharia	6eeb379cf8	Fix some test issues	2012-09-24 15:39:58 -07:00
Reynold Xin	397d3816e1	Separated ShuffledRDD into multiple classes: RepartitionShuffledRDD, ShuffledSortedRDD, and ShuffledAggregatedRDD.	2012-09-19 12:31:45 -07:00
Denny	5e4076e3f2	Merge branch 'dev' into feature/fileserver Conflicts: core/src/main/scala/spark/SparkContext.scala	2012-09-11 16:57:17 -07:00
Matei Zaharia	6d7f907e73	Manually merge pull request #175 by Imran Rashid	2012-09-11 16:00:06 -07:00
Denny	4d3471dd07	Fix serialization bugs and added local cluster tests	2012-09-10 15:39:58 -07:00
Denny	b864c36a30	Dynamically adding jar files and caching fileSets.	2012-09-10 12:49:09 -07:00
Denny	f275fb07da	General FileServer A general fileserver for both JARs and regular files.	2012-09-10 12:48:59 -07:00
Matei Zaharia	a13780670d	Added a unit test for local-cluster mode and simplified some of the code involved in that	2012-09-10 12:48:58 -07:00
Matei Zaharia	995982b3c9	Added a unit test for local-cluster mode and simplified some of the code involved in that	2012-09-07 17:08:36 -07:00
Reynold Xin	c308fbcb79	Removed cache add/remove log messages from CacheTracker. Added log messages on BlockManagerMaster to reflect block add/remove. Also did some minor cleanup of storage package code.	2012-09-05 15:59:48 -07:00
Reynold Xin	a8a2a08a1a	Added a test for testing map-side combine on/off switch.	2012-08-30 12:34:28 -07:00
Matei Zaharia	2c16ae36d7	Set log level in tests to WARN	2012-08-23 20:38:14 -07:00
Matei Zaharia	deedb9e7b7	Fix further issues with tests and broadcast. The broadcast fix is to store values as MEMORY_ONLY_DESER instead of MEMORY_ONLY, which will save substantial time on serialization.	2012-08-23 20:31:49 -07:00
Shivaram Venkataraman	0f4fbb057b	Change BlockManagerSuite test cases to use a deterministic size estimator and update the results to match the new estimates	2012-08-13 13:32:23 -07:00
Shivaram Venkataraman	22ba3a3f77	Add test-cases for 32-bit and no-compressed oops scenarios.	2012-08-13 13:32:10 -07:00
Shivaram Venkataraman	1f68c4b03b	Update test cases to match the new size estimates. Uses 64-bit and compressed oops setting to get deterministic results	2012-08-13 13:31:54 -07:00
Matei Zaharia	6ae3c375a9	Renamed apply() to call() in Java API and allowed it to throw Exceptions	2012-08-12 23:10:19 +02:00
Matei Zaharia	e463e7a333	Merge pull request #167 from JoshRosen/piped-rdd-fixes Detect non-zero exit status from PipedRDD process	2012-08-10 00:56:42 -07:00
Shivaram Venkataraman	ce3444d2cb	Fix testcheckpoint to reuse spark context defined in the class	2012-08-03 18:52:26 -07:00
Matei Zaharia	62898b631f	Made range partition balance tests more aggressive. This is because we pull out such a large sample (10x the number of partitions) that we should expect pretty good balance. The tests are also deterministic so there's no worry about them failing irreproducibly.	2012-08-03 16:46:48 -04:00
Matei Zaharia	6601a6212b	Added a unit test for cross-partition balancing in sort, and changes to RangePartitioner to make it pass. It turns out that the first partition was always kind of small due to how we picked partition boundaries.	2012-08-03 16:40:45 -04:00
Matei Zaharia	3ee2530c0c	Merge branch 'block-manager-fix' into dev	2012-07-30 13:58:46 -07:00
Matei Zaharia	400221f851	Merge branch 'dev' of git://github.com/tdas/spark into dev	2012-07-30 13:54:57 -07:00
Matei Zaharia	ed1b0f8388	Made BlockManagerMaster no longer be a singleton. Also cleaned up a few formatting things throughout block manager code.	2012-07-30 13:53:47 -07:00
Matei Zaharia	d7f089323a	Fixed AccumulatorSuite to clean up SparkContext with BeforeAndAfter	2012-07-28 20:25:42 -07:00
Imran Rashid	f7149c5e46	tasks cannot access value of accumulator	2012-07-28 20:16:17 -07:00
Imran Rashid	f1face1ea9	rename addToAccum to addAccumulator	2012-07-28 20:16:01 -07:00
Imran Rashid	2d666b9d76	add some functionality to Vector, delete copy in AccumulatorSuite	2012-07-28 20:15:51 -07:00
Imran Rashid	83659af11c	Accumulator now inherits from Accumulable, whcih simplifies a bunch of other things (eg., no +:=) Conflicts: core/src/main/scala/spark/Accumulators.scala	2012-07-28 20:13:51 -07:00
Imran Rashid	ae07f3864c	add Accumulatable, add corresponding docs & tests for accumulators	2012-07-28 20:12:41 -07:00
Matei Zaharia	f6f917bd00	Add a sleep to prevent a failing test. The BlockManager's put seems to be slightly asynchronous, which can cause it to fail this test by not removing stuff from the cache before we put the next value. We should probably change the semantics of put() in this case but it's hard right now. It will also be hard for asynchronously replicated puts.	2012-07-27 16:59:36 -07:00
Matei Zaharia	c0c78d2119	Renamed test more descriptively	2012-07-27 16:28:18 -07:00
Matei Zaharia	dee8ff1b9d	Added a second version of union() without varargs.	2012-07-27 16:27:52 -07:00
Matei Zaharia	b51d733a57	Fixed Java union methods having same erasure. Changed union() methods on lists to take a separate "first element" argument in order to differentiate them to the compiler, because Java 7 considered it an error to have them all take Lists parameterized with different types.	2012-07-27 12:23:27 -07:00
Tathagata Das	024905f682	Added BlockRDD and a first-cut version of checkpoint() to RDD class.	2012-07-27 12:00:49 -07:00
Tathagata Das	0426769f89	Modified the block dropping code for better performance.	2012-07-26 20:53:45 -07:00
Matei Zaharia	5c5aa2ff81	Merge pull request #153 from JoshRosen/new-java-api Java API	2012-07-26 17:20:52 -07:00
Josh Rosen	c5e2810dc7	Add persist(), splits(), glom(), and mapPartitions() to Java API.	2012-07-26 12:46:47 -07:00
Josh Rosen	bf61c10072	Detect non-zero exit status from PipedRDD process.	2012-07-26 11:32:59 -07:00
Denny	4f4a34c025	Stlystic changes Conflicts: core/src/test/scala/spark/MesosSchedulerSuite.scala	2012-07-23 16:32:20 -07:00
Denny	866e6949df	Always destroy SparkContext in after block for the unit tests. Conflicts: core/src/test/scala/spark/ShuffleSuite.scala	2012-07-23 16:29:17 -07:00
Josh Rosen	042dcbde33	Add type annotations to Java API methods. Add missing Scala Map to java.util.Map conversions.	2012-07-22 17:35:29 -07:00
Josh Rosen	01dce3f569	Add Java API Add distinct() method to RDD. Fix bug in DoubleRDDFunctions.	2012-07-18 17:34:29 -07:00
Matei Zaharia	408b5a1332	More work on deploy code (adding Worker class)	2012-06-30 16:45:57 -07:00
Matei Zaharia	2fb6e7d71e	Initial framework to get a master and web UI up.	2012-06-30 14:45:55 -07:00
Matei Zaharia	c53670b9bf	Various code style fixes, mostly from IntelliJ IDEA	2012-06-29 18:47:12 -07:00
Matei Zaharia	3920189932	Upgraded to Akka 2 and fixed test execution (which was still parallel across projects).	2012-06-28 23:51:28 -07:00
Tathagata Das	e896a505e2	Added testcase for ByteBufferInputStream bugs.	2012-06-17 16:11:12 -07:00
Matei Zaharia	f58da6164e	Merge branch 'master' into dev	2012-06-15 23:47:11 -07:00
Tathagata Das	c6156da9e2	Multiple bug fixes to pass the testsuites ShuffleSuite and BlockManagerSuite.	2012-06-13 16:26:49 -04:00
Matei Zaharia	e75b1b5cb4	Change the default broadcast implementation to a simple HTTP-based broadcast. Fixes #139.	2012-06-09 15:58:07 -07:00
Matei Zaharia	a96558caa3	Performance improvements to shuffle operations: in particular, preserve RDD partitioning in more cases where it's possible, and use iterators instead of materializing collections when doing joins.	2012-06-09 14:44:18 -07:00
Matei Zaharia	c2c7299d7a	Added BlockManagerSuite, which I'd forgotten to merge.	2012-06-07 13:47:10 -07:00
Matei Zaharia	63051dd2bc	Merge in engine improvements from the Spark Streaming project, developed jointly with Tathagata Das and Haoyuan Li. This commit imports the changes and ports them to Mesos 0.9, but does not yet pass unit tests due to various classes not supporting a graceful stop() yet.	2012-06-07 12:45:38 -07:00
Matei Zaharia	6ae2746d1e	Handle arrays that contain the same element many times better in SizeEstimator. Also added a test for SizeEstimator. Fixes #136.	2012-06-06 16:13:02 -07:00
Matei Zaharia	0a617958d1	Some refactoring to make BoundedMemoryCache test similar to others	2012-06-06 16:12:08 -07:00
Matei Zaharia	e141f644ca	Merge pull request #132 from Benky/rb-first-iteration Little refactoring and unit tests for CacheTrackerActor	2012-05-26 13:15:06 -07:00
Richard Benkovsky	ae64920337	MesosScheduler refactoring	2012-05-22 11:04:54 +02:00
Richard Benkovsky	3a1bcd4028	Added tests for CacheTrackerActor	2012-05-22 11:04:54 +02:00
Richard Benkovsky	518506a7c5	Added tests for Utils.copyStream	2012-05-22 11:04:51 +02:00
Richard Benkovsky	565245871f	BoundedMemoryCache.put fails when estimated size of 'value' is larger than cache capacity	2012-05-20 22:13:35 +02:00
Reynold Xin	16461e2eda	Updated Cache's put method to use a case class for response. Previously it was pretty ugly that put() should return -1 for failures.	2012-05-15 00:31:52 -07:00
Reynold Xin	019e48833f	Added the capacity to report cache usage status back to the cache trackor. This is essential for building a dashboard to see the status of caches on all slaves.	2012-05-14 18:39:04 -07:00
Reynold Xin	761ea65a98	Added a test for the previous commit (failing to serialize task results would throw an exception for local tasks).	2012-04-24 15:14:35 -07:00
Reynold Xin	e601b3b9e5	Added the ability to set environmental variables in piped rdd.	2012-04-17 16:40:56 -07:00
Matei Zaharia	c7af538ac1	Some fixes to sorting for when the RDD has fewer elements than the number of partitions we ask to partition it into. Also, removed a test that was taking way too long to run.	2012-03-17 13:08:36 -07:00
Matei Zaharia	1e10df0a46	Merge pull request #111 from alupher/master Adding sorting to RDDs	2012-02-24 15:50:14 -08:00
Antonio	0d93d95bcf	Removed unnecessary import	2012-02-21 19:57:12 -08:00
Antonio	2990298f71	Added sorting testing suite	2012-02-21 19:54:21 -08:00
Matei Zaharia	aa04f87cd2	Added support for parallel execution of jobs in DAGScheduler.	2012-02-19 22:50:23 -08:00
Matei Zaharia	a766780f4c	Added some tests for multithreaded access to Spark.	2012-02-09 22:27:53 -08:00
Matei Zaharia	43a3335090	Simplifying test	2012-02-05 22:46:51 -08:00
Matei Zaharia	eb05154b7a	Fixed a failure recovery bug and added some tests for fault recovery.	2012-01-13 19:08:25 -08:00
Matei Zaharia	e269f6f7ea	Register RDDs with the MapOutputTracker even if they have no partitions. Fixes #105.	2012-01-05 15:59:20 -05:00
Matei Zaharia	735843a049	Merge remote-tracking branch 'origin/charles-newhadoop'	2011-12-02 21:59:30 -08:00
Charles Reiss	66f05f383e	Add new Hadoop API reading support.	2011-12-01 14:02:10 -08:00
Charles Reiss	02d43e6986	Add new Hadoop API writing support.	2011-12-01 14:01:28 -08:00
Matei Zaharia	22b8fcf632	Added fold() and aggregate() operations that reuse an object to merge results into rather than requiring a new object allocation for each element merged. Fixes #95.	2011-11-30 11:37:47 -08:00
Matei Zaharia	9e4c79a4d3	Closure cleaner unit test	2011-11-08 00:40:15 -08:00
Matei Zaharia	c2b7fd6899	Make parallelize() work efficiently for ranges of Long, Double, etc (splitting them into sub-ranges). Fixes #87.	2011-11-02 15:16:02 -07:00
Matei Zaharia	d12122502b	Various improvements to Kryo serializer: - Replaced modified Kryo version with the standard one augmented with the kryo-serializers package, which includes support for classes with no-arg constructors (that was why we had a modified Kryo before) - The kryo-serializers version also fixes issue #72. - Added a bunch of tests. - Serialize maps and a few other common types properly by default.	2011-07-21 22:09:33 -07:00
Matei Zaharia	e4c3402d2d	Renamed ParallelArray to ParallelCollection	2011-07-14 14:47:01 -04:00
Matei Zaharia	2604939f64	Simplified and documented code a little and added test	2011-07-14 00:19:00 -04:00
Matei Zaharia	9c0069188b	Updated save code to allow non-file-based OutputFormats and added a test for file-related stuff	2011-07-13 23:04:06 -04:00
Matei Zaharia	842e14d567	Added mapPartitions operation and a bunch of tests for RDD ops	2011-07-13 00:19:52 -04:00
Olivier Grisel	2e3531d8bf	Implemented RDD.leftOuterJoin and RDD.rightOuterJoin	2011-06-24 11:00:51 +02:00
Olivier Grisel	005d1605a4	add missing test for RDD.groupWith	2011-06-23 02:10:52 +02:00
Ismael Juma	1396678baa	Move REPL classes to separate module.	2011-05-27 11:22:50 +01:00
Matei Zaharia	4db50e26c7	Fixed unit tests by making them clean up the SparkContext after use and thus clean up the various singletons (RDDCache, MapOutputTracker, etc). This isn't perfect yet (ideally we shouldn't use singleton objects at all) but we can fix that later.	2011-05-13 12:03:58 -07:00
Matei Zaharia	e5c4cd8a5e	Made examples and core subprojects	2011-02-01 15:11:08 -08:00

1 2 3 4 5

229 commits