ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Charles Reiss	cf79de425d	Fix NullPointerException when unregistering a map output twice.	2012-11-27 16:12:05 -08:00
Matei Zaharia	3ff6f4bdee	Merge pull request #304 from mbautin/configurable_local_ip SPARK-624: make the default local IP customizable	2012-11-19 13:23:39 -08:00
mbautin	00f4e3ff9c	Addressing Matei's comment: SPARK_LOCAL_IP environment variable	2012-11-19 11:52:10 -08:00
Charles Reiss	12c24e786c	Set default uncaught exception handler to exit. Among other things, should prevent OutOfMemoryErrors in some daemon threads (such as the network manager) from causing a spark executor to enter a state where it cannot make progress but does not report an error.	2012-11-16 20:12:31 -08:00
mbautin	1f5a7e0e64	SPARK-624: make the default local IP customizable	2012-11-15 13:57:47 -08:00
Matei Zaharia	c23a74df0a	Use DNS names instead of IP addresses in standalone mode, to allow matching with data locality hints from storage systems.	2012-11-15 00:10:52 -08:00
Matei Zaharia	173e0354c0	Detect correctly when one has disconnected from a standalone cluster. SPARK-617 #resolve	2012-11-11 21:06:57 -08:00
root	acf8272324	Fix K-means example a little	2012-11-10 23:07:21 -08:00
Tathagata Das	9915989bfa	Incorporated Matei's suggestions. Tested with 5 producer(consumer) threads each doing 50k puts (gets), took 15 minutes to run, no errors or deadlocks.	2012-11-09 15:46:15 -08:00
Tathagata Das	de00bc63db	Fixed deadlock in BlockManager. 1. Changed the lock structure of BlockManager by replacing the 337 coarse-grained locks to use BlockInfo objects as per-block fine-grained locks. 2. Changed the MemoryStore lock structure by making the block putting threads lock on a different object (not the memory store) thus making sure putting threads minimally blocks to the getting treads. 3. Added spark.storage.ThreadingTest to stress test the BlockManager using 5 block producer and 5 block consumer threads.	2012-11-09 14:09:37 -08:00
Matei Zaharia	6607f546cc	Added an option to spread out jobs in the standalone mode.	2012-11-08 23:13:12 -08:00
Matei Zaharia	66cbdee941	Fix for connections not being reused (from Josh Rosen)	2012-11-08 09:53:40 -08:00
Imran Rashid	809b2bb1fe	fix bug in getting slave id out of mesos	2012-11-08 00:34:28 -08:00
Matei Zaharia	bb1bce7924	Various fixes to standalone mode and web UI: - Don't report a job as finishing multiple times - Don't show state of workers as LOADING when they're running - Show start and finish times in web UI - Sort web UI tables by ID and time by default	2012-11-07 16:49:53 -08:00
Matei Zaharia	e2b8477487	Made Akka timeout and message frame size configurable, and upped the defaults	2012-11-06 15:58:05 -08:00
Shivaram Venkataraman	a7d967a1ca	Remove unnecessary hash-map put in MemoryStore	2012-11-01 10:46:38 -07:00
Josh Rosen	2ccf3b6652	Fix PySpark hash partitioning bug. A Java array's hashCode is based on its object identify, not its elements, so this was causing serialized keys to be hashed incorrectly. This commit adds a PySpark-specific workaround and adds more tests.	2012-10-28 22:30:28 -07:00
root	e782187b4a	Don't throw an error in the block manager when a block is cached on the master due to a locally computed operation Conflicts: core/src/main/scala/spark/storage/BlockManagerMaster.scala	2012-10-26 00:33:45 -07:00
Matei Zaharia	f63a40fd99	Strip leading mesos:// in URLs passed to Mesos	2012-10-24 21:52:13 -07:00
Matei Zaharia	d290e964ea	Merge pull request #281 from rxin/memreport Added a method to report slave memory status; force serialize accumulator update in local mode.	2012-10-23 22:04:35 -07:00
Matei Zaharia	0bd20c63e2	Merge remote-tracking branch 'JoshRosen/shuffle_refactoring' into dev Conflicts: core/src/main/scala/spark/Dependency.scala core/src/main/scala/spark/rdd/CoGroupedRDD.scala core/src/main/scala/spark/rdd/ShuffledRDD.scala	2012-10-23 22:01:45 -07:00
Josh Rosen	d4f2e5b0ef	Remove PYTHONPATH from SparkContext's executorEnvs. It makes more sense to pass it in the dictionary of environment variables that is used to construct PythonRDD.	2012-10-22 10:28:59 -07:00
Josh Rosen	c23bf1aff4	Add PySpark README and run scripts.	2012-10-20 00:22:27 +00:00
Josh Rosen	52989c8a2c	Update Python API for v0.6.0 compatibility.	2012-10-19 10:24:49 -07:00
Josh Rosen	e21eb6e00d	Merge tag 'v0.6.0' into python-api	2012-10-19 09:44:32 -07:00
Thomas Dudziak	d9c2a89c57	Support for Hadoop 2 distributions such as cdh4	2012-10-18 16:08:54 -07:00
Reynold Xin	4a3fb06ac2	Updated Kryo to 2.20.	2012-10-16 01:10:01 -07:00
Reynold Xin	63fae9bc23	Serialize accumulator updates in TaskResult for local mode.	2012-10-15 21:38:28 -07:00
Reynold Xin	42d20fa8da	Added a method to report slave memory status.	2012-10-14 22:30:53 -07:00
Matei Zaharia	64dbf8d372	Made ShuffleDependency automatically find a shuffle ID for itself	2012-10-14 10:00:22 -07:00
Matei Zaharia	8815aeba0c	Take executor environment vars as an arguemnt to SparkContext	2012-10-13 15:31:11 -07:00
Josh Rosen	33cd3a0c12	Remove map-side combining from ShuffleMapTask. This separation of concerns simplifies the ShuffleDependency and ShuffledRDD interfaces. Map-side combining can be performed in a mapPartitions() call prior to shuffling the RDD. I don't anticipate this having much of a performance impact: in both approaches, each tuple is hashed twice: once in the bucket partitioning and once in the combiner's hashtable. The same steps are being performed, but in a different order and through one extra Iterator.	2012-10-13 14:59:20 -07:00
Josh Rosen	10bcd217d2	Remove mapSideCombine field from Aggregator. Instead, the presence or absense of a ShuffleDependency's aggregator will control whether map-side combining is performed.	2012-10-13 14:59:20 -07:00
Josh Rosen	4775c55641	Change ShuffleFetcher to return an Iterator.	2012-10-13 14:59:20 -07:00
Josh Rosen	110832e88f	Add helper methods to Aggregator.	2012-10-13 14:57:56 -07:00
Denny	0700d1920a	Protect from null env variables in mesos.	2012-10-13 13:57:59 -07:00
Denny	21047d923e	Protect from setting null environment variables.	2012-10-13 13:44:24 -07:00
Denny	fa41d50f7d	Don't use system envs for Mesos.	2012-10-13 13:15:50 -07:00
Denny	67c42a41d0	Let the user specify environment variables to be passed to the Executors. Also removed unused variables in the ExecutorRunner.	2012-10-13 13:08:44 -07:00
Matei Zaharia	b4067cbad4	More doc updates, and moved Serializer to a subpackage.	2012-10-12 18:19:21 -07:00
Matei Zaharia	8d7b77bcb5	Some doc and usability improvements: - Added a StorageLevels class for easy access to StorageLevel constants in Java - Added doc comments on Function classes in Java - Updated Accumulator and HadoopWriter docs slightly	2012-10-12 17:53:20 -07:00
Matei Zaharia	dca496bb77	Document cartesian() operation	2012-10-12 14:46:41 -07:00
Matei Zaharia	23015ccac0	Merge pull request #271 from shivaram/block-manager-npe-fix Change block manager to accept a ArrayBuffer	2012-10-12 14:36:28 -07:00
Patrick Wendell	dc8adbd359	Adding Java documentation	2012-10-11 00:49:03 -07:00
Shivaram Venkataraman	2cf40c5fd5	Change block manager to accept a ArrayBuffer instead of an iterator to ensure that the computation can proceed even if we run out of memory to cache the block. Update CacheTracker to use this new interface	2012-10-11 00:42:46 -07:00
Denny	d3f095f904	Fixed bug when fetching Jar dependencies. Instead of checking currentFiles check currentJars.	2012-10-10 16:09:53 -07:00
Matei Zaharia	ee2fcb2ce6	Added documentation to all the *RDDFunction classes, and moved them into the spark package to make them more visible. Also documented various other miscellaneous things in the API.	2012-10-09 18:38:36 -07:00
Matei Zaharia	bc0bc672d0	Updates to documentation: - Edited quick start and tuning guide to simplify them a little - Simplified top menu bar - Made private a SparkContext constructor parameter that was left as public - Various small fixes	2012-10-09 14:30:23 -07:00
Andy Konwinski	1d79ff6028	Fixes a typo, adds scaladoc comments to SparkContext constructors.	2012-10-08 22:49:17 -07:00
Patrick Wendell	ac310098ef	More docs in RDD class	2012-10-08 22:25:11 -07:00
Andy Konwinski	bd688940a1	A start on scaladoc for the public APIs.	2012-10-08 21:13:29 -07:00
Mosharaf Chowdhury	edc67bfba8	Merge branch 'dev' into bc-fix-dev	2012-10-08 16:19:13 -07:00
Matei Zaharia	efc5423210	Made compression configurable separately for shuffle, broadcast and RDDs	2012-10-07 11:30:53 -07:00
Matei Zaharia	039cc6228e	Merge pull request #251 from JoshRosen/docs/internals Document Dependency classes and make minor interface improvements	2012-10-07 09:56:53 -07:00
Reynold Xin	f66c0e9561	Changed the println to logInfo in Utils.fetchFile.	2012-10-07 01:53:24 -07:00
Matei Zaharia	d72db3d7dc	Merge pull request #250 from rxin/dev Fixed a bug in addFile that if the file is specified as "file:///", the symlink is created incorrectly for local mode.	2012-10-07 00:56:53 -07:00
Reynold Xin	80f59e17e2	Fixed a bug in addFile that if the file is specified as "file:///", the symlink is created wrong for local mode.	2012-10-07 00:54:38 -07:00
Josh Rosen	e10308f5a0	Make ShuffleDependency.aggregator explicitly optional. It was confusing to be using new Aggregator[K, V, V](null, null, null, false) to represent the absence of an aggregator.	2012-10-07 00:36:04 -07:00
Matei Zaharia	f930fe5d81	Improve error message	2012-10-07 07:34:36 +00:00
Matei Zaharia	a3bf0ce57f	Don't crash on ask timeout exceptions in deploy.Client.stop() (fixes a crash in tests)	2012-10-07 07:25:41 +00:00
Matei Zaharia	eca570f66a	Removed the need to sleep in tests due to waiting for Akka to shut down	2012-10-07 00:17:59 -07:00
Josh Rosen	4f72066a9a	Document the Dependency classes.	2012-10-07 00:05:37 -07:00
Josh Rosen	3f2571fe98	Remove unused isShuffle field from Dependency.	2012-10-07 00:03:55 -07:00
Matei Zaharia	b2fc3dd902	Log message	2012-10-07 06:43:52 +00:00
Matei Zaharia	ea096f7cd5	More logging	2012-10-07 06:35:48 +00:00
root	554b42cb24	Log more info in MapOutputTracker	2012-10-07 05:02:18 +00:00
root	a73b25826b	Made Akka thread pool and message batch sizes configurable	2012-10-07 04:19:54 +00:00
root	ce915cadee	Made run script add test-classes onto the classpath only if SPARK_TESTING is set; fixes #216	2012-10-07 04:19:16 +00:00
root	975009d688	Avoid acquiring locks in BlockManager when fetching shuffle outputs	2012-10-07 04:02:10 +00:00
root	0bc63f7ef1	Log initial number of fetches in reducer	2012-10-07 03:51:04 +00:00
Matei Zaharia	dc28a3ac0a	Modified shuffle to limit the maximum outstanding data size in bytes, instead of the maximum number of outstanding fetches. This should make it faster when there are many small map output files, as well as more robust to overallocating memory on large map outputs.	2012-10-06 20:07:10 -07:00
Matei Zaharia	9a3b3f32a3	Pass sizes of map outputs back to MapOutputTracker	2012-10-06 18:46:04 -07:00
Matei Zaharia	0e42832e6a	Made block store return the size of each block put in	2012-10-06 18:00:53 -07:00
Matei Zaharia	b0110de5b6	Warn about user programs that try to set spark.cache.class	2012-10-06 17:27:14 -07:00
Matei Zaharia	65113b7e1b	Only group elements ten at a time into SequenceFile records in saveAsObjectFile	2012-10-06 17:14:41 -07:00
Matei Zaharia	716e10ca32	Minor formatting fixes	2012-10-05 22:03:06 -07:00
Matei Zaharia	70f02fa912	Merge branch 'dev' of github.com:mesos/spark into dev	2012-10-05 22:00:22 -07:00
Andy Konwinski	a242cdd0a6	Factor subclasses of RDD out of RDD.scala into their own classes in the rdd package.	2012-10-05 19:53:54 -07:00
Andy Konwinski	d7363a6b8a	Moves all files in core/src/main/scala/ that have RDD in their name from that directory to a new core/src/main/scala/rdd directory.	2012-10-05 19:23:45 -07:00
Andy Konwinski	e0067da082	Moves all files in core/src/main/scala/ that have RDD in them from package spark to package spark.rdd and updates all references to them.	2012-10-05 19:23:45 -07:00
Matei Zaharia	69588baf65	Cleaning up code slightly	2012-10-05 19:16:09 -07:00
root	f52bc09a34	Reduce some overly aggressive logging in connection manager	2012-10-06 01:54:39 +00:00
Matei Zaharia	e3ae98b54e	Merge pull request #247 from squito/dev Dev	2012-10-05 10:27:18 -07:00
Imran Rashid	e0698f8f26	change tests to show utility of localValue	2012-10-04 23:05:42 -07:00
Imran Rashid	82a3327862	make accumulator.localValue public, add tests Conflicts: core/src/test/scala/spark/AccumulatorSuite.scala	2012-10-04 23:05:01 -07:00
Matei Zaharia	8c82f43db3	Scaladoc documentation for some core Spark functionality	2012-10-04 22:59:36 -07:00
Reynold Xin	45f4b7cc7e	Made Serializer and JavaSerializer non private.	2012-10-03 10:20:59 -07:00
Matei Zaharia	833f1d0c86	Made StorageLevel public	2012-10-03 08:27:25 -07:00
Matei Zaharia	6cf5dffc72	Make more stuff private[spark]	2012-10-02 22:28:55 -07:00
Mosharaf Chowdhury	119e50c7b9	Conflict fixed	2012-10-02 22:25:39 -07:00
Matei Zaharia	626f701931	Merge pull request #240 from dennybritz/private_classes Package-Private Classes	2012-10-02 21:24:32 -07:00
Denny	0361353a70	Make Java API abstract wrapped functions private	2012-10-02 20:02:53 -07:00
Denny	b9badcd5bd	accidentially removed trait	2012-10-02 19:35:07 -07:00
Denny	18a1faedf6	Stylistic changes and Public Accumulable and Broadcast	2012-10-02 19:28:37 -07:00
Denny	b7a913e1fa	Make dependency classes public - used by spark	2012-10-02 19:04:23 -07:00
Denny	4d9f4b01af	Make classes package private	2012-10-02 19:00:19 -07:00
Matei Zaharia	97cbd699d7	Merge branch 'dev' of github.com:mesos/spark into dev	2012-10-02 17:31:01 -07:00
Matei Zaharia	6098f7e87a	Fixed cache replacement behavior of BlockManager: - Partitions that get dropped to disk will now be loaded back into RAM after they're accessed again - Same-RDD rule for cache replacement is now implemented (don't drop partitions from an RDD to make room for other partitions from itself) - Items stored as MEMORY_AND_DISK go into memory only first, instead of being eagerly written out to disk - MemoryStore.ensureFreeSpace is called within a lock on the writer thread to prevent race conditions (this can still be optimized to allow multiple concurrent calls to it but it's a start) - MemoryStore does not accept blocks larger than its limit	2012-10-02 17:25:38 -07:00
Reynold Xin	7997585616	Added a check to make sure SPARK_MEM <= memoryPerSlave for local cluster mode.	2012-10-02 15:45:25 -07:00
Reynold Xin	0898a21b95	Merge branch 'dev' of https://github.com/mesos/spark into dev	2012-10-02 13:08:01 -07:00
Matei Zaharia	22684653a5	Revert "Place Spray repo ahead of Cloudera in Maven search path" This reverts commit `42e0a68082`.	2012-10-02 12:01:32 -07:00
Reynold Xin	b8cd681169	Allow whitespaces in cluster URL configuration for local cluster.	2012-10-02 11:52:12 -07:00
Matei Zaharia	42e0a68082	Place Spray repo ahead of Cloudera in Maven search path	2012-10-02 11:37:19 -07:00
Matei Zaharia	b9fb8d6463	Include date in folder name for Spark local dir.	2012-10-01 15:55:16 -07:00
Matei Zaharia	bc881e4798	Merge branch 'dev' of github.com:mesos/spark into dev	2012-10-01 15:21:56 -07:00
Matei Zaharia	802aa8aef9	Some bug fixes and logging fixes for broadcast.	2012-10-01 15:20:42 -07:00
Reynold Xin	f264153162	Fixed #232 : DirectBuffer's cleaner was empty and Spark tried to invoke clean on it.	2012-10-01 14:07:34 -07:00
Matei Zaharia	3b348f909d	Improve log messages from BlockManager	2012-10-01 12:01:38 -07:00
Matei Zaharia	53f90d0f0e	Use underscores instead of colons in RDD IDs	2012-10-01 10:48:53 -07:00
Matei Zaharia	2314132d57	Added a (failing) test for LRU with MEMORY_AND_DISK.	2012-09-30 22:52:16 -07:00
Matei Zaharia	3128c57f90	Simplified Class / ClassLoader test	2012-09-30 21:48:27 -07:00
Matei Zaharia	83143f9a5f	Fixed several bugs that caused weird behavior with files in spark-shell: - SizeEstimator was following through a ClassLoader field of Hadoop JobConfs, which referenced the whole interpreter, Scala compiler, etc. Chaos ensued, giving an estimated size in the tens of gigabytes. - Broadcast variables in local mode were only stored as MEMORY_ONLY and never made accessible over a server, so they fell out of the cache when they were deemed too large and couldn't be reloaded.	2012-09-30 21:19:39 -07:00
Matei Zaharia	fd0374b9de	Comment	2012-09-29 21:43:06 -07:00
Matei Zaharia	5718cef2a4	Removed Logging trait from CoalescedRDD since we don't log anything	2012-09-29 21:40:43 -07:00
Matei Zaharia	143ef4f90d	Added a CoalescedRDD class for reducing the number of partitions in an RDD.	2012-09-29 21:30:52 -07:00
Matei Zaharia	ebd52347b5	Merge branch 'dev' of github.com:mesos/spark into dev	2012-09-29 20:22:31 -07:00
Matei Zaharia	9b326d01e9	Made BlockManager unmap memory-mapped files when necessary to reduce the number of open files. Also optimized sending of disk-based blocks.	2012-09-29 20:21:54 -07:00
Matei Zaharia	2f11e3c285	Merge pull request #227 from JoshRosen/fix/distinct_numsplits Allow controlling number of splits in distinct().	2012-09-28 23:57:24 -07:00
Josh Rosen	8654165e69	Use null as dummy value in distinct().	2012-09-28 23:55:17 -07:00
Josh Rosen	37c199bbb0	Allow controlling number of splits in distinct().	2012-09-28 23:44:19 -07:00
Matei Zaharia	56dcad5936	Don't create a Cache in SparkEnv because we don't use it	2012-09-28 23:40:56 -07:00
Matei Zaharia	1d44644f4f	Logging tweaks	2012-09-28 23:28:16 -07:00
Matei Zaharia	815d6bd69a	Renamed subdirs option	2012-09-28 19:02:41 -07:00
Matei Zaharia	e54e1d7043	Made subdirs per local dir configurable, and reduced lock usage a bit	2012-09-28 19:00:50 -07:00
Matei Zaharia	ae8c7d6cfa	Made disk store use multiple directories, deleted ShuffleManager	2012-09-28 18:28:13 -07:00
Matei Zaharia	3d7267999d	Print and track user call sites in more places in Spark	2012-09-28 17:42:00 -07:00
Matei Zaharia	9f6efbf06a	Merge pull request #225 from pwendell/dev Log message which records RDD origin	2012-09-28 16:28:07 -07:00
Matei Zaharia	0121a26bd1	Changed the way tasks' dependency files are sent to workers so that custom serializers or Kryo registrators can be loaded.	2012-09-28 16:14:05 -07:00
Patrick Wendell	9fc78f8f29	Fixing some whitespace issues	2012-09-28 16:05:50 -07:00
Patrick Wendell	bc909c2903	Changes based on Matei's comments	2012-09-28 16:04:36 -07:00
Patrick Wendell	c387e40fb1	Log message which records RDD origin This adds tracking to determine the "origin" of an RDD. Origin is defined by the boundary between the user's code and the spark code, during an RDD's instantiation. It is meant to help users understand where a Spark RDD is coming from in their code. This patch also logs origin data when stages are submitted to the scheduler. Finally, it adds a new log message to fix an inconsitency in the way that dependent stages (those missing parents) and independent stages (those without) are logged during submission.	2012-09-28 15:51:46 -07:00
Matei Zaharia	2a8bfbca00	Fixed a bug where isLocal was set to false when using local[K]	2012-09-28 14:50:54 -07:00
Matei Zaharia	4a138403ef	Fix a bug in JAR fetcher that made it always fetch the JAR	2012-09-27 21:32:06 -07:00
Matei Zaharia	009b0e37e7	Added an option to compress blocks in the block store	2012-09-27 18:45:44 -07:00
Matei Zaharia	7bcb08cef5	Renamed storage levels to something cleaner; fixes #223 .	2012-09-27 17:50:59 -07:00
Matei Zaharia	920fab23c3	Merge pull request #222 from rxin/dev Added MapPartitionsWithSplitRDD.	2012-09-26 23:16:45 -07:00
Matei Zaharia	ea05fc130b	Updates to standalone cluster, web UI and deploy docs.	2012-09-26 22:54:39 -07:00
Matei Zaharia	1ef4f0fbd2	Allow controlling number of splits in sortByKey.	2012-09-26 19:18:47 -07:00
Reynold Xin	1ad1331a34	Added MapPartitionsWithSplitRDD.	2012-09-26 17:11:28 -07:00
Matei Zaharia	ee71fa49c1	Look for Kryo registrator using context class loader	2012-09-26 14:15:16 -07:00
Matei Zaharia	d71a358c46	Fixed a test that was getting extremely lucky before, and increased the number of samples used for sorting	2012-09-26 00:25:34 -07:00
Matei Zaharia	051785c7e6	Several fixes to sampling issues pointed out by Henry Milner: - takeSample was biased towards earlier partitions - There were some range errors in takeSample - SampledRDDs with replacement didn't produce appropriate counts across partitions (we took exactly frac of each one)	2012-09-25 21:46:58 -07:00
Matei Zaharia	4d3339a3ec	Merge pull request #217 from rxin/dev Added a method to RDD to expose the ClassManifest.	2012-09-24 23:52:32 -07:00
Reynold Xin	7a4cd92861	Renamed RDD.manifest to RDD.elementClassManifest	2012-09-24 23:42:33 -07:00
Matei Zaharia	296e24b440	Merge pull request #218 from rnpandya/dev Scripts to start Spark under windows	2012-09-24 21:10:31 -07:00
Reynold Xin	348bcbca1f	Added a method to RDD to expose the ClassManifest.	2012-09-24 16:56:27 -07:00
Ravi Pandya	39215357af	Windows command scripts for sbt and run	2012-09-24 15:43:19 -07:00
Matei Zaharia	6eeb379cf8	Fix some test issues	2012-09-24 15:39:58 -07:00
Matei Zaharia	f855e4fad2	Merge pull request #208 from rxin/dev Separated ShuffledRDD into multiple classes.	2012-09-24 12:32:01 -07:00
root	107a5ca879	Make default number of parallel fetches slightly smaller since it doesn't seem to hurt performance much and it will cause slightly less GC.	2012-09-23 06:06:12 +00:00

1 2 3 4 5 ...

679 commits