ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	0e40cfabf8	Fix some review comments	2013-10-08 23:16:16 -07:00
Matei Zaharia	b535db7d89	Added a fast and low-memory append-only map implementation for cogroup and parallel reduce operations	2013-10-08 23:14:38 -07:00
Reynold Xin	213b70a2db	Merge pull request #31 from sundeepn/branch-0.8 Resolving package conflicts with hadoop 0.23.9 Hadoop 0.23.9 is having a package conflict with easymock's dependencies. (cherry picked from commit `023e3fdf00`) Signed-off-by: Reynold Xin <rxin@apache.org>	2013-10-07 10:54:22 -07:00
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Aaron Davidson	0f070279e7	Address Matei's comments	2013-10-05 15:15:29 -07:00
Aaron Davidson	db6f154940	Fix race conditions during recovery One major change was the use of messages instead of raw functions as the parameter of Akka scheduled timers. Since messages are serialized, unlike raw functions, the behavior is easier to think about and doesn't cause race conditions when exceptions are thrown. Another change is to avoid using global pointers that might change without a lock.	2013-10-04 19:54:33 -07:00
Andre Schumacher	c84946fe21	Fixing SPARK-602: PythonPartitioner Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.	2013-10-04 11:56:47 -07:00
Reynold Xin	d29e8035a0	Added countAsync and various unit tests for async actions.	2013-10-03 15:13:44 -07:00
Reynold Xin	e8e917f209	Merge branch 'master' into kill Conflicts: core/src/main/scala/org/apache/spark/TaskEndReason.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala	2013-10-02 23:01:34 -07:00
Reynold Xin	1c48ba0d9f	Merge remote-tracking branch 'origin' into kill Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala	2013-10-02 16:40:44 -07:00
Kay Ousterhout	0dcad2edcb	Added additional unit test for repeated task failures	2013-09-30 23:26:15 -07:00
Kay Ousterhout	dea4677c88	Fixed compilation errors and broken test.	2013-09-30 22:07:01 -07:00
Kay Ousterhout	8deda427bc	Merge remote-tracking branch 'upstream/master' into results_through-bm Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterScheduler.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalTaskSetManager.scala	2013-09-30 10:16:58 -07:00
Kay Ousterhout	58b764b7c6	Addressed Matei's code review comments	2013-09-30 10:11:59 -07:00
Aaron Davidson	42d72308fb	Add license notices	2013-09-26 15:45:20 -07:00
Reynold Xin	70a0b993d4	Merge pull request #14 from kayousterhout/untangle_scheduler Improved organization of scheduling packages. This commit does not change any code -- only file organization. Please let me know if there was some masterminded strategy behind the existing organization that I failed to understand! There are two components of this change: (1) Moving files out of the cluster package, and down a level to the scheduling package. These files are all used by the local scheduler in addition to the cluster scheduler(s), so should not be in the cluster package. As a result of this change, none of the files in the local package reference files in the cluster package. (2) Moving the mesos package to within the cluster package. The mesos scheduling code is for a cluster, and represents a specific case of cluster scheduling (the Mesos-related classes often subclass cluster scheduling classes). Thus, the most logical place for it seems to be within the cluster package. The one thing about the scheduling code that seems a little funny to me is the naming of the SchedulerBackends. The StandaloneSchedulerBackend is not just for Standalone mode, but instead is used by Mesos coarse grained mode and Yarn, and the backend that is just for Standalone mode is instead called SparkDeploySchedulerBackend. I didn't change this because I wasn't sure if there was a reason for this naming that I'm just not aware of.	2013-09-26 14:11:54 -07:00
Patrick Wendell	6566a19b38	Merge pull request #9 from rxin/limit Smarter take/limit implementation.	2013-09-26 08:01:04 -07:00
Kay Ousterhout	d85fe41b2b	Improved organization of scheduling packages. This commit does not change any code -- only file organization. There are two components of this change: (1) Moving files out of the cluster package, and down a level to the scheduling package. These files are all used by the local scheduler in addition to the cluster scheduler(s), so should not be in the cluster package. As a result of this change, none of the files in the local package reference files in the cluster package. (2) Moving the mesos package to within the cluster package. The mesos scheduling code is for a cluster, and represents a specific case of cluster scheduling (the Mesos-related classes often subclass cluster scheduling classes). Thus, the most logical place for it is within the cluster package.	2013-09-25 12:45:46 -07:00
Reynold Xin	ff540a015b	Merge branch 'master' of github.com:markhamstra/incubator-spark	2013-09-23 11:55:02 -07:00
Kay Ousterhout	c75eb14fe5	Send Task results through the block manager when larger than Akka frame size. This change requires adding an extra failure mode: tasks can complete successfully, but the result gets lost or flushed from the block manager before it's been fetched.	2013-09-22 21:20:48 -07:00
Reynold Xin	a2ea069a5f	Merge pull request #937 from jerryshao/localProperties-fix Fix PR926 local properties issues in Spark Streaming like scenarios	2013-09-21 23:04:42 -07:00
jerryshao	aa0c29f747	Add barrier for local properties unit test and fix some styles	2013-09-22 09:53:11 +08:00
Reynold Xin	42571d30d0	Smarter take/limit implementation.	2013-09-20 17:09:53 -07:00
Ankur Dave	026dba6aba	After unit tests, clear port properties unconditionally In MapOutputTrackerSuite, the "remote fetch" test sets spark.driver.port and spark.hostPort, assuming that they will be cleared by LocalSparkContext. However, the test never sets sc, so it remains null, causing LocalSparkContext to skip clearing these properties. Subsequent tests therefore fail with java.net.BindException: "Address already in use". This commit makes LocalSparkContext clear the properties even if sc is null.	2013-09-19 22:05:23 -07:00
jerryshao	ffa5f8e11d	Fix issue when local properties pass from parent to child thread	2013-09-18 17:33:24 +08:00
Reynold Xin	37d8f37a8e	Added a submitJob interface that returns a Future of the result.	2013-09-17 21:13:59 -07:00
Reynold Xin	cbc48be13b	Initial commit for job killing.	2013-09-16 18:54:06 -07:00
Patrick Wendell	bddf135670	Change port from 3030 to 4040	2013-09-11 10:01:38 -07:00
Matei Zaharia	a85758c200	Merge pull request #907 from stephenh/document_coalesce_shuffle Add better docs for coalesce.	2013-09-09 13:45:40 -07:00
Stephen Haberman	59003d387d	Use a set since shuffle could change order.	2013-09-09 11:45:03 -05:00
Matei Zaharia	7d3204b056	Merge pull request #905 from mateiz/docs2 Job scheduling and cluster mode docs	2013-09-08 21:39:12 -07:00
Patrick Wendell	f68848d95d	Merge pull request #906 from pwendell/ganglia-sink Clean-up of Metrics Code/Docs and Add Ganglia Sink	2013-09-08 18:32:16 -07:00
Matei Zaharia	170b3869ee	Fix unit test failure due to changed default	2013-09-08 17:51:27 -07:00
Patrick Wendell	c190b48bf5	Adding more docs and some code cleanup	2013-09-08 13:46:28 -07:00
Stephen Haberman	df5fd35273	Add better docs for coalesce. Include the useful tip that if shuffle=true, coalesce can actually increase the number of partitions. This makes coalesce more like a generic `RDD.repartition` operation. (Ideally this `RDD.repartition` could automatically choose either a coalesce or a shuffle if numPartitions was either less than or greater than, respectively, the current number of partitions.)	2013-09-08 15:39:04 -05:00
Matei Zaharia	651a96adf7	More fair scheduler docs and property names. Also changed uses of "job" terminology to "application" when they referred to an entire Spark program, to avoid confusion.	2013-09-08 00:29:11 -07:00
Matei Zaharia	98fb69822c	Work in progress: - Add job scheduling docs - Rename some fair scheduler properties - Organize intro page better - Link to Apache wiki for "contributing to Spark"	2013-09-08 00:29:11 -07:00
Reynold Xin	1e15feb5a3	Hot fix to resolve the compilation error caused by SPARK-821.	2013-09-06 22:44:05 +08:00
Aaron Davidson	3a04e76c89	Reynold's second round of comments	2013-09-05 21:43:26 -07:00
Aaron Davidson	4f2236a1c5	Add unit test and address comments	2013-09-05 18:06:30 -07:00
Aaron Davidson	1418d18af4	SPARK-821: Don't cache results when action run locally on driver Caching the results of local actions (e.g., rdd.first()) causes the driver to store entire partitions in its own memory, which may be highly constrained. This patch simply makes the CacheManager avoid caching the result of all locally-run computations.	2013-09-05 15:34:42 -07:00
Aaron Davidson	714e7f9e32	Fix line over 100 chars	2013-09-04 22:40:08 -07:00
Aaron Davidson	37db141aef	Address Patrick's comments	2013-09-04 21:34:20 -07:00
Aaron Davidson	9e6f2b6822	SPARK-884: Add unit test to validate Spark JSON output This unit test simply validates that the outputs of the JsonProtocol methods are syntactically valid JSON.	2013-09-04 15:26:46 -07:00
Mark Hamstra	c9bc8af3d1	Removed repetative import; fixes hidden definition compiler warning.	2013-09-03 15:25:20 -07:00
Matei Zaharia	12b2f1f9c9	Add missing license headers found with RAT	2013-09-02 12:23:03 -07:00
Matei Zaharia	246bf67f58	Fix test	2013-09-02 10:57:34 -07:00
Matei Zaharia	0a8cc30921	Move some classes to more appropriate packages: * RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer	2013-09-01 14:13:16 -07:00
Matei Zaharia	46eecd110a	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
Matei Zaharia	53cd50c069	Change build and run instructions to use assemblies This commit makes Spark invocation saner by using an assembly JAR to find all of Spark's dependencies instead of adding all the JARs in lib_managed. It also packages the examples into an assembly and uses that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script with two better-named scripts: "run-examples" for examples, and "spark-class" for Spark internal classes (e.g. REPL, master, etc). This is also designed to minimize the confusion people have in trying to use "run" to run their own classes; it's not meant to do that, but now at least if they look at it, they can modify run-examples to do a decent job for them. As part of this, Bagel's examples are also now properly moved to the examples package instead of bagel.	2013-08-29 21:19:04 -07:00
Ali Ghodsi	c0942a710f	Bug in test fixed	2013-08-20 16:16:05 -07:00
Ali Ghodsi	5db41919b5	Added a test to make sure no locality preferences are ignored	2013-08-20 16:16:05 -07:00
Ali Ghodsi	7b123b3126	Simpler code	2013-08-20 16:16:05 -07:00
Ali Ghodsi	a75a64eade	Fixed almost all of Matei's feedback	2013-08-20 16:16:05 -07:00
Ali Ghodsi	f1c853d76d	fixed Matei's comments	2013-08-20 16:16:04 -07:00
Ali Ghodsi	d6b6c680be	comment in the test to make it more understandable	2013-08-20 16:16:04 -07:00
Ali Ghodsi	b69e7166ba	Coalescer now uses current preferred locations for derived RDDs. Made run() in DAGScheduler thread safe and added a method to be able to ask it for preferred locations. Added a similar method that wraps the former inside SparkContext.	2013-08-20 16:16:04 -07:00
Ali Ghodsi	3b5bb8a4ae	added one test that will test a future functionality	2013-08-20 16:13:37 -07:00
Ali Ghodsi	33a0f59354	Added error messages to the tests to make failed tests less cryptic	2013-08-20 16:13:37 -07:00
Ali Ghodsi	f24861b60a	Fix bug in tests	2013-08-20 16:13:36 -07:00
Ali Ghodsi	937f72feb8	word wrap before 100 chars per line	2013-08-20 16:13:36 -07:00
Ali Ghodsi	7a2a33e32d	Large scale load and locality tests for the coalesced partitions added	2013-08-20 16:13:36 -07:00
Ali Ghodsi	1ede102ba5	load balancing coalescer	2013-08-20 16:13:36 -07:00
Matei Zaharia	8cae72e94e	Merge pull request #828 from mateiz/sched-improvements Scheduler fixes and improvements	2013-08-19 23:40:04 -07:00
Matei Zaharia	efeb142981	Merge pull request #849 from mateiz/web-fixes Small fixes to web UI	2013-08-19 19:23:50 -07:00
Matei Zaharia	793a722f8e	Allow some wiggle room in UISuite port test and in EC2 ports	2013-08-19 18:51:00 -07:00
Matei Zaharia	498a26189b	Small fixes to web UI: - Use SPARK_PUBLIC_DNS environment variable if set (for EC2) - Use a non-ephemeral port (3030 instead of 33000) by default - Updated test to use non-ephemeral port too	2013-08-19 18:17:49 -07:00
Reynold Xin	5054abd41b	Code review feedback. (added tests for cogroup and substract; added more documentation on MutablePair)	2013-08-19 12:58:02 -07:00
Reynold Xin	acc4aa1f47	Added a test for sorting using MutablePair's.	2013-08-19 11:02:10 -07:00
Reynold Xin	71d705a66e	Made PairRDDFunctions taking only Tuple2, but made the rest of the shuffle code path working with general Product2.	2013-08-19 00:40:43 -07:00
Matei Zaharia	8ac3d1e263	Added unit tests for ClusterTaskSetManager, and fix a bug found with resetting locality level after a non-local launch	2013-08-18 19:51:07 -07:00
Matei Zaharia	2a4ed10210	Address some review comments: - When a resourceOffers() call has multiple offers, force the TaskSets to consider them in increasing order of locality levels so that they get a chance to launch stuff locally across all offers - Simplify ClusterScheduler.prioritizeContainers - Add docs on the new configuration options	2013-08-18 19:51:07 -07:00
Matei Zaharia	222c897128	Comment cleanup (via Kay) and some debug messages	2013-08-18 19:51:07 -07:00
Matei Zaharia	90a04dab8d	Initial work towards scheduler refactoring: - Replace use of hostPort vs host in Task.preferredLocations with a TaskLocation class that contains either an executorId and a host or just a host. This is part of a bigger effort to eliminate hostPort based data structures and just use executorID, since the hostPort vs host stuff is confusing (and not checkable with static typing, leading to ugly debug code), and hostPorts are not provided by Mesos. - Replaced most hostPort-based data structures and fields as above. - Simplified ClusterTaskSetManager to deal with preferred locations in a more concise way and generally be more concise. - Updated the way ClusterTaskSetManager handles racks: instead of enqueueing a task to a separate queue for all the hosts in the rack, which would create lots of large queues, have one queue per rack name. - Removed non-local fallback stuff in ClusterScheduler that tried to launch less-local tasks on a node once the local ones were all assigned. This change didn't work because many cluster schedulers send offers for just one node at a time (even the standalone and YARN ones do so as nodes join the cluster one by one). Thus, lots of non-local tasks would be assigned even though a node with locality for them would be able to receive tasks just a short time later. - Renamed MapOutputTracker "generations" to "epochs".	2013-08-18 19:51:06 -07:00
Reynold Xin	2c00ea3efc	Moved shuffle serializer setting from a constructor parameter to a setSerializer method in various RDDs that involve shuffle operations.	2013-08-17 21:43:29 -07:00
Matei Zaharia	e89ffc7b3c	Merge pull request #839 from jegonzal/zip_partitions Currying RDD.zipPartitions	2013-08-16 14:02:34 -07:00
Joseph E. Gonzalez	53b2639a1e	Reversing the argument order in zipPartitions to enable stronger type inference.	2013-08-16 12:38:59 -07:00
Patrick Wendell	659553b21d	Merge pull request #836 from pwendell/rename Rename `memoryBytesToString` and `memoryMegabytesToString`	2013-08-15 16:56:31 -07:00
Patrick Wendell	4c6ade1ad5	Rename `memoryBytesToString` and `memoryMegabytesToString` These are used all over the place now and they are not specific to memory at all. memoryBytesToString --> bytesToString memoryMegabytesToString --> megabytesToString	2013-08-15 15:58:07 -07:00
Reynold Xin	3886b54933	A few small scheduler / job description changes. 1. Renamed SparkContext.addLocalProperty to setLocalProperty. And allow this function to unset a property. 2. Renamed SparkContext.setDescription to setCurrentJobDescription. 3. Throw an exception if the fair scheduler allocation file is invalid.	2013-08-14 17:19:42 -07:00
Patrick Wendell	ed6a1646e6	Slight change to pr-784	2013-08-13 09:29:40 -07:00
Patrick Wendell	a0133bfbad	Merge pull request #784 from jerryshao/dev-metrics-servlet Add MetricsServlet for Spark metrics system	2013-08-13 09:28:18 -07:00
jerryshao	09c7179e81	MetricsServlet code refactor according to comments	2013-08-12 13:23:23 +08:00
jerryshao	320e87e7ab	Add MetricsServlet for Spark metrics system	2013-08-12 13:23:23 +08:00
Josh Rosen	d7f78b443b	Change scala.Option to Guava Optional in Java APIs.	2013-08-11 12:05:09 -07:00
Matei Zaharia	d1e1c1b24d	Add test for Kryo with WrappedArray (which was failing in Chill 0.3.0)	2013-08-08 13:34:11 -07:00
Matei Zaharia	6b043a6f11	Merge pull request #724 from dlyubimov/SPARK-826 SPARK-826: fold(), reduce(), collect() always attempt to use java serialization	2013-08-06 22:31:02 -07:00
Patrick Wendell	5b3784a79c	Show user-defined job name in UI	2013-08-02 15:47:41 -07:00
Patrick Wendell	5e7b38fbb3	Merge pull request #695 from xiajunluan/pool_ui Enhance job ui in spark ui system with adding pool information	2013-08-01 14:59:33 -07:00
Dmitriy Lyubimov	d29ee3689b	Merge fixes merge commit hasn't picked	2013-08-01 00:21:26 -07:00
Dmitriy Lyubimov	cb6be5bd7e	Merge remote-tracking branch 'mesos/master' into SPARK-826 Conflicts: core/src/main/scala/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/spark/scheduler/local/LocalTaskSetManager.scala core/src/test/scala/spark/KryoSerializerSuite.scala	2013-07-31 22:09:22 -07:00
Dmitriy Lyubimov	28f1550f01	More elegant rewrite of the same.	2013-07-31 21:41:00 -07:00
Dmitriy Lyubimov	7c52ecc6a4	(1) added reduce test case. (2) added nested streaming in ParallelCollectionRDD (3) added kryo with fold test which still doesn't work	2013-07-31 19:27:30 -07:00
Andrew xia	5670c96f29	Merge branch 'master' into Pool_UI Conflicts: core/src/main/scala/spark/SparkContext.scala core/src/main/scala/spark/scheduler/DAGScheduler.scala core/src/main/scala/spark/scheduler/SparkListener.scala core/src/main/scala/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala core/src/main/scala/spark/scheduler/local/LocalTaskSetManager.scala core/src/main/scala/spark/ui/jobs/IndexPage.scala core/src/main/scala/spark/ui/jobs/JobProgressUI.scala	2013-07-31 19:36:36 +08:00
Reynold Xin	98024eadc3	Renamed compressionOutputStream and compressionInputStream to compressedOutputStream and compressedInputStream.	2013-07-30 18:28:46 -07:00
Reynold Xin	56774b176e	Added unit test for compression codecs.	2013-07-30 17:12:33 -07:00
Reynold Xin	ad7e9d0d64	CompressionCodec cleanup. Moved it to spark.io package.	2013-07-30 17:11:54 -07:00
Dmitriy Lyubimov	13a9d66645	adding ===	2013-07-30 16:10:55 -07:00
Dmitriy Lyubimov	1bca91633e	+ bug fixes; test added Conflicts: core/src/test/scala/spark/KryoSerializerSuite.scala	2013-07-30 11:04:11 -07:00
Dmitriy Lyubimov	23f3e0f117	mixing in SharedSparkContext for the kryo-collect test	2013-07-26 19:15:11 -07:00

1 2 3 4 5 ...

601 commits