Commit graph

601 commits

Author SHA1 Message Date
Matei Zaharia 0e40cfabf8 Fix some review comments 2013-10-08 23:16:16 -07:00
Matei Zaharia b535db7d89 Added a fast and low-memory append-only map implementation for cogroup
and parallel reduce operations
2013-10-08 23:14:38 -07:00
Reynold Xin 213b70a2db Merge pull request #31 from sundeepn/branch-0.8
Resolving package conflicts with hadoop 0.23.9

Hadoop 0.23.9 is having a package conflict with easymock's dependencies.

(cherry picked from commit 023e3fdf00)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-10-07 10:54:22 -07:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Aaron Davidson 0f070279e7 Address Matei's comments 2013-10-05 15:15:29 -07:00
Aaron Davidson db6f154940 Fix race conditions during recovery
One major change was the use of messages instead of raw functions as the
parameter of Akka scheduled timers. Since messages are serialized, unlike
raw functions, the behavior is easier to think about and doesn't cause
race conditions when exceptions are thrown.

Another change is to avoid using global pointers that might change without
a lock.
2013-10-04 19:54:33 -07:00
Andre Schumacher c84946fe21 Fixing SPARK-602: PythonPartitioner
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-04 11:56:47 -07:00
Reynold Xin d29e8035a0 Added countAsync and various unit tests for async actions. 2013-10-03 15:13:44 -07:00
Reynold Xin e8e917f209 Merge branch 'master' into kill
Conflicts:
	core/src/main/scala/org/apache/spark/TaskEndReason.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
2013-10-02 23:01:34 -07:00
Reynold Xin 1c48ba0d9f Merge remote-tracking branch 'origin' into kill
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
2013-10-02 16:40:44 -07:00
Kay Ousterhout 0dcad2edcb Added additional unit test for repeated task failures 2013-09-30 23:26:15 -07:00
Kay Ousterhout dea4677c88 Fixed compilation errors and broken test. 2013-09-30 22:07:01 -07:00
Kay Ousterhout 8deda427bc Merge remote-tracking branch 'upstream/master' into results_through-bm
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalTaskSetManager.scala
2013-09-30 10:16:58 -07:00
Kay Ousterhout 58b764b7c6 Addressed Matei's code review comments 2013-09-30 10:11:59 -07:00
Aaron Davidson 42d72308fb Add license notices 2013-09-26 15:45:20 -07:00
Reynold Xin 70a0b993d4 Merge pull request #14 from kayousterhout/untangle_scheduler
Improved organization of scheduling packages.

This commit does not change any code -- only file organization.
Please let me know if there was some masterminded strategy behind
the existing organization that I failed to understand!

There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.

(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it seems to be within the cluster package.

The one thing about the scheduling code that seems a little funny to me
is the naming of the SchedulerBackends.  The StandaloneSchedulerBackend
is not just for Standalone mode, but instead is used by Mesos coarse grained
mode and Yarn, and the backend that *is* just for Standalone mode is instead called SparkDeploySchedulerBackend. I didn't change this because I wasn't sure if there
was a reason for this naming that I'm just not aware of.
2013-09-26 14:11:54 -07:00
Patrick Wendell 6566a19b38 Merge pull request #9 from rxin/limit
Smarter take/limit implementation.
2013-09-26 08:01:04 -07:00
Kay Ousterhout d85fe41b2b Improved organization of scheduling packages.
This commit does not change any code -- only file organization.

There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.

(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it is within the cluster package.
2013-09-25 12:45:46 -07:00
Reynold Xin ff540a015b Merge branch 'master' of github.com:markhamstra/incubator-spark 2013-09-23 11:55:02 -07:00
Kay Ousterhout c75eb14fe5 Send Task results through the block manager when larger than Akka frame size.
This change requires adding an extra failure mode: tasks can complete
successfully, but the result gets lost or flushed from the block manager
before it's been fetched.
2013-09-22 21:20:48 -07:00
Reynold Xin a2ea069a5f Merge pull request #937 from jerryshao/localProperties-fix
Fix PR926 local properties issues in Spark Streaming like scenarios
2013-09-21 23:04:42 -07:00
jerryshao aa0c29f747 Add barrier for local properties unit test and fix some styles 2013-09-22 09:53:11 +08:00
Reynold Xin 42571d30d0 Smarter take/limit implementation. 2013-09-20 17:09:53 -07:00
Ankur Dave 026dba6aba After unit tests, clear port properties unconditionally
In MapOutputTrackerSuite, the "remote fetch" test sets spark.driver.port
and spark.hostPort, assuming that they will be cleared by
LocalSparkContext. However, the test never sets sc, so it remains null,
causing LocalSparkContext to skip clearing these properties. Subsequent
tests therefore fail with java.net.BindException: "Address already in
use".

This commit makes LocalSparkContext clear the properties even if sc is
null.
2013-09-19 22:05:23 -07:00
jerryshao ffa5f8e11d Fix issue when local properties pass from parent to child thread 2013-09-18 17:33:24 +08:00
Reynold Xin 37d8f37a8e Added a submitJob interface that returns a Future of the result. 2013-09-17 21:13:59 -07:00
Reynold Xin cbc48be13b Initial commit for job killing. 2013-09-16 18:54:06 -07:00
Patrick Wendell bddf135670 Change port from 3030 to 4040 2013-09-11 10:01:38 -07:00
Matei Zaharia a85758c200 Merge pull request #907 from stephenh/document_coalesce_shuffle
Add better docs for coalesce.
2013-09-09 13:45:40 -07:00
Stephen Haberman 59003d387d Use a set since shuffle could change order. 2013-09-09 11:45:03 -05:00
Matei Zaharia 7d3204b056 Merge pull request #905 from mateiz/docs2
Job scheduling and cluster mode docs
2013-09-08 21:39:12 -07:00
Patrick Wendell f68848d95d Merge pull request #906 from pwendell/ganglia-sink
Clean-up of Metrics Code/Docs and Add Ganglia Sink
2013-09-08 18:32:16 -07:00
Matei Zaharia 170b3869ee Fix unit test failure due to changed default 2013-09-08 17:51:27 -07:00
Patrick Wendell c190b48bf5 Adding more docs and some code cleanup 2013-09-08 13:46:28 -07:00
Stephen Haberman df5fd35273 Add better docs for coalesce.
Include the useful tip that if shuffle=true, coalesce can actually
increase the number of partitions.

This makes coalesce more like a generic `RDD.repartition` operation.

(Ideally this `RDD.repartition` could automatically choose either a coalesce or
a shuffle if numPartitions was either less than or greater than, respectively,
the current number of partitions.)
2013-09-08 15:39:04 -05:00
Matei Zaharia 651a96adf7 More fair scheduler docs and property names.
Also changed uses of "job" terminology to "application" when they
referred to an entire Spark program, to avoid confusion.
2013-09-08 00:29:11 -07:00
Matei Zaharia 98fb69822c Work in progress:
- Add job scheduling docs
- Rename some fair scheduler properties
- Organize intro page better
- Link to Apache wiki for "contributing to Spark"
2013-09-08 00:29:11 -07:00
Reynold Xin 1e15feb5a3 Hot fix to resolve the compilation error caused by SPARK-821. 2013-09-06 22:44:05 +08:00
Aaron Davidson 3a04e76c89 Reynold's second round of comments 2013-09-05 21:43:26 -07:00
Aaron Davidson 4f2236a1c5 Add unit test and address comments 2013-09-05 18:06:30 -07:00
Aaron Davidson 1418d18af4 SPARK-821: Don't cache results when action run locally on driver
Caching the results of local actions (e.g., rdd.first()) causes the driver to
store entire partitions in its own memory, which may be highly constrained.
This patch simply makes the CacheManager avoid caching the result of all locally-run computations.
2013-09-05 15:34:42 -07:00
Aaron Davidson 714e7f9e32 Fix line over 100 chars 2013-09-04 22:40:08 -07:00
Aaron Davidson 37db141aef Address Patrick's comments 2013-09-04 21:34:20 -07:00
Aaron Davidson 9e6f2b6822 SPARK-884: Add unit test to validate Spark JSON output
This unit test simply validates that the outputs of
the JsonProtocol methods are syntactically valid JSON.
2013-09-04 15:26:46 -07:00
Mark Hamstra c9bc8af3d1 Removed repetative import; fixes hidden definition compiler warning. 2013-09-03 15:25:20 -07:00
Matei Zaharia 12b2f1f9c9 Add missing license headers found with RAT 2013-09-02 12:23:03 -07:00
Matei Zaharia 246bf67f58 Fix test 2013-09-02 10:57:34 -07:00
Matei Zaharia 0a8cc30921 Move some classes to more appropriate packages:
* RDD, *RDDFunctions -> org.apache.spark.rdd
* Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util
* JavaSerializer, KryoSerializer -> org.apache.spark.serializer
2013-09-01 14:13:16 -07:00
Matei Zaharia 46eecd110a Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
Matei Zaharia 53cd50c069 Change build and run instructions to use assemblies
This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.
2013-08-29 21:19:04 -07:00
Ali Ghodsi c0942a710f Bug in test fixed 2013-08-20 16:16:05 -07:00
Ali Ghodsi 5db41919b5 Added a test to make sure no locality preferences are ignored 2013-08-20 16:16:05 -07:00
Ali Ghodsi 7b123b3126 Simpler code 2013-08-20 16:16:05 -07:00
Ali Ghodsi a75a64eade Fixed almost all of Matei's feedback 2013-08-20 16:16:05 -07:00
Ali Ghodsi f1c853d76d fixed Matei's comments 2013-08-20 16:16:04 -07:00
Ali Ghodsi d6b6c680be comment in the test to make it more understandable 2013-08-20 16:16:04 -07:00
Ali Ghodsi b69e7166ba Coalescer now uses current preferred locations for derived RDDs. Made run() in DAGScheduler thread safe and added a method to be able to ask it for preferred locations. Added a similar method that wraps the former inside SparkContext. 2013-08-20 16:16:04 -07:00
Ali Ghodsi 3b5bb8a4ae added one test that will test a future functionality 2013-08-20 16:13:37 -07:00
Ali Ghodsi 33a0f59354 Added error messages to the tests to make failed tests less cryptic 2013-08-20 16:13:37 -07:00
Ali Ghodsi f24861b60a Fix bug in tests 2013-08-20 16:13:36 -07:00
Ali Ghodsi 937f72feb8 word wrap before 100 chars per line 2013-08-20 16:13:36 -07:00
Ali Ghodsi 7a2a33e32d Large scale load and locality tests for the coalesced partitions added 2013-08-20 16:13:36 -07:00
Ali Ghodsi 1ede102ba5 load balancing coalescer 2013-08-20 16:13:36 -07:00
Matei Zaharia 8cae72e94e Merge pull request #828 from mateiz/sched-improvements
Scheduler fixes and improvements
2013-08-19 23:40:04 -07:00
Matei Zaharia efeb142981 Merge pull request #849 from mateiz/web-fixes
Small fixes to web UI
2013-08-19 19:23:50 -07:00
Matei Zaharia 793a722f8e Allow some wiggle room in UISuite port test and in EC2 ports 2013-08-19 18:51:00 -07:00
Matei Zaharia 498a26189b Small fixes to web UI:
- Use SPARK_PUBLIC_DNS environment variable if set (for EC2)
- Use a non-ephemeral port (3030 instead of 33000) by default
- Updated test to use non-ephemeral port too
2013-08-19 18:17:49 -07:00
Reynold Xin 5054abd41b Code review feedback. (added tests for cogroup and substract; added more documentation on MutablePair) 2013-08-19 12:58:02 -07:00
Reynold Xin acc4aa1f47 Added a test for sorting using MutablePair's. 2013-08-19 11:02:10 -07:00
Reynold Xin 71d705a66e Made PairRDDFunctions taking only Tuple2, but made the rest of the shuffle code path working with general Product2. 2013-08-19 00:40:43 -07:00
Matei Zaharia 8ac3d1e263 Added unit tests for ClusterTaskSetManager, and fix a bug found with
resetting locality level after a non-local launch
2013-08-18 19:51:07 -07:00
Matei Zaharia 2a4ed10210 Address some review comments:
- When a resourceOffers() call has multiple offers, force the TaskSets
  to consider them in increasing order of locality levels so that they
  get a chance to launch stuff locally across all offers

- Simplify ClusterScheduler.prioritizeContainers

- Add docs on the new configuration options
2013-08-18 19:51:07 -07:00
Matei Zaharia 222c897128 Comment cleanup (via Kay) and some debug messages 2013-08-18 19:51:07 -07:00
Matei Zaharia 90a04dab8d Initial work towards scheduler refactoring:
- Replace use of hostPort vs host in Task.preferredLocations with a
  TaskLocation class that contains either an executorId and a host or
  just a host. This is part of a bigger effort to eliminate hostPort
  based data structures and just use executorID, since the hostPort vs
  host stuff is confusing (and not checkable with static typing, leading
  to ugly debug code), and hostPorts are not provided by Mesos.

- Replaced most hostPort-based data structures and fields as above.

- Simplified ClusterTaskSetManager to deal with preferred locations in a
  more concise way and generally be more concise.

- Updated the way ClusterTaskSetManager handles racks: instead of
  enqueueing a task to a separate queue for all the hosts in the rack,
  which would create lots of large queues, have one queue per rack name.

- Removed non-local fallback stuff in ClusterScheduler that tried to
  launch less-local tasks on a node once the local ones were all
  assigned. This change didn't work because many cluster schedulers send
  offers for just one node at a time (even the standalone and YARN ones
  do so as nodes join the cluster one by one). Thus, lots of non-local
  tasks would be assigned even though a node with locality for them
  would be able to receive tasks just a short time later.

- Renamed MapOutputTracker "generations" to "epochs".
2013-08-18 19:51:06 -07:00
Reynold Xin 2c00ea3efc Moved shuffle serializer setting from a constructor parameter to a setSerializer method in various RDDs that involve shuffle operations. 2013-08-17 21:43:29 -07:00
Matei Zaharia e89ffc7b3c Merge pull request #839 from jegonzal/zip_partitions
Currying RDD.zipPartitions
2013-08-16 14:02:34 -07:00
Joseph E. Gonzalez 53b2639a1e Reversing the argument order in zipPartitions to enable stronger type inference. 2013-08-16 12:38:59 -07:00
Patrick Wendell 659553b21d Merge pull request #836 from pwendell/rename
Rename `memoryBytesToString` and `memoryMegabytesToString`
2013-08-15 16:56:31 -07:00
Patrick Wendell 4c6ade1ad5 Rename memoryBytesToString and memoryMegabytesToString
These are used all over the place now and they are not specific to memory at all.

memoryBytesToString --> bytesToString
memoryMegabytesToString --> megabytesToString
2013-08-15 15:58:07 -07:00
Reynold Xin 3886b54933 A few small scheduler / job description changes.
1. Renamed SparkContext.addLocalProperty to setLocalProperty. And allow this function to unset a property.

2. Renamed SparkContext.setDescription to setCurrentJobDescription.

3. Throw an exception if the fair scheduler allocation file is invalid.
2013-08-14 17:19:42 -07:00
Patrick Wendell ed6a1646e6 Slight change to pr-784 2013-08-13 09:29:40 -07:00
Patrick Wendell a0133bfbad Merge pull request #784 from jerryshao/dev-metrics-servlet
Add MetricsServlet for Spark metrics system
2013-08-13 09:28:18 -07:00
jerryshao 09c7179e81 MetricsServlet code refactor according to comments 2013-08-12 13:23:23 +08:00
jerryshao 320e87e7ab Add MetricsServlet for Spark metrics system 2013-08-12 13:23:23 +08:00
Josh Rosen d7f78b443b Change scala.Option to Guava Optional in Java APIs. 2013-08-11 12:05:09 -07:00
Matei Zaharia d1e1c1b24d Add test for Kryo with WrappedArray (which was failing in Chill 0.3.0) 2013-08-08 13:34:11 -07:00
Matei Zaharia 6b043a6f11 Merge pull request #724 from dlyubimov/SPARK-826
SPARK-826: fold(), reduce(), collect() always attempt to use java serialization
2013-08-06 22:31:02 -07:00
Patrick Wendell 5b3784a79c Show user-defined job name in UI 2013-08-02 15:47:41 -07:00
Patrick Wendell 5e7b38fbb3 Merge pull request #695 from xiajunluan/pool_ui
Enhance job ui in spark ui system with adding pool information
2013-08-01 14:59:33 -07:00
Dmitriy Lyubimov d29ee3689b Merge fixes merge commit hasn't picked 2013-08-01 00:21:26 -07:00
Dmitriy Lyubimov cb6be5bd7e Merge remote-tracking branch 'mesos/master' into SPARK-826
Conflicts:
	core/src/main/scala/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/spark/scheduler/local/LocalTaskSetManager.scala
	core/src/test/scala/spark/KryoSerializerSuite.scala
2013-07-31 22:09:22 -07:00
Dmitriy Lyubimov 28f1550f01 More elegant rewrite of the same. 2013-07-31 21:41:00 -07:00
Dmitriy Lyubimov 7c52ecc6a4 (1) added reduce test case.
(2) added nested streaming in ParallelCollectionRDD
(3) added kryo with fold test which still doesn't work
2013-07-31 19:27:30 -07:00
Andrew xia 5670c96f29 Merge branch 'master' into Pool_UI
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
	core/src/main/scala/spark/scheduler/DAGScheduler.scala
	core/src/main/scala/spark/scheduler/SparkListener.scala
	core/src/main/scala/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala
	core/src/main/scala/spark/scheduler/local/LocalTaskSetManager.scala
	core/src/main/scala/spark/ui/jobs/IndexPage.scala
	core/src/main/scala/spark/ui/jobs/JobProgressUI.scala
2013-07-31 19:36:36 +08:00
Reynold Xin 98024eadc3 Renamed compressionOutputStream and compressionInputStream to compressedOutputStream and compressedInputStream. 2013-07-30 18:28:46 -07:00
Reynold Xin 56774b176e Added unit test for compression codecs. 2013-07-30 17:12:33 -07:00
Reynold Xin ad7e9d0d64 CompressionCodec cleanup. Moved it to spark.io package. 2013-07-30 17:11:54 -07:00
Dmitriy Lyubimov 13a9d66645 adding === 2013-07-30 16:10:55 -07:00
Dmitriy Lyubimov 1bca91633e + bug fixes;
test added

Conflicts:

	core/src/test/scala/spark/KryoSerializerSuite.scala
2013-07-30 11:04:11 -07:00
Dmitriy Lyubimov 23f3e0f117 mixing in SharedSparkContext for the kryo-collect test 2013-07-26 19:15:11 -07:00