Commit graph

2588 commits

Author SHA1 Message Date
Tathagata Das 984c582487 Merge branch 'scheduler-update' into filestream-fix
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
	streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
2013-12-19 11:20:48 -08:00
Tathagata Das ec71b445ad Minor changes. 2013-12-18 23:39:28 -08:00
Tathagata Das e93b391d75 Merge branch 'apache-master' into scheduler-update
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala
2013-12-18 17:51:14 -08:00
Tathagata Das b80ec05635 Added StatsReportListener to generate processing time statistics across multiple batches. 2013-12-18 15:35:24 -08:00
Reynold Xin 9a6864d016 Fixed a performance problem in RDD.top and BoundedPriorityQueue (size in BoundedPriority was actually traversing the entire queue to calculate the size, resulting in bad performance in insertion). 2013-12-17 18:44:39 -08:00
Patrick Wendell c1fec89895 Cleanup 2013-12-16 21:56:21 -08:00
Patrick Wendell c6f95e603e Attempt with extra repositories 2013-12-16 21:53:51 -08:00
Reynold Xin 883e034aeb Merge pull request #245 from gregakespret/task-maxfailures-fix
Fix for spark.task.maxFailures not enforced correctly.

Docs at http://spark.incubator.apache.org/docs/latest/configuration.html say:

```
spark.task.maxFailures

Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1.
```

Previous implementation worked incorrectly. When for example `spark.task.maxFailures` was set to 1, the job was aborted only after the second task failure, not after the first one.
2013-12-16 14:16:02 -08:00
Mark Hamstra 09ed7ddfa0 Use scala.binary.version in POMs 2013-12-15 12:39:58 -08:00
Josh Rosen 2fd781d347 Merge pull request #249 from ngbinh/partitionInJavaSortByKey
Expose numPartitions parameter in JavaPairRDD.sortByKey()

This change makes Java and Scala API on sortByKey() the same.
2013-12-14 12:59:37 -08:00
Prashant Sharma 1ae3c0fc5e Added a comment about ActorRef and ActorSelection difference. 2013-12-14 10:44:24 +05:30
Prashant Sharma a854cc536d Review comments on the PR for scala 2.10 migration. 2013-12-13 15:19:51 +05:30
Tathagata Das 097e120c0c Refactored streaming scheduler and added listener interface.
- Refactored Scheduler + JobManager to JobGenerator + JobScheduler and
  added JobSet for cleaner code. Moved scheduler related code to
  streaming.scheduler package.
- Added StreamingListener trait (similar to SparkListener) to enable
  gathering to streaming stats like processing times and delays.
  StreamingContext.addListener() to added listeners.
- Deduped some code in streaming tests by modifying TestSuiteBase, and
  added StreamingListenerSuite.
2013-12-12 20:48:02 -08:00
Tathagata Das 5e9ce83d68 Fixed multiple file stream and checkpointing bugs.
- Made file stream more robust to transient failures.
- Changed Spark.setCheckpointDir API to not have the second
  'useExisting' parameter. Spark will always create a unique directory
  for checkpointing underneath the directory provide to the funtion.
- Fixed bug wrt local relative paths as checkpoint directory.
- Made DStream and RDD checkpointing use
  SparkContext.hadoopConfiguration, so that more HDFS compatible
  filesystems are supported for checkpointing.
2013-12-11 14:01:36 -08:00
Prashant Sharma 603af51bb5 Merge branch 'master' into akka-bug-fix
Conflicts:
	core/pom.xml
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	pom.xml
	project/SparkBuild.scala
	streaming/pom.xml
	yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Binh Nguyen 0b494f7db4 Hook directly to Scala API 2013-12-10 11:17:52 -08:00
Binh Nguyen e85af50767 Leave default value of numPartitions to Scala code. 2013-12-10 11:04:14 -08:00
Grega Kespret 558af87334 Fix tests. 2013-12-10 11:43:42 +01:00
Binh Nguyen c82d4f079b Use braces to shorten the line. 2013-12-10 01:04:52 -08:00
Binh Nguyen 5013fb64b2 Expose numPartitions parameter in JavaPairRDD.sortByKey()
This change make Java and Scala API on sortByKey() the same.
2013-12-10 00:38:16 -08:00
Prashant Sharma 17db6a9041 Style fixes and addressed review comments at #221 2013-12-10 11:47:16 +05:30
Patrick Wendell 5b74609d97 License headers 2013-12-09 16:41:01 -08:00
Grega Kespret 14a1df6572 Fix for spark.task.maxFailures not enforced correctly. 2013-12-09 10:39:02 +01:00
Prashant Sharma 7ad6921ae0 Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster. 2013-12-07 12:45:57 +05:30
Matei Zaharia e0392343a0 Merge pull request #190 from markhamstra/Stages4Jobs
stageId <--> jobId mapping in DAGScheduler

Okay, I think this one is ready to go -- or at least it's ready for review and discussion.  It's a carry-over of https://github.com/mesos/spark/pull/842 with updates for the newer job cancellation functionality.  The prior discussion still applies.  I've actually changed the job cancellation flow a bit: Instead of ``cancelTasks`` going to the TaskScheduler and then ``taskSetFailed`` coming back to the DAGScheduler (resulting in ``abortStage`` there), the DAGScheduler now takes care of figuring out which stages should be cancelled, tells the TaskScheduler to cancel tasks for those stages, then does the cleanup within the DAGScheduler directly without the need for any further prompting by the TaskScheduler.

I know of three outstanding issues, each of which can and should, I believe, be handled in follow-up pull requests:

1) https://spark-project.atlassian.net/browse/SPARK-960
2) JobLogger should be re-factored to eliminate duplication
3) Related to 2), the WebUI should also become a consumer of the DAGScheduler's new understanding of the relationship between jobs and stages so that it can display progress indication and the like grouped by job.  Right now, some of this information is just being sent out as part of ``SparkListenerJobStart`` messages, but more or different job <--> stage information may need to be exported from the DAGScheduler to meet listeners needs.

Except for the eventQueue -> Actor commit, the rest can be cherry-picked almost cleanly into branch-0.8.  A little merging is needed in MapOutputTracker and the DAGScheduler.  Merged versions of those files are in aba2b40ce0

Note that between the recent Actor change in the DAGScheduler and the cleaning up of DAGScheduler data structures on job completion in this PR, some races have been introduced into the DAGSchedulerSuite.  Those tests usually pass, and I don't think that better-behaved code that doesn't directly inspect DAGScheduler data structures should be seeing any problems, but I'll work on fixing DAGSchedulerSuite as either an addition to this PR or as a separate request.

UPDATE: Fixed the race that I introduced.  Created a JIRA issue (SPARK-965) for the one that was introduced with the switch to eventProcessorActor in the DAGScheduler.
2013-12-06 11:49:59 -08:00
Matei Zaharia bfa68609d9 Merge pull request #233 from hsaputra/changecontexttobackend
Change the name of input argument in ClusterScheduler#initialize from context to backend.

The SchedulerBackend used to be called ClusterSchedulerContext so just want to make small
change of the input param in the ClusterScheduler#initialize to reflect this.
2013-12-06 11:04:03 -08:00
Matei Zaharia 3fb302c08d Merge pull request #205 from kayousterhout/logging
Added logging of scheduler delays to UI

This commit adds two metrics to the UI:

1) The time to get task results, if they're fetched remotely

2) The scheduler delay.  When the scheduler starts getting overwhelmed (because it can't keep up with the rate at which tasks are being submitted), the result is that tasks get delayed on the tail-end: the message from the worker saying that the task has completed ends up in a long queue and takes a while to be processed by the scheduler.  This commit records that delay in the UI so that users can tell when the scheduler is becoming the bottleneck.
2013-12-06 11:03:32 -08:00
Matei Zaharia 87676a6af2 Merge pull request #220 from rxin/zippart
Memoize preferred locations in ZippedPartitionsBaseRDD

so preferred location computation doesn't lead to exponential explosion.

This was a problem in GraphX where we have a whole chain of RDDs that are ZippedPartitionsRDD's, and the preferred locations were taking eternity to compute.

(cherry picked from commit e36fe55a03)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-12-06 11:01:42 -08:00
Aaron Davidson 94b5881ee9 Fix long lines 2013-12-06 00:22:00 -08:00
Aaron Davidson 5a864e3fce Rename SparkActorSystem to IndestructibleActorSystem 2013-12-06 00:21:43 -08:00
Prashant Sharma c9cd2af71e Merge branch 'wip-scala-2.10' into akka-bug-fix 2013-12-06 13:32:15 +05:30
Mark Hamstra ee888f6b25 FutureAction result tests 2013-12-05 23:01:18 -08:00
Prashant Sharma 4e70480038 A left over akka -> akka.tcp changes 2013-12-06 12:29:53 +05:30
Henry Saputra 1cb259cb57 Change the name of input ragument in ClusterScheduler#initialize from context to backend.
The SchedulerBackend used to be called ClusterSchedulerContext so just want to make small
change of the input param in the ClusterScheduler#initialize to reflect this.
2013-12-05 18:50:26 -08:00
Mark Hamstra aebb123fd3 jobWaiter.synchronized before jobWaiter.wait 2013-12-05 17:16:44 -08:00
Patrick Wendell 5d460253d6 Merge pull request #228 from pwendell/master
Document missing configs and set shuffle consolidation to false.
2013-12-05 12:31:24 -08:00
Patrick Wendell 75d161b357 Forcing shuffle consolidation in DiskBlockManagerSuite 2013-12-05 11:36:41 -08:00
Matei Zaharia 72b696156c Merge pull request #199 from harveyfeng/yarn-2.2
Hadoop 2.2 migration

Includes support for the YARN API stabilized in the Hadoop 2.2 release, and a few style patches.

Short description for each set of commits:

a98f5a0 - "Misc style changes in the 'yarn' package"
a67ebf4 - "A few more style fixes in the 'yarn' package"
Both of these are some minor style changes, such as fixing lines over 100 chars, to the existing YARN code.

ab8652f - "Add a 'new-yarn' directory ... "
Copies everything from `SPARK_HOME/yarn` to `SPARK_HOME/new-yarn`. No actual code changes here.

4f1c3fa - "Hadoop 2.2 YARN API migration ..."
API patches to code in the `SPARK_HOME/new-yarn` directory. There are a few more small style changes mixed in, too.
Based on @colorant's Hadoop 2.2 support for the scala-2.10 branch in #141.

a1a1c62 - "Add optional Hadoop 2.2 settings in sbt build ... "
If Spark should be built against Hadoop 2.2, then:
a) the `org.apache.spark.deploy.yarn` package will be compiled from the `new-yarn` directory.
b) Protobuf v2.5 will be used as a Spark dependency, since Hadoop 2.2 depends on it. Also, Spark will be built against a version of Akka v2.0.5 that's built against Protobuf 2.5, named `akka-2.0.5-protobuf-2.5`. The patched Akka is here: https://github.com/harveyfeng/akka/tree/2.0.5-protobuf-2.5, and was published to local Ivy during testing.

There's also a new boolean environment variable, `SPARK_IS_NEW_HADOOP`, that users can manually set if their `SPARK_HADOOP_VERSION` specification does not start with `2.2`, which is how the build file tries to detect a 2.2 version. Not sure if this is necessary or done in the best way, though...
2013-12-04 23:33:04 -08:00
Patrick Wendell b1c6fa1584 Document missing configs and set shuffle consolidation to false. 2013-12-04 18:39:34 -08:00
Patrick Wendell 182f9baeed Merge pull request #227 from pwendell/master
Fix small bug in web UI and minor clean-up.

There was a bug where sorting order didn't work correctly for write time metrics.

I also cleaned up some earlier code that fixed the same issue for read and
write bytes.
2013-12-04 15:52:07 -08:00
Patrick Wendell 380b90b9b3 Fix small bug in web UI and minor clean-up.
There was a bug where sorting order didn't work correctly for write time metrics.

I also cleaned up some earlier code that fixed the same issue for read and
write bytes.
2013-12-04 14:41:48 -08:00
Andrew Ash 217611680d Add missing space after "Serialized" in StorageLevel
Current code creates outputs like:

scala> res0.getStorageLevel.description
res2: String = Serialized1x Replicated
2013-12-04 11:29:20 -08:00
Matei Zaharia d6e5473872 Merge pull request #223 from rxin/transient
Mark partitioner, name, and generator field in RDD as @transient.

As part of the effort to reduce serialized task size.
2013-12-04 10:28:50 -08:00
Reynold Xin 974a69d79c Marked doCheckpointCalled as transient. 2013-12-03 11:34:38 -08:00
Mark Hamstra 403234dd0d SparkListenerJobStart posted from local jobs 2013-12-03 09:57:32 -08:00
Mark Hamstra f55d0b935d Synchronous, inline cleanup after runLocally 2013-12-03 09:57:32 -08:00
Mark Hamstra c9fcd909d0 Local jobs post SparkListenerJobEnd, and DAGScheduler data structure
cleanup always occurs before any posting of SparkListenerJobEnd.
2013-12-03 09:57:32 -08:00
Mark Hamstra 9ae2d094a9 Tightly couple stageIdToJobIds and jobIdToStageIds 2013-12-03 09:57:32 -08:00
Mark Hamstra 27c45e5236 Cleaned up job cancellation handling 2013-12-03 09:57:32 -08:00
Mark Hamstra 686a420ddc Refactoring to make job removal, stage removal, task cancellation clearer 2013-12-03 09:57:32 -08:00