spark-instrumented-optimizer

History

Xingbo Jiang e3486e1b95 [SPARK-24795][CORE] Implement barrier execution mode ## What changes were proposed in this pull request? Propose new APIs and modify job/task scheduling to support barrier execution mode, which requires all tasks in a same barrier stage start at the same time, and retry all tasks in case some tasks fail in the middle. The barrier execution mode is useful for some ML/DL workloads. The proposed API changes include: - `RDDBarrier` that marks an RDD as barrier (Spark must launch all the tasks together for the current stage). - `BarrierTaskContext` that support global sync of all tasks in a barrier stage, and provide extra `BarrierTaskInfo`s. In DAGScheduler, we retry all tasks of a barrier stage in case some tasks fail in the middle, this is achieved by unregistering map outputs for a shuffleId (for ShuffleMapStage) or clear the finished partitions in an active job (for ResultStage). ## How was this patch tested? Add `RDDBarrierSuite` to ensure we convert RDDs correctly; Add new test cases in `DAGSchedulerSuite` to ensure we do task scheduling correctly; Add new test cases in `SparkContextSuite` to ensure the barrier execution mode actually works (both under local mode and local cluster mode). Add new test cases in `TaskSchedulerImplSuite` to ensure we schedule tasks for barrier taskSet together. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #21758 from jiangxb1987/barrier-execution-mode.	2018-07-26 12:09:01 -07:00
..
main	[SPARK-24795][CORE] Implement barrier execution mode	2018-07-26 12:09:01 -07:00
test	[SPARK-24795][CORE] Implement barrier execution mode	2018-07-26 12:09:01 -07:00

Xingbo Jiang e3486e1b95 [SPARK-24795][CORE] Implement barrier execution mode

## What changes were proposed in this pull request?

Propose new APIs and modify job/task scheduling to support barrier execution mode, which requires all tasks in a same barrier stage start at the same time, and retry all tasks in case some tasks fail in the middle. The barrier execution mode is useful for some ML/DL workloads.

The proposed API changes include:

- `RDDBarrier` that marks an RDD as barrier (Spark must launch all the tasks together for the current stage).
- `BarrierTaskContext` that support global sync of all tasks in a barrier stage, and provide extra `BarrierTaskInfo`s.

In DAGScheduler, we retry all tasks of a barrier stage in case some tasks fail in the middle, this is achieved by unregistering map outputs for a shuffleId (for ShuffleMapStage) or clear the finished partitions in an active job (for ResultStage).

## How was this patch tested?

Add `RDDBarrierSuite` to ensure we convert RDDs correctly;
Add new test cases in `DAGSchedulerSuite` to ensure we do task scheduling correctly;
Add new test cases in `SparkContextSuite` to ensure the barrier execution mode actually works (both under local mode and local cluster mode).
Add new test cases in `TaskSchedulerImplSuite` to ensure we schedule tasks for barrier taskSet together.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #21758 from jiangxb1987/barrier-execution-mode.

2018-07-26 12:09:01 -07:00

main

[SPARK-24795][CORE] Implement barrier execution mode

2018-07-26 12:09:01 -07:00

test

[SPARK-24795][CORE] Implement barrier execution mode

2018-07-26 12:09:01 -07:00