spark-instrumented-optimizer/common
yi.wu df43300227 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum
### What changes were proposed in this pull request?

This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this:
The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown.

After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis.

Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project.

### Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users.

### Does this PR introduce _any_ user-facing change?

Yes, users may know the cause of the shuffle corruption after this change.

### How was this patch tested?

Added tests.

Closes #33451 from Ngone51/SPARK-36206.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit a98d919da4)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-02 09:59:30 -05:00
..
kvstore [SPARK-35824][CORE][TESTS] Convert LevelDBSuite.IntKeyType from a nested class to a normal class 2021-06-19 11:36:01 -07:00
network-common [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle 2021-08-01 23:30:02 -05:00
network-shuffle [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum 2021-08-02 09:59:30 -05:00
network-yarn [SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units 2021-07-24 21:27:56 +08:00
sketch [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters 2021-06-17 06:29:33 +00:00
tags [SPARK-34578][SQL][TESTS][TEST-MAVEN] Refactor ORC encryption tests and ignore ORC shim loaded by old Hadoop library 2021-03-02 16:52:27 +09:00
unsafe [SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast 2021-07-13 20:29:05 +08:00