spark-instrumented-optimizer

History

Yuexin Zhang e70df2cea4 [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue ### What changes were proposed in this pull request? Improve the check logic on if all node managers are really being backlisted. ### Why are the changes needed? I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens: ... 20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately. ... 20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing. ... then the spark job would suddenly run into AllNodeBlacklisted state: ... 20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) ... but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119 We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A minor change. No changes on tests. Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue. Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-01 09:46:18 -05:00
..
src	[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue	2020-06-01 09:46:18 -05:00
pom.xml	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00

Yuexin Zhang e70df2cea4 [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue

### What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

### Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0  numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A minor change. No changes on tests.

Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.

Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

2020-06-01 09:46:18 -05:00

src

[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue

2020-06-01 09:46:18 -05:00

pom.xml

[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT

2020-02-25 19:44:31 -08:00