-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-17111 cart: Fix csm_alive_count #15945
base: master
Are you sure you want to change the base?
Conversation
Ticket title is '[SWIM] Zombie Node Messes Up SWIM' |
cae93e2
to
5af023b
Compare
In swim, csm_alive_count may underflow because some cst->cst_state.sms_status changes in csm overlook the count. Moreover, not counting SUSPECT members seems to be a mistake. Consider a membership of three, {x, y, z}. If x enters a state where it can't receive any SWIM messages, and it picks y in the next period, then it will suspect y, causing csm_alive_count to drop from 3 to 2, which prevents x from declaring an "outage". (In the subsequent period, x will suspect z, causing csm_alive_count to drop from 2 to 1 quickly.) Since x keeps pinging SUSPECT members, it seems reasonable to count them in and expect them to send messages to x until they become DEAD. This patch fixes the underflow, and counts SUSPECT members in addition to ALIVE members in csm_alive_count (renamed to csm_alive_or_suspect_count). Signed-off-by: Li Wei <[email protected]>
5af023b
to
571cd65
Compare
@@ -1057,7 +1070,7 @@ static int64_t crt_swim_progress_cb(crt_context_t crt_ctx, int64_t timeout_us, v | |||
* The max_delay should be less suspicion timeout to guarantee | |||
* the already suspected members will not be expired. | |||
*/ | |||
if (csm->csm_alive_count > 2) { | |||
if (csm->csm_alive_or_suspect_count > 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you figured out that why it's okay for us to not update/extend the suspecting timeout if the number of alive_or_suspect_count is less or equal than 2? I tend to think it's applicable no matter how many ranks there are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a single pending question
In swim, csm_alive_count may underflow because some
cst->cst_state.sms_status changes in csm overlook the count. Moreover,
not counting SUSPECT members seems to be a mistake. Consider a
membership of three, {x, y, z}. If x enters a state where it can't
receive any SWIM messages, and it picks y in the next period, then it
will suspect y, causing csm_alive_count to drop from 3 to 2, which
prevents x from declaring an "outage". (In the subsequent period, x will
suspect z, causing csm_alive_count to drop from 2 to 1 quickly.) Since x
keeps pinging SUSPECT members, it seems reasonable to count them in and
expect them to send messages to x until they become DEAD.
This patch fixes the underflow, and counts SUSPECT members in addition
to ALIVE members in csm_alive_count (renamed to
csm_alive_or_suspect_count).
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: