-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Meta fencing #7022
feat: Meta fencing #7022
Conversation
I will review it after #6937. Currently most of the code is from that pr and it's a little bit difficult to review. |
src/meta/src/rpc/server.rs
Outdated
|
||
#[tokio::test] | ||
async fn test_fencing_5() { | ||
for params in vec![(true, true), (true, false), (false, true)] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the sequential approach is too slow. As an alternative I can also introduce 3 functions for each combination here.
src/meta/src/rpc/server.rs
Outdated
// TODO: Enabling this causes the recovery test to fail | ||
// See https://buildkite.com/risingwavelabs/pull-request/builds/14686#01855869-b9d7-432b-9ea8-b835830f1a8e | ||
// tracing::info!("Waiting, to give former leaders fencing mechanism time to trigger"); | ||
// sleep(Duration::from_millis(lease_interval_secs * 1000 + 500)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have liked to introduce a break here to give the fencing mechanism time to kill the old leader before the new leader takes over. Unfortunately this seems to break the recovery test. Should we still try to introduce a wait time here @yezizp2012 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recovery test failed due to timeout, because sleep here will increase the overall test time.
I thought about it for a while, since the ticker for manage_term
(renew_lease
) is set to Duration::from_millis(lease_time_sec * 500 + rng.gen_range(1..500))
, lease_time_sec
is also used for lease expire time. If a follower get elected, it means that the old leader didn't renew the lease in the past lease_time_sec
which should be done if the old leader is alive. It works when there's no network isolation, but not work when the old leader are directly network isolated from etcd using the wrapped transactional interface. Because there's a retry logic inside, which will significantly increase the time of renew request. During this time there will be a split-brain situation with 2 leaders.
Anyway, when integrating the official api, we should not have retry logic and need to have a reasonable timeout to avoid this. Cc @shanicky . For this PR, let's remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small correction from above statement: We do not elect a new leader directly after the lease expires. We give the old leader a little extra time. The current implementation looks like this:
// delete lease and run new election if lease is expired for some time
let some_time = lease_time_sec / 2;
let (_, lease_info) = leader_lease_result.unwrap();
if lease_info.get_lease_expire_time() + some_time < since_epoch().as_secs() {
tracing::warn!("Detected that leader is down");
src/meta/src/rpc/server.rs
Outdated
// Current implementation panics, if leader looses lease. | ||
// Alternative implementation that panics only if leader is unable to re-acquire lease: | ||
// if was_leader && !is_leader { | ||
if was_leader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why handling exit in the print status loop? Turning this part of the code from display logic to display and control mixed. I suggest to exit directly inside the election.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I will go for that. As a note: With that we now have a slightly different behaviour, since we can do failover from leader to follower if it is caused by election re-runs and not during the term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing failover from leader to follower is too dangerous, because the time of graceful shutdown the leader services and sub-tasks is uncertain and likely to exceed the time of the lease expire time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will introduce another green thread that calls process::exit
if it notices that node lost leadership. Kind of like before, but seperate from node status green thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed it in 671a5dd. The only downside now is that we have two places in the code that handle fencing. If we only want one place to handle fencing, we would need to use the previous approach of sending a dummy update or use you suggestion. Let me know which way you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the leadership is taken, we should exit the leader logic as soon as possible, it's better to exit as soon as the leader is taken
src/meta/src/rpc/server.rs
Outdated
// TODO: Enabling this causes the recovery test to fail | ||
// See https://buildkite.com/risingwavelabs/pull-request/builds/14686#01855869-b9d7-432b-9ea8-b835830f1a8e | ||
// tracing::info!("Waiting, to give former leaders fencing mechanism time to trigger"); | ||
// sleep(Duration::from_millis(lease_interval_secs * 1000 + 500)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recovery test failed due to timeout, because sleep here will increase the overall test time.
I thought about it for a while, since the ticker for manage_term
(renew_lease
) is set to Duration::from_millis(lease_time_sec * 500 + rng.gen_range(1..500))
, lease_time_sec
is also used for lease expire time. If a follower get elected, it means that the old leader didn't renew the lease in the past lease_time_sec
which should be done if the old leader is alive. It works when there's no network isolation, but not work when the old leader are directly network isolated from etcd using the wrapped transactional interface. Because there's a retry logic inside, which will significantly increase the time of renew request. During this time there will be a split-brain situation with 2 leaders.
Anyway, when integrating the official api, we should not have retry logic and need to have a reasonable timeout to avoid this. Cc @shanicky . For this PR, let's remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Don't worry, The failure of scaling test is a known issue under fixing. FYI. |
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
#6786
Please only merge this after #6937 is merged
Checklist
./risedev check
(or alias,./risedev c
)Documentation
If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.
Types of user-facing changes
Please keep the types that apply to your changes, and remove those that do not apply.
Release note
Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.
Refer to a related PR or issue link (optional)