feat: Meta fencing #7022

CAJan93 · 2022-12-22T15:24:56Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

Please only merge this after #6937 is merged

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

Installation and deployment
Connector (sources & sinks)
SQL commands, functions, and operators
RisingWave cluster configuration changes
Other (please specify in the release note below)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

…o fencing2

src/meta/src/rpc/server.rs

yezizp2012 · 2022-12-23T06:01:55Z

I will review it after #6937. Currently most of the code is from that pr and it's a little bit difficult to review.

CAJan93 · 2022-12-23T07:38:56Z

I will review it after #6937. Currently most of the code is from that pr and it's a little bit difficult to review.

For sure. I think this PR will be quick to review, once #6937 got merged.

…o fencing2

src/meta/src/rpc/server.rs

CAJan93 · 2022-12-23T12:42:29Z

src/meta/src/rpc/server.rs

+
+    #[tokio::test]
+    async fn test_fencing_5() {
+        for params in vec![(true, true), (true, false), (false, true)] {


Let me know if the sequential approach is too slow. As an alternative I can also introduce 3 functions for each combination here.

…o fencing2

src/meta/src/rpc/server.rs

…o fencing2

src/meta/src/rpc/server.rs

…o fencing2

CAJan93 · 2022-12-28T12:20:51Z

src/meta/src/rpc/server.rs

+        // TODO: Enabling this causes the recovery test to fail
+        // See https://buildkite.com/risingwavelabs/pull-request/builds/14686#01855869-b9d7-432b-9ea8-b835830f1a8e
+        // tracing::info!("Waiting, to give former leaders fencing mechanism time to trigger");
+        // sleep(Duration::from_millis(lease_interval_secs * 1000 + 500));
+


I would have liked to introduce a break here to give the fencing mechanism time to kill the old leader before the new leader takes over. Unfortunately this seems to break the recovery test. Should we still try to introduce a wait time here @yezizp2012 ?

The recovery test failed due to timeout, because sleep here will increase the overall test time.

I thought about it for a while, since the ticker for manage_term(renew_lease) is set to Duration::from_millis(lease_time_sec * 500 + rng.gen_range(1..500)), lease_time_sec is also used for lease expire time. If a follower get elected, it means that the old leader didn't renew the lease in the past lease_time_sec which should be done if the old leader is alive. It works when there's no network isolation, but not work when the old leader are directly network isolated from etcd using the wrapped transactional interface. Because there's a retry logic inside, which will significantly increase the time of renew request. During this time there will be a split-brain situation with 2 leaders.
Anyway, when integrating the official api, we should not have retry logic and need to have a reasonable timeout to avoid this. Cc @shanicky . For this PR, let's remove it for now.

Small correction from above statement: We do not elect a new leader directly after the lease expires. We give the old leader a little extra time. The current implementation looks like this:

// delete lease and run new election if lease is expired for some time let some_time = lease_time_sec / 2; let (_, lease_info) = leader_lease_result.unwrap(); if lease_info.get_lease_expire_time() + some_time < since_epoch().as_secs() { tracing::warn!("Detected that leader is down");

src/meta/src/rpc/elections.rs

yezizp2012 · 2022-12-29T06:23:56Z

src/meta/src/rpc/server.rs

+            // Current implementation panics, if leader looses lease.
+            // Alternative implementation that panics only if leader is unable to re-acquire lease:
+            // if was_leader && !is_leader {
+            if was_leader {


Why handling exit in the print status loop? Turning this part of the code from display logic to display and control mixed. I suggest to exit directly inside the election.

Alright. I will go for that. As a note: With that we now have a slightly different behaviour, since we can do failover from leader to follower if it is caused by election re-runs and not during the term.

Doing failover from leader to follower is too dangerous, because the time of graceful shutdown the leader services and sub-tasks is uncertain and likely to exceed the time of the lease expire time.

I will introduce another green thread that calls process::exit if it notices that node lost leadership. Kind of like before, but seperate from node status green thread.

I pushed it in 671a5dd. The only downside now is that we have two places in the code that handle fencing. If we only want one place to handle fencing, we would need to use the previous approach of sending a dummy update or use you suggestion. Let me know which way you prefer.

If the leadership is taken, we should exit the leader logic as soon as possible, it's better to exit as soon as the leader is taken

yezizp2012 · 2022-12-29T06:34:43Z

src/meta/src/rpc/server.rs

+        // TODO: Enabling this causes the recovery test to fail
+        // See https://buildkite.com/risingwavelabs/pull-request/builds/14686#01855869-b9d7-432b-9ea8-b835830f1a8e
+        // tracing::info!("Waiting, to give former leaders fencing mechanism time to trigger");
+        // sleep(Duration::from_millis(lease_interval_secs * 1000 + 500));
+


The recovery test failed due to timeout, because sleep here will increase the overall test time.

I thought about it for a while, since the ticker for manage_term(renew_lease) is set to Duration::from_millis(lease_time_sec * 500 + rng.gen_range(1..500)), lease_time_sec is also used for lease expire time. If a follower get elected, it means that the old leader didn't renew the lease in the past lease_time_sec which should be done if the old leader is alive. It works when there's no network isolation, but not work when the old leader are directly network isolated from etcd using the wrapped transactional interface. Because there's a retry logic inside, which will significantly increase the time of renew request. During this time there will be a split-brain situation with 2 leaders.
Anyway, when integrating the official api, we should not have retry logic and need to have a reasonable timeout to avoid this. Cc @shanicky . For this PR, let's remove it for now.

src/meta/src/rpc/server.rs

src/meta/src/rpc/elections.rs

src/meta/src/rpc/server.rs

…o fencing2

yezizp2012

Rest LGTM

src/meta/src/rpc/server.rs

yezizp2012 · 2022-12-29T10:22:17Z

Don't worry, The failure of scaling test is a known issue under fixing. FYI.

…o fencing2

CAJan93 added 2 commits December 22, 2022 16:23

initial commit

bff3520

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

6db4f0c

…o fencing2

github-actions bot added the type/feature label Dec 22, 2022

CAJan93 added 2 commits December 22, 2022 16:32

fencing mechanism

7c3b0c8

add tests

028ec4b

CAJan93 commented Dec 22, 2022

View reviewed changes

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

remove notes

4db2a25

CAJan93 marked this pull request as ready for review December 22, 2022 16:36

CAJan93 requested a review from yezizp2012 December 22, 2022 16:36

CAJan93 added 5 commits December 23, 2022 12:12

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

7aa85e5

…o fencing2

tests succeed

6d2b48d

fix test

8c4038d

delete leader and/or lease info

19384ab

discussion notes

e89544e

CAJan93 commented Dec 23, 2022

View reviewed changes

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

CAJan93 commented Dec 23, 2022

View reviewed changes

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

a40a0e2

…o fencing2

CAJan93 commented Dec 23, 2022

View reviewed changes

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

CAJan93 added 7 commits December 23, 2022 15:48

Merge branch 'main' into fencing2

ad5af9c

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

35c9d2d

…o fencing2

not_renew_lease_2

cea558f

trigger fencing if leader looses leadership

843f905

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

89ef8eb

…o fencing2

format

2a560e7

risedev c

f5750e5

kwannoel reviewed Dec 27, 2022

View reviewed changes

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

CAJan93 added 2 commits December 27, 2022 18:23

exit instead of panic

4ae58a9

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

f2a367d

…o fencing2

CAJan93 commented Dec 28, 2022

View reviewed changes

CAJan93 mentioned this pull request Dec 28, 2022

Tracking: high availability for Meta service #5943

Closed

15 tasks

CAJan93 requested a review from kwannoel December 28, 2022 12:25

yezizp2012 requested a review from shanicky December 29, 2022 06:01

yezizp2012 reviewed Dec 29, 2022

View reviewed changes

shanicky reviewed Dec 29, 2022

View reviewed changes

src/meta/src/rpc/elections.rs Outdated Show resolved Hide resolved

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

CAJan93 added 14 commits December 29, 2022 08:43

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

3acfb9b

…o fencing2

move fencing logic to election.rs

ece527c

change test_fencing

4c84456

change comment

c91ada1

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

11075c8

…o fencing2

risedev c

89f02ff

add more fencing logic

671a5dd

change sleep time

a750d4b

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

80cda0c

…o fencing2

remove fencing green thread again

9de4071

remove import

81a2555

Merge branch 'main' into fencing2

9063a00

re-write test

bf29129

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

9423ea6

…o fencing2

yezizp2012 approved these changes Dec 29, 2022

View reviewed changes

src/meta/src/rpc/server.rs Outdated Show resolved Hide resolved

CAJan93 added 2 commits December 29, 2022 11:30

minor change in test

038b1a6

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

a84529d

…o fencing2

CAJan93 added the mergify/can-merge label Dec 29, 2022

CAJan93 and others added 3 commits December 29, 2022 11:51

Merge branch 'main' into fencing2

02aa1c8

Merge branch 'main' into fencing2

9d74ff4

Merge branch 'main' into fencing2

d59eb4d

mergify bot merged commit 7ea4a82 into main Dec 29, 2022

mergify bot deleted the fencing2 branch December 29, 2022 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Meta fencing #7022

feat: Meta fencing #7022

CAJan93 commented Dec 22, 2022 •

edited

Loading

yezizp2012 commented Dec 23, 2022

CAJan93 commented Dec 23, 2022

CAJan93 Dec 23, 2022

CAJan93 Dec 28, 2022

yezizp2012 Dec 29, 2022

CAJan93 Dec 29, 2022

yezizp2012 Dec 29, 2022 •

edited

Loading

CAJan93 Dec 29, 2022

yezizp2012 Dec 29, 2022

CAJan93 Dec 29, 2022

CAJan93 Dec 29, 2022 •

edited

Loading

shanicky Dec 29, 2022

yezizp2012 Dec 29, 2022

yezizp2012 left a comment

yezizp2012 commented Dec 29, 2022

feat: Meta fencing #7022

feat: Meta fencing #7022

Conversation

CAJan93 commented Dec 22, 2022 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Types of user-facing changes

Release note

Refer to a related PR or issue link (optional)

yezizp2012 commented Dec 23, 2022

CAJan93 commented Dec 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yezizp2012 Dec 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAJan93 Dec 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yezizp2012 left a comment

Choose a reason for hiding this comment

yezizp2012 commented Dec 29, 2022

CAJan93 commented Dec 22, 2022 •

edited

Loading

yezizp2012 Dec 29, 2022 •

edited

Loading

CAJan93 Dec 29, 2022 •

edited

Loading