feat(meta): Client failover #7248

CAJan93 · 2023-01-06T11:19:54Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

PR based on #7049. Only merge after that one is merged.

Checklist

Client can connect against leader on original connection
Client can handle leader failover
I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)
Det Sim tests should also run with 3etcd-3meta-1cn-1fe
Det Sim tests should also kill meta node if using 3etcd-3meta-1cn-1fe

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

Installation and deployment
Connector (sources & sinks)
SQL commands, functions, and operators
RisingWave cluster configuration changes
Other (please specify in the release note below)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

…o service_discovery

…o fix_fencing

…o service_discovery

…o client_failover

CAJan93 · 2023-01-26T13:11:29Z

src/rpc_client/src/meta_client.rs

+            let current_leader = nl.unwrap();
+
+            let addr = format!(
+                "http://{}:{}",


Are there any plans to switch to https? Do I have to consider that here?

CAJan93 · 2023-01-26T13:12:22Z

src/rpc_client/src/meta_client.rs

+            return;
+        }
+
+        // Only print failure messages if the entire failover failed


I am doing this, to avoid spamming the logs. Printing immediately is very verbose, since the meta nodes send stale leader information for quite some time.

CAJan93 · 2023-01-27T08:40:47Z

Codecov is complaining that coverage decreases. I assume that this is ok, since we also relied on relied on sim testing in feat: introduce ElectionClient trait for meta. Please correct me if I am wrong @shanicky @yezizp2012

yezizp2012 · 2023-01-28T09:09:58Z

Good job! No worries about the code coverage. @shanicky also submitted a PR #7389 to support service discovery(including client failover). After a quick look at both of your pr's, I believe you have implemented some common functionality, except for inconsistencies in where you update the meta leader address information.

in the k8s deployment environment, the clients in this PR will fetch and refresh the meta leader address from the load balance service.
In other deployment environments, feat: meta-client with new election mechanism #7389 will have a bypass logic to refresh leader address from configured and cached meta members. This could cover some scenarios of migration in non-k8s environments.

I think both implementations are necessary as we discussed earlier, and need a configuration parameter to determine which path to take, such as --meta-service-mode=lb/group. the first path is taken when mode is lb, the second when group.

Besides, I think starting risingwave via risedev belongs to non-k8s deployment, so I prefer not to introduce nginx in risedev, config --meta-service-mode=group and leave it to the implementation of the second way.

yezizp2012 · 2023-01-28T09:24:53Z

src/common/common_service/src/observer_manager.rs

-            .map_err(RpcError::into)
+            .map_err(RpcError::into);
+
+        for retry in self.meta_client.get_retry_strategy() {


Failover is already covered in meta_rpc_client_method_impl. So I guess this's not necessary?

yezizp2012 · 2023-01-28T09:33:26Z

src/rpc_client/src/meta_client.rs

+
+            // Hold locks on all sub-clients, to update atomically
+            {
+                let mut leader_c = self.leader_client.as_ref().lock().await;


We'd better wrap a core struct for all clients and lock/unlock on it.

yezizp2012 · 2023-01-28T09:37:47Z

src/rpc_client/src/lib.rs

+                }
+
+                // repeat request if we were connected against the wrong node
+                if self.do_failover_if_needed().await {


If some RPC calls failed, it can already tell that if we should do failover based on the response code. And not all RPCs calls are retryable.

yezizp2012 · 2023-01-28T09:42:54Z

src/rpc_client/src/meta_client.rs

+    /// Execute the failover if it is needed. If no failover is needed do nothing
+    /// Returns true if failover was needed, else false
+    pub async fn do_failover_if_needed(&self) -> bool {
+        let current_leader = self.try_get_leader_from_connected_node().await;


We don't have to issue a call of make_leader_request for all failure requests and need to make failover singleton, so that this refresh process will only happen once.

yezizp2012 · 2023-01-28T09:44:59Z

src/rpc_client/src/meta_client.rs

+        let channel = get_channel_with_defaults(addr).await?;
+
+        // Dummy address forces a client failover
+        let dummy_address = HostAddress {


Why don't get leader address directly here? Storing a dummy info here is quite weird.

CAJan93 · 2023-01-30T16:48:04Z

Thank you very much for the feedback. I will have a look at this on Wednesday after my vacation :)

CAJan93 · 2023-02-01T10:16:18Z

Closing this PR as duplicate work 😞

CAJan93 added 30 commits December 23, 2022 15:41

get implementation from big PR

139cd44

start writing test

7df3c61

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

9a121fd

…o service_discovery

get current leader HostAddress

9154fa4

working test

3b0a07a

test_leader_svc_delete_everything

996bd8b

minor changes in test

7b2878d

minor change

e0b3e92

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

183fb25

…o service_discovery

get_meta_channel helper func

0443bef

introduce get_agreed_leader

96602a9

change comments

0d6b187

test_leader_svc_add_nodes

6cf753d

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

5f31d78

…o service_discovery

minor change

267bf42

Merge branch 'main' into service_discovery

e024167

Merge branch 'main' into service_discovery

e9f174e

Merge branch 'main' into service_discovery

1515031

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

d8e4336

…o service_discovery

add todo

0a20618

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

0f4f280

…o service_discovery

add test

386be35

remove comment

f04ebcc

use kill signal

df89cae

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

393d346

…o fix_fencing

minor changes

edb4a81

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

e7ff5df

…o service_discovery

Merge branch 'fix_fencing' into this branch

289f6ca

minor changes

a20bbb9

Proto LeaderService

4c2b0a7

CAJan93 added 15 commits January 25, 2023 18:43

make_leader_request

3d06175

remove log line

e4db3c2

minor changes

4bbe383

add function descriptions

0242100

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

2e6ada2

…o client_failover

risedev c

8558d0b

remove todo

07fe7a9

minor changes

92470ed

minor changes in task/utils.rs

6ea861f

add comments

0ec2342

minor change

9ca5b1d

use tokio retry + avoid panic in get_current_leader_from_service

a11cb3f

minor change in retry

c3b36c9

Merge branch 'main' of ssh://github.com/risingwavelabs/risingwave int…

9ec4ace

…o client_failover

less verbose

63fc871

CAJan93 commented Jan 26, 2023

View reviewed changes

format

c3ad9ac

CAJan93 mentioned this pull request Jan 26, 2023

Tracking: high availability for Meta service #5943

Closed

15 tasks

CAJan93 requested review from shanicky and yezizp2012 January 27, 2023 12:03

yezizp2012 reviewed Jan 28, 2023

View reviewed changes

CAJan93 mentioned this pull request Jan 30, 2023

feat(test): Change sim tests to match meta HA setup #7533

Closed

2 tasks

CAJan93 closed this Feb 1, 2023

CAJan93 mentioned this pull request Feb 1, 2023

feat: Meta clients can handle meta failover #6787

Closed

xxchan deleted the client_failover branch May 14, 2023 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(meta): Client failover #7248

feat(meta): Client failover #7248

CAJan93 commented Jan 6, 2023 •

edited

Loading

CAJan93 Jan 26, 2023

CAJan93 Jan 26, 2023

CAJan93 commented Jan 27, 2023

yezizp2012 commented Jan 28, 2023

yezizp2012 Jan 28, 2023

yezizp2012 Jan 28, 2023

yezizp2012 Jan 28, 2023

yezizp2012 Jan 28, 2023

yezizp2012 Jan 28, 2023

CAJan93 commented Jan 30, 2023

CAJan93 commented Feb 1, 2023

feat(meta): Client failover #7248

feat(meta): Client failover #7248

Conversation

CAJan93 commented Jan 6, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Types of user-facing changes

Release note

Refer to a related PR or issue link (optional)

CAJan93 Jan 26, 2023

Choose a reason for hiding this comment

CAJan93 Jan 26, 2023

Choose a reason for hiding this comment

CAJan93 commented Jan 27, 2023

yezizp2012 commented Jan 28, 2023

yezizp2012 Jan 28, 2023

Choose a reason for hiding this comment

yezizp2012 Jan 28, 2023

Choose a reason for hiding this comment

yezizp2012 Jan 28, 2023

Choose a reason for hiding this comment

yezizp2012 Jan 28, 2023

Choose a reason for hiding this comment

yezizp2012 Jan 28, 2023

Choose a reason for hiding this comment

CAJan93 commented Jan 30, 2023

CAJan93 commented Feb 1, 2023

CAJan93 commented Jan 6, 2023 •

edited

Loading