Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(meta): Client failover #7248

Closed
wants to merge 173 commits into from
Closed

feat(meta): Client failover #7248

wants to merge 173 commits into from

Conversation

CAJan93
Copy link
Contributor

@CAJan93 CAJan93 commented Jan 6, 2023

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

#6787

PR based on #7049. Only merge after that one is merged.

Checklist

  • Client can connect against leader on original connection
  • Client can handle leader failover
  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • Det Sim tests should also run with 3etcd-3meta-1cn-1fe
  • Det Sim tests should also kill meta node if using 3etcd-3meta-1cn-1fe

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

let current_leader = nl.unwrap();

let addr = format!(
"http://{}:{}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any plans to switch to https? Do I have to consider that here?

return;
}

// Only print failure messages if the entire failover failed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing this, to avoid spamming the logs. Printing immediately is very verbose, since the meta nodes send stale leader information for quite some time.

@CAJan93
Copy link
Contributor Author

CAJan93 commented Jan 27, 2023

Codecov is complaining that coverage decreases. I assume that this is ok, since we also relied on relied on sim testing in feat: introduce ElectionClient trait for meta. Please correct me if I am wrong @shanicky @yezizp2012

@yezizp2012
Copy link
Member

Good job! No worries about the code coverage. @shanicky also submitted a PR #7389 to support service discovery(including client failover). After a quick look at both of your pr's, I believe you have implemented some common functionality, except for inconsistencies in where you update the meta leader address information.

  1. in the k8s deployment environment, the clients in this PR will fetch and refresh the meta leader address from the load balance service.
  2. In other deployment environments, feat: meta-client with new election mechanism #7389 will have a bypass logic to refresh leader address from configured and cached meta members. This could cover some scenarios of migration in non-k8s environments.

I think both implementations are necessary as we discussed earlier, and need a configuration parameter to determine which path to take, such as --meta-service-mode=lb/group. the first path is taken when mode is lb, the second when group.

Besides, I think starting risingwave via risedev belongs to non-k8s deployment, so I prefer not to introduce nginx in risedev, config --meta-service-mode=group and leave it to the implementation of the second way.

.map_err(RpcError::into)
.map_err(RpcError::into);

for retry in self.meta_client.get_retry_strategy() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failover is already covered in meta_rpc_client_method_impl. So I guess this's not necessary?


// Hold locks on all sub-clients, to update atomically
{
let mut leader_c = self.leader_client.as_ref().lock().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better wrap a core struct for all clients and lock/unlock on it.

}

// repeat request if we were connected against the wrong node
if self.do_failover_if_needed().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some RPC calls failed, it can already tell that if we should do failover based on the response code. And not all RPCs calls are retryable.

/// Execute the failover if it is needed. If no failover is needed do nothing
/// Returns true if failover was needed, else false
pub async fn do_failover_if_needed(&self) -> bool {
let current_leader = self.try_get_leader_from_connected_node().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to issue a call of make_leader_request for all failure requests and need to make failover singleton, so that this refresh process will only happen once.

let channel = get_channel_with_defaults(addr).await?;

// Dummy address forces a client failover
let dummy_address = HostAddress {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't get leader address directly here? Storing a dummy info here is quite weird.

@CAJan93
Copy link
Contributor Author

CAJan93 commented Jan 30, 2023

Thank you very much for the feedback. I will have a look at this on Wednesday after my vacation :)

@CAJan93
Copy link
Contributor Author

CAJan93 commented Feb 1, 2023

Closing this PR as duplicate work 😞

@CAJan93 CAJan93 closed this Feb 1, 2023
@xxchan xxchan deleted the client_failover branch May 14, 2023 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants