Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: high availability for Meta service #5943

Closed
13 of 15 tasks
yezizp2012 opened this issue Oct 20, 2022 · 17 comments
Closed
13 of 15 tasks

Tracking: high availability for Meta service #5943

yezizp2012 opened this issue Oct 20, 2022 · 17 comments
Assignees
Milestone

Comments

@CAJan93
Copy link
Contributor

CAJan93 commented Oct 20, 2022

Some helpful Resources. Please let me know if I forgot something.

@CAJan93
Copy link
Contributor

CAJan93 commented Oct 20, 2022

Looking forward to work on this task. However I am still working on a different kernel task and will be at KubeCon next week. I will start with this the earliest at the beginning of November.

@skyzh
Copy link
Contributor

skyzh commented Nov 17, 2022

added a new task "support connecting to multiple meta-node on compute-node"

@CAJan93
Copy link
Contributor

CAJan93 commented Nov 22, 2022

First draft version ready: Failover works, but Hummock crashes: #6534

@CAJan93
Copy link
Contributor

CAJan93 commented Nov 23, 2022

support connecting to multiple meta-node on compute-node

@yezizp2012 I do not understand this part: My understanding is that the meta HA setup is a single-leader-system (see design docs). Therefore we only ever connect to one instance.
Is my understanding correct @fuyufjh ?

@yezizp2012
Copy link
Member Author

yezizp2012 commented Nov 23, 2022

support connecting to multiple meta-node on compute-node

@yezizp2012 I do not understand this part: My understanding is that the meta HA setup is a single-leader-system (see design docs). Therefore we only ever connect to one instance. Is my understanding correct @fuyufjh ?

IIUC, this should refers to the refactoring of the meta client: meta client should be able to connect and retrieve leader information from multi-meta nodes, which requires a get_leader_addr interface in meta.

@skyzh
Copy link
Contributor

skyzh commented Nov 23, 2022

IIUC, this should refers to the refactoring of the meta client: meta client should be able to connect and retrieve leader information from multi-meta nodes, which requires a get_leader_addr interface in meta.

Yes. After that we can modify risedev to provide multiple meta addresses to compute nodes and frontend nodes. Currently it only picks the first one (see risedev warning)

@CAJan93
Copy link
Contributor

CAJan93 commented Nov 24, 2022

Thank you very much for the clarifying. I would suggest that the change in the client and the change in the meta enabling a follower service which supports get_leader_addr will be done in separate PRs.

@CAJan93
Copy link
Contributor

CAJan93 commented Nov 24, 2022

First version of the failover handling is done. Please have a look: #6466

There are a few things in the PR missing:

  • e2e tests. I do not know how best to approach that. Please let me know if you have suggestions
  • Currently we do busy-wait to do elections. ETCD also provides an update API afaik. I guess that this will require quite a reqwrite, which is why I would like to do this as future work.
  • Meta crashes on failover because of Meta failover crashes Hummock  #6534. Will be addressed in future PR

@CAJan93
Copy link
Contributor

CAJan93 commented Dec 7, 2022

added a new task "support connecting to multiple meta-node on compute-node"

Is this still up to date? I do not see the subtask @skyzh

@yezizp2012
Copy link
Member Author

added a new task "support connecting to multiple meta-node on compute-node"

Is this still up to date? I do not see the subtask @skyzh

That should be a part of #6755 .

@CAJan93
Copy link
Contributor

CAJan93 commented Dec 7, 2022

Tests are still running, but my guess is that #6771 is ready for review. This is a minor tasks overall, but merging it would still simplify further steps. Let me know if you have any objections or suggestions to the PR.

@CAJan93
Copy link
Contributor

CAJan93 commented Dec 13, 2022

I currently have 2 PRs that are ready for review.

CC @arkbriar

mergify bot pushed a commit that referenced this issue Dec 14, 2022
…6771)

This is also needed for #5943

Implementing the following TODO:

```rust
// TODO: Use tonic's serve_with_shutdown for a graceful shutdown. Now it does not work,
// as the graceful shutdown waits all connections to disconnect in order to finish the stop.
```


Approved-By: fuyufjh
Approved-By: yezizp2012
Approved-By: zwang28

Co-Authored-By: CAJan93 <[email protected]>
@yezizp2012 yezizp2012 self-assigned this Dec 19, 2022
@CAJan93
Copy link
Contributor

CAJan93 commented Dec 20, 2022

#6937 is ready for review. Please view the issue for features and limitations.

@CAJan93
Copy link
Contributor

CAJan93 commented Dec 28, 2022

The fencing PR is ready for review. The CI pipeline is green. Also see #6786.

Maybe we can get this merged before New Years :)

@CAJan93
Copy link
Contributor

CAJan93 commented Dec 29, 2022

Fencing is merged. Thank you very much for your guidance and approval @yezizp2012

@CAJan93
Copy link
Contributor

CAJan93 commented Jan 26, 2023

feat(meta): Client failover is ready for review.

  • Client is able to failover to new meta node
  • Failover may require a few seconds. Meta nodes initially report stale leader information until election is finished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants