Implement Leader Election for HA Mode #135

aminmr · 2024-10-05T09:30:24Z

Description

Currently, the k8s-cleaner operator is deployed with a single replica by default and does not support high availability (HA). If the number of replicas is increased, multiple pods may attempt to take actions simultaneously, which could lead to conflicts or redundant operations on the cluster.

To address this, I suggest implementing leader election for HA mode. Many open-source projects utilize Kubernetes' Lease mechanism for this purpose, allowing only one pod to act as the leader at any given time. This would prevent multiple instances from interfering with each other when multiple replicas are running.

Here are the relevant Kubernetes docs: Kubernetes Lease Mechanism.

Proposed Solution:

Introduce leader election logic using Kubernetes Leases.
Ensure that only the pod holding the active lease performs actions on the cluster, while other replicas remain in a standby mode.

I am happy to volunteer to implement this feature for the project.

I am looking forward to your thoughts! Thanks!

gianlucam76 · 2024-10-06T07:08:29Z

Thanks @aminmr. I am not sure about this. k8s-cleaner originally had leader election. But I removed it. Reason being, I believe with cleaner we might hit scaling more than availability (and the end it does not have to respond to other services). And when we hit scaling, my plan is to introduce sharding. So different cleaner instances can process in parallel different cleaner instances based on some annotation.

Let's keep this open though. If we don't take that path, we will add leader election back.

Thank you again!

aminmr · 2024-10-06T14:45:12Z

Thanks, @gianlucam76 !

I appreciate your explanation, but I have a couple of questions regarding the future implementation and design decisions for the Cleaner.
qir
What technologies are you planning to use for implementing sharding in the Cleaner?
I'm curious about how you plan to manage sharding to ensure multiple Cleaner instances can process resources in parallel.

Could you clarify why leader election isn't a good solution for this project?
I understand your point about scaling being more of a concern than availability, but I’m still unclear on why leader election is being considered less effective in this case. Is there a specific issue with leader election that conflicts with the overall goals of the project? And do you know any relevant operator with sharding feature?

Thanks again for your time, and I look forward to your thoughts on this!

gianlucam76 · 2024-10-07T16:34:06Z

Hi @aminmr regarding sharding, I am planning on using same approach I used in sveltos here it will require some manual configuration (as I don't have a shard controller like in Sveltos), but the idea is that.

In general leader election is great (tough it has a cost of you having to run 3 pods instead of 1 with 2 pods doing nothing most of the time). But I do see that more valuable for a service that need to respond to other services (where you cannot afford having it down for 30 seconds or so).
Cleaner has a configurable jitter window. So if cleaner gets stuck, it won't miss processing cleaner instances which are due. This makes the leader less needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Leader Election for HA Mode #135

Implement Leader Election for HA Mode #135

aminmr commented Oct 5, 2024

gianlucam76 commented Oct 6, 2024

aminmr commented Oct 6, 2024

gianlucam76 commented Oct 7, 2024 •

edited

Loading

Implement Leader Election for HA Mode #135

Implement Leader Election for HA Mode #135

Comments

aminmr commented Oct 5, 2024

Description

Proposed Solution:

gianlucam76 commented Oct 6, 2024

aminmr commented Oct 6, 2024

gianlucam76 commented Oct 7, 2024 • edited Loading

gianlucam76 commented Oct 7, 2024 •

edited

Loading