-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added leader election. #530
Conversation
pkg/leader/doc.go
Outdated
|
||
Both the Leader For Life and lease-based approaches to leader election are | ||
built on the concept that each candidate will attempt to create a resource with | ||
the same GVK, namespace and name. Whichever candidate succeeds becomes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same GVK, namespace and name
->
the same GVK, namespace, and name
? I believe we need an extra comma here to separate items in a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar-wise, it's optional. I'm happy to add it if that's the preferred style here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add it if that's the preferred style here.
if you can do that, it will be great.
pkg/leader/leader.go
Outdated
"k8s.io/apimachinery/pkg/util/wait" | ||
"k8s.io/client-go/rest" | ||
crclient "sigs.k8s.io/controller-runtime/pkg/client" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need new line here.
Usually, we group imports with following quideline:
- group all go internal imports first.
- group all current pkg imports second.
- group all external imports last.
pkg/leader/leader.go
Outdated
func myOwnerRef(ctx context.Context, client crclient.Client, ns string) (metav1.OwnerReference, error) { | ||
hostname, err := os.Hostname() | ||
if err != nil { | ||
return metav1.OwnerReference{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot use nil as type "github.com/operator-framework/operator-sdk/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".OwnerReference in return argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, the return arg is a type of metav1.OwnerReference
not *metav1.OwnerReference
? Maybe change to *metav1.OwnerReference
as the return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a normal pattern either way, and there are examples of returning an error with both nil-pointer and zero-value in the standard library. I'm not sure if you had any reasoning in mind beyond aesthetics? In any case, since metav1 itself only ever returns a pointer to an OwnerReference, I'll go ahead and change this to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhrivnak from my past experience with Go, almost all the code I have seen has return nil, err
pattern which is what we use in the operator-sdk codebase. i guess this is just a style nit to be consistent with rest of the codebase. Sorry I should have been more clear.
pkg/leader/leader.go
Outdated
err = client.Get(ctx, key, myPod) | ||
if err != nil { | ||
logrus.Error("failed to get pod") | ||
return metav1.OwnerReference{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot use nil as type "github.com/operator-framework/operator-sdk/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".OwnerReference in return argument
pkg/leader/leader.go
Outdated
// case where a namespace cannot be found for the current pod. This is useful | ||
// for a service that might run outside the cluster, for example an operator | ||
// being started with `operator-sdk up local`. | ||
func TryBecome(ctx context.Context, name string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It maybe more intuitive if we call name
to either lockKey
or leaderKey
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking name
because the value will literally go into the name
field on the ConfigMap
. Maybe lockName
would be better?
I worry about using the term "key" in a scenario that involves a "lock", since we don't want anyone to think that it is the kind of key that could "unlock" something.
Thoughts on that? I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lockName
sounds better. The reason I think having a keyword lock
is useful because the usr now expects or have an idea that multiple instances having the same lockName
going compete and one will win.
pkg/leader/leader.go
Outdated
return err | ||
} | ||
|
||
cm := &corev1.ConfigMap{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move cm
to line 121 so it is closer to its usage.
pkg/leader/leader.go
Outdated
logrus.Info("Not the leader. Waiting.") | ||
select { | ||
case <-time.After(wait.Jitter(time.Second*time.Duration(seconds), .2)): | ||
if seconds < 16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change 16
to a named variable e.g maxBackoffInterval
and also use backoff
if backoff < maxBackoffInterval {
backoff *= 2
}
pkg/leader/leader.go
Outdated
} | ||
|
||
// try to create a lock | ||
seconds := 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have backoff := time.second
instead of seconds?
pkg/leader/leader.go
Outdated
case apierrors.IsAlreadyExists(err): | ||
logrus.Info("Not the leader. Waiting.") | ||
select { | ||
case <-time.After(wait.Jitter(time.Second*time.Duration(seconds), .2)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.After(wait.Jitter(time.Second*time.Duration(seconds), .2))
->
time.After(wait.Jitter(backoff, .2))
lgtm |
pkg/leader/leader.go
Outdated
// case where a namespace cannot be found for the current pod. This is useful | ||
// for a service that might run outside the cluster, for example an operator | ||
// being started with `operator-sdk up local`. | ||
func TryBecome(ctx context.Context, lockName string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the operator-sdk up local
case all we want to do is detect if the operator is not running in a pod and log that we're skipping leader election. We should do that in Become()
. Otherwise users will have to keep changing their code between Become()
and TryBecome()
depending on how they run the operator.
So let's just remove TryBecome()
and change the following in Become()
:
ns, err := myNS()
if err != nil {
if err == ErrNoNS {
logrus.Info("Skipping leader election; not running in cluster")
return nil
}
return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm not quite sure what you mean by needing to "keep changing their code...", but I did write this for two uses cases, and you're right that the SDK will probably only care about one of them. Just for the sake of discussion, they are:
- My service might run off-cluster, so I want failure to detect the current namespace to be accepted as normal.
- My service always runs in-cluster, so failure to detect the current namespace should result in hard failure.
For our purposes, I'll go ahead with the change you suggested. Thanks!
pkg/leader/leader.go
Outdated
return ctx.Err() | ||
} | ||
default: | ||
logrus.Error("unknown error creating configmap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the full error:
logrus.Errorf("unknown error creating configmap: %v", err)
pkg/leader/leader.go
Outdated
return "", err | ||
} | ||
ns := strings.TrimSpace(string(nsBytes)) | ||
logrus.Infof("found namespace: %s", ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably no need to log this. It's not saying much as this is normal behavior and users already know what namespace they're running the operator in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll change it to Debug level since it could be useful for troubleshooting.
pkg/leader/leader.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
logrus.Infof("found hostname: %s", hostname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this log. We can remove this.
pkg/leader/leader.go
Outdated
key := crclient.ObjectKey{Namespace: ns, Name: hostname} | ||
err = client.Get(ctx, key, myPod) | ||
if err != nil { | ||
logrus.Error("failed to get pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the complete error.
logrus.Errorf("failed to get self pod: %v", err)
|
||
// maxBackoffInterval defines the maximum amount of time to wait between | ||
// attempts to become the leader. | ||
const maxBackoffInterval = time.Second * 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on making this configurable via an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hasbro17 i'd suggest we keep it a default value first. if user complains, then we make it configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't have any strong opinions on making it configurable in this PR. Just wanted to bring it up for discussion as something todo in the future as it might be a use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this could be valuable as a configuration item in the future. I concluded it wasn't a priority right now, because it's hard to come up with a compelling use case in which changing it has appreciable value. But if someone brings us a good user story, it would be easy enough to add.
See doc/proposals/leader-for-life.md
300ad4c
to
67a2606
Compare
LGTM |
LGTM @mhrivnak Do you have any thoughts on how we can add an e2e test for this as a follow up?
And ensure that only 1 of the 3 pods become Ready(and is the leader). |
I'll open a new issue to track creating an e2e test. |
See doc/proposals/leader-for-life.md