-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cluster creation success in Korea #1196
Conversation
Please rebase pull request. |
Seems good. I'm not sure about never raising the error from the internet checker in the logs, though -- it seems like it will always requeue it, forever? |
If I read this right, it will try 6 times, and again after 1 hours |
Is it not a case that requeue on failure will be logged out by the framework? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we might want to change the waiting logic a bit
This is expected to significantly improve our cluster creation success in South Korea.
@@ -59,7 +59,9 @@ func (r *CheckerController) Reconcile(request ctrl.Request) (ctrl.Result, error) | |||
if thisErr != nil { | |||
// do all checks even if there is an error | |||
err = thisErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Can't we use err
instead of thisErr
? The only real error-generating statement this this function is c.Check(ctx)
so one variable should be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good questions, why this is not enough?
for _, c := range r.checkers {
err := c.Check(ctx)
if err != nil {
// do all checks even if there is an error
if err != errRequeue {
r.log.Errorf("checker %s failed with %v", c.Name(), err)
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's subtle. Consider the following case: checker 0 returns an error, and checker 1 returns nil. In that case by your code we would return foo, nil when we exit the function, but in fact we want to return foo, err.
@@ -63,6 +52,24 @@ func (r *InternetChecker) Name() string { | |||
|
|||
// Reconcile will keep checking that the cluster can connect to essential services. | |||
func (r *InternetChecker) Check(ctx context.Context) error { | |||
cli := &http.Client{ | |||
Transport: &http.Transport{ | |||
// We set DisableKeepAlives for two reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good explanation
gcs.prod.monitoring.core.windows.net in Korea Central currently has a bad server which when our internet checker hits it causes our cluster creations to fail irrevocably due to golang/go#36026. Workaround.