Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maximum delay of prober in its backoff configurable #1001

Merged
merged 1 commit into from
Dec 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion pkg/status/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,10 @@ import (
"net"
"net/http"
"net/url"
"os"
"path"
"reflect"
"strconv"
"sync"
"time"

Expand Down Expand Up @@ -56,6 +58,19 @@ const (
initialDelay = 200 * time.Millisecond
)

var (
// probeMaxRetryDelay defines the maximum delay between retries in the backoff of probing
probeMaxRetryDelay = 30 * time.Second
)

func init() {
if val, ok := os.LookupEnv("PROBE_MAX_RETRY_DELAY_SECONDS"); ok {
Copy link
Contributor

@skonto skonto Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this? We could add it at the controller deployment config (net-istio repo) as well after:

        - name: CONFIG_LOGGING_NAME
          value: config-logging
        - name: CONFIG_OBSERVABILITY_NAME
          value: config-observability
        - name: ENABLE_SECRET_INFORMER_FILTERING_BY_CERT_UID
          value: "false"

What about other ingresses are they equally affected? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any numbers about the correct value per number of services? That would be also interesting to document as a recommendation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this number should be related to the number of services. The number of services imo has an impact on how long - in our case - Istio will need to get the config synced to all Ingress gateways - and also depends on things like how Istio is tuned (like CPU for istiod), whether delta pushes are enabled for some parts or not, how large the mesh is over for example by using ambient or not etc.

The number we look at here is the maximum duration of the back-off in net-istio. As with exponential back-offs in the context of retries, I assume this was done having in mind to not overload things.

And here is where we at least do not see an overload for us. The two involved components are the net-istio-controller itself that originates the requests. Would for whetever reason be there many KIngresses that it reconciles and then must probe, then yes, reducing the probing max delay increases the load. But in our experience, that is rarely the case and has never been something near to a bottleneck. On the receiving side, the probe requests to the Istio ingress gateway. And there the percentage of requests that are actually probing requests are lik 0.0...1 % ( I have no exact number that I now looked at, but it is so low that it has no relevance at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other ingresses are they equally affected? 🤔

I have no experience on them.

And yes, adding it to the deployment yaml can be done as a follow-on pr.

if durationSeconds, err := strconv.Atoi(val); err == nil && durationSeconds > 0 {
probeMaxRetryDelay = time.Duration(durationSeconds) * time.Second
}
}
}

var dialContext = (&net.Dialer{Timeout: probeTimeout}).DialContext

// ingressState represents the probing state of an Ingress
Expand Down Expand Up @@ -144,7 +159,7 @@ func NewProber(
workQueue: workqueue.NewNamedRateLimitingQueue(
workqueue.NewMaxOfRateLimiter(
// Per item exponential backoff
workqueue.NewItemExponentialFailureRateLimiter(50*time.Millisecond, 30*time.Second),
workqueue.NewItemExponentialFailureRateLimiter(50*time.Millisecond, probeMaxRetryDelay),
// Global rate limiter
&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(50), 100)},
),
Expand Down