Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

nirnanaaa · 2019-07-31T14:46:49Z

Hey, so we're running KIAM 3 in our cluster. Lately we noticed that the AWS metadata endpoint is not immediately available after pod startup (it takes about 1-2s for the endpoint http://169.254.169.254/latest/meta-data/iam/security-credentials/<role_id> to be ready). For quick starting pods, that try to use this endpoint immediately after startup, this is an issue.

I thought this might be down to the pod cache taking some time (just a few ms) to get notified of the pod creation. What do you think?

/cc: @mseiwald @boynux

The text was updated successfully, but these errors were encountered:

pingles · 2019-08-12T12:27:58Z

The servers currently won't go healthy until the pod caches have been filled:

https://github.com/uswitch/kiam/blob/master/pkg/k8s/pod_cache.go#L170

Of course, once they're running it's possible that a pod would request credentials before the watcher has delivered the notification but Kiam deliberately prefetches credentials and tracks metadata as soon as Pods are pending and have an IP which I'd hope to mostly be before the container can execute and request credentials.

Maybe it's worth checking other metrics to ensure your API servers etc. aren't being overwhelmed or that Kiam servers aren't being throttled.

There's also metrics that would track whether pods aren't found when requested so may be worth watching those to understand what's happening.

nirnanaaa · 2019-08-13T07:19:30Z

we've already added metrics and can't really see this issue anymore. It was just really obvious when you initialize a nodejs application and fetch something from AWS immediately right after starting up.

We're not sure this is related to the server, since we've also added delayed pod scheduling until the KIAM agent is ready on the node via taints at the same time.

nirnanaaa · 2019-10-28T06:50:57Z

@pingles sorry for the late response. This got somewhat lost on our table. Since my last comment we've updated to v3.4, but are still facing problems when rolling out both groups of server-nodes and agent-nodes simultaneously. Our servers are started as deployments, so the GRPC server's SIGTERM hook should be respected, right?

We've also found a significant increase in kiam_metadata_find_role_errors_total (going from zero to >50) during exactly the time of the rollout, that also reflects in pods entering a crashlooping state.

Also; when not rolling out both groups at the same time, we can see loads context Cancelled of log messages inside the agents:

{"addr":"xxx.xxx.xxx.xxx:58608","level":"error","method":"GET","msg":"error processing request: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-10-28T07:23:34Z"}

{"addr":"xxx.xxx.xxx.xxx:58608","duration":1001,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-10-28T07:23:34Z"}

Do you have any suggestions how to debug this topic further, or what measurements could be taken to find out the root cause of this problem?

This might also very well be related to #217

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

nirnanaaa commented Jul 31, 2019 •

edited

Loading

pingles commented Aug 12, 2019

nirnanaaa commented Aug 13, 2019

nirnanaaa commented Oct 28, 2019 •

edited

Loading

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

Comments

nirnanaaa commented Jul 31, 2019 • edited Loading

pingles commented Aug 12, 2019

nirnanaaa commented Aug 13, 2019

nirnanaaa commented Oct 28, 2019 • edited Loading

nirnanaaa commented Jul 31, 2019 •

edited

Loading

nirnanaaa commented Oct 28, 2019 •

edited

Loading