Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

Open
nirnanaaa opened this issue Jul 31, 2019 · 3 comments
Open

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

nirnanaaa opened this issue Jul 31, 2019 · 3 comments

Comments

@nirnanaaa
Copy link
Contributor

nirnanaaa commented Jul 31, 2019

Hey, so we're running KIAM 3 in our cluster. Lately we noticed that the AWS metadata endpoint is not immediately available after pod startup (it takes about 1-2s for the endpoint http://169.254.169.254/latest/meta-data/iam/security-credentials/<role_id> to be ready). For quick starting pods, that try to use this endpoint immediately after startup, this is an issue.

I thought this might be down to the pod cache taking some time (just a few ms) to get notified of the pod creation. What do you think?

/cc: @mseiwald @boynux

@pingles
Copy link
Contributor

pingles commented Aug 12, 2019

The servers currently won't go healthy until the pod caches have been filled:

https://github.com/uswitch/kiam/blob/master/pkg/k8s/pod_cache.go#L170

Of course, once they're running it's possible that a pod would request credentials before the watcher has delivered the notification but Kiam deliberately prefetches credentials and tracks metadata as soon as Pods are pending and have an IP which I'd hope to mostly be before the container can execute and request credentials.

Maybe it's worth checking other metrics to ensure your API servers etc. aren't being overwhelmed or that Kiam servers aren't being throttled.

There's also metrics that would track whether pods aren't found when requested so may be worth watching those to understand what's happening.

@nirnanaaa
Copy link
Contributor Author

we've already added metrics and can't really see this issue anymore. It was just really obvious when you initialize a nodejs application and fetch something from AWS immediately right after starting up.

We're not sure this is related to the server, since we've also added delayed pod scheduling until the KIAM agent is ready on the node via taints at the same time.

@nirnanaaa
Copy link
Contributor Author

nirnanaaa commented Oct 28, 2019

@pingles sorry for the late response. This got somewhat lost on our table. Since my last comment we've updated to v3.4, but are still facing problems when rolling out both groups of server-nodes and agent-nodes simultaneously. Our servers are started as deployments, so the GRPC server's SIGTERM hook should be respected, right?

We've also found a significant increase in kiam_metadata_find_role_errors_total (going from zero to >50) during exactly the time of the rollout, that also reflects in pods entering a crashlooping state.

Also; when not rolling out both groups at the same time, we can see loads context Cancelled of log messages inside the agents:

{"addr":"xxx.xxx.xxx.xxx:58608","level":"error","method":"GET","msg":"error processing request: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-10-28T07:23:34Z"}

{"addr":"xxx.xxx.xxx.xxx:58608","duration":1001,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-10-28T07:23:34Z"}

Do you have any suggestions how to debug this topic further, or what measurements could be taken to find out the root cause of this problem?

This might also very well be related to #217

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants