-
Notifications
You must be signed in to change notification settings - Fork 238
Ensuring the KIAM Endpoint is ready when a pod is scheduled #277
Comments
The servers currently won't go healthy until the pod caches have been filled: https://github.com/uswitch/kiam/blob/master/pkg/k8s/pod_cache.go#L170 Of course, once they're running it's possible that a pod would request credentials before the watcher has delivered the notification but Kiam deliberately prefetches credentials and tracks metadata as soon as Pods are pending and have an IP which I'd hope to mostly be before the container can execute and request credentials. Maybe it's worth checking other metrics to ensure your API servers etc. aren't being overwhelmed or that Kiam servers aren't being throttled. There's also metrics that would track whether pods aren't found when requested so may be worth watching those to understand what's happening. |
we've already added metrics and can't really see this issue anymore. It was just really obvious when you initialize a nodejs application and fetch something from AWS immediately right after starting up. We're not sure this is related to the server, since we've also added delayed pod scheduling until the KIAM agent is ready on the node via taints at the same time. |
@pingles sorry for the late response. This got somewhat lost on our table. Since my last comment we've updated to v3.4, but are still facing problems when rolling out both groups of server-nodes and agent-nodes simultaneously. Our servers are started as deployments, so the GRPC server's SIGTERM hook should be respected, right? We've also found a significant increase in Also; when not rolling out both groups at the same time, we can see loads
Do you have any suggestions how to debug this topic further, or what measurements could be taken to find out the root cause of this problem? This might also very well be related to #217 |
Hey, so we're running KIAM 3 in our cluster. Lately we noticed that the AWS metadata endpoint is not immediately available after pod startup (it takes about 1-2s for the endpoint
http://169.254.169.254/latest/meta-data/iam/security-credentials/<role_id>
to be ready). For quick starting pods, that try to use this endpoint immediately after startup, this is an issue.I thought this might be down to the pod cache taking some time (just a few ms) to get notified of the pod creation. What do you think?
/cc: @mseiwald @boynux
The text was updated successfully, but these errors were encountered: