kiam-agent race with dependent application pods on Node startup #395

jpugliesi · 2020-04-21T18:19:29Z

kiam version: v3.6-rc1 (primarily to support IMDS v2 from #381)

We frequently autoscale node groups in our cluster. These autoscaled nodes are dedicated to running applications which depend upon kiam for AWS credentials.

We're seeing a race between the kiam-agent pod and application pods when a new node is launched. It appears the application pod makes an AWS credentials request before the kiam-agent pod is fully ready, which causes the application to throw AWS NoCredentialsError errors, i.e.

botocore.exceptions.NoCredentialsError: Unable to locate credentials

We've reviewed the following similar issues, among others: #203, #358

We've made the following to mitigate issues on startup, but still havent fully solved the problem:

Add initContainer to the kiam-agent pods to check that dns can resolve the kiam-server service (per kiam-agent random errors on node startup #358)
Set priorityClassName: system-cluster-critical on the kiam-agent daemonset template to prioritize scheduling the kiam-agent (per [Feature] Set priorityClassName on kiam-server and kiam-agent Pods #343)
Explicitly allocate resources to the kiam-agent

It seems the additional guidance on how to mitigate issues on node startup includes:

Make your application pods sleep for some amount of time on startup to allow the kiam-agent to become ready (this feels like a pretty disgusting hack of a workaround) (per kiam agent init time and pod waiting for ready state #203)
Somehow taint new nodes with something akin to kiam-not-ready, and then somehow remove that taint when kiam-agent is ready. This also feels fairly invasive because it would require amending other important daemonsets to tolerate this additional taint (per kiam agent init time and pod waiting for ready state #203).
edit: I see the uswitch team created https://github.com/uswitch/nidhogg to address this - this seems worth documenting here in the kiam project

edit: this issue of pod readiness priority on node startup is tracked in a KEP

Am I missing something here? Is there additional configuration/strategies to harden kiam-agent deployment such that it is quickly and reliably available to all workloads on newly created cluster nodes? Appreciate any and all advice/help!

The text was updated successfully, but these errors were encountered:

Joseph-Irving · 2020-04-22T12:41:10Z

I don't think I have much to add on to what you've put, those are all ways to help ensure kiam-agent is running before pods startup. Having critical daemonset pods that need to be running is a generic problem to Kubernetes as seen by that KEP trying to solve the problem (although sadly it seems to have stalled). Nidhogg definitely has helped us with it, we use it for both kiam and node-local-dns.
More broadly you could consider how your applications handle aws credentials related errors, if they don't have credentials do they pass readiness and serve traffic? Can they recover from lack of credentials, do they have retries etc.

As an aside:

Add initContainer to the kiam-agent pods to check that dns can resolve the kiam-server service (per #358)

I don't personally recommend doing this, to me that indicates the nodes are allowing pods to schedule before DNS is functioning so I would fix that instead.

eytan-avisror · 2020-04-22T22:20:39Z

I'm also hitting this problem.
It seems there needs to be an initialDelay after the node joins the cluster to avoid this few restarts.. in my case this is happening because node networking is not ready yet, however the kiam-agent pod is scheduled and trying to create the gateway.
After two restarts it is able to do this and stays stable.

The couple restarts i'm seeing has:

{"level":"info","msg":"configuring iptables","time":"2020-04-22T22:08:24Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-04-22T22:08:24Z"}
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-04-22T22:08:54Z"}

followed by some liveness probe failures.

herikwebb · 2020-07-09T16:09:24Z

We're seeing this issue as well. We run a machine learning platform and many of the workloads trigger significant scale-ups causing new nodes and new kiam-agents to be spun up and often times the applications are attempting to fetch credentials before the agent is ready.

jpugliesi changed the title ~~kiam-agent boot race with dependent application pods on Node startup~~ kiam-agent race with dependent application pods on Node startup Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kiam-agent race with dependent application pods on Node startup #395

kiam-agent race with dependent application pods on Node startup #395

jpugliesi commented Apr 21, 2020 •

edited

Loading

Joseph-Irving commented Apr 22, 2020

eytan-avisror commented Apr 22, 2020 •

edited

Loading

herikwebb commented Jul 9, 2020

kiam-agent race with dependent application pods on Node startup #395

kiam-agent race with dependent application pods on Node startup #395

Comments

jpugliesi commented Apr 21, 2020 • edited Loading

Joseph-Irving commented Apr 22, 2020

eytan-avisror commented Apr 22, 2020 • edited Loading

herikwebb commented Jul 9, 2020

jpugliesi commented Apr 21, 2020 •

edited

Loading

eytan-avisror commented Apr 22, 2020 •

edited

Loading