Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

kiam-agent race with dependent application pods on Node startup #395

Open
jpugliesi opened this issue Apr 21, 2020 · 3 comments
Open

kiam-agent race with dependent application pods on Node startup #395

jpugliesi opened this issue Apr 21, 2020 · 3 comments

Comments

@jpugliesi
Copy link

jpugliesi commented Apr 21, 2020

kiam version: v3.6-rc1 (primarily to support IMDS v2 from #381)

We frequently autoscale node groups in our cluster. These autoscaled nodes are dedicated to running applications which depend upon kiam for AWS credentials.

We're seeing a race between the kiam-agent pod and application pods when a new node is launched. It appears the application pod makes an AWS credentials request before the kiam-agent pod is fully ready, which causes the application to throw AWS NoCredentialsError errors, i.e.

botocore.exceptions.NoCredentialsError: Unable to locate credentials

We've reviewed the following similar issues, among others: #203, #358

We've made the following to mitigate issues on startup, but still havent fully solved the problem:

  1. Add initContainer to the kiam-agent pods to check that dns can resolve the kiam-server service (per kiam-agent random errors on node startup #358)
  2. Set priorityClassName: system-cluster-critical on the kiam-agent daemonset template to prioritize scheduling the kiam-agent (per [Feature] Set priorityClassName on kiam-server and kiam-agent Pods #343)
  3. Explicitly allocate resources to the kiam-agent

It seems the additional guidance on how to mitigate issues on node startup includes:

  1. Make your application pods sleep for some amount of time on startup to allow the kiam-agent to become ready (this feels like a pretty disgusting hack of a workaround) (per kiam agent init time and pod waiting for ready state #203)
  2. Somehow taint new nodes with something akin to kiam-not-ready, and then somehow remove that taint when kiam-agent is ready. This also feels fairly invasive because it would require amending other important daemonsets to tolerate this additional taint (per kiam agent init time and pod waiting for ready state #203).
    edit: I see the uswitch team created https://github.com/uswitch/nidhogg to address this - this seems worth documenting here in the kiam project

edit: this issue of pod readiness priority on node startup is tracked in a KEP

Am I missing something here? Is there additional configuration/strategies to harden kiam-agent deployment such that it is quickly and reliably available to all workloads on newly created cluster nodes? Appreciate any and all advice/help!

@jpugliesi jpugliesi changed the title kiam-agent boot race with dependent application pods on Node startup kiam-agent race with dependent application pods on Node startup Apr 21, 2020
@Joseph-Irving
Copy link
Contributor

I don't think I have much to add on to what you've put, those are all ways to help ensure kiam-agent is running before pods startup. Having critical daemonset pods that need to be running is a generic problem to Kubernetes as seen by that KEP trying to solve the problem (although sadly it seems to have stalled). Nidhogg definitely has helped us with it, we use it for both kiam and node-local-dns.
More broadly you could consider how your applications handle aws credentials related errors, if they don't have credentials do they pass readiness and serve traffic? Can they recover from lack of credentials, do they have retries etc.

As an aside:

Add initContainer to the kiam-agent pods to check that dns can resolve the kiam-server service (per #358)

I don't personally recommend doing this, to me that indicates the nodes are allowing pods to schedule before DNS is functioning so I would fix that instead.

@eytan-avisror
Copy link
Contributor

eytan-avisror commented Apr 22, 2020

I'm also hitting this problem.
It seems there needs to be an initialDelay after the node joins the cluster to avoid this few restarts.. in my case this is happening because node networking is not ready yet, however the kiam-agent pod is scheduled and trying to create the gateway.
After two restarts it is able to do this and stays stable.

The couple restarts i'm seeing has:

{"level":"info","msg":"configuring iptables","time":"2020-04-22T22:08:24Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-04-22T22:08:24Z"}
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-04-22T22:08:54Z"}

followed by some liveness probe failures.

@herikwebb
Copy link

We're seeing this issue as well. We run a machine learning platform and many of the workloads trigger significant scale-ups causing new nodes and new kiam-agents to be spun up and often times the applications are attempting to fetch credentials before the agent is ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants