-
Notifications
You must be signed in to change notification settings - Fork 238
kiam-agent race with dependent application pods on Node startup #395
Comments
I don't think I have much to add on to what you've put, those are all ways to help ensure kiam-agent is running before pods startup. Having critical daemonset pods that need to be running is a generic problem to Kubernetes as seen by that KEP trying to solve the problem (although sadly it seems to have stalled). Nidhogg definitely has helped us with it, we use it for both kiam and node-local-dns. As an aside:
I don't personally recommend doing this, to me that indicates the nodes are allowing pods to schedule before DNS is functioning so I would fix that instead. |
I'm also hitting this problem. The couple restarts i'm seeing has:
followed by some liveness probe failures. |
We're seeing this issue as well. We run a machine learning platform and many of the workloads trigger significant scale-ups causing new nodes and new kiam-agents to be spun up and often times the applications are attempting to fetch credentials before the agent is ready. |
kiam version:
v3.6-rc1
(primarily to support IMDS v2 from #381)We frequently autoscale node groups in our cluster. These autoscaled nodes are dedicated to running applications which depend upon kiam for AWS credentials.
We're seeing a race between the
kiam-agent
pod and application pods when a new node is launched. It appears the application pod makes an AWS credentials request before thekiam-agent
pod is fully ready, which causes the application to throw AWSNoCredentialsError
errors, i.e.We've reviewed the following similar issues, among others: #203, #358
We've made the following to mitigate issues on startup, but still havent fully solved the problem:
initContainer
to thekiam-agent
pods to check that dns can resolve thekiam-server
service (per kiam-agent random errors on node startup #358)priorityClassName: system-cluster-critical
on thekiam-agent
daemonset template to prioritize scheduling thekiam-agent
(per [Feature] Set priorityClassName on kiam-server and kiam-agent Pods #343)resources
to thekiam-agent
It seems the additional guidance on how to mitigate issues on node startup includes:
sleep
for some amount of time on startup to allow thekiam-agent
to become ready (this feels like a pretty disgusting hack of a workaround) (per kiam agent init time and pod waiting for ready state #203)kiam-not-ready
, and then somehow remove that taint whenkiam-agent
is ready. This also feels fairly invasive because it would require amending other important daemonsets to tolerate this additional taint (per kiam agent init time and pod waiting for ready state #203).edit: I see the uswitch team created https://github.com/uswitch/nidhogg to address this - this seems worth documenting here in the kiam project
edit: this issue of pod readiness priority on node startup is tracked in a KEP
Am I missing something here? Is there additional configuration/strategies to harden
kiam-agent
deployment such that it is quickly and reliably available to all workloads on newly created cluster nodes? Appreciate any and all advice/help!The text was updated successfully, but these errors were encountered: