-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMDS query failed , exit code: 28 #4757
Comments
Hi @ravinayag I also have a cluster currently emitting these errors. Its having trouble contacting its ACR via managed identity though unsure if this is related to this error. Did you manage to find any more info or side effects from this error since posting? |
Error message like "IMDS query failed, exit code: 28\nConnection timed out after 24 seconds" is only implying that Node Problem Detector (NPD) check cannot be completed due to IMDS query timeout. This may or may not imply a problem (as it only saying timeout when querying the wireserver), but it won't mean that it has an actual"PreemptScheduled" event. (Note: For unknown reason, I usually see this kind of event when using Spot VMs (this may never be resolved and I kinda give up on this), and I just ignore it most time, as it won't affect how kubelet works as long as the Node is ready. If you are using workload identity for authentication, your workflow does not even relate to IMDS at all. Technically when IMDS is having issue, it may affect how containerd pulling image but so far as I know, when it successfully pulls the credential, and it should have around 24 hour valid time before it expires. If you want to check if the node has problem on contacting IMDS now, try following command on the node (or Pod with hostNetwork set to true):
(Note: remember to replace Result:
Update: In my initial version, I thought it only has 60 mins or valid time, but after checking, it has valid time of 1 day. Not sure why it is 1 day instead of 60 mins. |
I agree with @JoeyC-Dev comments, however the error message gives different impression since the node status shows as @brk3 no side effects till date. |
This https://learn.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events metadata endpoint isn;t rock solid and we often see this kind of query issues on the cluster. Sometimes it is related with node struggles, sometimes not. Using solely this query timeout to determine health - I would not. As everything can be running fine on the node |
We have a similar issue with a recently launched AKS user node pool (but without spot instances). The problem always begins with multiple "IMDS query failed, exit code: 28" errors at 23:30 The solution is to restart the node and then everything is running for about 2 days.
|
From curiosity - did the node experienced vmfreeze event sometime before the egress blocked ? |
The node experienced first the EgressBlocked before the FreezeScheduled event:
|
@kaiwoe Given the info that your node was ready at least once, so it should pass the initial network check (like connection to (I realized that GPU node with Ephemeral disk does not have live migration. So "NoVMEventScheduled" is fine here. Hence, re-write the below part.)
<out_of_topic> I don't use GPU nodes that much as work needs. So usually when I see this message (non-GPU nodes), it is either a simple timeout issue (false alarm), or it is really underlying failure and must have immediate underlying auto migration. Not actually encountered the situation that the node has 2 days off and only be "Ready" when restart it. If there is no "VMEventScheduled" , then the next direction is to see if remediator is being triggered. Because the Unready node will be taken action repeatedly. If there is no, then it is either the remediator broken, or your node was back to Ready after the event. Check if your node has any taint like But anyway, if |
I have recently launched the AKS node pool with spot instances, The node get launched, I can see from the
Kubectl get nodes
inReady
Stattus.However when i check with
kubectl describe node
i see some errors in the events. I thought it may be a random error. to check i deleted the node and autoscaler was able launch new node. But the erorrs on events still appeared on new node.Can i continue this node with worloads ? what is the impact continuing with the errors.?
Any known issue or fix available?
The text was updated successfully, but these errors were encountered: