Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMDS query failed , exit code: 28 #4757

Open
ravinayag opened this issue Jan 20, 2025 · 8 comments
Open

IMDS query failed , exit code: 28 #4757

ravinayag opened this issue Jan 20, 2025 · 8 comments

Comments

@ravinayag
Copy link

I have recently launched the AKS node pool with spot instances, The node get launched, I can see from the Kubectl get nodes in Ready Stattus.
However when i check with kubectl describe node i see some errors in the events. I thought it may be a random error. to check i deleted the node and autoscaler was able launch new node. But the erorrs on events still appeared on new node.

Can i continue this node with worloads ? what is the impact continuing with the errors.?
Any known issue or fix available?

  Type     Reason                   Age                From                                                          Message
  ----     ------                   ----               ----                                                          -------
  Normal   Starting                 55s                kube-proxy
  Normal   NodeHasSufficientMemory  73s (x2 over 73s)  kubelet                                                       Node aks-apppoolspot-31828581-vmss000001 status is now: NodeHasSufficientMemory
  Warning  InvalidDiskCapacity      73s                kubelet                                                       invalid capacity 0 on image filesystem
  Normal   NodeHasNoDiskPressure    73s (x2 over 73s)  kubelet                                                       Node aks-apppoolspot-31828581-vmss000001 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     73s (x2 over 73s)  kubelet                                                       Node aks-apppoolspot-31828581-vmss000001 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  73s                kubelet                                                       Updated Node Allocatable limit across pods
  Normal   Starting                 73s                kubelet                                                       Starting kubelet.
  Normal   NodeReady                72s                kubelet                                                       Node aks-apppoolspot-31828581-vmss000001 status is now: NodeReady
  Normal   RegisteredNode           68s                node-controller                                               Node aks-apppoolspot-31828581-vmss000001 event: Registered Node aks-apppoolspot-31828581-vmss000001 in Controller
  Warning  ContainerdStart          63s (x2 over 63s)  systemd-monitor                                               Starting containerd container runtime...
  Normal   NoVMEventScheduled       41s                custom-scheduledevents-consolidated-condition-plugin-monitor  Node condition VMEventScheduled is now: Unknown, reason: NoVMEventScheduled, message: "IMDS query failed, exit code: 28\nConnection timed out after 24 seconds."
  Warning  PreemptScheduled         41s                custom-scheduledevents-consolidated-preempt-plugin-monitor    IMDS query failed, exit code: 28
Connection timed out after 24 seconds.
  Normal  NoVMEventScheduled  41s  custom-scheduledevents-consolidated-condition-plugin-monitor  Node condition VMEventScheduled is now: False, reason: NoVMEventScheduled, message: "VM has no scheduled event"
@brk3
Copy link

brk3 commented Feb 13, 2025

Hi @ravinayag I also have a cluster currently emitting these errors. Its having trouble contacting its ACR via managed identity though unsure if this is related to this error. Did you manage to find any more info or side effects from this error since posting?

@microsoft-github-policy-service microsoft-github-policy-service bot removed the stale Stale issue label Feb 13, 2025
@JoeyC-Dev
Copy link

JoeyC-Dev commented Feb 20, 2025

Error message like "IMDS query failed, exit code: 28\nConnection timed out after 24 seconds" is only implying that Node Problem Detector (NPD) check cannot be completed due to IMDS query timeout. This may or may not imply a problem (as it only saying timeout when querying the wireserver), but it won't mean that it has an actual"PreemptScheduled" event.

(Note: exit code: 28 here is referring to curl exit code.)

For unknown reason, I usually see this kind of event when using Spot VMs (this may never be resolved and I kinda give up on this), and I just ignore it most time, as it won't affect how kubelet works as long as the Node is ready. If you are using workload identity for authentication, your workflow does not even relate to IMDS at all.

Technically when IMDS is having issue, it may affect how containerd pulling image but so far as I know, when it successfully pulls the credential, and it should have around 24 hour valid time before it expires.
So unless this error message keeps popping from hours to hours, and you always have to pull a new image from time to time, this should not pose as an actual problem.

If you want to check if the node has problem on contacting IMDS now, try following command on the node (or Pod with hostNetwork set to true):

curl "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://management.azure.com/&client_id=<kubelet-client-id>" -H "Metadata: true"

(Note: remember to replace <kubelet-client-id>.)

Result:

{
  "access_token": "xxxxxxxx",
  "client_id": "00000000-0000-0000-0000-000000000000",
  "expires_in": "86399",
  "expires_on": "1740131856",
  "ext_expires_in": "86399",
  "not_before": "1740045156",
  "resource": "https://management.azure.com/",
  "token_type": "Bearer"
}

Update: In my initial version, I thought it only has 60 mins or valid time, but after checking, it has valid time of 1 day. Not sure why it is 1 day instead of 60 mins.

@ravinayag-b3
Copy link

I agree with @JoeyC-Dev comments, however the error message gives different impression since the node status shows as Ready and pods getting migratted by node status
Also I noticed the error disappeers after some times probably by retries as @JoeyC-Dev mentioned, so what is the retry period

@brk3 no side effects till date.

@grzesuav
Copy link

This https://learn.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events metadata endpoint isn;t rock solid and we often see this kind of query issues on the cluster. Sometimes it is related with node struggles, sometimes not. Using solely this query timeout to determine health - I would not. As everything can be running fine on the node

@kaiwoe
Copy link

kaiwoe commented Mar 19, 2025

We have a similar issue with a recently launched AKS user node pool (but without spot instances).

The problem always begins with multiple "IMDS query failed, exit code: 28" errors at 23:30
and then ends with EgressBlocked. So that the Pods on the node can no longer communicate
with other pods in the cluster and external services.

The solution is to restart the node and then everything is running for about 2 days.
Then the errors occur again.

Events:
  Type     Reason              Age                   From                                                Message
  ----     ------              ----                  ----                                                -------
  Warning  EgressBlocked       32m                   node-egress-monitor                                 Required endpoints are unreachable (curl: (28) Resolving timed out after 10000 milliseconds: https://acs-mirror.azureedge.net/acs-mirror/healthz ;curl: (28) Connection timed out after 10000 milliseconds: https://packages.aks.azure.com/acs-mirror/healthz ), aka.ms/AArpzy5 for more information.
  Warning  RebootScheduled     29m (x34 over 4h36m)  custom-scheduledevents-consolidated-plugin-monitor  Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_reboot.sh": state - signal: killed. output - ""
  Normal   NodeNotSchedulable  18m                   kubelet                                             Node aks-np0-33124422-vmss000002 status is now: NodeNotSchedulable
  Warning  TerminateScheduled  15m (x30 over 4h35m)  custom-scheduledevents-consolidated-plugin-monitor  IMDS query failed, exit code: 28
Connection timed out after 54 seconds.
  Warning  RedeployScheduled   11m (x41 over 4h48m)     custom-scheduledevents-consolidated-plugin-monitor            Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_redeploy.sh": state - signal: killed. output - ""
  Normal   NoVMEventScheduled  10m (x391 over 2d15h)    custom-scheduledevents-consolidated-condition-plugin-monitor  Node condition VMEventScheduled is now: False, reason: NoVMEventScheduled, message: "VM has no scheduled event"
  Normal   NoVMEventScheduled  5m15s (x398 over 2d15h)  custom-scheduledevents-consolidated-condition-plugin-monitor  Node condition VMEventScheduled is now: Unknown, reason: NoVMEventScheduled, message: "IMDS query failed, exit code: 28\nConnection timed out after 24 seconds."
  Warning  PreemptScheduled    4m41s (x431 over 2d15h)  custom-scheduledevents-consolidated-preempt-plugin-monitor    IMDS query failed, exit code: 28
Connection timed out after 24 seconds.
  Warning  EgressBlocked    2m33s                 node-egress-monitor                                 Required endpoints are unreachable (curl: (28) Failed to connect to packages.microsoft.com port 443 after 5208 ms: Connection timed out ;curl: (28) Resolving timed out after 10000 milliseconds: https://acs-mirror.azureedge.net/acs-mirror/healthz ;curl: (28) Resolving timed out after 10000 milliseconds: https://packages.aks.azure.com/acs-mirror/healthz ), aka.ms/AArpzy5 for more information.
  Warning  FreezeScheduled  64s (x32 over 4h31m)  custom-scheduledevents-consolidated-plugin-monitor  Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_freeze.sh": state - signal: killed. output - ""

@grzesuav
Copy link

From curiosity - did the node experienced vmfreeze event sometime before the egress blocked ?

@kaiwoe
Copy link

kaiwoe commented Mar 19, 2025

From curiosity - did the node experienced vmfreeze event sometime before the egress blocked ?

The node experienced first the EgressBlocked before the FreezeScheduled event:

2025-03-19 10:12:31 aks-np0-33124422-vmss000002 PreemptScheduled IMDS query failed, exit code: 28 Connection timed out after 24 seconds.
2025-03-19 10:14:32 aks-np0-33124422-vmss000002 RedeployScheduled Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_redeploy.sh": state - signal: killed. output - ""
2025-03-19 10:26:32 aks-np0-33124422-vmss000002 RebootScheduled Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_reboot.sh": state - signal: killed. output - ""
2025-03-19 10:27:31 aks-np0-33124422-vmss000002 TerminateScheduled IMDS query failed, exit code: 28 Connection timed out after 54 seconds.
2025-03-19 10:29:53 aks-np0-33124422-vmss000002 EgressBlocked Required endpoints are unreachable (curl: (28) Resolving timed out after 10000 milliseconds: https://azwecxpaksprod-dns-3zw0vyxv.hcp.westeurope.azmk8s.io/healthz ), aka.ms/AArpzy5 for more information.
2025-03-19 10:31:32 aks-np0-33124422-vmss000002 FreezeScheduled Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_freeze.sh": state - signal: killed. output - ""

@JoeyC-Dev
Copy link

JoeyC-Dev commented Mar 19, 2025

@kaiwoe Given the info that your node was ready at least once, so it should pass the initial network check (like connection to packages.aks.azure.com and able to download the packages like kubelet), or you cannot post the message here.
Considered that DNS is timed out, and the connection to wireserver is timed out, there might be something wrong on underlying infra.

(I realized that GPU node with Ephemeral disk does not have live migration. So "NoVMEventScheduled" is fine here. Hence, re-write the below part.)

However, if this is pointing to VM failure, usually it comes with "VMEventScheduled" in the "reason" column. But I don't see it from your message. Not sure if you don't paste it all, or it is not there. (This line is no longer valid)

<out_of_topic>
Note that when "VMEventScheduled" in "reason" (important!) column appears (related to: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events#event-scheduling), make sure you see other negative events [important!] in the event list then you can have a guess that this might be related to VM issue itself, and should ask support for RCA. (Usually when "VMEventScheduled" is there, if there is any issue on your node, it is likely being auto resolved after then)

I don't use GPU nodes that much as work needs. So usually when I see this message (non-GPU nodes), it is either a simple timeout issue (false alarm), or it is really underlying failure and must have immediate underlying auto migration. Not actually encountered the situation that the node has 2 days off and only be "Ready" when restart it.
</out_of_topic>

If there is no "VMEventScheduled" , then the next direction is to see if remediator is being triggered. Because the Unready node will be taken action repeatedly. If there is no, then it is either the remediator broken, or your node was back to Ready after the event. Check if your node has any taint like node.kubernetes.io/unreachable constantly in 10 mins.

Be aware of this part:
Image

But anyway, if node.kubernetes.io/unreachable or EgressBlocked repeatedly appears, should request support because this is not normal. Usually it can be related to underlying maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants