-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SignClaims : invalid memory address or nil pointer dereference #23758
Comments
Hi @the-nando! Sorry to hear that. I'll look into this and circle back ASAP. |
Ok @the-nando, I've had a first pass through. First, a question:
Can you describe this a bit more? This crash is in the plan applier so it happened on the leader. After the leader crashed, did the next leader also crash, and so on? Given what I'll describe below, it'd be helpful to know if this was potentially an issue of persistent state. Here's what I'm seeing in the code... although you're on Nomad Enterprise, all the code involved here is present in the CE version, so links below will point there. The panic is happening at The caller is in the plan applier The claims object comes from
Which pretty much leaves a nil A non-nil default identity is created whenever we called Prior to Nomad 1.7.0, there were no code paths where we could have a nil In any case, the |
Thanks for the detailed analysis @tgross 😄 The three servers crashed multiple times one after the other and entered a restart loop, driven by the init system we use to control the Nomad process:
They eventually came all back in service by themselves after 10-15 restarts, which were all triggered by the same panic. |
Ok, thanks @the-nando. That implies the task was invalid when written to Raft and then replicated to all the nodes, as opposed to something like state store corruption on a single node. That may help me track it down. |
Sorry, one more question: were all servers involved currently on 1.8.2+ent, or were there servers on an older version as well? (Including any read-replicas with |
All servers and clients are on 1.8.2+ent. In case it's helpful for your investigation: the upgrade path to get there 1.6.9 -> 1.7.6 -> 1.8.1. We performed a downgrade 1.7.6 -> 1.6.9 in between which was caused by changes in Nomad which stopped passing an explicit cgroup parent to Docker containers; it required an upgrade of the init system we use to get that version to work on our systems. |
Tasks have a default identity created during canonicalization. If this default identity is somehow missing, we'll hit panics when trying to create and sign the claims in the plan applier. Fallback to the default identity if it's missing from the task. This changeset will need a different implementation in 1.7.x+ent backports, as the constructor for identities was refactored significantly in #23708. The panic cannot occur in 1.6.x+ent. Fixes: #23758
Something I realized in terms of the canonicalization path is that when a scheduler submits a plan, the
That would explain both how we managed to have repeated crashes and how we managed to eventually recover without any kind of manual intervention. Especially if you don't immediately run into this issue again within the next little bit, that points strongly to a state store corruption bug because those are transient rather than persistent. Follow-ons:
|
@the-nando we've closed the immediate issue via #23763 and that'll ship in Nomad 1.8.3 (very soon now). I'll continue to follow-up on the root cause of the state store corruption. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Issue
Nomad servers in one of our clusters started crashing out of the blue with:
We've recovered it after multiple restarts of all the three servers.
Reproduction steps
I'm not sure what triggered the issue, hopefully it's something someone has already seen.
For completeness and possibly not relevant: this cluster is still using the old token-based Vault and Consul integrations and there's an additional Vault cluster configured with the new integration but currently not in use.
The text was updated successfully, but these errors were encountered: