Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SignClaims : invalid memory address or nil pointer dereference #23758

Closed
the-nando opened this issue Aug 7, 2024 · 9 comments · Fixed by #23763
Closed

SignClaims : invalid memory address or nil pointer dereference #23758

the-nando opened this issue Aug 7, 2024 · 9 comments · Fixed by #23763

Comments

@the-nando
Copy link
Contributor

the-nando commented Aug 7, 2024

Nomad version

Nomad v1.8.2+ent
BuildDate 2024-07-16T09:08:42Z
Revision f675c14f9ea2fada1fd1e5f1f5e63ce5d07f8cd4

Issue

Nomad servers in one of our clusters started crashing out of the blue with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x88 pc=0x1ff89b9]

goroutine 41361128 [running]:
github.com/hashicorp/nomad/nomad.(*Encrypter).SignClaims(0xc000ec54a0, 0x0)
	github.com/hashicorp/nomad/nomad/encrypter.go:206 +0x79
github.com/hashicorp/nomad/nomad.signAllocIdentities({0x3becc60, 0xc000ec54a0}, 0xc02f176008, {0xc02dc872b8, 0x1, 0xc000e86780?}, {0xdf7901?, 0xede4451aa?, 0x0?})
	github.com/hashicorp/nomad/nomad/plan_apply.go:424 +0x32d
github.com/hashicorp/nomad/nomad.(*planner).applyPlan(0xc000ef4340, 0xc02a6e8630, 0xc0336d4b00, 0xc00af3f5f0)
	github.com/hashicorp/nomad/nomad/plan_apply.go:281 +0x10ee
github.com/hashicorp/nomad/nomad.(*planner).planApply(0xc000ef4340)
	github.com/hashicorp/nomad/nomad/plan_apply.go:200 +0x7fa
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership in goroutine 41361018
	github.com/hashicorp/nomad/nomad/leader.go:385 +0x19e

We've recovered it after multiple restarts of all the three servers.

Reproduction steps

I'm not sure what triggered the issue, hopefully it's something someone has already seen.

For completeness and possibly not relevant: this cluster is still using the old token-based Vault and Consul integrations and there's an additional Vault cluster configured with the new integration but currently not in use.

@tgross
Copy link
Member

tgross commented Aug 7, 2024

Hi @the-nando! Sorry to hear that. I'll look into this and circle back ASAP.

@tgross tgross self-assigned this Aug 7, 2024
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Aug 7, 2024
@tgross tgross added this to the 1.8.3 milestone Aug 7, 2024
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Aug 7, 2024
@tgross
Copy link
Member

tgross commented Aug 7, 2024

Ok @the-nando, I've had a first pass through. First, a question:

We've recovered it after multiple restarts of all the three servers.

Can you describe this a bit more? This crash is in the plan applier so it happened on the leader. After the leader crashed, did the next leader also crash, and so on? Given what I'll describe below, it'd be helpful to know if this was potentially an issue of persistent state.

Here's what I'm seeing in the code... although you're on Nomad Enterprise, all the code involved here is present in the CE version, so links below will point there.


The panic is happening at encrypter.go#L206. We know that the method receiver e is non-nil because it was referenced in the line above and several times before that, so it must be that the claims parameter is nil, which seems like a clear error. That lines up with what we see in the stack trace: github.com/hashicorp/nomad/nomad.(*Encrypter).SignClaims(0xc000ec54a0, 0x0), where that last 0x0 is the non-receiver parameter.

The caller is in the plan applier plan_apply.go#L424, where we're creating the default workload identity that we sign for every workload.

The claims object comes from NewIdentityClaims method call on the line above (ref structs.go#L11717-L11719). There are a few code paths where we could return nil there:

  1. If the task group for the allocation doesn't exist on the job. Impossible in this case because the caller just read this value at plan_apply.go#L416.
  2. If the *WorkloadIdentity pointer is nil.
  3. If the workload type is neither task nor service. For this caller, this can never happen because the default identity for any task has the workload type hard-coded to WorkloadTypeTask. (ref structs.go#L7893)
  4. If the workload type is WorkloadTypeTask or its WorkloadTypeService for a service with a task field, and the task doesn't exist in the task group. Impossible in this case because the caller just read this value at plan_apply.go#L417.

Which pretty much leaves a nil *WorkloadIdentity pointer in the Task.Identity field (ref structs.go#L7820-L7821).

A non-nil default identity is created whenever we called Task.Canonicalize as of Nomad 1.7.0. So this should only happen if the job was submitted prior to Nomad 1.7.0 and then wasn't canonicalized afterwards. But we canonicalize a job when we upsert it in the state store via Raft (ref fsm.go#L607-L614), and when we restore from snapshot (ref fsm.go#L1587), which happens when you upgrade a server.

Prior to Nomad 1.7.0, there were no code paths where we could have a nil IdentityClaims (ref structs.go#L11168-L11195).

In any case, the NewIdentityClaims constructor should return errors in invalid cases and/or the SignClaims method should be returning an error if it gets a nil claims object. Either would prevent the plan applier from accepting the allocation. If we're missing something during upgrades that could be bad, because it'd prevent allocations from being upgraded. But at least it would prevent a panic. I'll start on a PR to do that, but I'll also keep investigating to try to find a code path where we miss calling Canonicalize.

@the-nando
Copy link
Contributor Author

the-nando commented Aug 7, 2024

Thanks for the detailed analysis @tgross 😄

The three servers crashed multiple times one after the other and entered a restart loop, driven by the init system we use to control the Nomad process:

first node:
    2024-08-06T17:17:07.197Z [ERROR] http: request failed: method=GET path="<omitted>" error="event not processed by enough 'filter' and 'sink' nodes" code=500
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x88 pc=0x1ff89b9]

second node:
    2024-08-06T17:17:36.356Z [ERROR] nomad.autopilot: Error when computing next state: error="context deadline exceeded"
    2024-08-06T17:17:36.356Z [INFO]  nomad: eval broker status modified: paused=false
    2024-08-06T17:17:36.356Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-08-06T17:17:36.356Z [ERROR] nomad.raft: failed to heartbeat to: peer=<omitted>:4647 backoff time=80ms error="dial tcp <omitted>:4647: connect: connection refused"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x88 pc=0x1ff89b9]

third node:
    2024-08-06T17:17:47.629Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-08-06T17:17:47.633Z [INFO]  nomad.reporting: reporting agent started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x88 pc=0x1ff89b9]

They eventually came all back in service by themselves after 10-15 restarts, which were all triggered by the same panic.

@tgross
Copy link
Member

tgross commented Aug 7, 2024

Ok, thanks @the-nando. That implies the task was invalid when written to Raft and then replicated to all the nodes, as opposed to something like state store corruption on a single node. That may help me track it down.

@tgross
Copy link
Member

tgross commented Aug 7, 2024

Sorry, one more question: were all servers involved currently on 1.8.2+ent, or were there servers on an older version as well? (Including any read-replicas with num_schedulers != 0)

@the-nando
Copy link
Contributor Author

the-nando commented Aug 7, 2024

All servers and clients are on 1.8.2+ent.

In case it's helpful for your investigation: the upgrade path to get there 1.6.9 -> 1.7.6 -> 1.8.1. We performed a downgrade 1.7.6 -> 1.6.9 in between which was caused by changes in Nomad which stopped passing an explicit cgroup parent to Docker containers; it required an upgrade of the init system we use to get that version to work on our systems.

tgross added a commit that referenced this issue Aug 7, 2024
Tasks have a default identity created during canonicalization. If this default
identity is somehow missing, we'll hit panics when trying to create and sign the
claims in the plan applier. Fallback to the default identity if it's missing
from the task.

This changeset will need a different implementation in 1.7.x+ent backports, as
the constructor for identities was refactored significantly in #23708. The panic
cannot occur in 1.6.x+ent.

Fixes: #23758
@tgross
Copy link
Member

tgross commented Aug 7, 2024

Something I realized in terms of the canonicalization path is that when a scheduler submits a plan, the Job struct is attached to the plan we send from the scheduler node to the leader. The leader does not re-read the Job from the state store (presumably to ensure we're operating on the exact same version the scheduler did). So that gives us the following plausible series of events:

  • A scheduler on node A creates a plan, but the Job object is corrupt due to an as-yet-unknown bug (similar to scaling: fix state store corruption bug for job scaling events #23673).
  • The scheduler on node A submits the plan to the leader on node B. Node B crashes and restarts. Node C becomes the leader.
  • The scheduler on node A retries the plan to the leader on node C. Node C crashes and restarts. Node B becomes the leader.
  • (repeat until node A happens to become the leader)
  • The scheduler on node A retries the plan to itself and crashes.
  • At this point, node A recovers its state from disk and canonicalizes the Task again, so the cluster recovers.

That would explain both how we managed to have repeated crashes and how we managed to eventually recover without any kind of manual intervention. Especially if you don't immediately run into this issue again within the next little bit, that points strongly to a state store corruption bug because those are transient rather than persistent.

Follow-ons:

  • identity: prevent missing task identity from panicking server #23763 should prevent the panic from happening in this code path. I'm going to expedite landing that so we can ship it in the upcoming Nomad 1.8.3 (with backport to Nomad Enterprise 1.7.x).
  • I'm going to keep hunting for the state store corruption bug, but...
  • State store corruption bugs are extremely difficult to hunt down and they can lurk for years before getting triggered. I've worked on "hack week" projects in the past (ex checksumming state store implementation #16257) to try to fix this kind of thing more generally, but ultimately the problem is isomorphic to taint analysis and that's not a 100% solvable problem. A longer-term solution is for everything we put in memdb to be immutable (ex. flatbuffers), but that's also a very large project.
  • Not referring back to the state store for the Job feels to me like it makes upgrades a lot more risky than necessary, and that's a bigger design question to think about. Further discussion in plan submit should not send serialized job version #23766

@tgross
Copy link
Member

tgross commented Aug 7, 2024

@the-nando we've closed the immediate issue via #23763 and that'll ship in Nomad 1.8.3 (very soon now). I'll continue to follow-up on the root cause of the state store corruption.

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants