Instant Nomad Allocation Restart leads to Runner Memory Leak #602

mpass99 · 2024-05-30T10:40:27Z

Related to #591

In the case of runner 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 we see that the runner is started twice and deleted directly after the second creation.

time="2024-04-26T11:34:59.570546Z" level=debug msg="Runner started" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 startupDuration=570.731307ms
time="2024-04-26T11:35:26.198631Z" level=debug msg="Runner started" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 startupDuration=705.586839ms
time="2024-04-26T11:35:26.198780Z" level=debug msg="Destroying Runner" destroy_reason="the runner will be destroyed and replaced: the destruction should not cause external changes" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8
time="2024-04-26T11:35:26.198923Z" level=debug msg="Runner destroyed locally" destroy_reason="the runner will be destroyed and replaced: the destruction should not cause external changes" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8
time="2024-04-26T11:35:26.239789Z" level=debug msg="Runner stopped" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8

When inspecting the InfluxDB Nomad Allocation Events we see that the Nomad Allocation restart at 11:35:26.285 happens instantly.

_measurement | _time | client_status | event_type | id | job_id | nomad_agent
poseidon_nomad_allocations | 2024-04-26T11:34:59.274Z | pending | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:00.275Z | running | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:25.279Z | running | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | running | creation | 28e08715-38a9-42b3-8f77-0a14ee68b482 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-1
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | complete | deletion | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | pending | creation | 28e08715-38a9-42b3-8f77-0a14ee68b482 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-1
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | complete | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4

The Poseidon logs show that first the second creation event was handled and only after that the stop of the previous allocation was handled.

This leads (a) to the first started runner never being destroyed and causing a memory leak and (b) one idle runner too few.

In order to fix this, we might relate the Runner closer to the Allocation ID than the Runner ID. This can be done either by (1) adding the allocation ID to the Nomad Runner Object or (2) enlarging the allocation ID handling in the Event Stream Handling.
Both have their up and downsides. (1) might create the impression that multiple runners could have the same runner id which should not happen. (2) might increase the complexity of the already complex Event Stream Handling.

The text was updated successfully, but these errors were encountered:

MrSerth · 2024-05-30T11:58:49Z

Thanks for identifying this issue and providing two suggestions on how to resolve it. What would be your recommendation? I see that both approaches have disadvantages, and find it difficult to decide for one. My main concern with the first one would be that this could lead to data inconsistencies (with the runner management), but I cannot say how difficult the Event Stream Handling will be.

mpass99 · 2024-08-20T20:56:58Z

A solution via the Event Stream Handling would be to track the node. Complete Events would be only handled if they match the node that was used directly before.

However, another solution (at least for this specific situation) is to handle the runner destroying already when an allocation desires to stop. Before, we removed the runner only when the allocation stopped completely.
When checking the event history of this specific case, there was enough time between the first allocation announcing to stop and the second allocation fully running. Therefore, the changes fix the failure in this case.

I decided to go with the second solution, to introduce no further complexity and as few code changes as possible. @MrSerth Do you agree with this solution?

Either way, we should observe the Nomad behavior closely in the near future.

MrSerth · 2024-08-20T23:09:36Z

Yes, I am fine with the solution you proposed, this makes sense -- thank you! Let's monitor Nomad, as you suggested (we need to do so anyway, I think, due to the vast number of changes).

MrSerth · 2024-09-04T11:37:16Z

We didn't notice any other occurrence. The memory footprint has improved, so that we are closing this issue as completed.

mpass99 mentioned this issue May 30, 2024

Memory Threshold exceeded #591

Closed

mpass99 mentioned this issue Jun 6, 2024

Prewarming Pool Alert #587

Open

MrSerth added the bug Something isn't working label Jun 12, 2024

This was referenced Jun 13, 2024

Investigate leaking allocation storage data #615

Closed

Handle permanently dead Nomad jobs #612

Closed

mpass99 mentioned this issue Aug 7, 2024

Started Runner is already in use #597

Open

mpass99 mentioned this issue Aug 20, 2024

Fix/#602 instant restart #663

Merged

MrSerth closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instant Nomad Allocation Restart leads to Runner Memory Leak #602

Instant Nomad Allocation Restart leads to Runner Memory Leak #602

mpass99 commented May 30, 2024

MrSerth commented May 30, 2024

mpass99 commented Aug 20, 2024

MrSerth commented Aug 20, 2024

MrSerth commented Sep 4, 2024

Instant Nomad Allocation Restart leads to Runner Memory Leak #602

Instant Nomad Allocation Restart leads to Runner Memory Leak #602

Comments

mpass99 commented May 30, 2024

MrSerth commented May 30, 2024

mpass99 commented Aug 20, 2024

MrSerth commented Aug 20, 2024

MrSerth commented Sep 4, 2024