-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instant Nomad Allocation Restart leads to Runner Memory Leak #602
Comments
Thanks for identifying this issue and providing two suggestions on how to resolve it. What would be your recommendation? I see that both approaches have disadvantages, and find it difficult to decide for one. My main concern with the first one would be that this could lead to data inconsistencies (with the runner management), but I cannot say how difficult the Event Stream Handling will be. |
A solution via the Event Stream Handling would be to track the node. Complete Events would be only handled if they match the node that was used directly before. However, another solution (at least for this specific situation) is to handle the runner destroying already when an allocation desires to stop. Before, we removed the runner only when the allocation stopped completely. I decided to go with the second solution, to introduce no further complexity and as few code changes as possible. @MrSerth Do you agree with this solution? Either way, we should observe the Nomad behavior closely in the near future. |
Yes, I am fine with the solution you proposed, this makes sense -- thank you! Let's monitor Nomad, as you suggested (we need to do so anyway, I think, due to the vast number of changes). |
We didn't notice any other occurrence. The memory footprint has improved, so that we are closing this issue as completed. |
Related to #591
In the case of runner
10-0331c7d8-03c1-11ef-b832-fa163e7afdf8
we see that the runner is started twice and deleted directly after the second creation.When inspecting the InfluxDB Nomad Allocation Events we see that the Nomad Allocation restart at
11:35:26.285
happens instantly.The Poseidon logs show that first the second creation event was handled and only after that the stop of the previous allocation was handled.
This leads (a) to the first started runner never being destroyed and causing a memory leak and (b) one idle runner too few.
In order to fix this, we might relate the Runner closer to the Allocation ID than the Runner ID. This can be done either by (1) adding the allocation ID to the Nomad Runner Object or (2) enlarging the allocation ID handling in the Event Stream Handling.
Both have their up and downsides. (1) might create the impression that multiple runners could have the same runner id which should not happen. (2) might increase the complexity of the already complex Event Stream Handling.
The text was updated successfully, but these errors were encountered: