-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task leak if alloc prerun hook fails after client restart #17102
Comments
have maybe some paralles with #17079 ? |
This comment was marked as off-topic.
This comment was marked as off-topic.
@suikast42 it seems like you've mixed up two different issues. Can you move the drain discussion back over to the drain ticket and not this one? |
@suikast42 There are some conceptual parallels, in that both cases result in stuff getting left behind unexpectedly. However in #17079 your logs indicate that task "Killing" is starting, which is one of the things that is currently not happening under the specific circumstances that cause this issue here. And one of the things that does happen appropriately in this case are task stop hooks, which include service deregistration (edit: and alloc postrun hooks for deregistering group-level services). So good keeping watch, but these cases are definitely unrelated! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If a prerun hook fails when restoring alloc state, as with a client agent restart, tasks don't get fully cleaned up and may leave orphan resources like a running container and network configuration (e.g. iptables rules).
This was pointed out in #13028 where specifically a CSI prerun hook fails, but it's an issue more generally with alloc runner prerun hooks.
I encountered it myself while investigating that issue, and as @ygersie put it,
Reproduction steps
I made a strange hook to be able to poison an alloc on disk, so it can succeed first pass but fail after a client agent stop/start.
Expected Result
All of the failed task's resources are cleaned up.
Actual Result
The task is marked as failed and dead and gets replaced, but the old container remains running.
Also if the task uses a static port, the new one will fail to start because the port is held by the "failed" task.
The text was updated successfully, but these errors were encountered: