Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bob crash recovery #54

Open
lispyclouds opened this issue Sep 12, 2019 · 5 comments
Open

Bob crash recovery #54

lispyclouds opened this issue Sep 12, 2019 · 5 comments

Comments

@lispyclouds
Copy link
Member

lispyclouds commented Sep 12, 2019

  • How would bob recover in case of one of the nodes going down during a build?
  • What happens to the dangling build still in running state?
  • If thats to be marked as failed, who should be doing that?
  • More questions?
@lispyclouds
Copy link
Member Author

lispyclouds commented Feb 1, 2025

This should work for apiserver nodes:

  • The state here is the retrying logic with backoff
  • The node that has picked this up would only ack it once it has completed the backoff sleep
  • That way if that goes down, the message with the last known backoff is redelivered to another node which can continue that

Tradeoffs:

  • In the case of a long backoff and the node going down close to its completion, the next node picking that up needs to wait that much again, extending retry times

TODO: Hammock about runners.

@lispyclouds
Copy link
Member Author

Ideas for runners:

  • When the runner comes up it has its own queue onto which messages are routed by the apiserver
  • This queue should not be auto-delete or exclusive with a TTL of 30s (configurable? why?)
  • When the runner disconnects/goes down, the unacked message would be dead lettered after the TTL
  • If the runner reconnects within the TTL should ack properly
  • If not, a new runner comes up which would then pick this up via the retry mechanism

Tradeoffs:

  • The build state of the faulty runner is lost
  • The new one needs to redo it, introducing issues from potential side effects of the last build
  • The last queue would be empty and dangling and as of now there is no way to auto delete that, which adds the need to write a GC for that Bob

@lispyclouds
Copy link
Member Author

lispyclouds commented Feb 1, 2025

It seems there is a way to do this without writing a GC:

  • set a message ttl
  • set a queue ttl
  • message ttl < queue ttl
  • when the runner disconnects before the ack, the message is marked as ready meaning its requeued and subject to the message ttl
  • if the node doesn't reconnect within the queue ttl, the message ttl is in effect and upon expiring it dead letters and goes into a retry loop and the queue is deleted
  • the new node coming up should be delivered this message from the dlq

Questions:

  • what should be a gap between message ttl and queue ttl?
  • the message must be dead lettered before queue deletion, so what happens if the node reconnects in between the message ttl and queue ttl?
  • can the reconnecting node get the same message it had been running before?
  • is it just safer to have the GC of the hanging queues than setting an expires?

@lispyclouds
Copy link
Member Author

When an apiserver gets a retry message, it should check if the status of that run Id is running, if yes it implies that the runner died and there's no need to retry, it can do something like add a log saying lost runner and mark it failed. This is one solution

@lispyclouds
Copy link
Member Author

lispyclouds commented Mar 2, 2025

Better algo:

  • Only set a configurable message ttl and not queue expiry
  • This denotes the grace period for the runner to try to reconnect after it has lost connection/gone down
  • If the runner reconnects within that time, all is good
  • If not, the message is dead lettered and the apiserver picks it up
  • The apiserver checks if the status is set to running, set it as failed and no more retries
  • It will also delete the runner's queue

Questions:

  • How does the apiserver get the name of the runner's queue to delete? What is the impact?
  • What happens when the runner reconnects after the grace period?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: not prioritized
Development

No branches or pull requests

1 participant