Bob crash recovery #54

lispyclouds · 2019-09-12T08:49:17Z

How would bob recover in case of one of the nodes going down during a build?
What happens to the dangling build still in running state?
If thats to be marked as failed, who should be doing that?
More questions?

lispyclouds · 2025-02-01T14:15:08Z

This should work for apiserver nodes:

The state here is the retrying logic with backoff
The node that has picked this up would only ack it once it has completed the backoff sleep
That way if that goes down, the message with the last known backoff is redelivered to another node which can continue that

Tradeoffs:

In the case of a long backoff and the node going down close to its completion, the next node picking that up needs to wait that much again, extending retry times

TODO: Hammock about runners.

lispyclouds · 2025-02-01T14:25:30Z

Ideas for runners:

When the runner comes up it has its own queue onto which messages are routed by the apiserver
This queue should not be auto-delete or exclusive with a TTL of 30s (configurable? why?)
When the runner disconnects/goes down, the unacked message would be dead lettered after the TTL
If the runner reconnects within the TTL should ack properly
If not, a new runner comes up which would then pick this up via the retry mechanism

Tradeoffs:

The build state of the faulty runner is lost
The new one needs to redo it, introducing issues from potential side effects of the last build
The last queue would be empty and dangling and as of now there is no way to auto delete that, which adds the need to write a GC for that Bob

lispyclouds · 2025-02-01T14:58:53Z

It seems there is a way to do this without writing a GC:

set a message ttl
set a queue ttl
message ttl < queue ttl
when the runner disconnects before the ack, the message is marked as ready meaning its requeued and subject to the message ttl
if the node doesn't reconnect within the queue ttl, the message ttl is in effect and upon expiring it dead letters and goes into a retry loop and the queue is deleted
the new node coming up should be delivered this message from the dlq

Questions:

what should be a gap between message ttl and queue ttl?
the message must be dead lettered before queue deletion, so what happens if the node reconnects in between the message ttl and queue ttl?
can the reconnecting node get the same message it had been running before?
is it just safer to have the GC of the hanging queues than setting an expires?

lispyclouds · 2025-02-01T19:07:05Z

When an apiserver gets a retry message, it should check if the status of that run Id is running, if yes it implies that the runner died and there's no need to retry, it can do something like add a log saying lost runner and mark it failed. This is one solution

lispyclouds · 2025-03-02T11:00:19Z

Better algo:

Only set a configurable message ttl and not queue expiry
This denotes the grace period for the runner to try to reconnect after it has lost connection/gone down
If the runner reconnects within that time, all is good
If not, the message is dead lettered and the apiserver picks it up
The apiserver checks if the status is set to running, set it as failed and no more retries
It will also delete the runner's queue

Questions:

How does the apiserver get the name of the runner's queue to delete? What is the impact?
What happens when the runner reconnects after the grace period?

lispyclouds added enhancement hammock help wanted labels Sep 12, 2019

lispyclouds added this to Tasks Oct 11, 2022

lispyclouds moved this to not prioritized in Tasks Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bob crash recovery #54

Bob crash recovery #54

lispyclouds commented Sep 12, 2019 •

edited

Loading

lispyclouds commented Feb 1, 2025 •

edited

Loading

lispyclouds commented Feb 1, 2025

lispyclouds commented Feb 1, 2025 •

edited

Loading

lispyclouds commented Feb 1, 2025

lispyclouds commented Mar 2, 2025 •

edited

Loading

Bob crash recovery #54

Bob crash recovery #54

Comments

lispyclouds commented Sep 12, 2019 • edited Loading

lispyclouds commented Feb 1, 2025 • edited Loading

lispyclouds commented Feb 1, 2025

lispyclouds commented Feb 1, 2025 • edited Loading

lispyclouds commented Feb 1, 2025

lispyclouds commented Mar 2, 2025 • edited Loading

lispyclouds commented Sep 12, 2019 •

edited

Loading

lispyclouds commented Feb 1, 2025 •

edited

Loading

lispyclouds commented Feb 1, 2025 •

edited

Loading

lispyclouds commented Mar 2, 2025 •

edited

Loading