-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the timeout system #73
Comments
A watchdog mechanism is actually the only practical way to detect deadlocks or other kinds of malfunctions. I doubt there is a better way. If timeouts trigger when they should not, then they are likely too strict. They can be quite generous (and I think they are already), but if the node is stuck, you really want to have it restarted sooner rather than later. |
I disagree here. (Most types of) deadlocks can be detected statically: you construct a bipartite graph with threads and resources (be it a Mutex or else) and check for cycles (this is what dotgraph is doing btw). |
Not every cycle is a deadlock, also not every malfunction is an actual deadlock. Having a generic solution is solving the halting problem. Enhancing the architecture to avoid the need for cycles are certainly a worthwhile explorative target, but don't fully solve the issue. |
The interesting part is that we have most of the things in place to implement the watchdog, all subsystems receive signals and we need to process them, so we could most likely easily detect if the subsystem stop processing the signals and assert on that. |
Isn't this exactly what we already have? We send signals, we queue them up to a certain point. Eventually the queue fills, we block and timeout*). If the subsystem responds to signals in a reasonable time frame, we never block and we therefore also never timeout. And now with the separate *) This is at least my understanding on how this works, it has been a while since I last read the code. If my understanding is off, then this is a different discussion. |
The way I see it the difference between signals and normal messages is that signals are few and they are prioritised in So that's why I think it is still reasonable to timeout on signal sending, but not on messages. |
I was also thinking of removing the signal queue and replacing them only with the latest state. The subsystems need to handle missing block imports anyway and walk down the chain to the latest know one using |
That is interesting and does make a lot of sense! We would lose the watchdog though 😬 But other than that, I think this could help with performance and also might lead to cleaner code in some places. |
The watch channel will break the active leaves concept, so competing blocks will be difficult to model iiuc. Finality is the easy one, there is no wrong thing for that, but tracking active leaves is different. |
Not if we send all currently active leaves (our view). And it is bound by a small constant, so it shouldn't be a problem iiuc. |
That shifts the actual update calc to the individual subsystem, which is fair. Just trying to understand what implications would arise for gensyn and our current usage - from the top of my head it'd be feasible if really needed, although not necessarily desirable - our numbers are much much larger (ballpark 1000s as a lower end, likely going to be a few orders of magnitude higher), so deltas help us to compress them a lot. Let me ponder on it for a bit, my hunch is still the subsystem not able to process signals fast enough has a borked design and should be fixed. A watchdog - by design - is a last resort. If timeouts are not generous enough, that points to either too small timeouts (i.e. CPU shares not taken into account, IO congestion, ..) or (again) a borked Subsystem architecture. An improved deadlock mechanism is feasible, by limiting the enum variants a Subsystem can send rather than only a specific enum, that would improve a) the generated graph b) compile time errors. tl;dr not a fan, it'd make our life (a little) harder |
Originally posted by @ordian in paritytech/polkadot-sdk#1730 (comment)
The issue that timeouts were trying to solve is when a subsystem is either deadlocking or unreasonably slow, we should detect that and shut down the node gracefully.
IMHO we should not shutdown on slowness (it might be caused by some other system blips) and I think we need a more robust deadlock prevention mechanism. So I'm suggesting removing the timeouts completely once a better deadlock prevention mechanism is in place.
The dotgraph was a useful feature, but it has false-positives, since it doesn't distinguish between bounded and unbounded AFAIK, which is why we're getting these annoying warnings about cycles on polkadot-sdk build.
The text was updated successfully, but these errors were encountered: