-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queue drainage options at shutdown #6085
Comments
There are only a couple of plugins that initiate a shutdown like generator and ES input. Generator shouldn't be used in production scenarios, and in general used only for testing. ES input can be changed to a streaming case. In general, exposing too many options will just confuse users and make it really hard to support. IMO, for the first iteration:
|
I'm in favor of not suprising the user for 5.1.0, which means using the in-memory queue (not the in-memory "PQ"), and force the queue drain before shutdown/reload. The only question for me is: should the drain semantic be implicit or explicit depending on the queue? |
I personally like implicit since it is inline with current behavior. in-memory we wait until drain, PQ, we don't. Now that should be good for 5.1.0 and beyond. If we get concrete requests we can flip to being explicit and introducing a new setting. WDYT? |
@suyograo I'm open to your proposal, and if we move forward with it, we'll want to document the likeliness of duplicates shipping to outputs -- shutdown without a drain will cause any output-but-not-acked-in-the-queue events to be replayed upon restart, causing dups sent to outputs, right? |
A clean/normal shutdown (without drain) should not cause dups. The workers shutdown signalling and behaviour has not changed so upon a normal shutdown all workers have a chance to complete their duties which will result in correctly acking the "inflight" batches/events. Whatever is in the queue after the workers have stopped should be only unread events. |
@colinsurprenant thanks for the clarification! I think I was misunderstanding, I had assumed If I understand you, that |
@jordansissel yes you understood correctly. I am also +1 for the no-drain default. The question is also: should we expose an option for a yes-drain on the PQ ? Or, eventually, it could be a UI or API option when en expose a shutdown API and/or shutdown button, to ask for an explicit yes-drain on that shutdown. The only problem I see in a no-drain situation is if someone uses the generator or stdin inputs with PQ enabled, it might shutdown before the queue is emptied. There are unlikely use-cases but they will be real when we switch to have PQ on by default with the getting-started use-cases... |
I don't think this is needed for 5.1.0. We can re-evaluate if/when users ask for it |
It depends on how we frame/document the PQ feature. When someone switches to using a persistent, variable length queue, they shouldn't expect all in-flight events to be drained upon shutdown. Then, the use of queue does not make sense. Just like when someone uses Kafka or Redis to buffer events, they don't expect all events to be drained upon shutdown. To me, LS's in-built PQ's behavior should be congruent/consistent to when Logstash is used with Kafka and Redis behavior. |
@suyograo it is more a question of the default setting. If/when PQ is enabled by default with no-drain shutdown, the getting started use-cases (or to that respect all types of one-off processing) will "fail" (from an user expectation POV) because LS is likely to terminate before all events are processed. We will have to provide a cmd line switch or something to enable queue drain on shutdown. Otherwise, to avoid this pitfall all users using LS for the first time will fall into, we will have to keep the in-memory queuing by default. This is all about users expectations and experience. If the getting-started scenarios are important then we have to provide a set of default what make sense for these use cases. This is somewhat similar to the file input default behaviour. |
@colinsurprenant @suyograo If downscaling means destroying the container without stopping LS first and waiting for LS to shutdown then PQ should not be used. If downscaling means stopping LS, waiting until LS is stopped before destroying the container then the PQ needs to drain and it also makes sense to have a small sized PQ, say some multiple of batch-size for this use case so the drain time is reduced/controlled. I got this from a customer call when we talked to them about the jdbc_static enhancement filter. |
@guyboertje good point. I rebooted the drain PR #6433 and would like it to move forward. /cc @suyograo |
#6433 has been merged. closing. |
Historically logstash, for obvious reasons, always drained its internal in-memory queue(s) upon shutdown.
Now with the persistent queue, this is not required anymore, a clean shutdown can happen at any time and the queue does not require to be drained as persisted events will be picked up where they were left upon restarting logstash. This is the current implementation in #5958.
I think we should also support an option where logstash can wait until the queue is drained before completing its shutdown, similar to the legacy implementation. One of the problem is with the input plugins which can initiate a shutdown like the generator input: For a given configuration, if filter+output is slower, the input triggered shutdown will leave unprocessed events in the queue and this can be counter intuitive in a "getting started" scenario or create weird problems in specs like in #6084
Thoughts?
The text was updated successfully, but these errors were encountered: