Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue drainage options at shutdown #6085

Closed
colinsurprenant opened this issue Oct 18, 2016 · 13 comments
Closed

queue drainage options at shutdown #6085

colinsurprenant opened this issue Oct 18, 2016 · 13 comments
Labels

Comments

@colinsurprenant
Copy link
Contributor

Historically logstash, for obvious reasons, always drained its internal in-memory queue(s) upon shutdown.

Now with the persistent queue, this is not required anymore, a clean shutdown can happen at any time and the queue does not require to be drained as persisted events will be picked up where they were left upon restarting logstash. This is the current implementation in #5958.

I think we should also support an option where logstash can wait until the queue is drained before completing its shutdown, similar to the legacy implementation. One of the problem is with the input plugins which can initiate a shutdown like the generator input: For a given configuration, if filter+output is slower, the input triggered shutdown will leave unprocessed events in the queue and this can be counter intuitive in a "getting started" scenario or create weird problems in specs like in #6084

  • What should be the default behaviour? In a "getting started" use-case it is probably expected that all events will be processed before completing a shutdown
  • In production use-case I believe a quick-clean shutdown is a better default as the queue could/will grow to a large number of events and waiting for it to drain could be a problem.
  • If we ever switch the in-memory queuing to use the in-memory PQ backend instead of the current queuing implementation, this option should definitely be enabled to respect the in-memory queuing behaviour to not loose inflight event upon shutdown.

Thoughts?

@suyograo
Copy link
Contributor

suyograo commented Oct 18, 2016

There are only a couple of plugins that initiate a shutdown like generator and ES input. Generator shouldn't be used in production scenarios, and in general used only for testing. ES input can be changed to a streaming case. In general, exposing too many options will just confuse users and make it really hard to support.

IMO, for the first iteration:

  • In-memory queue: On by default, so we stick to the same semantics. Drain queue before shutting down
  • Disk-backed queue: First iteration, do not support wait-until-drain during shutdown. This works nicely with the variable queue semantic. If a user is really interested in draining events in the queue, they have to wait. Later on, if enough users ask for it, we can expose a flag to force drain before shutdown. If this workflow messes up with our testing, we can add a setting but not document/expose it to users.

@jsvd
Copy link
Member

jsvd commented Oct 19, 2016

I'm in favor of not suprising the user for 5.1.0, which means using the in-memory queue (not the in-memory "PQ"), and force the queue drain before shutdown/reload.

The only question for me is: should the drain semantic be implicit or explicit depending on the queue?
If it is implicit then using the in-memory means the drain must happen (and using the PQ means it will not happen), if it's explicit then we need another setting like drain_on_shutdown which defaults to draining.

@suyograo
Copy link
Contributor

If it is implicit then using the in-memory means the drain must happen (and using the PQ means it will not happen), if it's explicit then we need another setting like drain_on_shutdown which defaults to draining.

I personally like implicit since it is inline with current behavior. in-memory we wait until drain, PQ, we don't. Now that should be good for 5.1.0 and beyond. If we get concrete requests we can flip to being explicit and introducing a new setting.

WDYT?

@jordansissel
Copy link
Contributor

@suyograo I'm open to your proposal, and if we move forward with it, we'll want to document the likeliness of duplicates shipping to outputs -- shutdown without a drain will cause any output-but-not-acked-in-the-queue events to be replayed upon restart, causing dups sent to outputs, right?

@colinsurprenant
Copy link
Contributor Author

@jordansissel

shutdown without a drain will cause any output-but-not-acked-in-the-queue events to be replayed upon restart, causing dups sent to outputs, right?

A clean/normal shutdown (without drain) should not cause dups. The workers shutdown signalling and behaviour has not changed so upon a normal shutdown all workers have a chance to complete their duties which will result in correctly acking the "inflight" batches/events. Whatever is in the queue after the workers have stopped should be only unread events.

@jordansissel
Copy link
Contributor

@colinsurprenant thanks for the clarification! I think I was misunderstanding, I had assumed drain to refer to what Logstash does today on a safe shutdown -- events currently being processed by pipeline workers would be allowed to finish -- and that no drain would mean the current unsafe shutdown.

If I understand you, that drain in this discussion means draining the on-disk PQ fully, then I am +1 on not draining by default since the PQ will resume on restart with no loss and no dups.

@colinsurprenant
Copy link
Contributor Author

@jordansissel yes you understood correctly. I am also +1 for the no-drain default.

The question is also: should we expose an option for a yes-drain on the PQ ?

Or, eventually, it could be a UI or API option when en expose a shutdown API and/or shutdown button, to ask for an explicit yes-drain on that shutdown.

The only problem I see in a no-drain situation is if someone uses the generator or stdin inputs with PQ enabled, it might shutdown before the queue is emptied. There are unlikely use-cases but they will be real when we switch to have PQ on by default with the getting-started use-cases...

@suyograo
Copy link
Contributor

The question is also: should we expose an option for a yes-drain on the PQ ?

I don't think this is needed for 5.1.0. We can re-evaluate if/when users ask for it

@suyograo
Copy link
Contributor

suyograo commented Oct 22, 2016

The only problem I see in a no-drain situation is if someone uses the generator or stdin inputs with PQ enabled, it might shutdown before the queue is emptied. There are unlikely use-cases but they will be real when we switch to have PQ on by default with the getting-started use-cases...

It depends on how we frame/document the PQ feature. When someone switches to using a persistent, variable length queue, they shouldn't expect all in-flight events to be drained upon shutdown. Then, the use of queue does not make sense. Just like when someone uses Kafka or Redis to buffer events, they don't expect all events to be drained upon shutdown. To me, LS's in-built PQ's behavior should be congruent/consistent to when Logstash is used with Kafka and Redis behavior.

@colinsurprenant
Copy link
Contributor Author

@suyograo it is more a question of the default setting. If/when PQ is enabled by default with no-drain shutdown, the getting started use-cases (or to that respect all types of one-off processing) will "fail" (from an user expectation POV) because LS is likely to terminate before all events are processed. We will have to provide a cmd line switch or something to enable queue drain on shutdown. Otherwise, to avoid this pitfall all users using LS for the first time will fall into, we will have to keep the in-memory queuing by default.

This is all about users expectations and experience. If the getting-started scenarios are important then we have to provide a set of default what make sense for these use cases. This is somewhat similar to the file input default behaviour.

@guyboertje
Copy link
Contributor

guyboertje commented Feb 10, 2017

@colinsurprenant @suyograo
A real consideration here is autoscaling multiple LS instances to consume from a Kafka or RabbitMQ store. In particular the downscaling part.

If downscaling means destroying the container without stopping LS first and waiting for LS to shutdown then PQ should not be used.

If downscaling means stopping LS, waiting until LS is stopped before destroying the container then the PQ needs to drain and it also makes sense to have a small sized PQ, say some multiple of batch-size for this use case so the drain time is reduced/controlled.

I got this from a customer call when we talked to them about the jdbc_static enhancement filter.

@colinsurprenant
Copy link
Contributor Author

@guyboertje good point.

I rebooted the drain PR #6433 and would like it to move forward. /cc @suyograo

@colinsurprenant
Copy link
Contributor Author

#6433 has been merged. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants