queue drainage options at shutdown #6085

colinsurprenant · 2016-10-18T14:30:35Z

Historically logstash, for obvious reasons, always drained its internal in-memory queue(s) upon shutdown.

Now with the persistent queue, this is not required anymore, a clean shutdown can happen at any time and the queue does not require to be drained as persisted events will be picked up where they were left upon restarting logstash. This is the current implementation in #5958.

I think we should also support an option where logstash can wait until the queue is drained before completing its shutdown, similar to the legacy implementation. One of the problem is with the input plugins which can initiate a shutdown like the generator input: For a given configuration, if filter+output is slower, the input triggered shutdown will leave unprocessed events in the queue and this can be counter intuitive in a "getting started" scenario or create weird problems in specs like in #6084

What should be the default behaviour? In a "getting started" use-case it is probably expected that all events will be processed before completing a shutdown
In production use-case I believe a quick-clean shutdown is a better default as the queue could/will grow to a large number of events and waiting for it to drain could be a problem.
If we ever switch the in-memory queuing to use the in-memory PQ backend instead of the current queuing implementation, this option should definitely be enabled to respect the in-memory queuing behaviour to not loose inflight event upon shutdown.

Thoughts?

suyograo · 2016-10-18T15:51:53Z

There are only a couple of plugins that initiate a shutdown like generator and ES input. Generator shouldn't be used in production scenarios, and in general used only for testing. ES input can be changed to a streaming case. In general, exposing too many options will just confuse users and make it really hard to support.

IMO, for the first iteration:

In-memory queue: On by default, so we stick to the same semantics. Drain queue before shutting down
Disk-backed queue: First iteration, do not support wait-until-drain during shutdown. This works nicely with the variable queue semantic. If a user is really interested in draining events in the queue, they have to wait. Later on, if enough users ask for it, we can expose a flag to force drain before shutdown. If this workflow messes up with our testing, we can add a setting but not document/expose it to users.

jsvd · 2016-10-19T13:28:15Z

I'm in favor of not suprising the user for 5.1.0, which means using the in-memory queue (not the in-memory "PQ"), and force the queue drain before shutdown/reload.

The only question for me is: should the drain semantic be implicit or explicit depending on the queue?
If it is implicit then using the in-memory means the drain must happen (and using the PQ means it will not happen), if it's explicit then we need another setting like drain_on_shutdown which defaults to draining.

suyograo · 2016-10-19T15:58:21Z

If it is implicit then using the in-memory means the drain must happen (and using the PQ means it will not happen), if it's explicit then we need another setting like drain_on_shutdown which defaults to draining.

I personally like implicit since it is inline with current behavior. in-memory we wait until drain, PQ, we don't. Now that should be good for 5.1.0 and beyond. If we get concrete requests we can flip to being explicit and introducing a new setting.

WDYT?

jordansissel · 2016-10-19T19:21:55Z

@suyograo I'm open to your proposal, and if we move forward with it, we'll want to document the likeliness of duplicates shipping to outputs -- shutdown without a drain will cause any output-but-not-acked-in-the-queue events to be replayed upon restart, causing dups sent to outputs, right?

colinsurprenant · 2016-10-20T07:34:35Z

@jordansissel

shutdown without a drain will cause any output-but-not-acked-in-the-queue events to be replayed upon restart, causing dups sent to outputs, right?

A clean/normal shutdown (without drain) should not cause dups. The workers shutdown signalling and behaviour has not changed so upon a normal shutdown all workers have a chance to complete their duties which will result in correctly acking the "inflight" batches/events. Whatever is in the queue after the workers have stopped should be only unread events.

jordansissel · 2016-10-20T23:44:38Z

@colinsurprenant thanks for the clarification! I think I was misunderstanding, I had assumed drain to refer to what Logstash does today on a safe shutdown -- events currently being processed by pipeline workers would be allowed to finish -- and that no drain would mean the current unsafe shutdown.

If I understand you, that drain in this discussion means draining the on-disk PQ fully, then I am +1 on not draining by default since the PQ will resume on restart with no loss and no dups.

colinsurprenant · 2016-10-21T16:28:51Z

@jordansissel yes you understood correctly. I am also +1 for the no-drain default.

The question is also: should we expose an option for a yes-drain on the PQ ?

Or, eventually, it could be a UI or API option when en expose a shutdown API and/or shutdown button, to ask for an explicit yes-drain on that shutdown.

The only problem I see in a no-drain situation is if someone uses the generator or stdin inputs with PQ enabled, it might shutdown before the queue is emptied. There are unlikely use-cases but they will be real when we switch to have PQ on by default with the getting-started use-cases...

suyograo · 2016-10-22T00:16:53Z

The question is also: should we expose an option for a yes-drain on the PQ ?

I don't think this is needed for 5.1.0. We can re-evaluate if/when users ask for it

suyograo · 2016-10-22T00:19:43Z

The only problem I see in a no-drain situation is if someone uses the generator or stdin inputs with PQ enabled, it might shutdown before the queue is emptied. There are unlikely use-cases but they will be real when we switch to have PQ on by default with the getting-started use-cases...

It depends on how we frame/document the PQ feature. When someone switches to using a persistent, variable length queue, they shouldn't expect all in-flight events to be drained upon shutdown. Then, the use of queue does not make sense. Just like when someone uses Kafka or Redis to buffer events, they don't expect all events to be drained upon shutdown. To me, LS's in-built PQ's behavior should be congruent/consistent to when Logstash is used with Kafka and Redis behavior.

colinsurprenant · 2016-10-24T11:50:08Z

@suyograo it is more a question of the default setting. If/when PQ is enabled by default with no-drain shutdown, the getting started use-cases (or to that respect all types of one-off processing) will "fail" (from an user expectation POV) because LS is likely to terminate before all events are processed. We will have to provide a cmd line switch or something to enable queue drain on shutdown. Otherwise, to avoid this pitfall all users using LS for the first time will fall into, we will have to keep the in-memory queuing by default.

This is all about users expectations and experience. If the getting-started scenarios are important then we have to provide a set of default what make sense for these use cases. This is somewhat similar to the file input default behaviour.

guyboertje · 2017-02-10T09:38:03Z

@colinsurprenant @suyograo
A real consideration here is autoscaling multiple LS instances to consume from a Kafka or RabbitMQ store. In particular the downscaling part.

If downscaling means destroying the container without stopping LS first and waiting for LS to shutdown then PQ should not be used.

If downscaling means stopping LS, waiting until LS is stopped before destroying the container then the PQ needs to drain and it also makes sense to have a small sized PQ, say some multiple of batch-size for this use case so the drain time is reduced/controlled.

I got this from a customer call when we talked to them about the jdbc_static enhancement filter.

colinsurprenant · 2017-02-23T22:19:47Z

@guyboertje good point.

I rebooted the drain PR #6433 and would like it to move forward. /cc @suyograo

colinsurprenant · 2017-03-06T21:53:55Z

#6433 has been merged. closing.

jsvd added discuss resiliency labels Oct 18, 2016

colinsurprenant mentioned this issue Dec 16, 2016

queue drain option support #6433

Merged

2 tasks

colinsurprenant closed this as completed Mar 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

queue drainage options at shutdown #6085

queue drainage options at shutdown #6085

colinsurprenant commented Oct 18, 2016

suyograo commented Oct 18, 2016 •

edited

Loading

jsvd commented Oct 19, 2016

suyograo commented Oct 19, 2016

jordansissel commented Oct 19, 2016

colinsurprenant commented Oct 20, 2016

jordansissel commented Oct 20, 2016

colinsurprenant commented Oct 21, 2016

suyograo commented Oct 22, 2016

suyograo commented Oct 22, 2016 •

edited

Loading

colinsurprenant commented Oct 24, 2016

guyboertje commented Feb 10, 2017 •

edited

Loading

colinsurprenant commented Feb 23, 2017

colinsurprenant commented Mar 6, 2017

queue drainage options at shutdown #6085

queue drainage options at shutdown #6085

Comments

colinsurprenant commented Oct 18, 2016

suyograo commented Oct 18, 2016 • edited Loading

jsvd commented Oct 19, 2016

suyograo commented Oct 19, 2016

jordansissel commented Oct 19, 2016

colinsurprenant commented Oct 20, 2016

jordansissel commented Oct 20, 2016

colinsurprenant commented Oct 21, 2016

suyograo commented Oct 22, 2016

suyograo commented Oct 22, 2016 • edited Loading

colinsurprenant commented Oct 24, 2016

guyboertje commented Feb 10, 2017 • edited Loading

colinsurprenant commented Feb 23, 2017

colinsurprenant commented Mar 6, 2017

suyograo commented Oct 18, 2016 •

edited

Loading

suyograo commented Oct 22, 2016 •

edited

Loading

guyboertje commented Feb 10, 2017 •

edited

Loading