Options for Mitigating Input Downtime #68

todd · 2015-12-10T20:48:56Z

We recently experienced an issue where our Redis input for Logstash went down and our app became unresponsive, a scenario outlined in the README. As noted in the README, we can bump up the values for the buffer configuration, but it doesn't seem that that will prevent this issue from recurring in the event of another significant logging infrastructure downtime event.

There's the sync option, but, based on the documentation, I'm unclear on whether this would have prevented this issue from occurring.

It would be great if there was a way for the logger to flush the buffer if it receives a connection error. We'd much rather lose logs than take downtime. Is this something that would be possible? I'd be more than willing to work on a patch and submit a PR if you thought it was possible and worthwhile and could give a little direction.

The text was updated successfully, but these errors were encountered:

johnnonolan · 2015-12-10T20:50:22Z

👍

dwbutler · 2015-12-10T21:10:46Z

I think it's a great idea to make this behavior configurable. I think most people would rather drop some logs rather than experience downtime! :)

In order for this to work really well, LogStashLogger would need to never block and never raise an exception. This would require a thorough review of the code to make sure this works consistently everywhere.

I can see there being different options for this. Don't buffer messages, and drop messages on connection failure. Buffer messages, but drop them if the buffer gets full. Buffer messages, but don't drop them. (e.g. block until the connection is re-established.)

blysik · 2016-01-07T16:22:42Z

I think we experienced the same thing recently: our redis system was down, and this seemed to cause long pauses because of timeouts sending logs to it in our Rails application.

Is there a way to make a very short timeout for logging?

dwbutler · 2016-01-08T04:01:06Z

The Ruby Redis client defaults to a 5 second timeout. You can override it by passing a different value for timeout and/or connect_timeout in your Redis configuration. Let me know if that helps with the issue.

todd · 2016-02-29T17:58:54Z

So we're not going to work on this due to time and resource constraints, but I did want to report back with the solution we went with.

We ended up removing this gem entirely. Instead, we're logging to files that are being tailed with Filebeat and shipping events directly to our collectors.

I regret that I won't be able to work on this. I'm going to leave this issue open for now as it's still an issue that I believe should be solved at some point.

dwbutler · 2016-02-29T19:57:56Z

I agree that it should be solved. LogStashLogger is essentially an in-process log shipper, and should act in a well-defined, reliable way that does not interfere with normal operation of the application.

DaveCollinsJr · 2016-03-28T21:38:59Z

We were bitten by this in production today also. Our ELK stack went down over the weekend and eventually, I believe, the inability to log caused our sidekiq workers to get hung. Would +1 the option to simply lose the data when the buffer is full rather than having the application become unresponsive.

DaveCollinsJr · 2016-06-13T17:21:12Z

Thanks much @sauliusgrigaitis for that fix!

lucke84 · 2016-07-05T10:18:35Z

Hello there, we've experienced yesterday the same issue (our Redis endpoint went down and the application quickly became unresponsive). @dwbutler What needs to be done to introduce the ability to drop the data if the connection times out? I'd be happy to help if you give me a few pointers on what code to review (as you were suggesting in this very issue).

LogStashLogger currently uses `Stud::Buffer` to implement buffering for connectable devices (such as TCP, Redis, etc.) When the remote service goes down, an exception is raised when a buffer flush is attempted. By default `Stud::Buffer` will retry sending the messages forever. Since a flush is triggered when a message is received, or on a regular timer, this will cause logging calls to block. See jordansissel/ruby-stud#28 `Stud::Buffer` allows callbacks to be fired when it encounters a flush error. This ties into that mechanism to abort the flush and re-enqueue the failed messages. This behavior is now enabled by default. To instead drop messages when there is a flush error, pass the new `drop_messages_on_flush_error` option to the logger. Most applications will want to buffer messages and only drop them when the buffer fills up. This behavior has been implemented by tying into `Stud::Buffer`'s callback for the buffer full event. By default, when the buffer is full, `Stud::Buffer` will block when any new message comes in, until there is room in the buffer. If you want to discard messages in the buffer when this happens, pass the new `drop_messages_on_full_buffer` option to the logger. Fixes #68

dwbutler · 2016-07-06T17:37:28Z

I finally found some time to work on this. Please try the patch in #81. By default, the logger will no longer block when there is a connection error. If you want to drop messages when the buffer is full, add this new configuration option to your logger:

logger = LogStashLogger.new(type: :redis, drop_messages_on_full_buffer: true)

lucke84 · 2016-07-10T14:47:49Z

Thanks @dwbutler for getting this done! Any chance you'll release a new version of the gem anytime soon?

dwbutler · 2016-07-10T23:58:03Z

Yes, my goal is to release sometime this week.

sauliusgrigaitis added a commit to necolt/logstash-logger that referenced this issue Apr 4, 2016

Temporary fix for dwbutler#68

b7f5df5

This was referenced Apr 22, 2016

option to limit number of retry for kafka #73

Closed

logger will hang programs indefinitely #75

Closed

dwbutler added the enhancement label Jul 5, 2016

dwbutler mentioned this issue Jul 6, 2016

Allow log messages to be dropped #81

Merged

dwbutler closed this as completed in #81 Jul 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for Mitigating Input Downtime #68

Options for Mitigating Input Downtime #68

todd commented Dec 10, 2015

johnnonolan commented Dec 10, 2015

dwbutler commented Dec 10, 2015

blysik commented Jan 7, 2016

dwbutler commented Jan 8, 2016

todd commented Feb 29, 2016

dwbutler commented Feb 29, 2016

DaveCollinsJr commented Mar 28, 2016

DaveCollinsJr commented Jun 13, 2016

lucke84 commented Jul 5, 2016

dwbutler commented Jul 6, 2016

lucke84 commented Jul 10, 2016

dwbutler commented Jul 10, 2016

Options for Mitigating Input Downtime #68

Options for Mitigating Input Downtime #68

Comments

todd commented Dec 10, 2015

johnnonolan commented Dec 10, 2015

dwbutler commented Dec 10, 2015

blysik commented Jan 7, 2016

dwbutler commented Jan 8, 2016

todd commented Feb 29, 2016

dwbutler commented Feb 29, 2016

DaveCollinsJr commented Mar 28, 2016

DaveCollinsJr commented Jun 13, 2016

lucke84 commented Jul 5, 2016

dwbutler commented Jul 6, 2016

lucke84 commented Jul 10, 2016

dwbutler commented Jul 10, 2016