-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Journal: A "bad message" error results in an infinite hot-loop #4053
Comments
May be one of those two could help https://github.com/coreos/go-systemd/blob/main/sdjournal/journal.go#L572-L606 ? what do you think ? Although it's not available in the reader/follow interface so we might need to fork it or contribute upstream to it to give access to the underlying journal. |
My impression is that it won't, since Interestingly, I can run |
@href please see #4066 and give us your thoughts. |
I haven't been able to reproduce this in the past 24 hours either (before it repeatedly happened on multiple hosts over the course of a few days). What those hosts share is a recent hard reboot, so we suspect that this somehow introduced corrupt journal files. Though we were not able to confirm this theory with In any case, I had a look at your PR and to me it looks great. This would certainly defuse the situation and would allow us to go live with Thanks a lot for fixing this so quickly! I'm impressed ❤️ |
Awesome, thanks for reporting this @href and happy Promtailing 🎉 |
I was wondering, when will this be included in a release? I didn't see any mention of this in the 2.3.0 change log, so I'm unsure if this made it into it or not. |
@href hhmm, doesn't seem like it was included in v2.3.0, for some reason. |
Awesome, thank you! |
Description
The systemd journal library may at times follow logs that it cannot process. When that happens,
promtail
enters an infinite loop that will produce the following error messages as fast as it can (with newlines for claritry):This is likely related to #2928, which introduced a loop over the journal follow logic.
Environment
Systemd 241 and 245
Theory
The intention of #2928 is to skip log entries when "bad message" occurs, but from my study of the source code involved, this is likely not what's happening:
The error clearly bubbles out of
Follow
in the following snippet:loki/clients/pkg/promtail/targets/journal/journaltarget.go
Lines 181 to 187 in b4086df
The error is logged and
Follow
is called again. For it to produce a differing result, the current line would have to be skipped at some point, or we are likely reading the same result over and over again (which is what we are observing).Inside
Follow
theRead
function is called, which is the function that advances the current cursor here:loki/vendor/github.com/coreos/go-systemd/sdjournal/read.go
Lines 138 to 147 in babea82
Internally, this calls
sd_journal_next
, which in turn callsreal_journal_next
:https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L816
Towards the end of that function,
journal_file_move_to_object
is called. This is whereEBADMSG
comes from! Unfortunately, reading a bad message does not advance the cursor, as an error forces to function to exit before that:https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L861-L867
Expected Behavior
Ideally we would like to skip bad messages, but I'm not sure this is possible, as the skip functions systemd offers are really just "next" calls with a counter. It might make sense to treat these errors like end-of-file errors, since this seems to be what
journald
is doing internally:https://unix.stackexchange.com/a/87862/413524
The text was updated successfully, but these errors were encountered: