-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted user journal files #24150
Comments
I think I have traced it down to this block: systemd/src/libsystemd/sd-journal/sd-journal.c Lines 1280 to 1290 in b622e95
But I've got no idea what would make |
Can you attach the corrupted journal file? |
To be honest I've got concerns as it might contain sensitive information, I mean it's a user journal file and I cannot open it to check. Any way of sharing this non-publicly with you? I could mail (or DM) you a link to the file after uploading it somewhere (e.g. a NextCloud instance). |
I am not heavily involved with the journal development so let's wait for someone who understands it, but it should be fine to share privately once that happens |
hello, same issue here. After each rotation, i got a corrupted journal message issue. SYSTEMD_LOG_LEVEL=debug journalctl --verify Failed to open journal file /var/log/journal/bed9c4a1d1e24617bb1d2977838f103c/user-1001@9a6fcf41ba5c4c31a60afbf94ecfd150-0000000000000000-0000000000000000.journal: Bad message |
Can we remove the 'needs-reporter-feedback' label? There is currently no unanswered question, the sample log file is uploaded and ready to be shared with anyone (dev) that wants to have a look. I'd like to avoid having people overlook this ticket because they think there's info missing. |
So after an extensive round of So I'm pretty sure it doesn't contain sensitive information. Here it is: corrupted.journal.gz |
Any more info that I could provide to help resolving this issue? I'm seeing more and more machines being affected by this, seems that the affected systemd version has hit the Debian repositories. |
@lynix Can you send a journal file to my email (it's in my profile)? I'll need a corrupted one to figure out the problem |
@DaanDeMeyer Please do have a look at the comment before my last comment ;) |
Yeah, something is probably going wrong way earlier, I can't figure it out just from the corrupted journal file. Are all the affected machines running btrfs? I'd need journald logs to have any hope of figuring this out, they'll be in dmesg but you'll have to run systemd-journald with SYSTEMD_LOG_LEVEL=debug (use |
@cyayon please also describe your setup: architecture, distribution, kernel version, file system, systemd version. |
Hi, |
Any plans to fix it? I have done
Filesystem:
Duplicate of #24116? Corrupted journal sent to @poettering and @DaanDeMeyer. |
Some
@yuwata @DaanDeMeyer Can the |
This is the first time in 12 years that I can say: |
Can your resend a corrupted journal? The previous links in the email you sent me expired |
If the source file is removed during copying it, then the result file may be broken. Let's check if the file still exists during and after copying. This is important when archiving a journal file on btrfs. Archiving journal files is done asynchronously. So, a journal file may be removed by the main thread of journald by journal_directory_vacuum(), while another thread is copying the file. If that happens, a broken journal file is created. Fixes systemd#24150 and systemd#31222.
If the source file is removed during copying it, then the result file may be broken. Let's check if the file still exists during and after copying. This is important when archiving a journal file on btrfs. Archiving journal files is done asynchronously. So, a journal file may be removed by the main thread of journald by journal_directory_vacuum(), while another thread is copying the file. If that happens, a broken journal file is created. Fixes systemd#24150 and systemd#31222.
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222.
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222.
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222.
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222.
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222.
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222. (cherry picked from commit 18d4e47)
For issue systemd#24150 and systemd#31222. (cherry picked from commit 8cc4216)
As the main thread may call journal_directory_vacuum() -> unlinkat_deallocate() while another thread is copying the file. Fixes systemd#24150 and systemd#31222. (cherry picked from commit 18d4e47) (cherry picked from commit 0420956)
For issue systemd#24150 and systemd#31222. (cherry picked from commit 8cc4216) (cherry picked from commit d1a7ffc)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <[email protected]>
…-telemetry#35635) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <[email protected]>
…-telemetry#35635) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <[email protected]>
systemd version the issue has been seen with
systemd-251.3-1
Used distribution
Arch Linux
Linux kernel version used
5.18.14.arch1-1
CPU architectures issue was seen on
x86_64
Component
journalctl, systemd-journald
Expected behaviour you didn't see
journalctl --verify
does not show any errors.Unexpected behaviour you saw
SYSTEMD_LOG_LEVEL=debug journalctl --verify
shows the following error:Note:
btrfs scrub
ing)Steps to reproduce the problem
None, actually, simply wait a couple of days (usually 1 or 2).
I've got the feeling it always happens when the user journal is rotated, but I'm not sure.
Additional program output to the terminal or log subsystem illustrating the issue
The text was updated successfully, but these errors were encountered: