Persistent queue fails to delete items from disk in some circumstances #8115

swiatekm · 2023-07-20T10:45:40Z

Describe the bug
The persistent queue pulls items from the on-disk database, but fails to delete them after they've been successfully sent. The result is that the db has orphaned data in it that never gets deleted.

Steps to reproduce
I haven't been able to reproduce this, but I have had a user send me their DB file, and have confirmed it contained items with indices less than the saved read index.

What version did you use?
0.77.0

What config did you use?
The relevant part is:

extensions:
  file_storage:
    directory: /var/lib/storage/otc
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /tmp

Environment
The report I have is from AWS EKS 1.23 with the default storage driver.

Additional context
#7396 should technically help address similar problems, but this has happened quite consistently for this specific user in low disk space conditions, and the issue that change addresses should be fairly difficult to trigger.

I'm filing this in the hope that I can get more reports of this behaviour to help verify that it still occurs, and hopefully get to the bottom of the issue.

swiatekm · 2023-07-20T10:46:56Z

@jpkrohling @djaglowski @frzifus you asked to be included during yesterday's SIG meeting.

…eue (#8963) **Description:** Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state. Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out. **Link to tracking Issue:** #8115 **Testing:** The `TestPersistentQueue_StorageFull` test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.

…eue (open-telemetry#8963) **Description:** Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state. Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out. **Link to tracking Issue:** open-telemetry#8115 **Testing:** The `TestPersistentQueue_StorageFull` test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.

johngmyers · 2024-08-19T19:28:48Z

We're seeing something that looks like this with splunk-otel-collector 0.102.0, which claims to have changes from opentelemetry-collector 0.102.0. The persistent queue file is upwards of 100GB, yet the Prometheus metrics show no backlog.

swiatekm · 2024-08-20T12:29:41Z

@johngmyers do you have access to the queue file? It's possible to inspect it with bbolt's CLI and check if there are keys outside of the queue's normal range.

johngmyers · 2024-08-21T17:34:14Z

$ sudo ~/bbolt keys exporter_splunk_hec_platform_logs_logs default
254468
254469
254470
254471
254472
254473
[...]
263190
263191
263192
di
ri
wi
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default di
0300000004e203000000000005e203000000000006e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default ri
07e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default wi
1904040000000000

johngmyers · 2024-08-21T17:49:46Z

ri decodes to 254471, so there are three keys before that.

The numbers in di are 3, 254468, 0, 254469, 0, 254470, 0

johngmyers · 2024-08-21T18:04:18Z

This particular pod is in CrashLoopBackoff, from the container exiting with code 2. Log lines preceding the exit are about it being over the hard memory limit.

johngmyers · 2024-08-21T18:38:56Z

I think the livenessProbe was configured too aggressively.

johngmyers · 2024-08-21T20:10:44Z

After giving the liveness probe an initial delay, both the lowest and highest key numbers have advanced. It's probably fair to say I'm not seeing any problem with the queue itself.

swiatekm · 2024-11-10T15:24:57Z

I'm going to close this for now. The fundamental problem appears to be the inability to delete items from the queue if there's no disk space available. This is not easy to reproduce, as it depends on the implementation details of bbolt's memory management. We also can't do much about it in the queue implementation itself. Maybe the storage client can try to look at remaining disk space and reject writes pre-emptively? Ideally, the embedded db we're using would handle this on its own.

swiatekm added the bug Something isn't working label Jul 20, 2023

swiatekm mentioned this issue Nov 21, 2023

[exporterhelper] Fix invalid write index updates in the persistent queue #8963

Merged

swiatekm closed this as completed Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent queue fails to delete items from disk in some circumstances #8115

Persistent queue fails to delete items from disk in some circumstances #8115

swiatekm commented Jul 20, 2023 •

edited

Loading

swiatekm commented Jul 20, 2023

johngmyers commented Aug 19, 2024

swiatekm commented Aug 20, 2024 •

edited

Loading

johngmyers commented Aug 21, 2024 •

edited

Loading

johngmyers commented Aug 21, 2024

johngmyers commented Aug 21, 2024

johngmyers commented Aug 21, 2024

johngmyers commented Aug 21, 2024

swiatekm commented Nov 10, 2024

Persistent queue fails to delete items from disk in some circumstances #8115

Persistent queue fails to delete items from disk in some circumstances #8115

Comments

swiatekm commented Jul 20, 2023 • edited Loading

swiatekm commented Jul 20, 2023

johngmyers commented Aug 19, 2024

swiatekm commented Aug 20, 2024 • edited Loading

johngmyers commented Aug 21, 2024 • edited Loading

johngmyers commented Aug 21, 2024

johngmyers commented Aug 21, 2024

johngmyers commented Aug 21, 2024

johngmyers commented Aug 21, 2024

swiatekm commented Nov 10, 2024

swiatekm commented Jul 20, 2023 •

edited

Loading

swiatekm commented Aug 20, 2024 •

edited

Loading

johngmyers commented Aug 21, 2024 •

edited

Loading