Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent queue fails to delete items from disk in some circumstances #8115

Closed
swiatekm opened this issue Jul 20, 2023 · 9 comments
Closed
Labels
bug Something isn't working

Comments

@swiatekm
Copy link
Contributor

swiatekm commented Jul 20, 2023

Describe the bug
The persistent queue pulls items from the on-disk database, but fails to delete them after they've been successfully sent. The result is that the db has orphaned data in it that never gets deleted.

Steps to reproduce
I haven't been able to reproduce this, but I have had a user send me their DB file, and have confirmed it contained items with indices less than the saved read index.

What version did you use?
0.77.0

What config did you use?
The relevant part is:

extensions:
  file_storage:
    directory: /var/lib/storage/otc
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /tmp

Environment
The report I have is from AWS EKS 1.23 with the default storage driver.

Additional context
#7396 should technically help address similar problems, but this has happened quite consistently for this specific user in low disk space conditions, and the issue that change addresses should be fairly difficult to trigger.

I'm filing this in the hope that I can get more reports of this behaviour to help verify that it still occurs, and hopefully get to the bottom of the issue.

@swiatekm swiatekm added the bug Something isn't working label Jul 20, 2023
@swiatekm
Copy link
Contributor Author

@jpkrohling @djaglowski @frzifus you asked to be included during yesterday's SIG meeting.

dmitryax pushed a commit that referenced this issue Nov 27, 2023
…eue (#8963)

**Description:**
Fixing a bug where the in-memory value of the persistent queue's write
index would be updated even if writing to the storage failed. This
normally wouldn't have any negative effect other than inflating the
queue size temporarily, as the read loop would simply skip over the
nonexistent record. However, in the case where the storage doesn't have
any available space, the in-memory and in-storage write index could
become significantly different, at which point a collector restart would
leave the queue in an inconsistent state.

Worth noting that the same issue affects reading from the queue, but in
that case the writes are very small, and in practice the storage will
almost always have enough space to carry them out.

**Link to tracking Issue:** #8115

**Testing:**
The `TestPersistentQueue_StorageFull` test actually only passed by
accident. Writing would leave one additional item in the put channel,
then the first read would fail (as there is not enough space to do the
read index and dispatched items writes), but subsequent reads would
succeed, so the bugs would cancel out. I modified this test to check for
the number of items in the queue after inserting them, and also to
expect one fewer item to be returned.
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this issue Dec 8, 2023
…eue (open-telemetry#8963)

**Description:**
Fixing a bug where the in-memory value of the persistent queue's write
index would be updated even if writing to the storage failed. This
normally wouldn't have any negative effect other than inflating the
queue size temporarily, as the read loop would simply skip over the
nonexistent record. However, in the case where the storage doesn't have
any available space, the in-memory and in-storage write index could
become significantly different, at which point a collector restart would
leave the queue in an inconsistent state.

Worth noting that the same issue affects reading from the queue, but in
that case the writes are very small, and in practice the storage will
almost always have enough space to carry them out.

**Link to tracking Issue:** open-telemetry#8115

**Testing:**
The `TestPersistentQueue_StorageFull` test actually only passed by
accident. Writing would leave one additional item in the put channel,
then the first read would fail (as there is not enough space to do the
read index and dispatched items writes), but subsequent reads would
succeed, so the bugs would cancel out. I modified this test to check for
the number of items in the queue after inserting them, and also to
expect one fewer item to be returned.
@johngmyers
Copy link

We're seeing something that looks like this with splunk-otel-collector 0.102.0, which claims to have changes from opentelemetry-collector 0.102.0. The persistent queue file is upwards of 100GB, yet the Prometheus metrics show no backlog.

@swiatekm
Copy link
Contributor Author

swiatekm commented Aug 20, 2024

@johngmyers do you have access to the queue file? It's possible to inspect it with bbolt's CLI and check if there are keys outside of the queue's normal range.

@johngmyers
Copy link

johngmyers commented Aug 21, 2024

$ sudo ~/bbolt keys exporter_splunk_hec_platform_logs_logs default
254468
254469
254470
254471
254472
254473
[...]
263190
263191
263192
di
ri
wi
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default di
0300000004e203000000000005e203000000000006e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default ri
07e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default wi
1904040000000000

@johngmyers
Copy link

ri decodes to 254471, so there are three keys before that.

The numbers in di are 3, 254468, 0, 254469, 0, 254470, 0

@johngmyers
Copy link

This particular pod is in CrashLoopBackoff, from the container exiting with code 2. Log lines preceding the exit are about it being over the hard memory limit.

@johngmyers
Copy link

I think the livenessProbe was configured too aggressively.

@johngmyers
Copy link

After giving the liveness probe an initial delay, both the lowest and highest key numbers have advanced. It's probably fair to say I'm not seeing any problem with the queue itself.

@swiatekm
Copy link
Contributor Author

I'm going to close this for now. The fundamental problem appears to be the inability to delete items from the queue if there's no disk space available. This is not easy to reproduce, as it depends on the implementation details of bbolt's memory management. We also can't do much about it in the queue implementation itself. Maybe the storage client can try to look at remaining disk space and reject writes pre-emptively? Ideally, the embedded db we're using would handle this on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants