-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent queue fails to delete items from disk in some circumstances #8115
Comments
@jpkrohling @djaglowski @frzifus you asked to be included during yesterday's SIG meeting. |
…eue (#8963) **Description:** Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state. Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out. **Link to tracking Issue:** #8115 **Testing:** The `TestPersistentQueue_StorageFull` test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.
…eue (open-telemetry#8963) **Description:** Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state. Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out. **Link to tracking Issue:** open-telemetry#8115 **Testing:** The `TestPersistentQueue_StorageFull` test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.
We're seeing something that looks like this with splunk-otel-collector 0.102.0, which claims to have changes from opentelemetry-collector 0.102.0. The persistent queue file is upwards of 100GB, yet the Prometheus metrics show no backlog. |
@johngmyers do you have access to the queue file? It's possible to inspect it with bbolt's CLI and check if there are keys outside of the queue's normal range. |
|
The numbers in |
This particular pod is in CrashLoopBackoff, from the container exiting with code 2. Log lines preceding the exit are about it being over the hard memory limit. |
I think the |
After giving the liveness probe an initial delay, both the lowest and highest key numbers have advanced. It's probably fair to say I'm not seeing any problem with the queue itself. |
I'm going to close this for now. The fundamental problem appears to be the inability to delete items from the queue if there's no disk space available. This is not easy to reproduce, as it depends on the implementation details of bbolt's memory management. We also can't do much about it in the queue implementation itself. Maybe the storage client can try to look at remaining disk space and reject writes pre-emptively? Ideally, the embedded db we're using would handle this on its own. |
Describe the bug
The persistent queue pulls items from the on-disk database, but fails to delete them after they've been successfully sent. The result is that the db has orphaned data in it that never gets deleted.
Steps to reproduce
I haven't been able to reproduce this, but I have had a user send me their DB file, and have confirmed it contained items with indices less than the saved read index.
What version did you use?
0.77.0
What config did you use?
The relevant part is:
Environment
The report I have is from AWS EKS 1.23 with the default storage driver.
Additional context
#7396 should technically help address similar problems, but this has happened quite consistently for this specific user in low disk space conditions, and the issue that change addresses should be fairly difficult to trigger.
I'm filing this in the hope that I can get more reports of this behaviour to help verify that it still occurs, and hopefully get to the bottom of the issue.
The text was updated successfully, but these errors were encountered: