[exporterhelper] Fix invalid write index updates in the persistent queue #8963

swiatekm · 2023-11-20T18:11:47Z

Description:
Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state.

Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out.

Link to tracking Issue: #8115

Testing:
The TestPersistentQueue_StorageFull test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.

codecov · 2023-11-20T18:18:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4f464ce) 91.53% compared to head (4c5ed96) 91.55%.
Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8963      +/-   ##
==========================================
+ Coverage   91.53%   91.55%   +0.01%     
==========================================
  Files         316      316              
  Lines       17111    17115       +4     
==========================================
+ Hits        15663    15670       +7     
+ Misses       1152     1150       -2     
+ Partials      296      295       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dmitryax · 2023-11-20T19:37:32Z

Thanks for the bug fix.

About the second change, I’m not 100% sure that the proposed mode of handling erroneous situations is always better than the existing one.

Can you please submit separate PRs?

swiatekm · 2023-11-21T10:20:38Z

Thanks for the bug fix.

About the second change, I’m not 100% sure that the proposed mode of handling erroneous situations is always better than the existing one.

Can you please submit separate PRs?

Can do, but I need to modify the test for full storage in a somewhat unintuitive way. I've done this and left a comment explaining the reasoning, let me know if it's clear.

bogdandrutu · 2023-11-21T17:44:40Z

exporter/exporterhelper/internal/persistent_queue.go

+	// Carry out a transaction where we both add the item and update the write index
+	setWriteIndexOp := storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex))
+	setItemOp := storage.SetOperation(itemKey, reqBuf)
+	if err := pq.client.Batch(ctx, setWriteIndexOp, setItemOp); err != nil {
+		return err
+	}


I understand the return error, but why changing the way operations were constructed?

I wanted to write this by assigning to err in the condition, which is more readable with a shorter statement imo. So I defined the operations separately. Do you think it's worse now?

I think it just complicates the review, especially understanding what changed. Writing the code as you suggested makes sense in general, but here I feel that it complicates a bit (especially that you still shadow err). I would write maybe like:

ops := []storage.Operations{ storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex), storage.SetOperation(itemKey, reqBuf)} if err := pq.client.Batch(ctx, ops...); err != nil { return err }

Yeah, that looks better. Applied this change and also renamed the error variable to avoid shadowing.

bogdandrutu · 2023-11-22T16:58:10Z

exporter/exporterhelper/internal/persistent_queue.go

+	// Carry out a transaction where we both add the item and update the write index
+	setWriteIndexOp := storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex))
+	setItemOp := storage.SetOperation(itemKey, reqBuf)
+	if err := pq.client.Batch(ctx, setWriteIndexOp, setItemOp); err != nil {
+		return err
+	}


I think it just complicates the review, especially understanding what changed. Writing the code as you suggested makes sense in general, but here I feel that it complicates a bit (especially that you still shadow err). I would write maybe like:

ops := []storage.Operations{ storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex), storage.SetOperation(itemKey, reqBuf)} if err := pq.client.Batch(ctx, ops...); err != nil { return err }

…eue (open-telemetry#8963) **Description:** Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state. Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out. **Link to tracking Issue:** open-telemetry#8115 **Testing:** The `TestPersistentQueue_StorageFull` test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.

swiatekm force-pushed the fix/persistentstorage/index-updates branch from 655830a to f6ae621 Compare November 20, 2023 18:12

swiatekm force-pushed the fix/persistentstorage/index-updates branch from f6ae621 to cd6904a Compare November 21, 2023 10:16

swiatekm force-pushed the fix/persistentstorage/index-updates branch from cd6904a to 725821b Compare November 21, 2023 10:36

swiatekm marked this pull request as ready for review November 21, 2023 10:36

swiatekm requested review from a team and djaglowski November 21, 2023 10:36

Fix invalid write index updates in the persistent queue

8f5cb99

swiatekm force-pushed the fix/persistentstorage/index-updates branch from 725821b to 8f5cb99 Compare November 21, 2023 10:36

bogdandrutu reviewed Nov 21, 2023

View reviewed changes

bogdandrutu approved these changes Nov 22, 2023

View reviewed changes

Minor style change to storage operation dispatch

4c5ed96

dmitryax approved these changes Nov 27, 2023

View reviewed changes

dmitryax merged commit c0deae5 into open-telemetry:main Nov 27, 2023
31 checks passed

github-actions bot added this to the next release milestone Nov 27, 2023

swiatekm deleted the fix/persistentstorage/index-updates branch November 27, 2023 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporterhelper] Fix invalid write index updates in the persistent queue #8963

[exporterhelper] Fix invalid write index updates in the persistent queue #8963

swiatekm commented Nov 20, 2023 •

edited

Loading

codecov bot commented Nov 20, 2023 •

edited

Loading

dmitryax commented Nov 20, 2023

swiatekm commented Nov 21, 2023

bogdandrutu Nov 21, 2023

swiatekm Nov 21, 2023

bogdandrutu Nov 22, 2023

swiatekm Nov 22, 2023

bogdandrutu Nov 22, 2023

[exporterhelper] Fix invalid write index updates in the persistent queue #8963

[exporterhelper] Fix invalid write index updates in the persistent queue #8963

Conversation

swiatekm commented Nov 20, 2023 • edited Loading

codecov bot commented Nov 20, 2023 • edited Loading

Codecov Report

dmitryax commented Nov 20, 2023

swiatekm commented Nov 21, 2023

bogdandrutu Nov 21, 2023

Choose a reason for hiding this comment

swiatekm Nov 21, 2023

Choose a reason for hiding this comment

bogdandrutu Nov 22, 2023

Choose a reason for hiding this comment

swiatekm Nov 22, 2023

Choose a reason for hiding this comment

bogdandrutu Nov 22, 2023

Choose a reason for hiding this comment

swiatekm commented Nov 20, 2023 •

edited

Loading

codecov bot commented Nov 20, 2023 •

edited

Loading