Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporterhelper] Fix invalid write index updates in the persistent queue #8963

Merged

Conversation

swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Nov 20, 2023

Description:
Fixing a bug where the in-memory value of the persistent queue's write index would be updated even if writing to the storage failed. This normally wouldn't have any negative effect other than inflating the queue size temporarily, as the read loop would simply skip over the nonexistent record. However, in the case where the storage doesn't have any available space, the in-memory and in-storage write index could become significantly different, at which point a collector restart would leave the queue in an inconsistent state.

Worth noting that the same issue affects reading from the queue, but in that case the writes are very small, and in practice the storage will almost always have enough space to carry them out.

Link to tracking Issue: #8115

Testing:
The TestPersistentQueue_StorageFull test actually only passed by accident. Writing would leave one additional item in the put channel, then the first read would fail (as there is not enough space to do the read index and dispatched items writes), but subsequent reads would succeed, so the bugs would cancel out. I modified this test to check for the number of items in the queue after inserting them, and also to expect one fewer item to be returned.

@swiatekm swiatekm force-pushed the fix/persistentstorage/index-updates branch from 655830a to f6ae621 Compare November 20, 2023 18:12
Copy link

codecov bot commented Nov 20, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4f464ce) 91.53% compared to head (4c5ed96) 91.55%.
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8963      +/-   ##
==========================================
+ Coverage   91.53%   91.55%   +0.01%     
==========================================
  Files         316      316              
  Lines       17111    17115       +4     
==========================================
+ Hits        15663    15670       +7     
+ Misses       1152     1150       -2     
+ Partials      296      295       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dmitryax
Copy link
Member

Thanks for the bug fix.

About the second change, I’m not 100% sure that the proposed mode of handling erroneous situations is always better than the existing one.

Can you please submit separate PRs?

@swiatekm swiatekm force-pushed the fix/persistentstorage/index-updates branch from f6ae621 to cd6904a Compare November 21, 2023 10:16
@swiatekm
Copy link
Contributor Author

Thanks for the bug fix.

About the second change, I’m not 100% sure that the proposed mode of handling erroneous situations is always better than the existing one.

Can you please submit separate PRs?

Can do, but I need to modify the test for full storage in a somewhat unintuitive way. I've done this and left a comment explaining the reasoning, let me know if it's clear.

@swiatekm swiatekm force-pushed the fix/persistentstorage/index-updates branch from cd6904a to 725821b Compare November 21, 2023 10:36
@swiatekm swiatekm marked this pull request as ready for review November 21, 2023 10:36
@swiatekm swiatekm requested review from a team and djaglowski November 21, 2023 10:36
@swiatekm swiatekm force-pushed the fix/persistentstorage/index-updates branch from 725821b to 8f5cb99 Compare November 21, 2023 10:36
Comment on lines 215 to 220
// Carry out a transaction where we both add the item and update the write index
setWriteIndexOp := storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex))
setItemOp := storage.SetOperation(itemKey, reqBuf)
if err := pq.client.Batch(ctx, setWriteIndexOp, setItemOp); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the return error, but why changing the way operations were constructed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to write this by assigning to err in the condition, which is more readable with a shorter statement imo. So I defined the operations separately. Do you think it's worse now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it just complicates the review, especially understanding what changed. Writing the code as you suggested makes sense in general, but here I feel that it complicates a bit (especially that you still shadow err). I would write maybe like:

ops := []storage.Operations{ storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex),  storage.SetOperation(itemKey, reqBuf)}
if err := pq.client.Batch(ctx, ops...); err != nil {
	return err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks better. Applied this change and also renamed the error variable to avoid shadowing.

Comment on lines 215 to 220
// Carry out a transaction where we both add the item and update the write index
setWriteIndexOp := storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex))
setItemOp := storage.SetOperation(itemKey, reqBuf)
if err := pq.client.Batch(ctx, setWriteIndexOp, setItemOp); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it just complicates the review, especially understanding what changed. Writing the code as you suggested makes sense in general, but here I feel that it complicates a bit (especially that you still shadow err). I would write maybe like:

ops := []storage.Operations{ storage.SetOperation(writeIndexKey, itemIndexToBytes(newIndex),  storage.SetOperation(itemKey, reqBuf)}
if err := pq.client.Batch(ctx, ops...); err != nil {
	return err
}

@dmitryax dmitryax merged commit c0deae5 into open-telemetry:main Nov 27, 2023
31 checks passed
@github-actions github-actions bot added this to the next release milestone Nov 27, 2023
@swiatekm swiatekm deleted the fix/persistentstorage/index-updates branch November 27, 2023 10:36
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this pull request Dec 8, 2023
…eue (open-telemetry#8963)

**Description:**
Fixing a bug where the in-memory value of the persistent queue's write
index would be updated even if writing to the storage failed. This
normally wouldn't have any negative effect other than inflating the
queue size temporarily, as the read loop would simply skip over the
nonexistent record. However, in the case where the storage doesn't have
any available space, the in-memory and in-storage write index could
become significantly different, at which point a collector restart would
leave the queue in an inconsistent state.

Worth noting that the same issue affects reading from the queue, but in
that case the writes are very small, and in practice the storage will
almost always have enough space to carry them out.

**Link to tracking Issue:** open-telemetry#8115

**Testing:**
The `TestPersistentQueue_StorageFull` test actually only passed by
accident. Writing would leave one additional item in the put channel,
then the first read would fail (as there is not enough space to do the
read index and dispatched items writes), but subsequent reads would
succeed, so the bugs would cancel out. I modified this test to check for
the number of items in the queue after inserting them, and also to
expect one fewer item to be returned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants