Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return queue is full error if sized_channel is full #11063

Merged
merged 10 commits into from
Sep 6, 2024

Conversation

splunkericl
Copy link
Contributor

Description

This change fixes a potential deadlock bug for persistent queue.

There is a race condition in persistent queue that caused used in sizedChannel to become out of sync with ch len. This causes Offer to be deadlocked in specific race condition. For example:

  1. Multiple consumers are calling Consume
  2. Multiple producers are calling Offer to insert into the queue
    a. All elements are taken from consumers. ch is empty
  3. One consumer completes consume, calls onProcessingFinished
    a. Inside sizedChannel, syncSize is invoked, used is reset to 0 when other consumers are still waiting for lock to consume
  4. More Offer is called inserting elements -> used and ch len should equal
  5. As step 3a consumers completes, used is decreased -> used is lower than ch len
    a. More Offer is called inserting since used is below capacity. however, ch is full.
    b. goroutine calling offer is holding the mutex but can’t release it as ch is full.
    c. no consumer can acquire mutex to complete previous onProcessingFinished

This change returns an error if channel is full instead of waiting for it to unblock.

Link to tracking issue

Fixes # #11015

Testing

  • Added concurrent test in persistent queue that can reproduce the problem(note: need to re-run it 100 times as the race condition is not consistent).
  • Added unit test for sizedChannel

Documentation

Added comment in the block explaining it

@splunkericl splunkericl requested review from a team and codeboten September 5, 2024 18:00
Copy link

codecov bot commented Sep 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.21%. Comparing base (708fc1a) to head (4497374).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11063      +/-   ##
==========================================
- Coverage   92.23%   92.21%   -0.02%     
==========================================
  Files         405      414       +9     
  Lines       19089    19728     +639     
==========================================
+ Hits        17606    18193     +587     
- Misses       1123     1165      +42     
- Partials      360      370      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@splunkericl splunkericl force-pushed the persistent_queue_deadlock branch from 498c230 to 678ef6a Compare September 5, 2024 20:19
@splunkericl splunkericl force-pushed the persistent_queue_deadlock branch from 678ef6a to 09ce9d3 Compare September 5, 2024 20:33
@splunkericl
Copy link
Contributor Author

Fixed all the checks. @dmitryax @mx-psi can you guys take a look at this PR? The change is small and mostly just testing.

@splunkericl
Copy link
Contributor Author

cc @sfc-gh-sili as well as snowflake partner is helping out

@mx-psi mx-psi requested a review from swiatekm September 6, 2024 09:44
Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmitryax dmitryax merged commit 2c22ed7 into open-telemetry:main Sep 6, 2024
49 checks passed
@github-actions github-actions bot added this to the next release milestone Sep 6, 2024
splunkericl added a commit to splunkericl/opentelemetry-collector that referenced this pull request Sep 6, 2024
…1063)

#### Description
This change fixes a potential deadlock bug for persistent queue.

There is a race condition in persistent queue that caused `used` in
`sizedChannel` to become out of sync with `ch` len. This causes `Offer`
to be deadlocked in specific race condition. For example:
1. Multiple consumers are calling Consume
2. Multiple producers are calling Offer to insert into the queue
  a. All elements are taken from consumers. ch is empty
3. One consumer completes consume, calls onProcessingFinished
a. Inside sizedChannel, syncSize is invoked, used is reset to 0 when
other consumers are still waiting for lock to consume
4. More Offer is called inserting elements -> used and ch len should
equal
5. As step 3a consumers completes, used is decreased -> used is lower
than ch len
a. More Offer is called inserting since used is below capacity. however,
ch is full.
b. goroutine calling offer is holding the mutex but can’t release it as
ch is full.
c. no consumer can acquire mutex to complete previous
onProcessingFinished

This change returns an error if channel is full instead of waiting for
it to unblock.

#### Link to tracking issue
Fixes #
open-telemetry#11015

#### Testing
- Added concurrent test in persistent queue that can reproduce the
problem(note: need to re-run it 100 times as the race condition is not
consistent).
- Added unit test for sizedChannel

#### Documentation
Added comment in the block explaining it

---------

Co-authored-by: Dmitrii Anoshin <[email protected]>
splunkericl added a commit to splunkericl/opentelemetry-collector that referenced this pull request Oct 22, 2024
…1063)

#### Description
This change fixes a potential deadlock bug for persistent queue.

There is a race condition in persistent queue that caused `used` in
`sizedChannel` to become out of sync with `ch` len. This causes `Offer`
to be deadlocked in specific race condition. For example:
1. Multiple consumers are calling Consume
2. Multiple producers are calling Offer to insert into the queue
  a. All elements are taken from consumers. ch is empty
3. One consumer completes consume, calls onProcessingFinished
a. Inside sizedChannel, syncSize is invoked, used is reset to 0 when
other consumers are still waiting for lock to consume
4. More Offer is called inserting elements -> used and ch len should
equal
5. As step 3a consumers completes, used is decreased -> used is lower
than ch len
a. More Offer is called inserting since used is below capacity. however,
ch is full.
b. goroutine calling offer is holding the mutex but can’t release it as
ch is full.
c. no consumer can acquire mutex to complete previous
onProcessingFinished

This change returns an error if channel is full instead of waiting for
it to unblock.

#### Link to tracking issue
Fixes #
open-telemetry#11015

#### Testing
- Added concurrent test in persistent queue that can reproduce the
problem(note: need to re-run it 100 times as the race condition is not
consistent).
- Added unit test for sizedChannel

#### Documentation
Added comment in the block explaining it

---------

Co-authored-by: Dmitrii Anoshin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants