Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent retention service from hanging #25121

Merged
merged 1 commit into from
Jul 1, 2024
Merged

fix: prevent retention service from hanging #25121

merged 1 commit into from
Jul 1, 2024

Conversation

gwossum
Copy link
Member

@gwossum gwossum commented Jul 1, 2024

Fix issue that can cause the retention service to hang waiting on a Shard.Close call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up.

The fix adds to new methods to Store, SetShardNewReadersBlocked and InUse. InUse can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. SetShardNewReadersBlocked determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of InUse and the deletion of shards.

If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers.

This is a port of ad68ec8 from master-1.x to main-2.x, then backported to 2.7 with a clean cherry-pick.

closes: #25118
(cherry picked from commit b4bd607) (cherry picked from commit cb8cfe3)

Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.

The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.

If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.

This is a port of ad68ec8 from master-1.x to main-2.x.

closes: #25118
(cherry picked from commit b4bd607)
(cherry picked from commit cb8cfe3)
@gwossum gwossum self-assigned this Jul 1, 2024
@gwossum gwossum added area/storage area/2.x OSS 2.0 related issues and PRs team/edge labels Jul 1, 2024
@gwossum gwossum marked this pull request as ready for review July 1, 2024 18:52
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gwossum gwossum merged commit e9e0f74 into 2.7 Jul 1, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage area/2.x OSS 2.0 related issues and PRs team/edge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants