Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: barrier stuck for more than 40mins #10210

Closed
Tracked by #6640
hzxa21 opened this issue Jun 7, 2023 · 3 comments
Closed
Tracked by #6640

bug: barrier stuck for more than 40mins #10210

hzxa21 opened this issue Jun 7, 2023 · 3 comments
Assignees
Labels
no-issue-activity type/bug Something isn't working
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Jun 7, 2023

Describe the bug

>> Actor 1356
Actor 1356: `sink_xxxx` [2602.358s]
  Epoch <initial> [!!! 2602.308s]
    SinkExecutor 54C00000001 (actor 1356, executor 1) [!!! 2602.308s]
      expect_first_barrier [!!! 2602.308s]
        BackfillExecutor (actor 1356, executor 10015) [!!! 2602.308s]

Restart CN/Meta doesn't fix the issue.

Backfill will use a epoch larger than max committed epoch and unbonunded key range of the upstream mv to read from storage (but only call next once, similar to LIMIT 1) via storage table to check whether the upstream mv is empty. This is different from other usage of storage table and state table but since it uses HummockReadEpoch::NoWait, theoretically this should work. However, this seems to be the cause of the stuck.

#10125 can prevent this issue from happening but the underlying root cause is still unknown.

We originally suspect that this is caused by slow I/O when reading the relevant SST meta. However, although we did have large number of L0 (~1350) SSTs (see the graph pasted below) due to #10209, the math doesn't match. Assume the worst case scenario: all SSTs overlap with each other and we need to fetch all of the L0 SSTs + each meta fetch hits object store + object store read latency == 500ms, it will only take ~700s to fetch all of them. Note that this assumption is already far worst than the actual scenario in this issue since L0->L0 compaction is not stuck and a large portion of L0 is in non-overlapping sub-level. In this case, the number of SST meta needed is far smaller.

To Reproduce

No response

Expected behavior

No response

Additional context

image

@hzxa21 hzxa21 added the type/bug Something isn't working label Jun 7, 2023
@github-actions github-actions bot added this to the release-0.20 milestone Jun 7, 2023
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Jun 7, 2023

cc @Li0k @Little-Wallace @wenym1

@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2023

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@Li0k
Copy link
Contributor

Li0k commented Aug 8, 2023

After #10584 no more stuck observed, close the issue first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants