bug: barrier stuck for more than 40mins #10210

hzxa21 · 2023-06-07T06:09:17Z

Describe the bug

>> Actor 1356
Actor 1356: `sink_xxxx` [2602.358s]
  Epoch <initial> [!!! 2602.308s]
    SinkExecutor 54C00000001 (actor 1356, executor 1) [!!! 2602.308s]
      expect_first_barrier [!!! 2602.308s]
        BackfillExecutor (actor 1356, executor 10015) [!!! 2602.308s]

Restart CN/Meta doesn't fix the issue.

Backfill will use a epoch larger than max committed epoch and unbonunded key range of the upstream mv to read from storage (but only call next once, similar to LIMIT 1) via storage table to check whether the upstream mv is empty. This is different from other usage of storage table and state table but since it uses HummockReadEpoch::NoWait, theoretically this should work. However, this seems to be the cause of the stuck.

#10125 can prevent this issue from happening but the underlying root cause is still unknown.

We originally suspect that this is caused by slow I/O when reading the relevant SST meta. However, although we did have large number of L0 (~1350) SSTs (see the graph pasted below) due to #10209, the math doesn't match. Assume the worst case scenario: all SSTs overlap with each other and we need to fetch all of the L0 SSTs + each meta fetch hits object store + object store read latency == 500ms, it will only take ~700s to fetch all of them. Note that this assumption is already far worst than the actual scenario in this issue since L0->L0 compaction is not stuck and a large portion of L0 is in non-overlapping sub-level. In this case, the number of SST meta needed is far smaller.

To Reproduce

No response

Expected behavior

No response

Additional context

The text was updated successfully, but these errors were encountered:

hzxa21 · 2023-06-07T06:11:48Z

cc @Li0k @Little-Wallace @wenym1

github-actions · 2023-08-07T01:53:19Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

Li0k · 2023-08-08T05:34:34Z

After #10584 no more stuck observed, close the issue first.

hzxa21 added the type/bug Something isn't working label Jun 7, 2023

github-actions bot added this to the release-0.20 milestone Jun 7, 2023

hzxa21 assigned Li0k Jun 7, 2023

lmatz mentioned this issue Jun 7, 2023

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

github-actions bot added the no-issue-activity label Aug 7, 2023

Li0k closed this as completed Aug 8, 2023

BugenZhao mentioned this issue Oct 10, 2023

fix(backfill): no need initial snapshot read #12740

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: barrier stuck for more than 40mins #10210

bug: barrier stuck for more than 40mins #10210

hzxa21 commented Jun 7, 2023

hzxa21 commented Jun 7, 2023 •

edited

Loading

github-actions bot commented Aug 7, 2023

Li0k commented Aug 8, 2023

bug: barrier stuck for more than 40mins #10210

bug: barrier stuck for more than 40mins #10210

Comments

hzxa21 commented Jun 7, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

hzxa21 commented Jun 7, 2023 • edited Loading

github-actions bot commented Aug 7, 2023

Li0k commented Aug 8, 2023

hzxa21 commented Jun 7, 2023 •

edited

Loading