You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Backfill will use a epoch larger than max committed epoch and unbonunded key range of the upstream mv to read from storage (but only call next once, similar to LIMIT 1) via storage table to check whether the upstream mv is empty. This is different from other usage of storage table and state table but since it uses HummockReadEpoch::NoWait, theoretically this should work. However, this seems to be the cause of the stuck.
#10125 can prevent this issue from happening but the underlying root cause is still unknown.
We originally suspect that this is caused by slow I/O when reading the relevant SST meta. However, although we did have large number of L0 (~1350) SSTs (see the graph pasted below) due to #10209, the math doesn't match. Assume the worst case scenario: all SSTs overlap with each other and we need to fetch all of the L0 SSTs + each meta fetch hits object store + object store read latency == 500ms, it will only take ~700s to fetch all of them. Note that this assumption is already far worst than the actual scenario in this issue since L0->L0 compaction is not stuck and a large portion of L0 is in non-overlapping sub-level. In this case, the number of SST meta needed is far smaller.
To Reproduce
No response
Expected behavior
No response
Additional context
The text was updated successfully, but these errors were encountered:
Describe the bug
Restart CN/Meta doesn't fix the issue.
Backfill will use a epoch larger than max committed epoch and unbonunded key range of the upstream mv to read from storage (but only call next once, similar to
LIMIT 1
) via storage table to check whether the upstream mv is empty. This is different from other usage of storage table and state table but since it usesHummockReadEpoch::NoWait
, theoretically this should work. However, this seems to be the cause of the stuck.#10125 can prevent this issue from happening but the underlying root cause is still unknown.
We originally suspect that this is caused by slow I/O when reading the relevant SST meta. However, although we did have large number of L0 (~1350) SSTs (see the graph pasted below) due to #10209, the math doesn't match. Assume the worst case scenario: all SSTs overlap with each other and we need to fetch all of the L0 SSTs + each meta fetch hits object store + object store read latency == 500ms, it will only take ~700s to fetch all of them. Note that this assumption is already far worst than the actual scenario in this issue since L0->L0 compaction is not stuck and a large portion of L0 is in non-overlapping sub-level. In this case, the number of SST meta needed is far smaller.
To Reproduce
No response
Expected behavior
No response
Additional context
The text was updated successfully, but these errors were encountered: