Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

Closed
fubinzh opened this issue Feb 19, 2024 · 3 comments · Fixed by #10740
Closed

CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

fubinzh opened this issue Feb 19, 2024 · 3 comments · Fixed by #10740
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Feb 19, 2024

What did you do?

This panic is seen when there are 3 kafka changefeed running to sync 4000+ tables

What did you expect to see?

CDC no panic

What did you see instead?

cdc panic seen.

panic: sinkManager: sink upperbound should not less than checkpoint ts

goroutine 681 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x5?, 0x5?, {0x0?, 0x0?, 0xc000f1a8e0?})
        go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00ba05930, {0xc026343540, 0x5, 0x5})
        go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000980400?, {0x508262c?, 0x0?}, {0xc026343540, 0x5, 0x5})
        go.uber.org/[email protected]/logger.go:284 +0x51
github.com/pingcap/log.Panic({0x508262c?, 0xc?}, {0xc026343540?, 0xc02f915578?, 0x3d1c?})
        github.com/pingcap/[email protected]/global.go:54 +0x85
github.com/pingcap/tiflow/cdc/processor/sinkmanager.(*SinkManager).GetTableStats(0xc005e14b00, {0x3d1c, {0xc028d169a8, 0x12, 0x18}, {0xc028d169c0, 0x12, 0x18}})
        github.com/pingcap/tiflow/cdc/processor/sinkmanager/manager.go:1033 +0x103d
github.com/pingcap/tiflow/cdc/processor.(*processor).IsAddTableSpanFinished.func1()
        github.com/pingcap/tiflow/cdc/processor/processor.go:259 +0xc5
github.com/pingcap/tiflow/cdc/processor.(*processor).IsAddTableSpanFinished(0xc00347ea80, {0x3d1c, {0xc028d169a8, 0x12, 0x18}, {0xc028d169c0, 0x12, 0x18}}, 0x0)
        github.com/pingcap/tiflow/cdc/processor/processor.go:278 +0x185
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpan).handleAddTableTask(0xc01ada9a00, {0x7f4085d69018, 0xc024fc94e8})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:222 +0x722
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpan).poll(0xc01b2eaf00?, {0x7f4085d69018?, 0xc024fc94e8?})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:291 +0x3f
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll.func1({0x3d1c, {0xc02a922108, 0x12, 0x18}, {0xc02a922120, 0x12, 0x18}}, 0x0?)
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:316 +0x7a
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll.(*BtreeMap[...]).Ascend.func2()
        github.com/pingcap/tiflow/pkg/spanz/btree_map.go:110 +0x5a
github.com/google/btree.(*node[...]).iterate(0x5c09160, 0xc025abd3b0, {{{0x0, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, 0x0}, ...}, ...)
        github.com/google/[email protected]/btree_generic.go:522 +0x5de
github.com/google/btree.(*node[...]).iterate(0x5c09160, 0xc01886cb90, {{{0x0, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, 0x0}, ...}, ...)
        github.com/google/[email protected]/btree_generic.go:510 +0x28c
github.com/google/btree.(*BTreeG[...]).Ascend(0x12?, 0xc026396a00?)
        github.com/google/[email protected]/btree_generic.go:779 +0x125
github.com/pingcap/tiflow/pkg/spanz.(*BtreeMap[...]).Ascend(...)
        github.com/pingcap/tiflow/pkg/spanz/btree_map.go:109
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll(0xc007ba4540, {0x7f4085d69018?, 0xc024fc94e8?})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:315 +0x126
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).Tick(0xc0037ae7e0, {0x7f4085d69018, 0xc024fc94e8})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:210 +0x7d
github.com/pingcap/tiflow/cdc/processor.(*processor).tick(0xc00347ea80, {0x5b7dc40?, 0xc024fc94e8})
        github.com/pingcap/tiflow/cdc/processor/processor.go:557 +0x89
github.com/pingcap/tiflow/cdc/processor.(*processor).Tick(0xc00347ea80, {0x5b7dc40, 0xc024fc94e8}, 0xa271ac0?, 0xc169167464f0a2c9?)
        github.com/pingcap/tiflow/cdc/processor/processor.go:514 +0x5b8
github.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick(0xc003421ae0, {0x7f4085784060?, 0xc0014a5420}, {0x5b50590?, 0xc001032850?})
        github.com/pingcap/tiflow/cdc/processor/manager.go:159 +0x5df
github.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run(0xc003462630, {0x7f4085784060?, 0xc0014a5420}, 0xc0023b5350, 0x2faf080, {0x4f22000, 0x9})
        github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:290 +0x1049
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker(0xc0010e3b00, {0x5b7dab0?, 0xc0014a5420}, {0x7f4085ab4488, 0xc003421ae0}, {0x5b50590, 0xc001032850}, 0x0?, {0x4f22000, 0x9})
        github.com/pingcap/tiflow/cdc/capture/capture.go:596 +0x185
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4()
        github.com/pingcap/tiflow/cdc/capture/capture.go:408 +0x42c
golang.org/x/sync/errgroup.(*Group).Go.func1()
        golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 507
        golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x96

Versions of the cluster

[release-version=v8.0.0-alpha] [git-hash=ed54e785188538740c69a74cdca9f4ae258bed06] [git-branch=heads/refs/tags/v8.0.0-alpha] [utc-build-time="2024-02-07 11:39:36"]

@fubinzh fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Feb 19, 2024
@fubinzh
Copy link
Author

fubinzh commented Feb 19, 2024

/severity major

@asddongmen
Copy link
Contributor

I try and can not reproduce this issue and it is hard to find the root cause by investigate the code, still need more efforts to solve it.

@CharlesCheung96
Copy link
Contributor

CharlesCheung96 commented Mar 7, 2024

cdc panic due to tableUpperbound less that the checkpoint. From the log, we found it only happens when starting a table, and the table is just finishing a two phase schedule:

# Phase 1 
[2024/03/07 14:11:37.254 +08:00] [INFO] [table.go:267] ["schedulerv3: table found new task"] [namespace=default] [changefeed=s3-sink] [tableSpan="\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\""] [task="{\"Span\":\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\",\"Checkpoint\":{\"checkpoint_ts\":448211686913474596,\"resolved_ts\":448211687070761065,\"last_synced_ts\":448192142990639116},\"IsRemove\":false,\"IsPrepare\":true,\"Epoch\":{}}"]
[2024/03/07 14:11:37.254 +08:00] [INFO] [manager.go:859] ["Add table sink"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [startTs=448211686913474596] [version=3]

[2024/03/07 14:11:37.345 +08:00] [INFO] [table.go:200] ["schedulerv3: table is prepared"] [namespace=default] [changefeed=s3-sink] [tableSpan="\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\""] [state=Prepared]

# Pahse 2
[2024/03/07 14:11:37.655 +08:00] [INFO] [table.go:267] ["schedulerv3: table found new task"] [namespace=default] [changefeed=s3-sink] [tableSpan="\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\""] [task="{\"Span\":\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\",\"Checkpoint\":{\"checkpoint_ts\":448211687057915916,\"resolved_ts\":448211687136297053,\"last_synced_ts\":448192142990639116},\"IsRemove\":false,\"IsPrepare\":false,\"Epoch\":{}}"]
[2024/03/07 14:11:37.655 +08:00] [INFO] [manager.go:869] ["Start table sink"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [startTs=448211687057915916]
[2024/03/07 14:11:37.655 +08:00] [INFO] [table_sink_wrapper.go:153] ["Sink is started"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [startTs=448211687057915916] [replicateTs=448211687214940178]

# Panic after staring table
[2024/03/07 14:11:37.655 +08:00] [PANIC] [manager.go:1038] ["sinkManager: sink upperbound should not less than checkpoint ts"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [upperbound=448211686913474596] [checkpointTs="{\"Mode\":0,\"Ts\":448211687057915916,\"BatchID\":18446744073709551615}"]

From start function, receivedSorterResolvedTs must be updated to startTS after a table started. So the smaller value, which was returned by getUpperBoundTs, must equal to barrierTs. barrierTs is affected by only two factors:

  1. The value received from owner node: the globalBarrierTS sent by owner must larger than barrierTs, because globalBarrierTS > table checkpointTs > barrierTS.
  2. The resolvedTs of schema storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

3 participants