CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

fubinzh · 2024-02-19T10:00:35Z

What did you do?

This panic is seen when there are 3 kafka changefeed running to sync 4000+ tables

What did you expect to see?

CDC no panic

What did you see instead?

cdc panic seen.

panic: sinkManager: sink upperbound should not less than checkpoint ts

goroutine 681 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x5?, 0x5?, {0x0?, 0x0?, 0xc000f1a8e0?})
        go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00ba05930, {0xc026343540, 0x5, 0x5})
        go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000980400?, {0x508262c?, 0x0?}, {0xc026343540, 0x5, 0x5})
        go.uber.org/[email protected]/logger.go:284 +0x51
github.com/pingcap/log.Panic({0x508262c?, 0xc?}, {0xc026343540?, 0xc02f915578?, 0x3d1c?})
        github.com/pingcap/[email protected]/global.go:54 +0x85
github.com/pingcap/tiflow/cdc/processor/sinkmanager.(*SinkManager).GetTableStats(0xc005e14b00, {0x3d1c, {0xc028d169a8, 0x12, 0x18}, {0xc028d169c0, 0x12, 0x18}})
        github.com/pingcap/tiflow/cdc/processor/sinkmanager/manager.go:1033 +0x103d
github.com/pingcap/tiflow/cdc/processor.(*processor).IsAddTableSpanFinished.func1()
        github.com/pingcap/tiflow/cdc/processor/processor.go:259 +0xc5
github.com/pingcap/tiflow/cdc/processor.(*processor).IsAddTableSpanFinished(0xc00347ea80, {0x3d1c, {0xc028d169a8, 0x12, 0x18}, {0xc028d169c0, 0x12, 0x18}}, 0x0)
        github.com/pingcap/tiflow/cdc/processor/processor.go:278 +0x185
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpan).handleAddTableTask(0xc01ada9a00, {0x7f4085d69018, 0xc024fc94e8})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:222 +0x722
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpan).poll(0xc01b2eaf00?, {0x7f4085d69018?, 0xc024fc94e8?})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:291 +0x3f
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll.func1({0x3d1c, {0xc02a922108, 0x12, 0x18}, {0xc02a922120, 0x12, 0x18}}, 0x0?)
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:316 +0x7a
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll.(*BtreeMap[...]).Ascend.func2()
        github.com/pingcap/tiflow/pkg/spanz/btree_map.go:110 +0x5a
github.com/google/btree.(*node[...]).iterate(0x5c09160, 0xc025abd3b0, {{{0x0, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, 0x0}, ...}, ...)
        github.com/google/[email protected]/btree_generic.go:522 +0x5de
github.com/google/btree.(*node[...]).iterate(0x5c09160, 0xc01886cb90, {{{0x0, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, 0x0}, ...}, ...)
        github.com/google/[email protected]/btree_generic.go:510 +0x28c
github.com/google/btree.(*BTreeG[...]).Ascend(0x12?, 0xc026396a00?)
        github.com/google/[email protected]/btree_generic.go:779 +0x125
github.com/pingcap/tiflow/pkg/spanz.(*BtreeMap[...]).Ascend(...)
        github.com/pingcap/tiflow/pkg/spanz/btree_map.go:109
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll(0xc007ba4540, {0x7f4085d69018?, 0xc024fc94e8?})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:315 +0x126
github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).Tick(0xc0037ae7e0, {0x7f4085d69018, 0xc024fc94e8})
        github.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:210 +0x7d
github.com/pingcap/tiflow/cdc/processor.(*processor).tick(0xc00347ea80, {0x5b7dc40?, 0xc024fc94e8})
        github.com/pingcap/tiflow/cdc/processor/processor.go:557 +0x89
github.com/pingcap/tiflow/cdc/processor.(*processor).Tick(0xc00347ea80, {0x5b7dc40, 0xc024fc94e8}, 0xa271ac0?, 0xc169167464f0a2c9?)
        github.com/pingcap/tiflow/cdc/processor/processor.go:514 +0x5b8
github.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick(0xc003421ae0, {0x7f4085784060?, 0xc0014a5420}, {0x5b50590?, 0xc001032850?})
        github.com/pingcap/tiflow/cdc/processor/manager.go:159 +0x5df
github.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run(0xc003462630, {0x7f4085784060?, 0xc0014a5420}, 0xc0023b5350, 0x2faf080, {0x4f22000, 0x9})
        github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:290 +0x1049
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker(0xc0010e3b00, {0x5b7dab0?, 0xc0014a5420}, {0x7f4085ab4488, 0xc003421ae0}, {0x5b50590, 0xc001032850}, 0x0?, {0x4f22000, 0x9})
        github.com/pingcap/tiflow/cdc/capture/capture.go:596 +0x185
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4()
        github.com/pingcap/tiflow/cdc/capture/capture.go:408 +0x42c
golang.org/x/sync/errgroup.(*Group).Go.func1()
        golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 507
        golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x96

Versions of the cluster

[release-version=v8.0.0-alpha] [git-hash=ed54e785188538740c69a74cdca9f4ae258bed06] [git-branch=heads/refs/tags/v8.0.0-alpha] [utc-build-time="2024-02-07 11:39:36"]

The text was updated successfully, but these errors were encountered:

fubinzh · 2024-02-19T10:00:53Z

/severity major

asddongmen · 2024-03-04T07:14:36Z

I try and can not reproduce this issue and it is hard to find the root cause by investigate the code, still need more efforts to solve it.

CharlesCheung96 · 2024-03-07T10:08:29Z

cdc panic due to tableUpperbound less that the checkpoint. From the log, we found it only happens when starting a table, and the table is just finishing a two phase schedule:

# Phase 1 
[2024/03/07 14:11:37.254 +08:00] [INFO] [table.go:267] ["schedulerv3: table found new task"] [namespace=default] [changefeed=s3-sink] [tableSpan="\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\""] [task="{\"Span\":\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\",\"Checkpoint\":{\"checkpoint_ts\":448211686913474596,\"resolved_ts\":448211687070761065,\"last_synced_ts\":448192142990639116},\"IsRemove\":false,\"IsPrepare\":true,\"Epoch\":{}}"]
[2024/03/07 14:11:37.254 +08:00] [INFO] [manager.go:859] ["Add table sink"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [startTs=448211686913474596] [version=3]

[2024/03/07 14:11:37.345 +08:00] [INFO] [table.go:200] ["schedulerv3: table is prepared"] [namespace=default] [changefeed=s3-sink] [tableSpan="\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\""] [state=Prepared]

# Pahse 2
[2024/03/07 14:11:37.655 +08:00] [INFO] [table.go:267] ["schedulerv3: table found new task"] [namespace=default] [changefeed=s3-sink] [tableSpan="\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\""] [task="{\"Span\":\"{table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}\",\"Checkpoint\":{\"checkpoint_ts\":448211687057915916,\"resolved_ts\":448211687136297053,\"last_synced_ts\":448192142990639116},\"IsRemove\":false,\"IsPrepare\":false,\"Epoch\":{}}"]
[2024/03/07 14:11:37.655 +08:00] [INFO] [manager.go:869] ["Start table sink"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [startTs=448211687057915916]
[2024/03/07 14:11:37.655 +08:00] [INFO] [table_sink_wrapper.go:153] ["Sink is started"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [startTs=448211687057915916] [replicateTs=448211687214940178]

# Panic after staring table
[2024/03/07 14:11:37.655 +08:00] [PANIC] [manager.go:1038] ["sinkManager: sink upperbound should not less than checkpoint ts"] [namespace=default] [changefeed=s3-sink] [span={table_id:39731,start_key:748000000000009bff335f720000000000fa,end_key:748000000000009bff335f730000000000fa}] [upperbound=448211686913474596] [checkpointTs="{\"Mode\":0,\"Ts\":448211687057915916,\"BatchID\":18446744073709551615}"]

From start function, receivedSorterResolvedTs must be updated to startTS after a table started. So the smaller value, which was returned by getUpperBoundTs, must equal to barrierTs. barrierTs is affected by only two factors:

The value received from owner node: the globalBarrierTS sent by owner must larger than barrierTs, because globalBarrierTS > table checkpointTs > barrierTS.
The resolvedTs of schema storage.

close #10613

close pingcap#10613

) close #10613

fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Feb 19, 2024

ti-chi-bot bot added severity/major may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Feb 19, 2024

asddongmen self-assigned this Feb 23, 2024

asddongmen assigned CharlesCheung96 Mar 7, 2024

CharlesCheung96 mentioned this issue Mar 7, 2024

sinkManger(ticdc): update barrierTs when starting table #10740

Merged

CharlesCheung96 added affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. and removed may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Mar 7, 2024

ti-chi-bot bot closed this as completed in #10740 Mar 9, 2024

ti-chi-bot bot pushed a commit that referenced this issue Mar 9, 2024

sinkManger(ticdc): update barrierTs when starting table (#10740)

e0bdf8c

close #10613

This was referenced Mar 15, 2024

sinkManger(ticdc): update barrierTs when starting table (#10740) #10793

Merged

sinkManger(ticdc): update barrierTs when starting table (#10740) #10794

Merged

sinkManger(ticdc): update barrierTs when starting table (#10740) #10795

Merged

CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Mar 28, 2024

sinkManger(ticdc): update barrierTs when starting table (pingcap#10740)

5e1caa4

close pingcap#10613

CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Mar 28, 2024

sinkManger(ticdc): update barrierTs when starting table (pingcap#10740)

62ae164

close pingcap#10613

ti-chi-bot bot pushed a commit that referenced this issue Mar 28, 2024

sinkManger(ticdc): update barrierTs when starting table (#10740) (#10793

6cf49f7

) close #10613

ti-chi-bot bot pushed a commit that referenced this issue Mar 29, 2024

sinkManger(ticdc): update barrierTs when starting table (#10740) (#10795

eb7a4ec

) close #10613

ti-chi-bot bot pushed a commit that referenced this issue Mar 29, 2024

sinkManger(ticdc): update barrierTs when starting table (#10740) (#10794

ca111ce

) close #10613

github-project-automation bot added this to Question and Bug Reports Aug 28, 2024

github-project-automation bot moved this to Done in Question and Bug Reports Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

fubinzh commented Feb 19, 2024

fubinzh commented Feb 19, 2024

asddongmen commented Mar 4, 2024

CharlesCheung96 commented Mar 7, 2024 •

edited

Loading

CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

CDC panic: sinkManager: sink upperbound should not less than checkpoint ts #10613

Comments

fubinzh commented Feb 19, 2024

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

fubinzh commented Feb 19, 2024

asddongmen commented Mar 4, 2024

CharlesCheung96 commented Mar 7, 2024 • edited Loading

CharlesCheung96 commented Mar 7, 2024 •

edited

Loading