changefeed get stuck ...in test case "cdc_scale_sync" #4464

Tammyxia · 2022-01-25T02:54:27Z

What did you do?

test infra test case: cdc_scale_sync, main branch
the changefeed get stuck already for 6 hours: https://tcms.pingcap.net/dashboard/executions/plan/630680
owner is not stuck because if creating new changefeed, the new one works fine.

What did you expect to see?

No response

What did you see instead?

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

main

TiCDC version (execute cdc version):

[release-version=v5.5.0-nightly] [git-hash=7a227b421dbfcdafee02148e787138798edadf31] [git-branch=heads/refs/tags/v5.5.0-nightly] [utc-build-time="2022-01-24 18:10:05"] [go-version="go version go1.16.4 linux/amd64"] [failpoint-build=false]

The text was updated successfully, but these errors were encountered:

Tammyxia · 2022-02-07T09:52:29Z

Reproduced again in main version.
After cdc scale finished for about 2 hours, while waiting for sync task completing, changefeed stuck.
It seems tikv is the main reason, because tikv has no grpc message output any more from metrics

And tikv log ERROR:
[2022/02/06 20:36:48.101 +00:00] [ERROR] [transport.rs:86] ["failed to send significant msg"] [msg="CaptureChange { cmd: ChangeObserver { cdc_id: Some(ObserveHandle { id: ObserveID(1174), observing: true }), rts_id: None, region_id: 1228 }, region_epoch: conf_ver: 5 version: 99, callback: Callback::Read(..) }"]
[2022/02/06 20:36:48.101 +00:00] [ERROR] [endpoint.rs:637] ["cdc initialize fail: Request error message: "region 1228 not found" region_not_found { region_id: 1228 }"] [region_id=1228]
[2022/02/06 20:36:48.109 +00:00] [ERROR] [transport.rs:86] ["failed to send significant msg"] [msg="CaptureChange { cmd: ChangeObserver { cdc_id: Some(ObserveHandle { id: ObserveID(1177), observing: true }), rts_id: None, region_id: 2 }, region_epoch: conf_ver: 5 version: 101, callback: Callback::Read(..) }"]
[2022/02/06 20:36:48.109 +00:00] [ERROR] [endpoint.rs:637] ["cdc initialize fail: Request error message: "region 2 not found" region_not_found { region_id: 2 }"] [region_id=2]

Tammyxia · 2022-02-13T12:00:11Z

Reproduce again in v5.1.4.

Tammyxia · 2022-03-23T05:54:45Z

Reproduced this issue in v6.0.

overvenus · 2022-03-25T09:53:55Z

It should be fixed by tikv/tikv#12262 let's keep it open and wait for another round of testing.

Rustin170506 · 2022-04-01T03:45:17Z

We have done the following analysis of the issue:

The scenario where it is triggered is table scheduling, especially when a table is advanced on a node for a while, then is scheduled to another node and then is scheduled back, then it uses the wrong checkpoint ts causing the checkpoint ts to never advance and eventually causing the flow control to get stuck. The whole changefeed is stuck.
It uses the wrong checkpoint ts because we forgot to clean up the old values in the tableCheckpointTsMap of sink manager and tableCheckpointTs of mysql sink properly when we created the table sink. So when the table was dispatched back, it used the wrong old value.
The root cause of the flow control getting stuck is when we use the wrong checkpoint ts, which is much larger than the resolved ts, and then return early in this logic, the flow control is not released for a long time.

// flushSink emits all rows in rowBuffer to the backend sink and flushes
// the backend sink.
func (n *sinkNode) flushSink(ctx context.Context, resolvedTs model.Ts) (err error) {
	defer func() {
		if err != nil {
			n.status.Store(TableStatusStopped)
			return
		}
		if n.checkpointTs >= n.targetTs {
			err = n.stop(ctx)
		}
	}()
	if resolvedTs > n.barrierTs {
		resolvedTs = n.barrierTs
	}
	if resolvedTs > n.targetTs {
		resolvedTs = n.targetTs
	}
	if resolvedTs <= n.checkpointTs {
                // always return 
		log.Warn("resolvedTs is less than or equal to checkpointTs",
			zap.Int64("tableID", n.tableID),
			zap.Uint64("resolvedTs", resolvedTs),
			zap.Uint64("checkpointTs", n.checkpointTs))
		return nil
	}

How did we locate to the whole problem?

The barrier ts were found to be stuck. At that time, we mistakenly thought that ddl puller was stuck, but after adding the logs, we found that ddl puller was not stuck.
We realized that it was actually checkpoint ts that were not advancing, and noticed in monitoring that the minimum checkpoint ts for a node was always affected by a particular table.
After adding more logs, I found that the table had been scheduled multiple times and that when the table was last scheduled back, the checkpoints after the first flush used the checkpoint ts from when it was last scheduled away, not the current checkpoint ts of the sink node.
Then we looked at the current stack information and found that flow control was stuck in ConsumeWithBlocking, its cond had not been waking up, and the time it was stuck and the checkpoint ts not advancing coincided.
I tried to reproduce the problem in tispace based on the above information. I just dispatched a particular table, then let it synchronize for a while, then repeated the dispatch operation several times, and finally we reproduced the problem. The behavior of the flow control jam and the stack information also match.

tiup cluster deploy  issue-4464 nightly topology.yaml -p （3 ticdc cluster）
tiup cluster start issue-4464
create database test;(in mysql)
tiup cdc:nightly cli changefeed create   --sink-uri="mysql://root:[email protected]:3306/" -c "test" --tz "Asia/Shanghai"
tiup cdc:nightly cli changefeed pause -c test
go-tpc tpcc -H 127.0.0.1 -P 4000  --warehouses 100 prepare -T 10
tiup cdc:nightly cli changefeed resume -c test
go-tpc tpcc -H 127.0.0.1 -P 4000  --warehouses 100 run -T 10
curl -X POST http://127.0.0.1:8300/capture/owner/move_table -d 'cf-id=test&target-cp-id=1f1c489f-c006-4f56-a50d-d509c4d140de&table-id=69'
curl -X POST http://127.0.0.1:8300/capture/owner/move_table -d 'cf-id=test&target-cp-id=81ce066d-5b78-44de-aa0c-ac596c3c83fd&table-id=69'
curl -X POST http://127.0.0.1:8300/capture/owner/move_table -d 'cf-id=test&target-cp-id=1f1c489f-c006-4f56-a50d-d509c4d140de&table-id=69'

Analyzing the code we realized that we were not clearing the old values correctly in CreateTableSink, which was causing the problem.

Related questions that are not yet clear:

Why would we push the checkpoint ts of a table to a huge value?

Thank you very much @liuzix and @overvenus for helping me to solve this problem together, without them I believe I couldn't have figured out this problem.

close #4464

Tammyxia · 2022-04-25T10:58:35Z

Reproduced this issue in v5.2.4

/cdc version
Release Version: v5.2.4
Git Commit Hash: 2971ac3
Git Branch: heads/refs/tags/v5.2.4
UTC Build Time: 2022-04-18 02:52:07
Go Version: go version go1.16.4 linux/amd64
Failpoint Build: false

ref #4464, close #5197

Tammyxia · 2022-05-06T01:39:06Z

Reproduced in v5.4.1 and v6.1.0-nightly 2022-05-05

Rustin170506 · 2022-05-09T03:33:06Z

It happened again because it was affected by #5196 and I will continue to investigate it.

asddongmen · 2022-05-09T04:04:02Z

@hi-rustin

Related questions that are not yet clear:

Why would we push the checkpoint ts of a table to a huge value?

In my view, the reason seems that when a table is being scheduled, the resolvedTs of it is equal to the global checkpointTs of all tables in a changefeed, which is possible very smaller than the table's local checkpointTs in sink.

Rustin170506 · 2022-05-10T06:40:08Z

It happened again because it was affected by #5196 and I will continue to investigate it.

It is indeed affected by this PR:

Before we delete the drawback chan, we wait to remove the buffer data in the bufferSink directly before the Barrier. So bufferSink will not have any new data written to MySQL Sink while doing the Barrier, as expected.
When we deleted it, although we deleted the buffer data in the buffer and then immediately went to do the Barrier operation, but this part of the code may continue to refresh and update the checkpoint in the bufferSink, at this time, MySQL sink may still be waiting for the Barrier to finish, so the bufferSink still reads the current checkpoint, so the bufferSink continues to save the old checkpoint, and we end up with the same problem as before.
Also, this problem exists when there is a drawback mechanism, but it's hard to happen because we won't have any new data, so we'll be able to finish the Barrier quickly, and even if the code mentioned above executes, we'll read the values that have been cleaned up.

For now we just clean up it in the bufferSink, and the bufferSink is about to be deleted, so we don't need to add the drawback back in.

Test result: https://tcms.pingcap.net/dashboard/executions/plan/786269

Thanks @asddongmen for discussing this with me!

close #4464

ref #4464, close #5197

nongfushanquan · 2022-06-21T02:40:40Z

/label affects-5.4

… (#5381) close #4464

Tammyxia added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Jan 25, 2022

Tammyxia added the severity/major label Jan 25, 2022

Tammyxia assigned Rustin170506 Jan 25, 2022

Tammyxia added the found/automation Bugs found by automation cases label Jan 26, 2022

overvenus assigned overvenus and unassigned Rustin170506 Feb 17, 2022

jebter added the affects-5.1 label Feb 18, 2022

VelocityLight added the affects-6.0 label Mar 17, 2022

jebter added affects-5.3 affects-5.2 affects-5.4 This bug affects the 5.4.x(LTS) versions. labels Mar 23, 2022

XuJianxu assigned Rustin170506 Mar 24, 2022

Rustin170506 unassigned overvenus Mar 24, 2022

Rustin170506 mentioned this issue Apr 3, 2022

sink/mysql(ticdc): clean up old values of table #5115

Merged

3 tasks

ti-chi-bot closed this as completed in #5115 Apr 12, 2022

ti-chi-bot pushed a commit that referenced this issue Apr 12, 2022

sink/mysql(ticdc): clean up old values of table (#5115)

d80eae1

close #4464

ti-chi-bot added a commit that referenced this issue Apr 13, 2022

sink/mysql(ticdc): clean up old values of table (#5115) (#5145)

ee1ad70

close #4464

Tammyxia reopened this Apr 25, 2022

ti-chi-bot added a commit that referenced this issue Apr 28, 2022

sink(ticdc): refine sink interface and add init method (#5196) (#5227)

b2293c0

ref #4464, close #5197

Rustin170506 mentioned this issue May 10, 2022

sink(ticdc): clean up table checkpoint ts in buffer and MQ sink #5372

Merged

ti-chi-bot closed this as completed in #5372 May 10, 2022

ti-chi-bot pushed a commit that referenced this issue May 10, 2022

sink(ticdc): clean up table checkpoint ts in buffer and MQ sink (#5372)

544aadb

close #4464

Rustin170506 added the affects-5.0 label May 16, 2022

ti-chi-bot mentioned this issue May 16, 2022

sink(ticdc): refine sink interface and add init method (#5196) #5432

Closed

ti-chi-bot added a commit that referenced this issue Jun 15, 2022

sink(ticdc): refine sink interface and add init method (#5196) (#5226)

c3a94eb

ref #4464, close #5197

ti-chi-bot added a commit that referenced this issue Jun 22, 2022

sink(ticdc): clean up table checkpoint ts in buffer and MQ sink (#5372)…

f9a4087

… (#5381) close #4464

This was referenced Jun 23, 2022

releases: add tidb 5.3.2 release notes pingcap/docs#9029

Merged

releases: add v5.3.2 release notes pingcap/docs-cn#9914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeed get stuck ...in test case "cdc_scale_sync" #4464

changefeed get stuck ...in test case "cdc_scale_sync" #4464

Tammyxia commented Jan 25, 2022 •

edited

Loading

Tammyxia commented Feb 7, 2022

Tammyxia commented Feb 13, 2022

Tammyxia commented Mar 23, 2022

overvenus commented Mar 25, 2022

Rustin170506 commented Apr 1, 2022 •

edited

Loading

Tammyxia commented Apr 25, 2022

Tammyxia commented May 6, 2022

Rustin170506 commented May 9, 2022

asddongmen commented May 9, 2022

Rustin170506 commented May 10, 2022 •

edited

Loading

nongfushanquan commented Jun 21, 2022

changefeed get stuck ...in test case "cdc_scale_sync" #4464

changefeed get stuck ...in test case "cdc_scale_sync" #4464

Comments

Tammyxia commented Jan 25, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

Tammyxia commented Feb 7, 2022

Tammyxia commented Feb 13, 2022

Tammyxia commented Mar 23, 2022

overvenus commented Mar 25, 2022

Rustin170506 commented Apr 1, 2022 • edited Loading

Tammyxia commented Apr 25, 2022

Tammyxia commented May 6, 2022

Rustin170506 commented May 9, 2022

asddongmen commented May 9, 2022

Rustin170506 commented May 10, 2022 • edited Loading

nongfushanquan commented Jun 21, 2022

Tammyxia commented Jan 25, 2022 •

edited

Loading

Rustin170506 commented Apr 1, 2022 •

edited

Loading

Rustin170506 commented May 10, 2022 •

edited

Loading