Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed get stuck ...in test case "cdc_scale_sync" #4464

Closed
Tammyxia opened this issue Jan 25, 2022 · 11 comments · Fixed by #5115 or #5372
Closed

changefeed get stuck ...in test case "cdc_scale_sync" #4464

Tammyxia opened this issue Jan 25, 2022 · 11 comments · Fixed by #5115 or #5372
Assignees
Labels
affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/major type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Tammyxia commented Jan 25, 2022

What did you do?

What did you expect to see?

No response

What did you see instead?

image

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

main

TiCDC version (execute cdc version):

[release-version=v5.5.0-nightly] [git-hash=7a227b421dbfcdafee02148e787138798edadf31] [git-branch=heads/refs/tags/v5.5.0-nightly] [utc-build-time="2022-01-24 18:10:05"] [go-version="go version go1.16.4 linux/amd64"] [failpoint-build=false]
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Jan 25, 2022
@Tammyxia Tammyxia added the found/automation Bugs found by automation cases label Jan 26, 2022
@Tammyxia
Copy link
Author

Tammyxia commented Feb 7, 2022

Reproduced again in main version.
After cdc scale finished for about 2 hours, while waiting for sync task completing, changefeed stuck.
It seems tikv is the main reason, because tikv has no grpc message output any more from metrics
image
image

And tikv log ERROR:
[2022/02/06 20:36:48.101 +00:00] [ERROR] [transport.rs:86] ["failed to send significant msg"] [msg="CaptureChange { cmd: ChangeObserver { cdc_id: Some(ObserveHandle { id: ObserveID(1174), observing: true }), rts_id: None, region_id: 1228 }, region_epoch: conf_ver: 5 version: 99, callback: Callback::Read(..) }"]
[2022/02/06 20:36:48.101 +00:00] [ERROR] [endpoint.rs:637] ["cdc initialize fail: Request error message: "region 1228 not found" region_not_found { region_id: 1228 }"] [region_id=1228]
[2022/02/06 20:36:48.109 +00:00] [ERROR] [transport.rs:86] ["failed to send significant msg"] [msg="CaptureChange { cmd: ChangeObserver { cdc_id: Some(ObserveHandle { id: ObserveID(1177), observing: true }), rts_id: None, region_id: 2 }, region_epoch: conf_ver: 5 version: 101, callback: Callback::Read(..) }"]
[2022/02/06 20:36:48.109 +00:00] [ERROR] [endpoint.rs:637] ["cdc initialize fail: Request error message: "region 2 not found" region_not_found { region_id: 2 }"] [region_id=2]

@Tammyxia
Copy link
Author

Reproduce again in v5.1.4.

@Tammyxia
Copy link
Author

Reproduced this issue in v6.0.

@overvenus
Copy link
Member

It should be fixed by tikv/tikv#12262 let's keep it open and wait for another round of testing.

@Rustin170506
Copy link
Member

Rustin170506 commented Apr 1, 2022

We have done the following analysis of the issue:

  1. The scenario where it is triggered is table scheduling, especially when a table is advanced on a node for a while, then is scheduled to another node and then is scheduled back, then it uses the wrong checkpoint ts causing the checkpoint ts to never advance and eventually causing the flow control to get stuck. The whole changefeed is stuck.
  2. It uses the wrong checkpoint ts because we forgot to clean up the old values in the tableCheckpointTsMap of sink manager and tableCheckpointTs of mysql sink properly when we created the table sink. So when the table was dispatched back, it used the wrong old value.
  3. The root cause of the flow control getting stuck is when we use the wrong checkpoint ts, which is much larger than the resolved ts, and then return early in this logic, the flow control is not released for a long time.
// flushSink emits all rows in rowBuffer to the backend sink and flushes
// the backend sink.
func (n *sinkNode) flushSink(ctx context.Context, resolvedTs model.Ts) (err error) {
	defer func() {
		if err != nil {
			n.status.Store(TableStatusStopped)
			return
		}
		if n.checkpointTs >= n.targetTs {
			err = n.stop(ctx)
		}
	}()
	if resolvedTs > n.barrierTs {
		resolvedTs = n.barrierTs
	}
	if resolvedTs > n.targetTs {
		resolvedTs = n.targetTs
	}
	if resolvedTs <= n.checkpointTs {
                // always return 
		log.Warn("resolvedTs is less than or equal to checkpointTs",
			zap.Int64("tableID", n.tableID),
			zap.Uint64("resolvedTs", resolvedTs),
			zap.Uint64("checkpointTs", n.checkpointTs))
		return nil
	}

How did we locate to the whole problem?

  1. The barrier ts were found to be stuck. At that time, we mistakenly thought that ddl puller was stuck, but after adding the logs, we found that ddl puller was not stuck.
  2. We realized that it was actually checkpoint ts that were not advancing, and noticed in monitoring that the minimum checkpoint ts for a node was always affected by a particular table.
  3. After adding more logs, I found that the table had been scheduled multiple times and that when the table was last scheduled back, the checkpoints after the first flush used the checkpoint ts from when it was last scheduled away, not the current checkpoint ts of the sink node.
  4. Then we looked at the current stack information and found that flow control was stuck in ConsumeWithBlocking, its cond had not been waking up, and the time it was stuck and the checkpoint ts not advancing coincided.
  5. I tried to reproduce the problem in tispace based on the above information. I just dispatched a particular table, then let it synchronize for a while, then repeated the dispatch operation several times, and finally we reproduced the problem. The behavior of the flow control jam and the stack information also match.
tiup cluster deploy  issue-4464 nightly topology.yaml -p (3 ticdc cluster)
tiup cluster start issue-4464
create database test;(in mysql)
tiup cdc:nightly cli changefeed create   --sink-uri="mysql://root:[email protected]:3306/" -c "test" --tz "Asia/Shanghai"
tiup cdc:nightly cli changefeed pause -c test
go-tpc tpcc -H 127.0.0.1 -P 4000  --warehouses 100 prepare -T 10
tiup cdc:nightly cli changefeed resume -c test
go-tpc tpcc -H 127.0.0.1 -P 4000  --warehouses 100 run -T 10
curl -X POST http://127.0.0.1:8300/capture/owner/move_table -d 'cf-id=test&target-cp-id=1f1c489f-c006-4f56-a50d-d509c4d140de&table-id=69'
curl -X POST http://127.0.0.1:8300/capture/owner/move_table -d 'cf-id=test&target-cp-id=81ce066d-5b78-44de-aa0c-ac596c3c83fd&table-id=69'
curl -X POST http://127.0.0.1:8300/capture/owner/move_table -d 'cf-id=test&target-cp-id=1f1c489f-c006-4f56-a50d-d509c4d140de&table-id=69'
  1. Analyzing the code we realized that we were not clearing the old values correctly in CreateTableSink, which was causing the problem.

Related questions that are not yet clear:

  1. Why would we push the checkpoint ts of a table to a huge value?

Thank you very much @liuzix and @overvenus for helping me to solve this problem together, without them I believe I couldn't have figured out this problem.

@Tammyxia
Copy link
Author

Reproduced this issue in v5.2.4

  • /cdc version
    Release Version: v5.2.4
    Git Commit Hash: 2971ac3
    Git Branch: heads/refs/tags/v5.2.4
    UTC Build Time: 2022-04-18 02:52:07
    Go Version: go version go1.16.4 linux/amd64
    Failpoint Build: false

@Tammyxia
Copy link
Author

Tammyxia commented May 6, 2022

Reproduced in v5.4.1 and v6.1.0-nightly 2022-05-05

@Rustin170506
Copy link
Member

It happened again because it was affected by #5196 and I will continue to investigate it.

@asddongmen
Copy link
Contributor

@hi-rustin

Related questions that are not yet clear:

  1. Why would we push the checkpoint ts of a table to a huge value?

In my view, the reason seems that when a table is being scheduled, the resolvedTs of it is equal to the global checkpointTs of all tables in a changefeed, which is possible very smaller than the table's local checkpointTs in sink.

@Rustin170506
Copy link
Member

Rustin170506 commented May 10, 2022

It happened again because it was affected by #5196 and I will continue to investigate it.

It is indeed affected by this PR:

  1. Before we delete the drawback chan, we wait to remove the buffer data in the bufferSink directly before the Barrier. So bufferSink will not have any new data written to MySQL Sink while doing the Barrier, as expected.
  2. When we deleted it, although we deleted the buffer data in the buffer and then immediately went to do the Barrier operation, but this part of the code may continue to refresh and update the checkpoint in the bufferSink, at this time, MySQL sink may still be waiting for the Barrier to finish, so the bufferSink still reads the current checkpoint, so the bufferSink continues to save the old checkpoint, and we end up with the same problem as before.
  3. Also, this problem exists when there is a drawback mechanism, but it's hard to happen because we won't have any new data, so we'll be able to finish the Barrier quickly, and even if the code mentioned above executes, we'll read the values that have been cleaned up.

For now we just clean up it in the bufferSink, and the bufferSink is about to be deleted, so we don't need to add the drawback back in.

Test result: https://tcms.pingcap.net/dashboard/executions/plan/786269

Thanks @asddongmen for discussing this with me!

@nongfushanquan
Copy link
Contributor

/label affects-5.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
7 participants