Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTL job hangs in running state after decreasing tidb_ttl_delete_worker_count #55561

Closed
lcwangchao opened this issue Aug 21, 2024 · 2 comments · Fixed by #55572
Closed

TTL job hangs in running state after decreasing tidb_ttl_delete_worker_count #55561

lcwangchao opened this issue Aug 21, 2024 · 2 comments · Fixed by #55572
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@lcwangchao
Copy link
Collaborator

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

set @@global.tidb_ttl_scan_batch_size=500;
set @@global.tidb_ttl_delete_batch_size=1;
set @@global.tidb_ttl_delete_rate_limit=1;
set @@global.tidb_ttl_delete_worker_count=2;
create table ttl1(t timestamp) TTL=`t`+interval 1 minute ttl_job_interval='1m';
insert into ttl1 values(now() - interval 1 day);
insert into ttl1 select * from ttl1; -- repeat many times

Then wait ttl job is running, after ttl job is running do:

set @@global.tidb_ttl_delete_worker_count=1;

Wait ttl job finished.

2. What did you expect to see? (Required)

The TTL job will finally finish.

3. What did you see instead (Required)

The TTL job will never finish because some task is always in running state:

> select * from mysql.tidb_ttl_task\G
***************************[ 1. row ]***************************
job_id             | 70a857793b0e4296975fe8f695cb05d5
table_id           | 27866
scan_id            | 0
scan_range_start   |
scan_range_end     |
expire_time        | 2024-08-21 15:01:47
owner_id           | af7d74fd-81d4-4e16-ac78-3a1f4fc6919f
owner_addr         | <null>
owner_hb_time      | 2024-08-21 15:10:08
status             | running
status_update_time | 2024-08-21 15:02:47
state              | {"total_rows":32766,"success_rows":32368,"error_rows":1,"scan_task_err":""}
created_time       | 2024-08-21 15:02:47
1 row in set

You can see that total_rows(32766) > success_rows(32368) + error_rows(1). The task will only seen as finished when they are equal.

4. What is your TiDB version? (Required)

@lcwangchao lcwangchao added type/bug The issue is confirmed as a bug. sig/sql-infra SIG: SQL Infra affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. labels Aug 21, 2024
@lcwangchao
Copy link
Collaborator Author

lcwangchao commented Aug 21, 2024

One reason is that when delete worker is canceling, the statistics will be updated:

tidb/ttl/ttlworker/del.go

Lines 118 to 121 in cf44157

if err = globalDelRateLimiter.Wait(ctx); err != nil {
t.statistics.IncErrorRows(len(delBatch))
return
}

However, it is not correct. Because only delBatch are included and leftRows are missed.

tidb/ttl/ttlworker/del.go

Lines 98 to 104 in cf44157

if int64(len(leftRows)) < maxBatch {
delBatch = leftRows
leftRows = nil
} else {
delBatch = leftRows[0:maxBatch]
leftRows = leftRows[maxBatch:]
}

And then, then

func (t *runningScanTask) finished() bool {
return t.result != nil && t.statistics.TotalRows.Load() == t.statistics.ErrorRows.Load()+t.statistics.SuccessRows.Load()
}

finished will always return false

@YangKeao
Copy link
Member

However, it is correct. Because only delBatch are included and leftRows are missed.

Maybe, if the jumping statement is continue, it'll be good to not add the length of leftRows. But for return, would t.statistics.IncErrorRows(len(leftRows) + len(delBatch)) be better? As these rows will also not be touched anymore 🤔.
(Or we can add a new SkippedRows for them, but I don't think it will have much difference).

@ti-chi-bot ti-chi-bot bot added may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Aug 21, 2024
@lcwangchao lcwangchao removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Aug 21, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in 1bf01f4 Aug 23, 2024
ti-chi-bot bot pushed a commit that referenced this issue Aug 23, 2024
ti-chi-bot bot pushed a commit that referenced this issue Aug 23, 2024
ti-chi-bot bot pushed a commit that referenced this issue Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
2 participants