Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: handle store closed in doDDLJob #18844

Merged
merged 31 commits into from
Sep 1, 2020
Merged

Conversation

lance6716
Copy link
Contributor

@lance6716 lance6716 commented Jul 29, 2020

What problem does this PR solve?

Issue Number: close #18714 , pingcap/dm#792

Problem Summary: add an exit branch in doDDLJob

What is changed and how it works?

What's Changed:

detected d.ctx.Done() in doDDLJob, which may caused by worker closed, store closed, ..., to avoid deadloop. Note now doDDLJob will fail immediately when worker closed, so changed some 5-year-old test

How it Works:

add an exit branch in doDDLJob, add a unit test to verify

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test

Side effects

Release note

  • No release note

@github-actions github-actions bot added the sig/sql-infra SIG: SQL Infra label Jul 29, 2020
@lance6716 lance6716 marked this pull request as draft July 29, 2020 05:56
@lance6716 lance6716 marked this pull request as ready for review July 31, 2020 05:44
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #18844 into master will decrease coverage by 0.0289%.
The diff coverage is 100.0000%.

@@               Coverage Diff                @@
##             master     #18844        +/-   ##
================================================
- Coverage   79.1739%   79.1450%   -0.0290%     
================================================
  Files           548        548                
  Lines        148545     147883       -662     
================================================
- Hits         117609     117042       -567     
+ Misses        21449      21370        -79     
+ Partials       9487       9471        -16     

@lance6716
Copy link
Contributor Author

PTAL @zimulala

@lance6716
Copy link
Contributor Author

PTAL @zimulala

@lance6716
Copy link
Contributor Author

PTAL @AilinKid @zimulala

Copy link
Contributor

@AilinKid AilinKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@AilinKid
Copy link
Contributor

/run-all-tests

Copy link
Contributor

@AilinKid AilinKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 13, 2020
@csuzhangxc
Copy link
Member

@zimulala PTAL, thanks!

c.Assert(s.getDDLSchemaVer(c, d), GreaterEqual, ver)
d.restartWorkers(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this test? Will deleting this test reduce test scenarios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d.Stop()will call d.cancel(), which will trigger doDDLJob quit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace done <- d.doDDLJob(ctx, job) with checking the job in history. Instead of removing these test scenarios.


c.Assert(failpoint.Enable("github.com/pingcap/tidb/ddl/storeCloseInLoop", `return(2)`), IsNil)
go func() {
time.Sleep(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use Sleep and go will make the test unstable and "sleep 2s" make the test run a long test.

@lance6716
Copy link
Contributor Author

/run-all-tests

1 similar comment
@lance6716
Copy link
Contributor Author

/run-all-tests

@lance6716
Copy link
Contributor Author

seems an error irrelevant to this PR could be reproduce:

run test [alter_table1] err: sql:ALTER TABLE m1 ENABLE KEYS, ALGORITHM= INPLACE;: failed to run query \n\"ALTER TABLE m1 ENABLE KEYS, ALGORITHM= INPLACE;\" \n around line 317, \nwe need(158):\nALTER TABLE m1 ENABLE KEYS, ALGORITHM= INPLACE;\nALTER TABLE t1 DROP INDEX i1;\nALTER TABLE t1 DROP INDEX i2;\nALTER TABLE t1 DROP INDEX i3;\nALTER TABLE t1 DROP \nbut got(158):\nALTER TABLE m1 ENABLE KEYS, ALGORITHM= INPLACE;\nError 1846: ALGORITHM=INPLACE is not supported. Reason: Cannot alter table by INPLACE. Try ALGORITHM=INSTANT.\n\n

@lance6716
Copy link
Contributor Author

PTAL @zimulala

Copy link
Contributor

@AilinKid AilinKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lance6716
Copy link
Contributor Author

lance6716 commented Aug 28, 2020

LGTM

Sorry there're some concurrency problem still, I'll fix it when I have some spare time 😢

@lance6716 lance6716 changed the title ddl: handle store closed in doDDLJob [WIP] ddl: handle store closed in doDDLJob Aug 28, 2020
// It only starts the original workers.
func (d *ddl) restartWorkers(ctx context.Context) {
d.cancel()
d.wg.Wait()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added these two lines⬆️

@lance6716 lance6716 changed the title [WIP] ddl: handle store closed in doDDLJob ddl: handle store closed in doDDLJob Aug 28, 2020
@lance6716
Copy link
Contributor Author

now CI is failed because Sending interrupt signal to process. waiting for more advice on code @zimulala

@lance6716
Copy link
Contributor Author

/run-all-tests

@lance6716
Copy link
Contributor Author

Oh finally CI passed

Copy link
Contributor

@zimulala zimulala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Sep 1, 2020
@ti-srebot ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Sep 1, 2020
@zimulala zimulala added the status/can-merge Indicates a PR has been approved by a committer. label Sep 1, 2020
@ti-srebot
Copy link
Contributor

/run-all-tests

@@ -488,9 +490,17 @@ func (d *ddl) doDDLJob(ctx sessionctx.Context, job *model.Job) error {
metrics.HandleJobHistogram.WithLabelValues(job.Type.String(), metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds())
}()
for {
failpoint.Inject("storeCloseInLoop", func(_ failpoint.Value) {
d.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may better call d.Stop here; otherwise clean up jobs like for del range won't be triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/sql-infra SIG: SQL Infra status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

doDDLJob may retry and enter dead loop
7 participants