Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: make docker more resilient to timeout in ContainerStart #17800

Merged
merged 1 commit into from
Aug 23, 2017

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 22, 2017

Docker likes to never respond to us, and we do not usually have cancellations
on the context (which would not help, after all, that would just fail the test
right there). Instead, try a few times.

The problem looks similar to

golang/go#16060
golang/go#5103

Another possibility mentioned in usergroups is that some file descriptor limit
is hit. Since I've never seen this locally, perhaps that's the case on our
agent machines. Unfortunately, those are hard to SSH into.

This may not be a good idea (after all, perhaps Start() succeeded) and we'd
have to do something similar for ContainerWait. But, at least it should
give us an additional data point: do the retries also just block? Is the
container actually started when we retry?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the docker-resilience branch from bc69a7e to efe2cbc Compare August 22, 2017 15:53
@tamird
Copy link
Contributor

tamird commented Aug 22, 2017

Seems worthwhile to me; thanks for doing this.


Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/acceptance/cluster/docker.go, line 324 at r1 (raw file):

func (cli resilientDockerClient) ContainerStart(clientCtx context.Context, id string, opts types.ContainerStartOptions) error {
	ctx, _ := context.WithTimeout(clientCtx, 20*time.Second)

help me understand this unusual use of context:

  • why do you need to bind clientCtx separately from this ctx?
  • why not bind and defer cancel here?
  • where you compare err == context.DeadlineExceeded, why not if err == ctx.Err(); isn't that what you really want?

after re-reading the code, I think maybe you just want to ctx, cancel := context.WithTimeout(clientCtx, ...) inside the for loop, so that each ContainerStart call has a timeout? the way the code is written now, I believe retries will be handed a context which is already expired. What do you think?


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Aug 23, 2017

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/acceptance/cluster/docker.go, line 324 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

help me understand this unusual use of context:

  • why do you need to bind clientCtx separately from this ctx?
  • why not bind and defer cancel here?
  • where you compare err == context.DeadlineExceeded, why not if err == ctx.Err(); isn't that what you really want?

after re-reading the code, I think maybe you just want to ctx, cancel := context.WithTimeout(clientCtx, ...) inside the for loop, so that each ContainerStart call has a timeout? the way the code is written now, I believe retries will be handed a context which is already expired. What do you think?

As you've pointed out this code was total 💩, rewritten!


Comments from Reviewable

@tbg tbg force-pushed the docker-resilience branch 2 times, most recently from 553a3ff to 7ffe1b4 Compare August 23, 2017 04:27
@tamird
Copy link
Contributor

tamird commented Aug 23, 2017

:lgtm:


Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/acceptance/cluster/docker.go, line 332 at r2 (raw file):

			return cli.APIClient.ContainerStart(ctx, id, opts)

remove this empty line?


pkg/acceptance/cluster/docker.go, line 335 at r2 (raw file):

		}()

		if err == nil {

i think you can remove this clause:

		// Keep going if client's context is up for it.
		if err == context.DeadlineExceeded && clientCtx.Err() == nil {
			log.Warningf(clientCtx, "ContainerStart timed out, retrying")
			continue
		}
		return err

Comments from Reviewable

Docker likes to never respond to us, and we do not usually have cancellations
on the context (which would not help, after all, that would just fail the test
right there). Instead, try a few times.

The problem looks similar to

golang/go#16060
golang/go#5103

Another possibility mentioned in usergroups is that some file descriptor limit
is hit. Since I've never seen this locally, perhaps that's the case on our
agent machines. Unfortunately, those are hard to SSH into.

This may not be a good idea (after all, perhaps `Start()` succeeded) and we'd
have to do something similar for `ContainerWait`. But, at least it should
give us an additional data point: do the retries also just block? Is the
container actually started when we retry?
@tbg tbg force-pushed the docker-resilience branch from 7ffe1b4 to 50554e9 Compare August 23, 2017 17:42
@tbg
Copy link
Member Author

tbg commented Aug 23, 2017

TFTR!


Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/acceptance/cluster/docker.go, line 332 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove this empty line?

Done.


pkg/acceptance/cluster/docker.go, line 335 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i think you can remove this clause:

		// Keep going if client's context is up for it.
		if err == context.DeadlineExceeded && clientCtx.Err() == nil {
			log.Warningf(clientCtx, "ContainerStart timed out, retrying")
			continue
		}
		return err

Done.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Aug 23, 2017

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@tbg tbg merged commit bb0e4dc into cockroachdb:master Aug 23, 2017
@tbg tbg deleted the docker-resilience branch August 23, 2017 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants