Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when running with COMPOSE_PARALLEL_LIMIT #5864

Closed
bcoughlan opened this issue Apr 10, 2018 · 8 comments · May be fixed by ko10ok/compose#1 or ko10ok/compose#2
Closed

Deadlock when running with COMPOSE_PARALLEL_LIMIT #5864

bcoughlan opened this issue Apr 10, 2018 · 8 comments · May be fixed by ko10ok/compose#1 or ko10ok/compose#2
Labels

Comments

@bcoughlan
Copy link

bcoughlan commented Apr 10, 2018

Description of the issue

Compose can hang when trying to run tasks with a low COMPOSE_PARALLEL_LIMIT.

Suppose I'm starting 10 containers with a parallel limit of 3:

The first 3 containers begin starting. Then in service.py:_execute_convergence_create another parallel task is kicked off to actually start the containers. However because the thread pool is full this task never executes and the application hangs.

I think either:

  1. The service.py code needs a separate thread pool (complicated).
  2. Multiple instances of the same service need to start sequentially (inefficient for certain deployments).
  3. The parallel logic could be contained in project.py by running a task for each instance of a service.

In cases where parallel_execute is passed an objects parameter of length 1, could it just execute it on the calling thread? That would at least limit the issue to containers where scale > 1.

Context information (for bug reports)

Tested on master (2975f06 at time of writing).

$ docker-compose --version
docker-compose version 1.21.0dev, build unknown

Steps to reproduce the issue

Below is a Compose file that starts 9 instances of Redis. Run with COMPOSE_PARALLEL_LIMIT=3 docker-compose up to observe the issue:

version: '2.3'

services:
  redis1:
    image: "redis:alpine"
    ports:
      - "6379:6379"
  redis2:
    image: "redis:alpine"
    ports:
      - "6380:6379"
  redis3:
    image: "redis:alpine"
    ports:
      - "6381:6379"
  redis4:
    image: "redis:alpine"
    ports:
      - "6382:6379"
  redis5:
    image: "redis:alpine"
    ports:
      - "6383:6379"
  redis6:
    image: "redis:alpine"
    ports:
      - "6384:6379"
  redis7:
    image: "redis:alpine"
    ports:
      - "6385:6379"
  redis8:
    image: "redis:alpine"
    ports:
      - "6386:6379"
  redis9:
    image: "redis:alpine"
    ports:
      - "6387:6379"
@shin-
Copy link

shin- commented Apr 13, 2018

Thanks for the report! It's something we can look into. Obviously, the simple workaround is to just set the parallel limit to a higher value.

@bcoughlan
Copy link
Author

Thanks for the reply. In my case I am starting about 15 java containers that are CPU heavy on startup and was experimenting with limiting the concurrency to reduce maxing out the CPU. The time to bring them all to healthy in parallel is much slower than in sequence.

I'm guessing that is the purpose of the concurrency limit flag? I have been thinking about that and I reckon that it is Docker rather than Compose that is lacking, as it needs to be done by Compose, Swarm but also by Docker at system boot time when starting many containers with restart=always.

@bjsee
Copy link

bjsee commented Aug 11, 2018

Hi, we have this issue, too. We are starting about 20 java containers and want to reduce the number of parallel starting containers because otherwise the system will slow down the same way bcoughlan mentioned. Today we work around this by defining "depends on" chains, but this is really ugly because we can not replace a single container without restarting all depdendent containers.

So it would be really cool to get the COMPOSE_PARALLEL_LIMIT feature working. Is there anyhting new to this issue?

@bcoughlan
Copy link
Author

@bjsee Even without this bug COMPOSE_PARALLEL_LIMIT wouldn't help much, as it doesn't wait for Docker healthchecks. I think the responsibility lies in the main Docker project, because the same issue occurs when you reboot your server and Docker starts up all the containers in parallel.

As of yet I haven't found a solution to the problem.

@Alexhha
Copy link

Alexhha commented Aug 29, 2018

As temporary workaround you can try to combine healthchecks and depends_on (with condition). For e.g. to run 3 containers in parallel you can try something like the following

version: '2.3'

services:
  redis1:
    image: "redis:alpine"
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "..."]

  redis2:
    image: "redis:alpine"
    ports:
      - "6380:6379"
    healthcheck:
      test: ["CMD", "..."]

  redis3:
    image: "redis:alpine"
    ports:
      - "6381:6379"
    healthcheck:
      test: ["CMD", "..."]

  redis4:
    image: "redis:alpine"
    ports:
      - "6382:6379"
    healthcheck:
      test: ["CMD", "..."]
    depends_on:
      redis1:
        condition: service_healthy
      redis2:
        condition: service_healthy
      redis3:
        condition: service_healthy

  redis5:
    image: "redis:alpine"
    ports:
      - "6383:6379"
    healthcheck:
      test: ["CMD", "..."]
    depends_on:
      redis1:
        condition: service_healthy
      redis2:
        condition: service_healthy
      redis3:
        condition: service_healthy

  redis6:
    image: "redis:alpine"
    ports:
      - "6384:6379"
    healthcheck:
      test: ["CMD", "..."]
    depends_on:
      redis1:
        condition: service_healthy
      redis2:
        condition: service_healthy
      redis3:
        condition: service_healthy

Of course this method has disadvantages - you have to manually distribute workload. And if you need to start 4 containers in parallel you have to totally modify docker-compose. But I think it could be automated with simple bash scripts. It's up to you.

@electrofelix
Copy link

Trying to use COMPOSE_PARALLEL_LIMIT to prevent a "device or resource busy" error that occasionally appears in our CI and running into the same problem

@stale
Copy link

stale bot commented Oct 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 9, 2019
@stale
Copy link

stale bot commented Oct 16, 2019

This issue has been automatically closed because it had not recent activity during the stale period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants