Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to use multiple threads per GPU worker? #109

Closed
randerzander opened this issue Aug 8, 2019 · 11 comments
Closed

[QST] How to use multiple threads per GPU worker? #109

randerzander opened this issue Aug 8, 2019 · 11 comments
Labels
inactive-30d question Further information is requested

Comments

@randerzander
Copy link
Contributor

While running a large job, I noticed with watch -n 1 nvidia-smi that my GPUs were relatively underutilized.

I attempted to give each GPU worker more threads on which to process tasks simultaneously with threads_per_worker=2.:

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask, dask_cudf

cluster = LocalCUDACluster(ip='0.0.0.0', threads_per_worker=2)
client = Client(cluster)
# print client info
client

The cluster starts up fine, and even begins processing tasks with twice as many streams, as expected. However, progress as reported by the Dask Dashboard locks up shortly afterwards on a DAG that completes successfully with the typical single-thread per worker.

nvidia-smi shows plenty of memory remaining memory per card, and Jupyter shows no errors or warnings.

Is this expected behavior? Any suggestions for how to diagnose the freeze?

@mrocklin
Copy link
Contributor

mrocklin commented Aug 8, 2019

No, locking up is not expected behavior.

I don't personally have enough knowledge of GPUs or CUDA to help here. If you can replicate the failure with a normal thread pool, without Dask, then I would try to do that and then report upstream.

from concurrent.futures import ThreadPoolExecutor

e = ThreadPoolExecutor(4)  # four threads

list(e.map(some_func, *args))  # maybe get this to lock up somehow?

@mrocklin
Copy link
Contributor

mrocklin commented Aug 8, 2019

Note that even if we stopped things from locking up, I think that the GPU effectively sequentializes calls sent to it. In the future we could assign one CUDA stream per CPU thread, but currently there is no generic way to do this in Python. Every GPU library has its own streams API.

@randerzander
Copy link
Contributor Author

Conceptually that makes sense to me.

However, in running a 100GB workflow on 2 GV100s (32 GB mem each), reading files from local SSD, I'm seeing large performance improvements by increasing the number of worker processes using each GPU.

Test 1: LocalCUDACluster default setup: 1 work process per GPU:
Wall time: 6-7 minutes, runtime varies significantly per run

Test 2: Multiple processes per worker started via dask cli
CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 2 --nthreads 1 --memory-limit 0
CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 2 --nthreads 1 --memory-limit 0

Wall time: 3 minutes 16 seconds

Test 2: Multiple processes per worker started via dask cli
CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 3 --nthreads 1 --memory-limit 0
CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 3 --nthreads 1 --memory-limit 0

Wall time: 2min 42s

I think this is telling me that, even if GPUs internally schedule tasks sequentially, dask's scheduler latency is such that queuing tasks per GPU can drastically improve utilization and throughput.

@mrocklin
Copy link
Contributor

mrocklin commented Aug 9, 2019 via email

@randerzander
Copy link
Contributor Author

randerzander commented Aug 9, 2019

Maybe there are also other non-GPU tasks that the workers
are spending time on.

Good point. The above tests were against gzipped files, so more processes reduced the overhead of host-side decompression.

When I switched to pre-decompressed data, some of the improvement dropped, but it is still significant, and variable only by about 10 seconds instead of 1 minute.

2 processes per worker, chunksize='512 MiB':
Wall time: 4min 46s

3 processes per worker, chunksize='512 MiB':
Wall time: 3min 55s

3 processes per worker, chunksize='1024 MiB':
Wall time: 3min 35s - 3min 46s

@randerzander
Copy link
Contributor Author

randerzander commented Aug 9, 2019

Despite the issue title, I don't have a preference for threads over processes. I'm mostly interested in how to improve throughput when my GPUs are underutilized.

@pentschev
Copy link
Member

From the experience I had so far is that GPUs tend to get underutilized due to communication. One of the main issues with using processes today is that all communication happens over TCP, and this hurts performance badly depending on the workflow (if workers require bits of data from chunks assigned to different workers). With the UCX work, this will eventually happen over InfiniBand or NVLink and stalls due to communication should be reduced significantly. Also, spilling to host takes a big toll, also reflecting on GPU utilization.

For now, I would recommend threads because communication within the same process can happen via host memory rather than TCP. Note that this is mostly an optimistic expectation, since communication between different GPUs (due to different worker processes) will still go through TCP.

@jakirkham
Copy link
Member

Do we have a sense of what it would take to get multiple threads-per-worker to work? What things would need to change?

@jakirkham
Copy link
Member

@jrhemstad asked here:

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

Pools are created per worker (not thread). So multiple threads on one worker would share the same pool.

We rely on the user to specify the size. If the user doesn't specify a pool size, we don't enable the pool.

@pentschev pentschev added the question Further information is requested label Jan 8, 2021
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@jakirkham
Copy link
Member

Should add Peter has been doing a lot of work adding support for PTDS. So that may be something worth trying out at some point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants