[QST] How to use multiple threads per GPU worker? #109

randerzander · 2019-08-08T22:16:54Z

While running a large job, I noticed with watch -n 1 nvidia-smi that my GPUs were relatively underutilized.

I attempted to give each GPU worker more threads on which to process tasks simultaneously with threads_per_worker=2.:

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask, dask_cudf

cluster = LocalCUDACluster(ip='0.0.0.0', threads_per_worker=2)
client = Client(cluster)
# print client info
client

The cluster starts up fine, and even begins processing tasks with twice as many streams, as expected. However, progress as reported by the Dask Dashboard locks up shortly afterwards on a DAG that completes successfully with the typical single-thread per worker.

nvidia-smi shows plenty of memory remaining memory per card, and Jupyter shows no errors or warnings.

Is this expected behavior? Any suggestions for how to diagnose the freeze?

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-08-08T22:22:43Z

No, locking up is not expected behavior.

I don't personally have enough knowledge of GPUs or CUDA to help here. If you can replicate the failure with a normal thread pool, without Dask, then I would try to do that and then report upstream.

from concurrent.futures import ThreadPoolExecutor

e = ThreadPoolExecutor(4)  # four threads

list(e.map(some_func, *args))  # maybe get this to lock up somehow?

mrocklin · 2019-08-08T22:23:40Z

Note that even if we stopped things from locking up, I think that the GPU effectively sequentializes calls sent to it. In the future we could assign one CUDA stream per CPU thread, but currently there is no generic way to do this in Python. Every GPU library has its own streams API.

randerzander · 2019-08-09T01:19:06Z

Conceptually that makes sense to me.

However, in running a 100GB workflow on 2 GV100s (32 GB mem each), reading files from local SSD, I'm seeing large performance improvements by increasing the number of worker processes using each GPU.

Test 1: LocalCUDACluster default setup: 1 work process per GPU:
Wall time: 6-7 minutes, runtime varies significantly per run

Test 2: Multiple processes per worker started via dask cli
CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 2 --nthreads 1 --memory-limit 0
CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 2 --nthreads 1 --memory-limit 0

Wall time: 3 minutes 16 seconds

Test 2: Multiple processes per worker started via dask cli
CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 3 --nthreads 1 --memory-limit 0
CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 3 --nthreads 1 --memory-limit 0

Wall time: 2min 42s

I think this is telling me that, even if GPUs internally schedule tasks sequentially, dask's scheduler latency is such that queuing tasks per GPU can drastically improve utilization and throughput.

mrocklin · 2019-08-09T01:42:29Z

Sounds great. Maybe there are also other non-GPU tasks that the workers are spending time on. Having multiple threads going to keep the GPUs saturated sounds great. The challenge is that you'll probably be working on a few different chains of task at a time, so you'll want to keep chunk size down-ish, but that's probably the case anyway. I guess the thing to do then is to figure out why the GPU tasks are stalling out when run in multiple threads. I don't know how to push on this personally. Maybe @kkraus14 has thoughts?

…

On Thu, Aug 8, 2019 at 9:19 PM Randy Gelhausen ***@***.***> wrote: Conceptually that makes sense to me. However, in running a 100GB workflow on 2 GV100s (32 GB mem each), reading files from local SSD, I'm seeing large performance improvements by increasing the number of worker processes using each GPU. Test 1: LocalCUDACluster default setup: 1 work process per GPU: Wall time: 6-7 minutes, runtime varies significantly per run Test 2: Multiple processes per worker started via dask cli CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 2 --nthreads 1 --memory-limit 0 CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 2 --nthreads 1 --memory-limit 0 Wall time: 3 minutes 16 seconds Test 2: Multiple processes per worker started via dask cli CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 3 --nthreads 1 --memory-limit 0 CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 3 --nthreads 1 --memory-limit 0 Wall time: 2min 42s I *think* this is telling me that, even if GPUs internally schedule tasks sequentially, dask's scheduler latency is such that queuing tasks per GPU can drastically improve utilization and throughput. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#109?email_source=notifications&email_token=AACKZTEAGVZI7G7OI2SN2OTQDTAYVA5CNFSM4IKO45P2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35KE4Q#issuecomment-519742066>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTFXALAHZ7FICJH7PRLQDTAYVANCNFSM4IKO45PQ> .

randerzander · 2019-08-09T02:55:31Z

Maybe there are also other non-GPU tasks that the workers
are spending time on.

Good point. The above tests were against gzipped files, so more processes reduced the overhead of host-side decompression.

When I switched to pre-decompressed data, some of the improvement dropped, but it is still significant, and variable only by about 10 seconds instead of 1 minute.

2 processes per worker, chunksize='512 MiB':
Wall time: 4min 46s

3 processes per worker, chunksize='512 MiB':
Wall time: 3min 55s

3 processes per worker, chunksize='1024 MiB':
Wall time: 3min 35s - 3min 46s

randerzander · 2019-08-09T03:04:44Z

Despite the issue title, I don't have a preference for threads over processes. I'm mostly interested in how to improve throughput when my GPUs are underutilized.

pentschev · 2019-08-09T09:16:02Z

From the experience I had so far is that GPUs tend to get underutilized due to communication. One of the main issues with using processes today is that all communication happens over TCP, and this hurts performance badly depending on the workflow (if workers require bits of data from chunks assigned to different workers). With the UCX work, this will eventually happen over InfiniBand or NVLink and stalls due to communication should be reduced significantly. Also, spilling to host takes a big toll, also reflecting on GPU utilization.

For now, I would recommend threads because communication within the same process can happen via host memory rather than TCP. Note that this is mostly an optimistic expectation, since communication between different GPUs (due to different worker processes) will still go through TCP.

jakirkham · 2020-08-13T17:53:10Z

Do we have a sense of what it would take to get multiple threads-per-worker to work? What things would need to change?

jakirkham · 2020-08-13T19:49:56Z

@jrhemstad asked here:

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

Pools are created per worker (not thread). So multiple threads on one worker would share the same pool.

We rely on the user to specify the size. If the user doesn't specify a pool size, we don't enable the pool.

github-actions · 2021-02-16T19:09:18Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

jakirkham · 2021-02-16T20:25:10Z

Should add Peter has been doing a lot of work adding support for PTDS. So that may be something worth trying out at some point

jakirkham mentioned this issue Aug 13, 2020

[BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered rapidsai/cudf#5934

Closed

pentschev added the question Further information is requested label Jan 8, 2021

github-actions bot added the inactive-30d label Feb 16, 2021

randerzander closed this as completed Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to use multiple threads per GPU worker? #109

[QST] How to use multiple threads per GPU worker? #109

randerzander commented Aug 8, 2019

mrocklin commented Aug 8, 2019

mrocklin commented Aug 8, 2019

randerzander commented Aug 9, 2019

mrocklin commented Aug 9, 2019 via email

randerzander commented Aug 9, 2019 •

edited

Loading

randerzander commented Aug 9, 2019 •

edited

Loading

pentschev commented Aug 9, 2019

jakirkham commented Aug 13, 2020

jakirkham commented Aug 13, 2020

github-actions bot commented Feb 16, 2021

jakirkham commented Feb 16, 2021

[QST] How to use multiple threads per GPU worker? #109

[QST] How to use multiple threads per GPU worker? #109

Comments

randerzander commented Aug 8, 2019

mrocklin commented Aug 8, 2019

mrocklin commented Aug 8, 2019

randerzander commented Aug 9, 2019

mrocklin commented Aug 9, 2019 via email

randerzander commented Aug 9, 2019 • edited Loading

randerzander commented Aug 9, 2019 • edited Loading

pentschev commented Aug 9, 2019

jakirkham commented Aug 13, 2020

jakirkham commented Aug 13, 2020

github-actions bot commented Feb 16, 2021

jakirkham commented Feb 16, 2021

randerzander commented Aug 9, 2019 •

edited

Loading

randerzander commented Aug 9, 2019 •

edited

Loading