-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] How to use multiple threads per GPU worker? #109
Comments
No, locking up is not expected behavior. I don't personally have enough knowledge of GPUs or CUDA to help here. If you can replicate the failure with a normal thread pool, without Dask, then I would try to do that and then report upstream. from concurrent.futures import ThreadPoolExecutor
e = ThreadPoolExecutor(4) # four threads
list(e.map(some_func, *args)) # maybe get this to lock up somehow? |
Note that even if we stopped things from locking up, I think that the GPU effectively sequentializes calls sent to it. In the future we could assign one CUDA stream per CPU thread, but currently there is no generic way to do this in Python. Every GPU library has its own streams API. |
Conceptually that makes sense to me. However, in running a 100GB workflow on 2 GV100s (32 GB mem each), reading files from local SSD, I'm seeing large performance improvements by increasing the number of worker processes using each GPU. Test 1: LocalCUDACluster default setup: 1 work process per GPU: Test 2: Multiple processes per worker started via dask cli Wall time: 3 minutes 16 seconds Test 2: Multiple processes per worker started via dask cli Wall time: 2min 42s I think this is telling me that, even if GPUs internally schedule tasks sequentially, dask's scheduler latency is such that queuing tasks per GPU can drastically improve utilization and throughput. |
Sounds great. Maybe there are also other non-GPU tasks that the workers
are spending time on. Having multiple threads going to keep the GPUs
saturated sounds great. The challenge is that you'll probably be working
on a few different chains of task at a time, so you'll want to keep chunk
size down-ish, but that's probably the case anyway.
I guess the thing to do then is to figure out why the GPU tasks are
stalling out when run in multiple threads. I don't know how to push on
this personally. Maybe @kkraus14 has thoughts?
…On Thu, Aug 8, 2019 at 9:19 PM Randy Gelhausen ***@***.***> wrote:
Conceptually that makes sense to me.
However, in running a 100GB workflow on 2 GV100s (32 GB mem each), reading
files from local SSD, I'm seeing large performance improvements by
increasing the number of worker processes using each GPU.
Test 1: LocalCUDACluster default setup: 1 work process per GPU:
Wall time: 6-7 minutes, runtime varies significantly per run
Test 2: Multiple processes per worker started via dask cli
CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 2 --nthreads 1
--memory-limit 0
CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 2 --nthreads 1
--memory-limit 0
Wall time: 3 minutes 16 seconds
Test 2: Multiple processes per worker started via dask cli
CUDA_VISIBLE_DEVICES=0 dask-worker localhost:8786 --nprocs 3 --nthreads 1
--memory-limit 0
CUDA_VISIBLE_DEVICES=1 dask-worker localhost:8786 --nprocs 3 --nthreads 1
--memory-limit 0
Wall time: 2min 42s
I *think* this is telling me that, even if GPUs internally schedule tasks
sequentially, dask's scheduler latency is such that queuing tasks per GPU
can drastically improve utilization and throughput.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#109?email_source=notifications&email_token=AACKZTEAGVZI7G7OI2SN2OTQDTAYVA5CNFSM4IKO45P2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35KE4Q#issuecomment-519742066>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTFXALAHZ7FICJH7PRLQDTAYVANCNFSM4IKO45PQ>
.
|
Good point. The above tests were against gzipped files, so more processes reduced the overhead of host-side decompression. When I switched to pre-decompressed data, some of the improvement dropped, but it is still significant, and variable only by about 10 seconds instead of 1 minute. 2 processes per worker, chunksize='512 MiB': 3 processes per worker, chunksize='512 MiB': 3 processes per worker, chunksize='1024 MiB': |
Despite the issue title, I don't have a preference for threads over processes. I'm mostly interested in how to improve throughput when my GPUs are underutilized. |
From the experience I had so far is that GPUs tend to get underutilized due to communication. One of the main issues with using processes today is that all communication happens over TCP, and this hurts performance badly depending on the workflow (if workers require bits of data from chunks assigned to different workers). With the UCX work, this will eventually happen over InfiniBand or NVLink and stalls due to communication should be reduced significantly. Also, spilling to host takes a big toll, also reflecting on GPU utilization. For now, I would recommend threads because communication within the same process can happen via host memory rather than TCP. Note that this is mostly an optimistic expectation, since communication between different GPUs (due to different worker processes) will still go through TCP. |
Do we have a sense of what it would take to get multiple threads-per-worker to work? What things would need to change? |
@jrhemstad asked here:
Pools are created per worker (not thread). So multiple threads on one worker would share the same pool. We rely on the user to specify the size. If the user doesn't specify a pool size, we don't enable the pool. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
Should add Peter has been doing a lot of work adding support for PTDS. So that may be something worth trying out at some point |
While running a large job, I noticed with
watch -n 1 nvidia-smi
that my GPUs were relatively underutilized.I attempted to give each GPU worker more threads on which to process tasks simultaneously with
threads_per_worker=2
.:The cluster starts up fine, and even begins processing tasks with twice as many streams, as expected. However, progress as reported by the Dask Dashboard locks up shortly afterwards on a DAG that completes successfully with the typical single-thread per worker.
nvidia-smi
shows plenty of memory remaining memory per card, and Jupyter shows no errors or warnings.Is this expected behavior? Any suggestions for how to diagnose the freeze?
The text was updated successfully, but these errors were encountered: