-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5934
Comments
This might be an OOM caused by the additional processing for JSON input with object rows. |
I was able to process dataframes much, much larger than this until about a week ago. Also, each JSON file is pretty small (~250MB in size). I am running it on a T4 (16GB GPU Memory), so I think there's enough GPU memory. I am also seeing this issue when processing less than 25 files (around 10-15). |
I also see a lot of Dask-CUDA issues popping up, maybe this is stemming from one of those bugs? I am by no means an expert, but just saying. |
Could be a processing bug, but I wonder why it would only happen with Dask. I agree that there should be enough memory, but it might still be an OOM issue because of unreasonably large overhead during reads. |
So if you run the scripts above as is, you should see the same CUDA errors in the worker logs, given your environment is the same as the one I described above. Are there any other logs I should be looking at? Would be happy to help, but I thought a minimal reproducer would be best for you guys to debug. P.S. I am using CUDA 10.2 and Python 3.7/3.8 both show the same errors. |
My system doesn't have the same device memory capacity. So I would appreciate the logs if it's easy for you to get. |
No problem, I just sent them over. Thanks for digging into this! |
If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested |
I see this problem both when using multiple GPUs as well as a single GPU. But I do believe it has something to do with the issues you mentioned. |
Just to add to this, IOW this is an issue related to PR ( rapidsai/rmm#466 ). We are discussing this in other contexts as well. |
Thanks for pointing this out, @jakirkham! |
Filed an MRE here ( rapidsai/dask-cuda#364 ). |
@chinmaychandak, could you please try PR ( rapidsai/dask-cuda#363 )? Requires the very latest (like from minutes ago) |
We went ahead and merged that |
Sorry I missed the earlier message. Great, thanks a lot! Will give it a shot soon. |
@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently? |
I'm running the repro locally, will update once the script is done. |
Update: If I start workers using |
When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users. |
I agree, but we have been using multiple Dask worker processes per GPU for high throughput for custreamz streaming pipelines. And it has been working flawlessly until recently. |
@quasiben in this case they're using multiple processes and CUDA MPS in order to handle workloads that don't nicely saturate the entire GPU on their own. @chinmaychandak it seems like everything is working as long as you have a single thread per process, yes? |
That may be true, but Ben is right. That's not expected to work currently. Not to say we are against changing this (and it is part of the reason for pushing for PTDS 😉) |
Yes, @kkraus14, that's correct. I even tested it with the accelerated Kafka bit now in one of the more complex cuStreamz pipelines, just to see if that works, and it does work fine as long as there's one thread per process. But for most pipelines we do use multiple threads per process, especially for benchmarking purposes. Again, we've already been doing this for over a year, and it's never been a problem. Not sure why I'm seeing this issue. I think @vuule mentioned that he couldn't reproduce the issue locally. Maybe I'm doing something wrong here then. |
cc @harrism as we're seeing a threading related issue and there was substantial changes with regards to RMM and threading. |
@kkraus14, an update: I'm now seeing the same issues with multiple processes too, but the issue happened after like 10 minutes of starting a stream with high-speed input rate. That's why I probably couldn't see it with the minimal reproducer. I will try it with reading a larger number of JSON files to see if multiple processes is failing too. |
Maybe retry with newer RMM packages ( rapidsai/rmm#493 )? Just to reiterate, I wouldn't expect Dask-CUDA to work with multiple threads per worker today ( rapidsai/dask-cuda#109 ). |
Will give it a shot when the nightlies are out. Actually, they're out. Let me try.
When I do I would really appreciate it if someone can try to run the repro locally to see if they're seeing the same error as me. |
Just did, still doesn't seem to work. |
What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately? |
Let's move that discussion over here ( rapidsai/dask-cuda#109 ) (if that's ok 🙂). Edit: Answered in comment ( rapidsai/dask-cuda#109 (comment) ). |
Does this reproduce with a from concurrent.futures import ThreadPoolExecutor
import cudf
def func_json(batch):
file = f"json_files/json-{batch}.txt"
df = cudf.read_json(file, lines=True, engine="cudf")
return len(df)
with ThreadPoolExecutor(max_workers=1) as executor:
batch_arr = [i for i in range(1, 25)]
res = executor.map(func_json, batch_arr)
for e in res:
print(e) Edit: May be worth playing with |
Okay, so I thought of using CSV files instead of JSON so, I used
to convert existing JSON to CSV files, and then updated the repro script to call read_csv
It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader? |
Got local repro with multithreaded JSON reads:
Reproes fairly consistently. |
I'm suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads. |
Were there any significant changes recently that could have caused this? Because I wonder why we weren't seeing these issues before. |
I made significant change to the JSON reader 2 weeks ago that could affect this. |
Still got this randomly from tests that normally don't hit it
|
I don't see any reference to cuDF there. It's possible that XGBoost is hitting this error in Thrust. Would need more information. |
This script just reads randomly created JSON files using Dask with no heavy processing.
Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.
NOTE: If I do not use Dask, the processing seems to go though without failures.
I used the following commands —
nohup dask-scheduler --host localhost &> scheduler.out &
CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &
Logs can be seen in
scheduler.out
andworker.out
.Random JSON files producer script:
Processing script:
Can someone please help? I'm seeing this kind of failure only as recent as one week.
I am using a fresh conda environment with this being the only installation command:
conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2
.I am using a T4 GPU with CUDA 10.2.
P.S. This seems similar to #5897.
The text was updated successfully, but these errors were encountered: