Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5934

Closed
chinmaychandak opened this issue Aug 11, 2020 · 38 comments · Fixed by #6003
Assignees
Labels
bug Something isn't working cuIO cuIO issue dask Dask issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@chinmaychandak
Copy link
Contributor

chinmaychandak commented Aug 11, 2020

This script just reads randomly created JSON files using Dask with no heavy processing.

Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.

NOTE: If I do not use Dask, the processing seems to go though without failures.

Worker logs:

terminate called after throwing an instance of 'thrust::system::system_error' what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
distributed.nanny - INFO - Worker process 13050 was killed by signal 6

I used the following commands —

  1. Start Scheduler: nohup dask-scheduler --host localhost &> scheduler.out &
  2. Start Workers: CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &

Logs can be seen in scheduler.out and worker.out.

Random JSON files producer script:

# Creates 25 JSON files, 2*120MB each 

from random import randrange,seed
import json
import math
import time
import random

num_columns = 40

def column_names(size):
    base_cols = ["AppId{}", "LoggedTime{}", "timestamp{}"]
    cols = []
    mult = math.ceil(size/len(base_cols))
    for i in range(mult):
        for c in base_cols:
            cols.append(c.format(i))
            if(len(cols) == size): break
    return cols

def generate_json(num_columns):
    dict_out = {}
    cols = column_names(num_columns)
    for col in cols:
        if col.startswith("AppId"): dict_out[col] = randrange(1,50000)
        elif col.startswith("LoggedTime"): dict_out[col] = randrange(1,50000)
        else: dict_out[col] = randrange(1,50000)
    return json.dumps(dict_out)

for i in range(0,25):
    count = 0
    f = open("json_files/json-%i.txt" % i, "w+")
    while count < 2*150000:
        f.write(generate_json(num_columns) + "\n")
        count = count + 1
    f.close()

Processing script:

from distributed import Client, LocalCluster
import cudf

client = Client("localhost:8786")
client.get_versions(check=True)

def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)

batch_arr = [i for i in range(1,25)]
res = client.map(func_json, batch_arr)
print(client.gather(res))

Can someone please help? I'm seeing this kind of failure only as recent as one week.

I am using a fresh conda environment with this being the only installation command:
conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2.

I am using a T4 GPU with CUDA 10.2.

P.S. This seems similar to #5897.

@chinmaychandak chinmaychandak added Needs Triage Need team to review and classify bug Something isn't working labels Aug 11, 2020
@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 12, 2020
@vuule vuule self-assigned this Aug 12, 2020
@vuule
Copy link
Contributor

vuule commented Aug 12, 2020

This might be an OOM caused by the additional processing for JSON input with object rows.

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 12, 2020

I was able to process dataframes much, much larger than this until about a week ago. Also, each JSON file is pretty small (~250MB in size). I am running it on a T4 (16GB GPU Memory), so I think there's enough GPU memory. I am also seeing this issue when processing less than 25 files (around 10-15).

@chinmaychandak
Copy link
Contributor Author

I also see a lot of Dask-CUDA issues popping up, maybe this is stemming from one of those bugs? I am by no means an expert, but just saying.

@vuule
Copy link
Contributor

vuule commented Aug 12, 2020

Could be a processing bug, but I wonder why it would only happen with Dask. I agree that there should be enough memory, but it might still be an OOM issue because of unreasonably large overhead during reads.
Can you please share the log with the CUDA issues?

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 12, 2020

So if you run the scripts above as is, you should see the same CUDA errors in the worker logs, given your environment is the same as the one I described above. Are there any other logs I should be looking at? Would be happy to help, but I thought a minimal reproducer would be best for you guys to debug.

P.S. I am using CUDA 10.2 and Python 3.7/3.8 both show the same errors.

@vuule
Copy link
Contributor

vuule commented Aug 12, 2020

My system doesn't have the same device memory capacity. So I would appreciate the logs if it's easy for you to get.

@chinmaychandak
Copy link
Contributor Author

No problem, I just sent them over. Thanks for digging into this!

@quasiben
Copy link
Member

If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

@chinmaychandak
Copy link
Contributor Author

I see this problem both when using multiple GPUs as well as a single GPU. But I do believe it has something to do with the issues you mentioned.

@jakirkham
Copy link
Member

If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

Just to add to this, IOW this is an issue related to PR ( rapidsai/rmm#466 ). We are discussing this in other contexts as well.

@chinmaychandak
Copy link
Contributor Author

Thanks for pointing this out, @jakirkham!

@jakirkham
Copy link
Member

Filed an MRE here ( rapidsai/dask-cuda#364 ).

@jakirkham
Copy link
Member

@chinmaychandak, could you please try PR ( rapidsai/dask-cuda#363 )? Requires the very latest (like from minutes ago) rmm installed as well.

@jakirkham
Copy link
Member

We went ahead and merged that dask-cuda PR and nightlies have been produced. Please let us know if you still see issues with them.

@chinmaychandak
Copy link
Contributor Author

Sorry I missed the earlier message. Great, thanks a lot! Will give it a shot soon.

@chinmaychandak
Copy link
Contributor Author

@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently?

@vuule
Copy link
Contributor

vuule commented Aug 13, 2020

@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently?

I'm running the repro locally, will update once the script is done.

@chinmaychandak
Copy link
Contributor Author

Update:

If I start workers using --nprocs 2 --nthreads 1 or --nprocs 1 --nthreads 1, everything gets processed smoothly. Only when I have each process having multiple threads do I see the issue. So --nprocs 2 --nthreads 2 fails. This is interesting. I think this should give us some more insight as to where this issue stems from.

@quasiben
Copy link
Member

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

@chinmaychandak
Copy link
Contributor Author

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

I agree, but we have been using multiple Dask worker processes per GPU for high throughput for custreamz streaming pipelines. And it has been working flawlessly until recently.

@kkraus14
Copy link
Collaborator

@quasiben in this case they're using multiple processes and CUDA MPS in order to handle workloads that don't nicely saturate the entire GPU on their own.

@chinmaychandak it seems like everything is working as long as you have a single thread per process, yes?

@jakirkham
Copy link
Member

jakirkham commented Aug 13, 2020

That may be true, but Ben is right. That's not expected to work currently. Not to say we are against changing this (and it is part of the reason for pushing for PTDS 😉)

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 13, 2020

it seems like everything is working as long as you have a single thread per process, yes?

Yes, @kkraus14, that's correct. I even tested it with the accelerated Kafka bit now in one of the more complex cuStreamz pipelines, just to see if that works, and it does work fine as long as there's one thread per process. But for most pipelines we do use multiple threads per process, especially for benchmarking purposes.

Again, we've already been doing this for over a year, and it's never been a problem. Not sure why I'm seeing this issue. I think @vuule mentioned that he couldn't reproduce the issue locally. Maybe I'm doing something wrong here then.

@kkraus14
Copy link
Collaborator

cc @harrism as we're seeing a threading related issue and there was substantial changes with regards to RMM and threading.

@chinmaychandak
Copy link
Contributor Author

@kkraus14, an update: I'm now seeing the same issues with multiple processes too, but the issue happened after like 10 minutes of starting a stream with high-speed input rate. That's why I probably couldn't see it with the minimal reproducer. I will try it with reading a larger number of JSON files to see if multiple processes is failing too.

@jakirkham
Copy link
Member

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Just to reiterate, I wouldn't expect Dask-CUDA to work with multiple threads per worker today ( rapidsai/dask-cuda#109 ).

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 13, 2020

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Will give it a shot when the nightlies are out. Actually, they're out. Let me try.

I wouldn't expect Dask-CUDA to work with multiple threads per worker today

When I do conda list dask-cuda, nothing shows up. My reproducer is only relying on RMM and not dask-cuda, I think. Nevertheless, as I said above, we've been trying these multi-process multi-thread Dask workers per-GPU for the last year and they've never been a problem. We do need CUDA MPS and there are multiple CUDA contexts created, but functionality wise, they've proven to be working fine.

I would really appreciate it if someone can try to run the repro locally to see if they're seeing the same error as me.

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 13, 2020

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Just did, still doesn't seem to work.

@jrhemstad
Copy link
Contributor

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

@jakirkham
Copy link
Member

jakirkham commented Aug 13, 2020

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

Let's move that discussion over here ( rapidsai/dask-cuda#109 ) (if that's ok 🙂).

Edit: Answered in comment ( rapidsai/dask-cuda#109 (comment) ).

@jakirkham
Copy link
Member

jakirkham commented Aug 14, 2020

Does this reproduce with a ThreadPoolExecutor. Maybe something like this?

from concurrent.futures import ThreadPoolExecutor
import cudf


def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)


with ThreadPoolExecutor(max_workers=1) as executor:
    batch_arr = [i for i in range(1, 25)]
    res = executor.map(func_json, batch_arr)
    for e in res:
        print(e)

Edit: May be worth playing with max_workers here.

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 14, 2020

Okay, so I thought of using CSV files instead of JSON so, I used

import cudf
for i in range(0,20):
    file = f"json_files/json-{i}.txt"
    cudf.read_json(file, lines=True, engine="cudf").to_csv("csv_files/csv-"+str(i)+".csv")

to convert existing JSON to CSV files, and then updated the repro script to call read_csv

def func_csv(batch):
    file = f"csv_files/csv-{batch}.csv"
    df = cudf.read_csv(file)
    return len(df)

It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader?

@vuule
Copy link
Contributor

vuule commented Aug 14, 2020

Got local repro with multithreaded JSON reads:

TEST_F(JsonReaderTest, Repro)
{
  auto read_all = [&]() {
    cudf_io::read_json_args in_args{cudf_io::source_info{""}};
    in_args.lines = true;
    for (int i = 0; i < 25; ++i) {
      in_args.source =
        cudf_io::source_info{"/home/vukasin/cudf/json-" + std::to_string(i) + ".txt"};
      auto df = cudf_io::read_json(in_args);
    }
  };

  auto th1 = std::async(std::launch::async, read_all);
  auto th2 = std::async(std::launch::async, read_all);
}

Reproes fairly consistently.

@vuule
Copy link
Contributor

vuule commented Aug 14, 2020

I'm suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads.
Digging into the repro, I found a few places where the synchronization is iffy.
Need to look into it some more to root cause.

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Aug 14, 2020

Were there any significant changes recently that could have caused this? Because I wonder why we weren't seeing these issues before.

@vuule
Copy link
Contributor

vuule commented Aug 14, 2020

I made significant change to the JSON reader 2 weeks ago that could affect this.

@pseudotensor
Copy link

Still got this randomly from tests that normally don't hit it

h2oaicore.systemutils.DAIFallBackError: Traceback (most recent call last):
  File "h2oaicore/models.py", line 4430, in h2oaicore.models.MainModel.predict_model_wrapper_internal
  File "h2oaicore/models.py", line 9011, in h2oaicore.models.XGBoostModel.predict
  File "h2oaicore/models.py", line 2830, in h2oaicore.models.MainModel.predict_simple_base
  File "h2oaicore/models.py", line 4579, in h2oaicore.models.MainModel.predict_simple
  File "h2oaicore/models.py", line 4701, in h2oaicore.models.MainModel.predict_batch
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/sklearn.py", line 1314, in predict_proba
    class_probs = super().predict(
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/sklearn.py", line 853, in predict
    return self.get_booster().predict(
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/core.py", line 1804, in predict
    _check_call(
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/core.py", line 214, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: uninitialized_fill_n: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

@harrism
Copy link
Member

harrism commented Jun 8, 2021

I don't see any reference to cuDF there. It's possible that XGBoost is hitting this error in Thrust. Would need more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue dask Dask issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants