[BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5934

chinmaychandak · 2020-08-11T20:14:56Z

This script just reads randomly created JSON files using Dask with no heavy processing.

Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.

NOTE: If I do not use Dask, the processing seems to go though without failures.

Worker logs:

terminate called after throwing an instance of 'thrust::system::system_error' what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
distributed.nanny - INFO - Worker process 13050 was killed by signal 6

I used the following commands —

Start Scheduler: nohup dask-scheduler --host localhost &> scheduler.out &
Start Workers: CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &

Logs can be seen in scheduler.out and worker.out.

Random JSON files producer script:

# Creates 25 JSON files, 2*120MB each 

from random import randrange,seed
import json
import math
import time
import random

num_columns = 40

def column_names(size):
    base_cols = ["AppId{}", "LoggedTime{}", "timestamp{}"]
    cols = []
    mult = math.ceil(size/len(base_cols))
    for i in range(mult):
        for c in base_cols:
            cols.append(c.format(i))
            if(len(cols) == size): break
    return cols

def generate_json(num_columns):
    dict_out = {}
    cols = column_names(num_columns)
    for col in cols:
        if col.startswith("AppId"): dict_out[col] = randrange(1,50000)
        elif col.startswith("LoggedTime"): dict_out[col] = randrange(1,50000)
        else: dict_out[col] = randrange(1,50000)
    return json.dumps(dict_out)

for i in range(0,25):
    count = 0
    f = open("json_files/json-%i.txt" % i, "w+")
    while count < 2*150000:
        f.write(generate_json(num_columns) + "\n")
        count = count + 1
    f.close()

Processing script:

from distributed import Client, LocalCluster
import cudf

client = Client("localhost:8786")
client.get_versions(check=True)

def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)

batch_arr = [i for i in range(1,25)]
res = client.map(func_json, batch_arr)
print(client.gather(res))

Can someone please help? I'm seeing this kind of failure only as recent as one week.

I am using a fresh conda environment with this being the only installation command:
conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2.

I am using a T4 GPU with CUDA 10.2.

P.S. This seems similar to #5897.

The text was updated successfully, but these errors were encountered:

vuule · 2020-08-12T18:05:20Z

This might be an OOM caused by the additional processing for JSON input with object rows.

chinmaychandak · 2020-08-12T18:13:25Z

I was able to process dataframes much, much larger than this until about a week ago. Also, each JSON file is pretty small (~250MB in size). I am running it on a T4 (16GB GPU Memory), so I think there's enough GPU memory. I am also seeing this issue when processing less than 25 files (around 10-15).

chinmaychandak · 2020-08-12T18:17:00Z

I also see a lot of Dask-CUDA issues popping up, maybe this is stemming from one of those bugs? I am by no means an expert, but just saying.

vuule · 2020-08-12T18:22:51Z

Could be a processing bug, but I wonder why it would only happen with Dask. I agree that there should be enough memory, but it might still be an OOM issue because of unreasonably large overhead during reads.
Can you please share the log with the CUDA issues?

chinmaychandak · 2020-08-12T18:38:52Z

So if you run the scripts above as is, you should see the same CUDA errors in the worker logs, given your environment is the same as the one I described above. Are there any other logs I should be looking at? Would be happy to help, but I thought a minimal reproducer would be best for you guys to debug.

P.S. I am using CUDA 10.2 and Python 3.7/3.8 both show the same errors.

vuule · 2020-08-12T18:46:48Z

My system doesn't have the same device memory capacity. So I would appreciate the logs if it's easy for you to get.

chinmaychandak · 2020-08-12T18:59:32Z

No problem, I just sent them over. Thanks for digging into this!

quasiben · 2020-08-12T19:14:50Z

If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

chinmaychandak · 2020-08-12T19:16:40Z

I see this problem both when using multiple GPUs as well as a single GPU. But I do believe it has something to do with the issues you mentioned.

jakirkham · 2020-08-12T19:52:06Z

If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

Just to add to this, IOW this is an issue related to PR ( rapidsai/rmm#466 ). We are discussing this in other contexts as well.

chinmaychandak · 2020-08-12T19:54:33Z

Thanks for pointing this out, @jakirkham!

jakirkham · 2020-08-12T21:31:15Z

Filed an MRE here ( rapidsai/dask-cuda#364 ).

jakirkham · 2020-08-12T22:09:58Z

@chinmaychandak, could you please try PR ( rapidsai/dask-cuda#363 )? Requires the very latest (like from minutes ago) rmm installed as well.

jakirkham · 2020-08-12T23:39:14Z

We went ahead and merged that dask-cuda PR and nightlies have been produced. Please let us know if you still see issues with them.

chinmaychandak · 2020-08-12T23:41:23Z

Sorry I missed the earlier message. Great, thanks a lot! Will give it a shot soon.

chinmaychandak · 2020-08-13T00:12:07Z

@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently?

vuule · 2020-08-13T01:35:31Z

@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently?

I'm running the repro locally, will update once the script is done.

chinmaychandak · 2020-08-13T04:21:12Z

Update:

If I start workers using --nprocs 2 --nthreads 1 or --nprocs 1 --nthreads 1, everything gets processed smoothly. Only when I have each process having multiple threads do I see the issue. So --nprocs 2 --nthreads 2 fails. This is interesting. I think this should give us some more insight as to where this issue stems from.

quasiben · 2020-08-13T14:49:26Z

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

chinmaychandak · 2020-08-13T15:25:14Z

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

I agree, but we have been using multiple Dask worker processes per GPU for high throughput for custreamz streaming pipelines. And it has been working flawlessly until recently.

kkraus14 · 2020-08-13T16:44:02Z

@quasiben in this case they're using multiple processes and CUDA MPS in order to handle workloads that don't nicely saturate the entire GPU on their own.

@chinmaychandak it seems like everything is working as long as you have a single thread per process, yes?

jakirkham · 2020-08-13T17:18:43Z

That may be true, but Ben is right. That's not expected to work currently. Not to say we are against changing this (and it is part of the reason for pushing for PTDS 😉)

chinmaychandak · 2020-08-13T17:21:44Z

it seems like everything is working as long as you have a single thread per process, yes?

Yes, @kkraus14, that's correct. I even tested it with the accelerated Kafka bit now in one of the more complex cuStreamz pipelines, just to see if that works, and it does work fine as long as there's one thread per process. But for most pipelines we do use multiple threads per process, especially for benchmarking purposes.

Again, we've already been doing this for over a year, and it's never been a problem. Not sure why I'm seeing this issue. I think @vuule mentioned that he couldn't reproduce the issue locally. Maybe I'm doing something wrong here then.

kkraus14 · 2020-08-13T17:23:03Z

cc @harrism as we're seeing a threading related issue and there was substantial changes with regards to RMM and threading.

chinmaychandak · 2020-08-13T17:26:02Z

@kkraus14, an update: I'm now seeing the same issues with multiple processes too, but the issue happened after like 10 minutes of starting a stream with high-speed input rate. That's why I probably couldn't see it with the minimal reproducer. I will try it with reading a larger number of JSON files to see if multiple processes is failing too.

jakirkham · 2020-08-13T17:59:26Z

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Just to reiterate, I wouldn't expect Dask-CUDA to work with multiple threads per worker today ( rapidsai/dask-cuda#109 ).

chinmaychandak · 2020-08-13T18:21:49Z

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Will give it a shot when the nightlies are out. Actually, they're out. Let me try.

I wouldn't expect Dask-CUDA to work with multiple threads per worker today

When I do conda list dask-cuda, nothing shows up. My reproducer is only relying on RMM and not dask-cuda, I think. Nevertheless, as I said above, we've been trying these multi-process multi-thread Dask workers per-GPU for the last year and they've never been a problem. We do need CUDA MPS and there are multiple CUDA contexts created, but functionality wise, they've proven to be working fine.

I would really appreciate it if someone can try to run the repro locally to see if they're seeing the same error as me.

chinmaychandak · 2020-08-13T18:36:59Z

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Just did, still doesn't seem to work.

jrhemstad · 2020-08-13T19:44:04Z

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

jakirkham · 2020-08-13T19:45:22Z

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

Let's move that discussion over here ( rapidsai/dask-cuda#109 ) (if that's ok 🙂).

Edit: Answered in comment ( rapidsai/dask-cuda#109 (comment) ).

jakirkham · 2020-08-14T00:14:10Z

Does this reproduce with a ThreadPoolExecutor. Maybe something like this?

from concurrent.futures import ThreadPoolExecutor
import cudf


def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)


with ThreadPoolExecutor(max_workers=1) as executor:
    batch_arr = [i for i in range(1, 25)]
    res = executor.map(func_json, batch_arr)
    for e in res:
        print(e)

Edit: May be worth playing with max_workers here.

chinmaychandak · 2020-08-14T01:24:25Z

Okay, so I thought of using CSV files instead of JSON so, I used

import cudf
for i in range(0,20):
    file = f"json_files/json-{i}.txt"
    cudf.read_json(file, lines=True, engine="cudf").to_csv("csv_files/csv-"+str(i)+".csv")

to convert existing JSON to CSV files, and then updated the repro script to call read_csv

def func_csv(batch):
    file = f"csv_files/csv-{batch}.csv"
    df = cudf.read_csv(file)
    return len(df)

It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader?

vuule · 2020-08-14T20:28:04Z

Got local repro with multithreaded JSON reads:

TEST_F(JsonReaderTest, Repro)
{
  auto read_all = [&]() {
    cudf_io::read_json_args in_args{cudf_io::source_info{""}};
    in_args.lines = true;
    for (int i = 0; i < 25; ++i) {
      in_args.source =
        cudf_io::source_info{"/home/vukasin/cudf/json-" + std::to_string(i) + ".txt"};
      auto df = cudf_io::read_json(in_args);
    }
  };

  auto th1 = std::async(std::launch::async, read_all);
  auto th2 = std::async(std::launch::async, read_all);
}

Reproes fairly consistently.

vuule · 2020-08-14T21:43:10Z

I'm suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads.
Digging into the repro, I found a few places where the synchronization is iffy.
Need to look into it some more to root cause.

chinmaychandak · 2020-08-14T21:49:00Z

Were there any significant changes recently that could have caused this? Because I wonder why we weren't seeing these issues before.

vuule · 2020-08-14T21:56:59Z

I made significant change to the JSON reader 2 weeks ago that could affect this.

pseudotensor · 2021-06-06T18:59:39Z

Still got this randomly from tests that normally don't hit it

h2oaicore.systemutils.DAIFallBackError: Traceback (most recent call last):
  File "h2oaicore/models.py", line 4430, in h2oaicore.models.MainModel.predict_model_wrapper_internal
  File "h2oaicore/models.py", line 9011, in h2oaicore.models.XGBoostModel.predict
  File "h2oaicore/models.py", line 2830, in h2oaicore.models.MainModel.predict_simple_base
  File "h2oaicore/models.py", line 4579, in h2oaicore.models.MainModel.predict_simple
  File "h2oaicore/models.py", line 4701, in h2oaicore.models.MainModel.predict_batch
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/sklearn.py", line 1314, in predict_proba
    class_probs = super().predict(
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/sklearn.py", line 853, in predict
    return self.get_booster().predict(
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/core.py", line 1804, in predict
    _check_call(
  File "/opt/h2oai/dai/cuda-11.2/lib/python3.8/site-packages/xgboost/core.py", line 214, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: uninitialized_fill_n: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

harrism · 2021-06-08T02:34:29Z

I don't see any reference to cuDF there. It's possible that XGBoost is hitting this error in Thrust. Would need more information.

chinmaychandak added Needs Triage Need team to review and classify bug Something isn't working labels Aug 11, 2020

kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 12, 2020

vuule self-assigned this Aug 12, 2020

jakirkham mentioned this issue Aug 13, 2020

[QST] How to use multiple threads per GPU worker? rapidsai/dask-cuda#109

Closed

vuule added the dask Dask issue label Aug 13, 2020

vuule mentioned this issue Aug 17, 2020

[REVIEW] Fix concurrent JSON reads crash #6003

Merged

kkraus14 closed this as completed in #6003 Aug 18, 2020

[BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5934

[BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5934

Comments

chinmaychandak commented Aug 11, 2020 • edited Loading

vuule commented Aug 12, 2020

chinmaychandak commented Aug 12, 2020 • edited Loading

chinmaychandak commented Aug 12, 2020

vuule commented Aug 12, 2020

chinmaychandak commented Aug 12, 2020 • edited Loading

vuule commented Aug 12, 2020

chinmaychandak commented Aug 12, 2020

quasiben commented Aug 12, 2020

chinmaychandak commented Aug 12, 2020

jakirkham commented Aug 12, 2020

chinmaychandak commented Aug 12, 2020

jakirkham commented Aug 12, 2020

jakirkham commented Aug 12, 2020

jakirkham commented Aug 12, 2020

chinmaychandak commented Aug 12, 2020

chinmaychandak commented Aug 13, 2020

vuule commented Aug 13, 2020

chinmaychandak commented Aug 13, 2020

quasiben commented Aug 13, 2020

chinmaychandak commented Aug 13, 2020

kkraus14 commented Aug 13, 2020

jakirkham commented Aug 13, 2020 • edited Loading

chinmaychandak commented Aug 13, 2020 • edited Loading

kkraus14 commented Aug 13, 2020

chinmaychandak commented Aug 13, 2020

jakirkham commented Aug 13, 2020

chinmaychandak commented Aug 13, 2020 • edited Loading

chinmaychandak commented Aug 13, 2020 • edited Loading

jrhemstad commented Aug 13, 2020

jakirkham commented Aug 13, 2020 • edited Loading

jakirkham commented Aug 14, 2020 • edited Loading

chinmaychandak commented Aug 14, 2020 • edited Loading

vuule commented Aug 14, 2020

vuule commented Aug 14, 2020

chinmaychandak commented Aug 14, 2020 • edited Loading

vuule commented Aug 14, 2020

pseudotensor commented Jun 6, 2021

harrism commented Jun 8, 2021

chinmaychandak commented Aug 11, 2020 •

edited

Loading

chinmaychandak commented Aug 12, 2020 •

edited

Loading

chinmaychandak commented Aug 12, 2020 •

edited

Loading

jakirkham commented Aug 13, 2020 •

edited

Loading

chinmaychandak commented Aug 13, 2020 •

edited

Loading

chinmaychandak commented Aug 13, 2020 •

edited

Loading

chinmaychandak commented Aug 13, 2020 •

edited

Loading

jakirkham commented Aug 13, 2020 •

edited

Loading

jakirkham commented Aug 14, 2020 •

edited

Loading

chinmaychandak commented Aug 14, 2020 •

edited

Loading

chinmaychandak commented Aug 14, 2020 •

edited

Loading