in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process #112

vilon888 · 2021-03-20T03:34:16Z

🐛 Bug

To Reproduce

follow the sample code like https://github.com/PyTorchLightning/metrics,
we use metric = torchmetrics.ROC()
model.roc_metric = metric

in test epoch,
metric.update (output, target)

and after test epoch, run compute
metric.compute()

hang training process and result two gpu in 100% usage

btw, use the metric code in the pytorch_lightning have the same issue as the standalone package

Environment

PyTorch Version (1.7.0+cu101):
OS (Linux ubuntu 18.04):
How you installed PyTorch (`pip):
torchmetrics Version: 0.2.0
Python version: 3.6
CUDA/cuDNN version: 10.1/7.6.5.32-1+cuda10.1
GPU models and configuration: two 2080ti

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-20T03:35:01Z

Hi! thanks for your contribution!, great first issue!

calebclayreagor · 2021-03-31T00:23:44Z

I'm running into the same issue with PrecisionRecallCurve and calling .compute() during or after any step if using ddp

maximsch2 · 2021-04-09T00:21:47Z

@calebclayreagor , how big is your data? PrecisionRecallCurve stores the entire dataset in memory and on compute() it needs to consolidate it on single rank. If your model is big and dataset is big, then at some point model + all predictions/labels will not fit into GPU memory which will lead to NCCL issues/GPU OOMs/hangs.

I have a solution for PrecisionRecall-based metrics in #128 by doing binning to make the compute constant-memory (as opposed O(dataset size) right now). You do trade off a bit of accuracy for it and have to specify number of thresholds to use.

Borda · 2021-04-10T08:26:46Z

@SkafteNicki have you check this issue? 🐰

SkafteNicki · 2021-04-11T16:37:17Z

@Borda cannot debug myself currently as my local cluster is under maintenance

calebclayreagor · 2021-04-12T16:28:32Z

@maximsch2 my dataset is quite large (>6m examples) but it still fits in memory. My problem was actually due to ghost processes, and I solved the issue by doing kill -9 <pid> before starting training in ddp mode. Newbie error.

Borda · 2021-04-19T13:03:44Z

@justusschock mind have look?

maximsch2 · 2021-04-30T17:47:35Z

@vilon888 , are you still seeing this issue? Can you check one thing - does your dataloader produce batches of the same size for all workers all the time?

@SkafteNicki , I've finally debugged a similar issue we've been having and it's due to handling of datasets that don't divide evenly in the full number of batches - this makes it so that last batch is partial and different lengths on different workers. This makes preds/target tensors different shape and that breaks gather_all_tensors. We probably need to be smarter there - first gather max shape across all workers, then pad the resulting tensor to that max shape, then truncate back.

SkafteNicki · 2021-05-03T07:21:01Z

@maximsch2 I agree that could be a problem.
Are you also seeing this when using torchmetrics as standalone or also when used together with lightning?

The reason that I am asking is that lightning as default is using Pytorchs DistributedSampler that will add additional samples to make sure that all processes gets equal workload
https://github.com/pytorch/pytorch/blob/87242d2393119990ebe9043e854317f02536bdff/torch/utils/data/distributed.py#L105-L114

maximsch2 · 2021-05-03T16:36:35Z

DistributedSamper doesn't work for IterableDataset which is what we usually get due to reading from databases, so we never really use that sampler. This is the fix btw: #220

SkafteNicki · 2021-05-06T15:09:50Z

Closing as this as it should have been solved by #220.
Please re-open if the error persist.

vilon888 added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 20, 2021

Borda added this to the 0.3 milestone Mar 25, 2021

Borda added the Priority Critical task/issue label Apr 1, 2021

edenlightning assigned SkafteNicki Apr 5, 2021

edenlightning unassigned SkafteNicki Apr 19, 2021

SkafteNicki closed this as completed May 6, 2021

Borda added the distributed DDP, etc. label Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process #112

in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process #112

vilon888 commented Mar 20, 2021 •

edited by Borda

Loading

github-actions bot commented Mar 20, 2021

calebclayreagor commented Mar 31, 2021 •

edited by Borda

Loading

maximsch2 commented Apr 9, 2021

Borda commented Apr 10, 2021

SkafteNicki commented Apr 11, 2021

calebclayreagor commented Apr 12, 2021

Borda commented Apr 19, 2021

maximsch2 commented Apr 30, 2021

SkafteNicki commented May 3, 2021

maximsch2 commented May 3, 2021

SkafteNicki commented May 6, 2021

in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process #112

in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process #112

Comments

vilon888 commented Mar 20, 2021 • edited by Borda Loading

🐛 Bug

To Reproduce

Environment

github-actions bot commented Mar 20, 2021

calebclayreagor commented Mar 31, 2021 • edited by Borda Loading

maximsch2 commented Apr 9, 2021

Borda commented Apr 10, 2021

SkafteNicki commented Apr 11, 2021

calebclayreagor commented Apr 12, 2021

Borda commented Apr 19, 2021

maximsch2 commented Apr 30, 2021

SkafteNicki commented May 3, 2021

maximsch2 commented May 3, 2021

SkafteNicki commented May 6, 2021

vilon888 commented Mar 20, 2021 •

edited by Borda

Loading

calebclayreagor commented Mar 31, 2021 •

edited by Borda

Loading