-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process #112
Comments
Hi! thanks for your contribution!, great first issue! |
I'm running into the same issue with |
@calebclayreagor , how big is your data? PrecisionRecallCurve stores the entire dataset in memory and on compute() it needs to consolidate it on single rank. If your model is big and dataset is big, then at some point model + all predictions/labels will not fit into GPU memory which will lead to NCCL issues/GPU OOMs/hangs. I have a solution for PrecisionRecall-based metrics in #128 by doing binning to make the compute constant-memory (as opposed O(dataset size) right now). You do trade off a bit of accuracy for it and have to specify number of thresholds to use. |
@SkafteNicki have you check this issue? 🐰 |
@Borda cannot debug myself currently as my local cluster is under maintenance |
@maximsch2 my dataset is quite large (>6m examples) but it still fits in memory. My problem was actually due to ghost processes, and I solved the issue by doing |
@justusschock mind have look? |
@vilon888 , are you still seeing this issue? Can you check one thing - does your dataloader produce batches of the same size for all workers all the time? @SkafteNicki , I've finally debugged a similar issue we've been having and it's due to handling of datasets that don't divide evenly in the full number of batches - this makes it so that last batch is partial and different lengths on different workers. This makes |
@maximsch2 I agree that could be a problem. The reason that I am asking is that lightning as default is using Pytorchs DistributedSampler that will add additional samples to make sure that all processes gets equal workload |
DistributedSamper doesn't work for IterableDataset which is what we usually get due to reading from databases, so we never really use that sampler. This is the fix btw: #220 |
Closing as this as it should have been solved by #220. |
🐛 Bug
To Reproduce
follow the sample code like https://github.com/PyTorchLightning/metrics,
we use
metric = torchmetrics.ROC()
model.roc_metric = metric
in test epoch,
metric.update (output, target)
and after test epoch, run compute
metric.compute()
hang training process and result two gpu in 100% usage
btw, use the metric code in the pytorch_lightning have the same issue as the standalone package
Environment
PyTorch Version (1.7.0+cu101):
OS (Linux ubuntu 18.04):
How you installed PyTorch (`pip):
torchmetrics Version: 0.2.0
Python version: 3.6
CUDA/cuDNN version: 10.1/7.6.5.32-1+cuda10.1
GPU models and configuration: two 2080ti
The text was updated successfully, but these errors were encountered: