Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Bug] To Fix the Hang Problem in Training PSPNet #19056

Open
barry-jin opened this issue Sep 1, 2020 · 4 comments
Open

[Bug] To Fix the Hang Problem in Training PSPNet #19056

barry-jin opened this issue Sep 1, 2020 · 4 comments
Labels

Comments

@barry-jin
Copy link
Contributor

barry-jin commented Sep 1, 2020

Error when training PSPNet on Cityscapes dataset using GluonCV #17439

Problem Description

The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started.

Debugging

After bisect the date of failure, I find the first bad commit is PR 13896, which introduced this problem.

Proposed solutions

Need more efforts.

References

@barry-jin barry-jin added the RFC Post requesting for comments label Sep 1, 2020
@barry-jin barry-jin changed the title [RFC] Turn Off CuDNN When Training PSPNet [RFC] Turn Off CuDNN in Dropout When Training PSPNet Sep 1, 2020
@leezu
Copy link
Contributor

leezu commented Sep 1, 2020

Why not fix the hang instead of disabling the feature?

@sxjscience
Copy link
Member

This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to

  • Fix cudnn dropout
  • Consider to drop CuDNN Dropout if we can accelerate our native dropout

In fact, we haven't used cuda calls like curand4 (curandStatePhilox4_32_10_t *state) when implementing the random operators.

@sxjscience
Copy link
Member

In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.

@barry-jin barry-jin changed the title [RFC] Turn Off CuDNN in Dropout When Training PSPNet [RFC] To Fix the Hang Problem in Training PSPNet Sep 1, 2020
@zhreshold
Copy link
Member

+1 to @sxjscience , the segmentation model training adopts the DataParallel pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp

@barry-jin barry-jin changed the title [RFC] To Fix the Hang Problem in Training PSPNet [Bug] To Fix the Hang Problem in Training PSPNet Sep 1, 2020
@sxjscience sxjscience added Bug and removed RFC Post requesting for comments labels Sep 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants