Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

用v100单机多卡训练,都跑到同一张卡上了 #2010

Closed
dahu1 opened this issue Sep 13, 2023 · 7 comments
Closed

用v100单机多卡训练,都跑到同一张卡上了 #2010

dahu1 opened this issue Sep 13, 2023 · 7 comments

Comments

@dahu1
Copy link

dahu1 commented Sep 13, 2023

image image

使用v100来多卡训练,发现分配gpu都分配到同一张卡了,这个要如何处理,之前使用2080和3090训练,都没遇到过这种情况。

机器信息:
镜像 nvidia/cuda:11.7.1-devel-ubuntu20.04
python3.8
torch==1.13.0+cu117
torchaudio==0.13.0

@robin1001
Copy link
Collaborator

@yuekaizhang 碰到过吗?

@dahu1
Copy link
Author

dahu1 commented Sep 14, 2023

image
image

补充一下git commit : 9804821

@MrSupW
Copy link
Collaborator

MrSupW commented Sep 20, 2023

@dahu1 你好,请问这个问题有解决方式吗?我今天也遇到这个情况了。

@ziyu123
Copy link

ziyu123 commented Sep 20, 2023

model.cuda() 之前 添加 torch.cuda.set_device(args.rank),我的已经解决了

@MrSupW
Copy link
Collaborator

MrSupW commented Sep 20, 2023

@ziyu123 感谢!我这里也正常了

@dahu1
Copy link
Author

dahu1 commented Sep 20, 2023

model.cuda() 之前 添加 torch.cuda.set_device(args.rank),我的已经解决了

感谢

@dahu1 dahu1 closed this as completed Sep 20, 2023
@robin1001
Copy link
Collaborator

或者可以拉最新的代码,使用 torchrun 跑并行训练,见 https://github.com/wenet-e2e/wenet/pull/2020。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants