Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add collect_results support for Ascend NPU #1309

Merged
merged 3 commits into from
Aug 23, 2023

Conversation

xuuyangg
Copy link
Contributor

@xuuyangg xuuyangg commented Aug 16, 2023

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

  1. The use of the 'set_npu_compile' function can lead to failures in launching multiprocessing on Ascend NPU. Please refer to the provided reference for more details. [Bug] Multiprocessing fails to launch on Ascend NPU device when import mmengine.dist #1307 https://www.hiascend.com/document/detail/zh/canncommercial/63RC2/modeldevpt/ptmigr/ptmigr_0058.html
  2. collect_resulst for Ascend NPU added.

Modification

add collect_resulst_npu.

Use case

"""test.py"""

import os
import torch
import torch_npu
import mmengine.dist as dist
import torch.distributed as torch_dist

torch.npu.set_device(int(os.environ['LOCAL_RANK']))
torch_dist.init_process_group(backend="hccl")


if dist.get_rank() == 0:
    data = [torch.tensor(1, dtype=torch.float32).npu(), torch.tensor(2,dtype=torch.float32).npu()]
else:
    data = [torch.tensor(3, dtype=torch.float32).npu(), torch.tensor(4,dtype=torch.float32).npu()]
size = 4
output = dist.collect_results(data, size, device='npu')

if dist.get_rank() == 0:
    print(output)

command

torchrun --nproc_per_node 2 test.py

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@CLAassistant
Copy link

CLAassistant commented Aug 16, 2023

CLA assistant check
All committers have signed the CLA.

@xuuyangg xuuyangg force-pushed the main branch 2 times, most recently from 258b9ca to 4149910 Compare August 17, 2023 06:37
@xuuyangg xuuyangg requested a review from C1rN09 as a code owner August 17, 2023 06:37
@xuuyangg xuuyangg force-pushed the main branch 2 times, most recently from 486a1d2 to 293de5a Compare August 17, 2023 06:43
@xuuyangg xuuyangg changed the title [Feature]Add a NPU Hook [Fix]fix multiprocessing fails to launch on Ascend NPU Aug 17, 2023
@xuuyangg xuuyangg force-pushed the main branch 3 times, most recently from 2e69c78 to 1e972cf Compare August 17, 2023 08:58
@xuuyangg xuuyangg changed the title [Fix]fix multiprocessing fails to launch on Ascend NPU [Feature]Add collect_results for Ascend NPU Aug 17, 2023
@xuuyangg xuuyangg force-pushed the main branch 2 times, most recently from 7a7842f to f0469cd Compare August 17, 2023 09:10
@zhouzaida zhouzaida changed the title [Feature]Add collect_results for Ascend NPU [Feature] Add collect_results support for Ascend NPU Aug 22, 2023
@zhouzaida zhouzaida added this to the 0.8.5 milestone Aug 22, 2023
@zhouzaida zhouzaida merged commit e1c6079 into open-mmlab:main Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Multiprocessing fails to launch on Ascend NPU device when import mmengine.dist
3 participants