Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize the default process group twice, When integrating with DeepSpeed #536

Closed
2 of 4 tasks
wookjeHan opened this issue Jul 20, 2022 · 5 comments
Closed
2 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@wookjeHan
Copy link

System Info

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-4.15.0-151-generic-x86_64-with-glibc2.27
- Python version: 3.9.12
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
        - fsdp_config: {}

If I run code
'''
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)
'''
RuntimeError: trying to initialize the default process group twice!

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader)

Expected behavior

I want no error during integrating with deepspeed
@wookjeHan wookjeHan added the bug Something isn't working label Jul 20, 2022
@pacman100
Copy link
Contributor

pacman100 commented Jul 20, 2022

Hello, I am running with the below config for the official example complete_nlp_example.py and everything is working as expected

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}

Output:

[2022-07-20 07:21:22,990] [INFO] [config.py:1063:print]   zero_enabled ................. True                                  
[2022-07-20 07:21:22,990] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 3                                     
[2022-07-20 07:21:22,990] [INFO] [config.py:1065:print]   json = {                                                             
    "train_batch_size": 128,                                                                                                   
    "train_micro_batch_size_per_gpu": 16,                                                                                      
    "gradient_accumulation_steps": 4,                                                                                          
    "zero_optimization": {                                                                                                     
        "stage": 3,                                                                                                            
        "offload_optimizer": {                                                                                                 
            "device": "cpu"                                                                                                    
        },                                                                                                                     
        "offload_param": {                                                                                                     
            "device": "none"                                                                                                   
        },                                                                                                                     
        "stage3_gather_16bit_weights_on_model_save": true                                                                      
    },                                                                                                                         
    "steps_per_print": inf,                                                                                                    
    "zero_allow_untested_optimizer": true                                                                                      
}
Using /home/sourab/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00020813941955566406 seconds
epoch 0: {'accuracy': 0.6985294117647058, 'f1': 0.8166915052160953}
epoch 1: {'accuracy': 0.7230392156862745, 'f1': 0.8264208909370201}
epoch 2: {'accuracy': 0.7573529411764706, 'f1': 0.840064620355412}
[2022-07-20 07:23:36,969] [INFO] [launch.py:210:main] Process 3679742 exits successfully.
[2022-07-20 07:23:36,969] [INFO] [launch.py:210:main] Process 3679741 exits successfully.

Please provide a minimal script so that we can replicated the behaviour.

@wookjeHan
Copy link
Author

I'm wondering whether it also works when deepspeed version is 0.6.7

@pacman100
Copy link
Contributor

Yup! Getting the error with deepspeed version 0.6.7. For time being, please use the deepspeed version 0.6.5.

File "complete_nlp_example.py", line 128, in training_function
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
  File "/home/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/sourab/transformers/src/transformers/modeling_utils.py", line 2065, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 655, in __init__
    init_distributed()
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 427, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 35, in __init__
    self.init_process_group(backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 38, in init_process_group
    return torch.distributed.init_process_group(backend,
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
    raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!

@wookjeHan
Copy link
Author

Thanks! It works for me!

@pacman100
Copy link
Contributor

The above merged PR should solve this issue and folks can now use latest DeepSpeed version without any problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants