Initialize the default process group twice, When integrating with DeepSpeed #536

wookjeHan · 2022-07-20T06:44:05Z

System Info

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-4.15.0-151-generic-x86_64-with-glibc2.27
- Python version: 3.9.12
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
        - fsdp_config: {}

If I run code
'''
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)
'''
RuntimeError: trying to initialize the default process group twice!

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader)

Expected behavior

I want no error during integrating with deepspeed

The text was updated successfully, but these errors were encountered:

pacman100 · 2022-07-20T07:25:39Z

Hello, I am running with the below config for the official example complete_nlp_example.py and everything is working as expected

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}

Output:

[2022-07-20 07:21:22,990] [INFO] [config.py:1063:print]   zero_enabled ................. True                                  
[2022-07-20 07:21:22,990] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 3                                     
[2022-07-20 07:21:22,990] [INFO] [config.py:1065:print]   json = {                                                             
    "train_batch_size": 128,                                                                                                   
    "train_micro_batch_size_per_gpu": 16,                                                                                      
    "gradient_accumulation_steps": 4,                                                                                          
    "zero_optimization": {                                                                                                     
        "stage": 3,                                                                                                            
        "offload_optimizer": {                                                                                                 
            "device": "cpu"                                                                                                    
        },                                                                                                                     
        "offload_param": {                                                                                                     
            "device": "none"                                                                                                   
        },                                                                                                                     
        "stage3_gather_16bit_weights_on_model_save": true                                                                      
    },                                                                                                                         
    "steps_per_print": inf,                                                                                                    
    "zero_allow_untested_optimizer": true                                                                                      
}
Using /home/sourab/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00020813941955566406 seconds
epoch 0: {'accuracy': 0.6985294117647058, 'f1': 0.8166915052160953}
epoch 1: {'accuracy': 0.7230392156862745, 'f1': 0.8264208909370201}
epoch 2: {'accuracy': 0.7573529411764706, 'f1': 0.840064620355412}
[2022-07-20 07:23:36,969] [INFO] [launch.py:210:main] Process 3679742 exits successfully.
[2022-07-20 07:23:36,969] [INFO] [launch.py:210:main] Process 3679741 exits successfully.

Please provide a minimal script so that we can replicated the behaviour.

wookjeHan · 2022-07-20T07:36:04Z

I'm wondering whether it also works when deepspeed version is 0.6.7

pacman100 · 2022-07-20T07:51:28Z

Yup! Getting the error with deepspeed version 0.6.7. For time being, please use the deepspeed version 0.6.5.

File "complete_nlp_example.py", line 128, in training_function
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
  File "/home/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/sourab/transformers/src/transformers/modeling_utils.py", line 2065, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 655, in __init__
    init_distributed()
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 427, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 35, in __init__
    self.init_process_group(backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 38, in init_process_group
    return torch.distributed.init_process_group(backend,
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
    raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!

wookjeHan · 2022-07-20T07:52:49Z

Thanks! It works for me!

pacman100 · 2022-07-25T05:41:03Z

The above merged PR should solve this issue and folks can now use latest DeepSpeed version without any problem.

wookjeHan added the bug Something isn't working label Jul 20, 2022

sgugger assigned pacman100 Jul 20, 2022

wookjeHan closed this as completed Jul 20, 2022

pacman100 mentioned this issue Jul 20, 2022

[BUG] version 0.6.7 is throwing RuntimeError: trying to initialize the default process group twice! deepspeedai/DeepSpeed#2117

Closed

pacman100 reopened this Jul 21, 2022

pacman100 mentioned this issue Jul 21, 2022

deepspeed version 0.6.7 fix #544

Merged

pacman100 closed this as completed Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize the default process group twice, When integrating with DeepSpeed #536

Initialize the default process group twice, When integrating with DeepSpeed #536

wookjeHan commented Jul 20, 2022

pacman100 commented Jul 20, 2022 •

edited

Loading

wookjeHan commented Jul 20, 2022

pacman100 commented Jul 20, 2022

wookjeHan commented Jul 20, 2022

pacman100 commented Jul 25, 2022

Initialize the default process group twice, When integrating with DeepSpeed #536

Initialize the default process group twice, When integrating with DeepSpeed #536

Comments

wookjeHan commented Jul 20, 2022

System Info

Information

Tasks

Reproduction

Expected behavior

pacman100 commented Jul 20, 2022 • edited Loading

wookjeHan commented Jul 20, 2022

pacman100 commented Jul 20, 2022

wookjeHan commented Jul 20, 2022

pacman100 commented Jul 25, 2022

pacman100 commented Jul 20, 2022 •

edited

Loading