Add DeepSpeed Baseline #1

nijkah · 2022-11-22T07:08:36Z

This PR is for the discussion open-mmlab#749.

Co-authored-by: Saeyeol Lee [email protected]
Co-authored-by: Donggeun Yu [email protected]
Co-authored-by: Junhwa Song [email protected]
Co-authored-by: Younghwan Na [email protected]

Signed-off-by: Hakjin Lee [email protected]
Signed-off-by: Saeyeol Lee [email protected]
Signed-off-by: Donggeun Yu [email protected]
Signed-off-by: Junhwa Song [email protected]
Signed-off-by: Younghwan Na [email protected]

Reproduce the performance
Reproduce the performance by resumming.
FP16

Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Donggeun Yu <[email protected]> Co-authored-by: Junhwa Song <[email protected]> Co-authored-by: Younghwan Na <[email protected]> Signed-off-by: Hakjin Lee <[email protected]> Signed-off-by: Saeyeol Lee <[email protected]> Signed-off-by: Donggeun Yu <[email protected]> Signed-off-by: Junhwa Song <[email protected]> Signed-off-by: Younghwan Na <[email protected]>

nijkah · 2022-11-22T07:33:43Z

mmengine/runner/deepspeed_runner.py

+        # load DeepSpeed configuration file
+        self.ds_config = self.cfg.get('ds_config', None)
+        assert self.ds_config is not None, 'ds_config should be specified.'


load deepspeed_config

nijkah · 2022-11-22T07:34:19Z

mmengine/runner/deepspeed_runner.py

+        self.ds_config = self.cfg.get('ds_config', None)
+        assert self.ds_config is not None, 'ds_config should be specified.'
+
+        self.check_ds_config(self.ds_config)


Some ds_config option should be ignored since MMEngine already supports it.

nijkah · 2022-11-22T07:34:46Z

mmengine/runner/deepspeed_runner.py

+        # initialize the model weights before wrapping it with deepspeed
+        self._weights_initialized = False
+        self._init_model_weights()


Model weights should be initialized before wrapping it with DeepSpeedEngine.

Is there any documentation on the reason?

I couldn't find any documentation yet. But, we couldn't reproduce the performance before changing it. (Similar performance trained from scratch.)
We could reproduce the performance after fixing it.

Maybe there is the other way to handle this.

nijkah · 2022-11-22T07:35:40Z

mmengine/runner/deepspeed_runner.py

+        if model_wrapper_cfg is None:
+            # Model will be wrapped in `deepspeed.initialize`.
+            pass


Wrapping model with DeepSpeedEngine is done in deepspeed.initialize.
We may be able to get out its logic by wrapping DeepSpeedEngine as model_wrapper.

nijkah · 2022-11-22T07:39:24Z

mmengine/runner/deepspeed_runner.py

+
+            # TODO: Model Sequentializing
+            # sequential_model = convert_to_sequential_model(model)
+            # model = PipelineModule(
+            #     layers=[model], num_stages=int(os.environ['WORLD_SIZE']))
+            raise NotImplementedError(
+                'Pipeline Parallel is not implemented yet.')


Cannot support PP yet.

nijkah · 2022-11-22T07:41:45Z

mmengine/runner/deepspeed_runner.py

+    def consolidate_state_dict(self,
+                               state_dict: Dict[str, Any],
+                               to: int = 0) -> None:
+        r"""
+        Consolidate a list of ``state_dict`` s (one per rank) on the target
+        rank.
+        Arguments:
+            to (int): the rank that receives the optimizer states (default: 0).
+        Raises:
+            RuntimeError: if ``overlap_with_ddp=True`` and this method is
+                called before this :class:`ZeroRedundancyOptimizer` instance
+                has been fully initialized, which happens once
+                :class:`DistributedDataParallel` gradient buckets have been
+                rebuilt.
+        .. warning:: This needs to be called on all ranks.
+        """
+        from torch.distributed.optim.zero_redundancy_optimizer import (
+            _broadcast_object, _recursive_copy_to_device)


This logic is borrowed from https://github.com/pytorch/pytorch/blob/7b0650d5cf4897089f32c011504d2b2d185cc60a/torch/distributed/optim/zero_redundancy_optimizer.py#L489

Does deepspeed provide APIs to deal with this? Borrowing from other library may cause incompatibility in the future.

I'll try to check it!

https://github.com/microsoft/DeepSpeed/blob/ee39187d8f07f7efc615f64affa8cc60ccf41eb5/deepspeed/runtime/engine.py#L3285
Maybe this one?

I checked the above link. It provides similar functions but has limitations.

It only supports ZeRO3

It only supports model state.

I think this consolidate logic is general and can be used for other similar purposes.
How about adding it in mmengine/dist/utils?

nijkah · 2022-11-22T07:44:09Z

mmengine/runner/deepspeed_runner.py

+        # initialize DeepSpeed Engine
+        self.model, optimizer, _, _ = deepspeed.initialize(
+            model=self.model,
+            optimizer=self.optim_wrapper.optimizer,
+            model_parameters=self.model.parameters(),
+            config=self.ds_config)
+        self.optim_wrapper.optimizer = optimizer


This is the problematic line.

Since optimizer is wrapped in DeepSpeedEngine, update the optimizer in optim_wrapper.

I suspect this update operation might be buggy in some special situations... Maybe it's better to build optimizer first, add it to dict and then build_optim_wrapper?

I think your idea is desirable. I'll check it.

nijkah · 2022-11-22T07:46:32Z

mmengine/runner/deepspeed_runner.py

+        if self.model.zero_optimization_partition_weights():
+            device = get_device()
+            checkpoint = _load_checkpoint(filename, map_location=device)


When using ZeRO, first load state_dict without changing it.
e.g. module must not be deleted to resume ZeRO.

nijkah · 2022-11-22T07:53:11Z

mmengine/runner/deepspeed_runner.py

+        if self.model.zero_optimization_partition_weights():
+            # Prepare for checkpoint save by
+            # ensuring all parameters are partitioned
+            self.model.optimizer.checkpoint_event_prologue()
+
+        checkpoint = {
+            'meta': meta,
+            'message_hub': self.message_hub.state_dict(),
+        }
+        # save optimizer state dict to checkpoint
+        if save_optimizer:
+            if not self.model.zero_optimization():
+                checkpoint['optimizer'] = self.optim_wrapper.state_dict()
+            else:
+                self.consolidate_state_dict(self.optim_wrapper.state_dict())
+                # Only the main process needs to load the optimizer's state.
+                optim_state = self.get_zero_state_dict()
+                checkpoint['optimizer'] = optim_state
+
+        # model state is stored after pulling optimizer state to handle ZeRO 3.
+        checkpoint['state_dict'] = weights_to_cpu(self.get_state_dict(model))
+
+        if self.model.zero_optimization_partition_weights():
+            self.model.optimizer.checkpoint_event_epilogue()


This logic is from https://github.com/microsoft/DeepSpeed/blob/211055216792cbb52ab6d355f698c194f9c55efb/deepspeed/runtime/engine.py#L2885.

nijkah · 2022-11-22T07:55:10Z

mmengine/runner/deepspeed_runner.py

+            if self.model.zero_optimization():
+                self.optim_wrapper.load_state_dict(  # type: ignore
+                    checkpoint['optimizer'],
+                    load_from_fp32_weights=self.model.
+                    zero_load_from_fp32_weights())


This tries to follow the compatibility with https://github.com/microsoft/DeepSpeed/blob/211055216792cbb52ab6d355f698c194f9c55efb/deepspeed/runtime/zero/stage_1_and_2.py#L2228

https://github.com/microsoft/DeepSpeed/blob/211055216792cbb52ab6d355f698c194f9c55efb/deepspeed/runtime/zero/stage3.py#L2339

nijkah · 2022-11-22T07:57:25Z

mmengine/runner/deepspeed_runner.py

+        if self.model.zero_optimization_partition_weights():
+            optim_state = self.get_zero_state_dict()
+            fp32_flat_groups = [
+                torch.cat(optim_state[i][FP32_FLAT_GROUPS])
+                for i in range(len(optim_state))
+            ]
+            param_shapes = self.model._get_zero_param_shapes()[0]
+            param_shapes = OrderedDict(
+                {'module.' + k: v
+                 for k, v in param_shapes.items()})
+
+            model_state = _get_fp32_state_dict_from_zero3_checkpoint(
+                world_size=self._world_size,
+                param_shapes=[param_shapes],
+                fp32_flat_groups=fp32_flat_groups,
+                buffers={})


There should be special logic for ZeRO 3.
https://github.com/microsoft/DeepSpeed/blob/211055216792cbb52ab6d355f698c194f9c55efb/deepspeed/utils/zero_to_fp32.py#L100

In the case of ZeRO3, model_state are saved in the optim_state.
Check deepspeedai/DeepSpeed#2413

nijkah · 2022-11-22T07:58:36Z

mmengine/runner/deepspeed_runner.py

+        # model state is stored after pulling optimizer state to handle ZeRO 3.
+        checkpoint['state_dict'] = weights_to_cpu(self.get_state_dict(model))


This is important.

nijkah · 2022-11-22T07:59:13Z

mmengine/runner/deepspeed_runner.py

+            if not self.model.zero_optimization():
+                checkpoint['optimizer'] = self.optim_wrapper.state_dict()
+            else:
+                self.consolidate_state_dict(self.optim_wrapper.state_dict())


To save optimizer_state, consolidate optim_state from all ranks.

nijkah · 2022-11-22T08:02:23Z

mmengine/optim/optimizer/deepspeed_optimizer_wrapper.py

+    def load_state_dict(self, state_dict: dict, **kwargs) -> None:
+        """A wrapper of ``Optimizer.load_state_dict``. load the state dict of


**kwargs are newly added.

nijkah · 2022-11-22T08:03:44Z

mmengine/runner/deepspeed_runner.py

+        # Set logging level to remove duplicate training log from DeepSpeed
+        deepspeed_logger = logging.getLogger('DeepSpeed')
+        deepspeed_logger.setLevel(logging.WARNING)


Remove deepspeed's log.

nijkah · 2022-11-22T08:04:20Z

mmengine/runner/deepspeed_runner.py

+    def inject_basemodel_methods(self):
+        """inject methods from ``BaseModel`` into ``DeepSpeedEngine`` to make
+        ``DeepSpeedEngine`` support the ``train_step`` method appropriately.
+
+        Without injecting, ``DeepSpeedOptimWrapper`` tries ``backward`` from
+        ``BaseModel``, which should be in ``DeepSpeedEngine``.
+        """
+
+        def _train_step(self, data: Union[dict, tuple, list],
+                        optim_wrapper) -> Dict[str, torch.Tensor]:
+            with optim_wrapper.optim_context(self):
+                data = self.data_preprocessor(data, True)
+                losses = self._run_forward(data, mode='loss')  # type: ignore
+            parsed_losses, log_vars = self.parse_losses(losses)  # type: ignore
+            optim_wrapper.update_params(parsed_losses)
+            return log_vars
+
+        self.model.train_step = types.MethodType(_train_step, self.model)


Can we delete this by wrapping DeepSpeedEngine with BaseModel?

C1rN09

Really awesome work! Actually I've not finished my review, especially those "messy" logic related to checkpoint saving/loading. I'll refer to DeepSpeed docs & example codes and your comments later. Stay connected.

C1rN09 · 2022-11-23T07:08:12Z

mmengine/runner/deepspeed_runner.py

+        deepspeed_logger = logging.getLogger('DeepSpeed')
+        deepspeed_logger.setLevel(logging.WARNING)
+
+        self.inject_basemodel_methods()


There are some models who overload the train_step, val_step, test_step methods. Can we support them in deepspeed_runner?

Can I ask you to provide some links to those examples?

One of the difficulties to handle this is that DeepSpeedEngine fails to get some attributes from self.module. https://github.com/microsoft/DeepSpeed/blob/90ae6884424232870154b49967c3e61f0db550d6/deepspeed/runtime/engine.py#L461

I think it will be difficult to support them in the current implementation. Maybe we have to find the other way to handle this.

Some low-level tasks need to rewrite train_step, such as GAN in mmediting. It is indeed very difficult to support them, so I think current implementation is acceptable.

C1rN09 · 2022-11-23T07:13:45Z

mmengine/runner/deepspeed_runner.py

+        # initialize the model weights before wrapping it with deepspeed
+        self._weights_initialized = False
+        self._init_model_weights()


Is there any documentation on the reason?

mmengine/runner/deepspeed_runner.py

C1rN09 · 2022-11-23T09:46:48Z

mmengine/runner/deepspeed_runner.py

+    def consolidate_state_dict(self,
+                               state_dict: Dict[str, Any],
+                               to: int = 0) -> None:
+        r"""
+        Consolidate a list of ``state_dict`` s (one per rank) on the target
+        rank.
+        Arguments:
+            to (int): the rank that receives the optimizer states (default: 0).
+        Raises:
+            RuntimeError: if ``overlap_with_ddp=True`` and this method is
+                called before this :class:`ZeroRedundancyOptimizer` instance
+                has been fully initialized, which happens once
+                :class:`DistributedDataParallel` gradient buckets have been
+                rebuilt.
+        .. warning:: This needs to be called on all ranks.
+        """
+        from torch.distributed.optim.zero_redundancy_optimizer import (
+            _broadcast_object, _recursive_copy_to_device)


Does deepspeed provide APIs to deal with this? Borrowing from other library may cause incompatibility in the future.

C1rN09 · 2022-11-23T10:01:21Z

mmengine/optim/optimizer/deepspeed_optimizer_wrapper.py

+    @contextmanager
+    def optim_context(self, model: nn.Module):
+        """A Context for gradient accumulation and automatic mix precision
+        training.
+
+        Compared to the original method, this saves model information as
+        a member variable in order to use in the training step.
+
+        Args:
+            model (nn.Module): The training model.
+        """
+        # During gradient accumulation process, the gradient synchronize
+        # should only happen before updating parameters.
+        self.model = model
+        yield super().optim_context(model)


Seems a little tricky. As you've mentioned, maybe we should provide wrap_model_and_optimizer

C1rN09 · 2022-11-23T10:05:54Z

mmengine/runner/deepspeed_runner.py

+        # initialize DeepSpeed Engine
+        self.model, optimizer, _, _ = deepspeed.initialize(
+            model=self.model,
+            optimizer=self.optim_wrapper.optimizer,
+            model_parameters=self.model.parameters(),
+            config=self.ds_config)
+        self.optim_wrapper.optimizer = optimizer


I suspect this update operation might be buggy in some special situations... Maybe it's better to build optimizer first, add it to dict and then build_optim_wrapper?

nijkah commented Nov 22, 2022

View reviewed changes

C1rN09 reviewed Nov 23, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSpeed Baseline #1

Add DeepSpeed Baseline #1

nijkah commented Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022

nijkah Nov 22, 2022

nijkah Nov 22, 2022 •

edited

Loading

C1rN09 Nov 23, 2022

nijkah Nov 24, 2022 •

edited

Loading

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022

nijkah Nov 22, 2022

C1rN09 Nov 23, 2022

nijkah Nov 24, 2022

l4d2boomer Nov 24, 2022

nijkah Nov 24, 2022

nijkah Nov 22, 2022

nijkah Nov 22, 2022

C1rN09 Nov 23, 2022

nijkah Nov 24, 2022

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022

nijkah Nov 22, 2022

nijkah Nov 22, 2022

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022

nijkah Nov 22, 2022

nijkah Nov 22, 2022

C1rN09 left a comment

C1rN09 Nov 23, 2022

nijkah Nov 24, 2022 •

edited

Loading

C1rN09 Nov 24, 2022

C1rN09 Nov 23, 2022

C1rN09 Nov 23, 2022

C1rN09 Nov 23, 2022

C1rN09 Nov 23, 2022

		# model state is stored after pulling optimizer state to handle ZeRO 3.
		checkpoint['state_dict'] = weights_to_cpu(self.get_state_dict(model))

		def load_state_dict(self, state_dict: dict, **kwargs) -> None:
		"""A wrapper of ``Optimizer.load_state_dict``. load the state dict of

Add DeepSpeed Baseline #1

Are you sure you want to change the base?

Add DeepSpeed Baseline #1

Conversation

nijkah commented Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah Nov 24, 2022 • edited Loading

Choose a reason for hiding this comment

nijkah Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

C1rN09 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah Nov 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nijkah commented Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 24, 2022 •

edited

Loading

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 22, 2022 •

edited

Loading

nijkah Nov 24, 2022 •

edited

Loading