Horovod all_gather function refactor #9695

four4fish · 2021-09-24T23:34:47Z

Proposed refactoring or deprecation

Current Horovod training type plugin collective function all_gather() is calling horovod all_gather() and convert result to a list

Horovod support allgather_object which support return a list of tensor
https://horovod.readthedocs.io/en/stable/_modules/horovod/torch/functions.html#allgather_object
Revisit the use cases to see do we need to return a list of tensor here, as the training_type_plugin all_gather() api is defined to return a tensor now.

Motivation

Have correct and consistent collective behavior

Pitch

    def all_gather(
        self, result: Union[torch.Tensor], group: Optional[Any] = dist_group.WORLD, sync_grads: bool = False
    ) -> torch.Tensor:
        if group is not None and group != dist_group.WORLD:
            raise ValueError("Horovod does not support allgather using a subcommunicator at this time. Unset `group`.")

        if len(result.shape) == 0:
            # Convert scalars to single dimension tensors
            result = result.reshape(1)

        # sync and gather all
        self.join()
        gathered = hvd.allgather(result)
        gathered_result = list(gathered.split(1, dim=0))
        return gathered_result

[RFC]
Option 1

    def all_gather(
        self, result: Union[torch.Tensor], group: Optional[Any] = dist_group.WORLD, sync_grads: bool = False
    ) -> torch.Tensor:
        if group is not None and group != dist_group.WORLD:
            raise ValueError("Horovod does not support allgather using a subcommunicator at this time. Unset `group`.")

        if len(result.shape) == 0:
            # Convert scalars to single dimension tensors
            result = result.reshape(1)

        # sync and gather all
        self.join()
        return hvd. allgather_object(result)

Option 2

    def all_gather(
        self, result: Union[torch.Tensor], group: Optional[Any] = dist_group.WORLD, sync_grads: bool = False
    ) -> torch.Tensor:
        if group is not None and group != dist_group.WORLD:
            raise ValueError("Horovod does not support allgather using a subcommunicator at this time. Unset `group`.")

        if len(result.shape) == 0:
            # Convert scalars to single dimension tensors
            result = result.reshape(1)

        # sync and gather all
        self.join()
        return  hvd.allgather(result)t)

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-09-25T02:12:54Z

Given that all_gather is typed as accepting a torch.Tensor and returning a torch.Tensor here, I am in favor of option 2 as it's the most direct translation and likely offers a performance win compared to calling hvd.allgather_object

four4fish added the refactor label Sep 24, 2021

four4fish added the distributed Generic distributed-related topic label Sep 24, 2021

four4fish mentioned this issue Sep 24, 2021

1/n Call training_type_plugin collective functions directly instead of going through the Accelerator #9677

Merged

12 tasks

ananthsub mentioned this issue Sep 25, 2021

Make HorovodPlugin.all_gather return a tensor #9696

Merged

12 tasks

carmocca added the let's do it! approved to implement label Sep 27, 2021

tchaton closed this as completed in #9696 Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod all_gather function refactor #9695

Horovod all_gather function refactor #9695

four4fish commented Sep 24, 2021

ananthsub commented Sep 25, 2021

Horovod all_gather function refactor #9695

Horovod all_gather function refactor #9695

Comments

four4fish commented Sep 24, 2021

Proposed refactoring or deprecation

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

ananthsub commented Sep 25, 2021