Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod all_gather function refactor #9695

Closed
four4fish opened this issue Sep 24, 2021 · 1 comment · Fixed by #9696
Closed

Horovod all_gather function refactor #9695

four4fish opened this issue Sep 24, 2021 · 1 comment · Fixed by #9696
Labels
distributed Generic distributed-related topic let's do it! approved to implement refactor

Comments

@four4fish
Copy link
Contributor

Proposed refactoring or deprecation

Current Horovod training type plugin collective function all_gather() is calling horovod all_gather() and convert result to a list

  1. Horovod support allgather_object which support return a list of tensor
    https://horovod.readthedocs.io/en/stable/_modules/horovod/torch/functions.html#allgather_object

  2. Revisit the use cases to see do we need to return a list of tensor here, as the training_type_plugin all_gather() api is defined to return a tensor now.

Motivation

Have correct and consistent collective behavior

Pitch

    def all_gather(
        self, result: Union[torch.Tensor], group: Optional[Any] = dist_group.WORLD, sync_grads: bool = False
    ) -> torch.Tensor:
        if group is not None and group != dist_group.WORLD:
            raise ValueError("Horovod does not support allgather using a subcommunicator at this time. Unset `group`.")

        if len(result.shape) == 0:
            # Convert scalars to single dimension tensors
            result = result.reshape(1)

        # sync and gather all
        self.join()
        gathered = hvd.allgather(result)
        gathered_result = list(gathered.split(1, dim=0))
        return gathered_result

[RFC]
Option 1

    def all_gather(
        self, result: Union[torch.Tensor], group: Optional[Any] = dist_group.WORLD, sync_grads: bool = False
    ) -> torch.Tensor:
        if group is not None and group != dist_group.WORLD:
            raise ValueError("Horovod does not support allgather using a subcommunicator at this time. Unset `group`.")

        if len(result.shape) == 0:
            # Convert scalars to single dimension tensors
            result = result.reshape(1)

        # sync and gather all
        self.join()
        return hvd. allgather_object(result)

Option 2

    def all_gather(
        self, result: Union[torch.Tensor], group: Optional[Any] = dist_group.WORLD, sync_grads: bool = False
    ) -> torch.Tensor:
        if group is not None and group != dist_group.WORLD:
            raise ValueError("Horovod does not support allgather using a subcommunicator at this time. Unset `group`.")

        if len(result.shape) == 0:
            # Convert scalars to single dimension tensors
            result = result.reshape(1)

        # sync and gather all
        self.join()
        return  hvd.allgather(result)t)

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub
Copy link
Contributor

Given that all_gather is typed as accepting a torch.Tensor and returning a torch.Tensor here, I am in favor of option 2 as it's the most direct translation and likely offers a performance win compared to calling hvd.allgather_object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic let's do it! approved to implement refactor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants