Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different AMP & buffer configurations in one experiment, fix minor bugs #389

Merged
merged 20 commits into from
Sep 28, 2021

Conversation

justheuristic
Copy link
Member

New features:

  • CollaborativeOptimizer can now combine fp16=True and reuse_grad_buffers=True with a special scaler
  • CollaborativeOptimizer peers with reuse_grad_buffers=True and reuse_grad_buffers=False can now co-exist
  • CollaborativeOptimizer peers with and without AMP can now co-exist

The new behavior of CollaborativeOptimizer with fp16 is:

  • grad_scaler=None: regular fp32 behavior
  • reuse_grad_buffers=False with GradScaler: works as usual, independently un-scales each tensor before accumulation, does not affect internal optimizer
  • reuse_grad_buffers=True with GradScaler: when calling scaler.step(opt), it will raise error and complain that it requires HivemindGradScaler
  • reuse_grad_buffers=False with HivemindGradScaler: applies unscale/update only around global optimizer step


@torch.no_grad()
def apply_accumulated_grads_(self, scale_by: Optional[float] = None):
if self.reuse_grad_buffers:
Copy link
Member Author

@justheuristic justheuristic Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previously caused a bug where reuse=True peers would not be scaled by scale_by. As a result, if there was a mix of reuse=True and reuse=False peers, reuse=True would have larger gradients and dominate the reuse=False peers.

self._grads = [
torch.zeros_like(grad, device=self.accumulate_grads_on) for grad in self._grad_buffers()
]
yield from self._grads
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wold actually an error with reuse_grad_buffers=True, but it worked because noone asked for more than len(grad_buffers) elements

@justheuristic justheuristic changed the title FP16 support, a few patches from sahajbert Extended AMP support, a few patches from sahajbert Sep 25, 2021
@codecov
Copy link

codecov bot commented Sep 25, 2021

Codecov Report

Merging #389 (773a3bb) into master (4a9bc92) will decrease coverage by 0.74%.
The diff coverage is 23.65%.

@@            Coverage Diff             @@
##           master     #389      +/-   ##
==========================================
- Coverage   84.11%   83.37%   -0.75%     
==========================================
  Files          70       71       +1     
  Lines        6423     6497      +74     
==========================================
+ Hits         5403     5417      +14     
- Misses       1020     1080      +60     
Impacted Files Coverage Δ
hivemind/optim/collaborative.py 23.77% <7.89%> (-1.32%) ⬇️
hivemind/optim/grad_scaler.py 33.33% <33.33%> (ø)
hivemind/optim/__init__.py 100.00% <100.00%> (ø)
hivemind/dht/node.py 91.44% <0.00%> (-1.19%) ⬇️
hivemind/averaging/matchmaking.py 83.75% <0.00%> (-0.32%) ⬇️

@justheuristic justheuristic changed the title Extended AMP support, a few patches from sahajbert Extended AMP support with reuse_grad_buffers, a few patches from sahajbert Sep 25, 2021
@mryab mryab changed the title Extended AMP support with reuse_grad_buffers, a few patches from sahajbert Extend AMP support with reuse_grad_buffers, improve cross-device averaging Sep 26, 2021
@mryab mryab changed the title Extend AMP support with reuse_grad_buffers, improve cross-device averaging Improve AMP support with reuse_grad_buffers and cross-device averaging Sep 26, 2021
@mryab mryab changed the title Improve AMP support with reuse_grad_buffers and cross-device averaging Support different AMP configurations in one experiment Sep 26, 2021
@justheuristic justheuristic changed the title Support different AMP configurations in one experiment Support different AMP & reuse configurations in one experiment, fix minor bugs Sep 27, 2021
@justheuristic justheuristic changed the title Support different AMP & reuse configurations in one experiment, fix minor bugs Support different AMP & buffer configurations in one experiment, fix minor bugs Sep 27, 2021
def _unscale_grads_(
self, optimizer: Optimizer, inv_scale: torch.Tensor, found_inf: torch.Tensor, allow_fp16: bool
) -> Dict[torch.device, torch.Tensor]:
return super()._unscale_grads_(optimizer, inv_scale, found_inf, allow_fp16=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it ignore the allow_fp16 value always setting it to True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the same trick that fairscale uses to allow training without master fp32 weights
https://github.com/facebookresearch/fairscale/blob/main/fairscale/optim/grad_scaler.py
(got referred there by @TimDettmers)

Added a quick comment explaining that

elif self._grads is None:
with torch.no_grad():
return
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (suggestion): Since the if branch ends with return, we can remove else: and one indent level for the remaining code. This may make it look simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respectfully disagree here: it does reduce indentation, but:

  • the code in the else clause is quite simple as it is
  • crucially, last time the code was wrong because i've accidentally removed return and didn't think about the ramifications

If you do insist, please state that explicitly, i'll remove the else clause anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that else: after return is obviously redundant and seems like a bug, so I insist on choosing one of the two options:

  1. return without else: (preferred since it implies less indentation, more clarity)
  2. else: without return

@justheuristic justheuristic merged commit 1d862c9 into master Sep 28, 2021
@justheuristic justheuristic deleted the fp16 branch September 28, 2021 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants