Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix: load_state_from_peers with offload_optimizer #417

Merged
merged 1 commit into from
Dec 2, 2021

Conversation

justheuristic
Copy link
Member

@justheuristic justheuristic commented Dec 1, 2021

hivemind.optim.experimental.Optimizer with offload_optimizer=True behaved incorrectly when loading state from peers.

It would load the state into local parameters, and then it was meant to write new parameters into the offloaded optimizer, but actually overriden the newly loaded parameters with old offloaded ones. The PR fixes this.

Demonstration 1: peers with uneven performance are now handled correctly (same as in CollaborativeOptimizer)
image

Demonstration 2: peers that join late are now handled correctly even with offload_optimizer=True. In the demo below, the orange peer started late.
image

Before fix, late peers with offload_optimizer would have correct main parameters, but their optimizer would actually have wrong parameters
image

@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #417 (dbaff7e) into master (a960438) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #417      +/-   ##
==========================================
- Coverage   83.68%   83.65%   -0.04%     
==========================================
  Files          77       77              
  Lines        7785     7788       +3     
==========================================
  Hits         6515     6515              
- Misses       1270     1273       +3     
Impacted Files Coverage Δ
hivemind/optim/experimental/state_averager.py 86.86% <100.00%> (+0.09%) ⬆️
hivemind/dht/node.py 91.44% <0.00%> (-1.19%) ⬇️
hivemind/averaging/matchmaking.py 88.48% <0.00%> (ø)
hivemind/utils/mpfuture.py 95.00% <0.00%> (+0.90%) ⬆️

@justheuristic justheuristic changed the title Hotfix: offload_optimizer in load_state_from_peers Hotfix: load_state_from_peers with offload_optimizer Dec 2, 2021
@@ -631,7 +631,8 @@ def load_state_from_peers(self, **kwargs):
Attempt to download the latest optimizer state from peers and update trainer parameters/statistics.
:returns: whether or the averager succeeded in loading parameters
"""
main_parameters_and_extras = tuple(chain(self.main_parameters, self.extra_tensors))
opt_parameters = tuple(param for param_group in self.optimizer.param_groups for param in param_group["params"])
main_parameters_and_extras = tuple(chain(opt_parameters, self.extra_tensors))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean? If self.main_parameters are invalid at this stage, shouldn't we replace them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If offload_optimizer == False, opt_parameters ARE main parameters

If offload_optimizer == True, we update main parameters in L665

@justheuristic
Copy link
Member Author

@SeanNaren JFYI: I see that you've reacted to this, but just in case, you will need to switch to newer hivemind version after this gets merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants