-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix: load_state_from_peers with offload_optimizer #417
Conversation
Codecov Report
@@ Coverage Diff @@
## master #417 +/- ##
==========================================
- Coverage 83.68% 83.65% -0.04%
==========================================
Files 77 77
Lines 7785 7788 +3
==========================================
Hits 6515 6515
- Misses 1270 1273 +3
|
@@ -631,7 +631,8 @@ def load_state_from_peers(self, **kwargs): | |||
Attempt to download the latest optimizer state from peers and update trainer parameters/statistics. | |||
:returns: whether or the averager succeeded in loading parameters | |||
""" | |||
main_parameters_and_extras = tuple(chain(self.main_parameters, self.extra_tensors)) | |||
opt_parameters = tuple(param for param_group in self.optimizer.param_groups for param in param_group["params"]) | |||
main_parameters_and_extras = tuple(chain(opt_parameters, self.extra_tensors)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean? If self.main_parameters
are invalid at this stage, shouldn't we replace them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If offload_optimizer == False, opt_parameters ARE main parameters
If offload_optimizer == True, we update main parameters in L665
@SeanNaren JFYI: I see that you've reacted to this, but just in case, you will need to switch to newer hivemind version after this gets merged :) |
hivemind.optim.experimental.Optimizer with offload_optimizer=True behaved incorrectly when loading state from peers.
It would load the state into local parameters, and then it was meant to write new parameters into the offloaded optimizer, but actually overriden the newly loaded parameters with old offloaded ones. The PR fixes this.
Demonstration 1: peers with uneven performance are now handled correctly (same as in CollaborativeOptimizer)

Demonstration 2: peers that join late are now handled correctly even with offload_optimizer=True. In the demo below, the orange peer started late.

Before fix, late peers with offload_optimizer would have correct main parameters, but their optimizer would actually have wrong parameters
