Hotfix: load_state_from_peers with offload_optimizer #417

justheuristic · 2021-12-01T23:51:15Z

hivemind.optim.experimental.Optimizer with offload_optimizer=True behaved incorrectly when loading state from peers.

It would load the state into local parameters, and then it was meant to write new parameters into the offloaded optimizer, but actually overriden the newly loaded parameters with old offloaded ones. The PR fixes this.

Demonstration 1: peers with uneven performance are now handled correctly (same as in CollaborativeOptimizer)

Demonstration 2: peers that join late are now handled correctly even with offload_optimizer=True. In the demo below, the orange peer started late.

Before fix, late peers with offload_optimizer would have correct main parameters, but their optimizer would actually have wrong parameters

codecov · 2021-12-01T23:59:55Z

Codecov Report

Merging #417 (dbaff7e) into master (a960438) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #417      +/-   ##
==========================================
- Coverage   83.68%   83.65%   -0.04%     
==========================================
  Files          77       77              
  Lines        7785     7788       +3     
==========================================
  Hits         6515     6515              
- Misses       1270     1273       +3

Impacted Files	Coverage Δ
hivemind/optim/experimental/state_averager.py	`86.86% <100.00%> (+0.09%)`	⬆️
hivemind/dht/node.py	`91.44% <0.00%> (-1.19%)`	⬇️
hivemind/averaging/matchmaking.py	`88.48% <0.00%> (ø)`
hivemind/utils/mpfuture.py	`95.00% <0.00%> (+0.90%)`	⬆️

borzunov · 2021-12-02T09:59:48Z

hivemind/optim/experimental/state_averager.py

@@ -631,7 +631,8 @@ def load_state_from_peers(self, **kwargs):
        Attempt to download the latest optimizer state from peers and update trainer parameters/statistics.
        :returns: whether or the averager succeeded in loading parameters
        """
-        main_parameters_and_extras = tuple(chain(self.main_parameters, self.extra_tensors))
+        opt_parameters = tuple(param for param_group in self.optimizer.param_groups for param in param_group["params"])
+        main_parameters_and_extras = tuple(chain(opt_parameters, self.extra_tensors))


What does it mean? If self.main_parameters are invalid at this stage, shouldn't we replace them?

If offload_optimizer == False, opt_parameters ARE main parameters

If offload_optimizer == True, we update main parameters in L665

justheuristic · 2021-12-02T10:25:17Z

@SeanNaren JFYI: I see that you've reacted to this, but just in case, you will need to switch to newer hivemind version after this gets merged :)

Hotfix: offload_optimizer in load_state_from_peers

dbaff7e

justheuristic changed the title ~~Hotfix: offload_optimizer in load_state_from_peers~~ Hotfix: load_state_from_peers with offload_optimizer Dec 2, 2021

borzunov reviewed Dec 2, 2021

View reviewed changes

borzunov approved these changes Dec 2, 2021

View reviewed changes

justheuristic merged commit 318bb7a into master Dec 2, 2021

justheuristic deleted the justheuristic-patch-2 branch December 2, 2021 10:25

justheuristic mentioned this pull request Dec 15, 2021

[BUG] Dead lock when 'Downloading parameters' cost Took too much time #401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix: load_state_from_peers with offload_optimizer #417

Hotfix: load_state_from_peers with offload_optimizer #417

justheuristic commented Dec 1, 2021 •

edited

Loading

codecov bot commented Dec 1, 2021 •

edited

Loading

borzunov Dec 2, 2021

justheuristic Dec 2, 2021

justheuristic commented Dec 2, 2021

Hotfix: load_state_from_peers with offload_optimizer #417

Hotfix: load_state_from_peers with offload_optimizer #417

Conversation

justheuristic commented Dec 1, 2021 • edited Loading

codecov bot commented Dec 1, 2021 • edited Loading

Codecov Report

borzunov Dec 2, 2021

Choose a reason for hiding this comment

justheuristic Dec 2, 2021

Choose a reason for hiding this comment

justheuristic commented Dec 2, 2021

justheuristic commented Dec 1, 2021 •

edited

Loading

codecov bot commented Dec 1, 2021 •

edited

Loading