Question about loading checkpoint #446

finger92 · 2022-01-14T03:18:12Z

I was able to resume training by only 'load_state_dict' in monitor peer before using hivemind 1.0.0 version.
The code looks like this:

# monitor peer
if load_from_pretrained:
  self.model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"), strict=False)
  ...
  self.collaborative_optimizer.load_state_dict(torch.load("optimizer.pt", map_location="cpu"))

The peers would load from monitor's state after start up.

However, in ver 1.0.0 or master code, 'load_state_dict' in monitor seems not work.
My question is am I using the wrong method or should I load the checkpoint on the worker peer?

The text was updated successfully, but these errors were encountered:

justheuristic · 2022-01-14T04:18:02Z

Hi! Thanks for the report. I'll appreciate if you can run a few tests to isolate the problem.

Note: there's quite a bunch of them. If you find any discrepancies, you don't need to run the rest of tests. Also: feel free to reach out if you need any assistance with running these checks.

Q0: If possible, please specify hivemind versions you used: both the old one and the new one

Q1: In the new version, when you launch one monitor and one training peer, does the training peer print Downloading parameter from <long PeerID string>? (and not "Failed to load state" / "Cowardly refusing to load state")

Q2: Are you loading a checkpoint saved in an earlier hivemind version or in master? If new hivemind fails to load old checkpoint, does it load the checkpoint that was saved in the new version?

Q3: Please print model and optimizer checksums on the monitor right after it loads parameters from file(monitor) or from peer(trainer) respectively

print("Local epoch:", self.collaborative_optimizer.local_epoch)
print("Params checksum:", sum(p.sum().item() for p in self.model.parameters()))
print("Optimizer checksum:", sum(v.data.numpy().sum() for k,v in self.collaborative_optimizer.state_dict().items()))

And the print same values in a training peer right after it loads state for the first time.

Does local_epoch match the epoch that was used when you last saved state?
Do these values match? If they do not, did they match in earlier version?

Q4: Can you please check if all keys match successfully? (print outputs of whatever.load_state_dict(...) - it contains a report of which keys matched and which did not)

finger92 · 2022-01-14T09:57:36Z

Q0: 0.10.0 vs 1.0.0
Q1: I found a warn
[WARN] [hivemind.optim.state_averager.load_state_from_peers:669] Failed to load state from peer, received inconsistent number of optimizer statistics
Could this be the reason for fail to load checkpoint?

borzunov · 2022-01-14T12:51:00Z

Could this be the reason for fail to load checkpoint?

I think yes. Are you sure that the model and the optimizer are defined in the same way in the monitor and the trainer?

If possible, can you please send the code and the CLI args you're running it with, so we can reproduce the issue locally?

justheuristic · 2022-01-15T08:08:03Z

@finger92 JFYI: the warning is thrown by this line: state_averager.py:669.

This warning is triggered by StopIteration which means that you received less tensors in loaded state than you expected

either the peer that sent you state has a different model and/or optimizer configuration (e.g. number of layers, Adam vs Lamb or different options)
or there was a connection error - in which case you will see that connection error above the warning (e.g. TimeoutError, BrokenPipeError)

It would be great if you could physically check that the states have the same shape. On both aux and gpu peers, run:

metadata, tensors, infos = self.collaborative_optimizer.state_averager.get_current_state()
print("Number of tensors in state:", len(tensors))

If they match, please also check print(metadata["optimizer_metadata"]) and see if it has the same type/number of elements

If either of the two mismatch between trainer and aux peer, then the two peers created model/optimizer differently and we should look for the problem in the client code (as in "not in hivemind core"). If they match, then the state somehow got broken in transit, we'll help you investigate that.

finger92 · 2022-01-19T03:25:09Z

I solved this!
In example/albert
The trainer peer used scheduler while monitor peer not. Which results in some differences in the "optimizer state_dict" of the two peers(scheduler will add a 'initial_lr' to optimizer's state). After adding a non-functional "scheduler" in monitor peer it works fine.

By the way, I changed "prefix" in state_averager in monitor peer's code to let trainer could download state from monitor

self.state_averager = TrainingStateAverager(
            dht=dht,
            optimizer=opt,
            prefix=f"{experiment_prefix}_state_averager",
            state_compression=hivemind.Float16Compression(),
            bandwidth=optimizer_args.bandwidth,
            client_mode=optimizer_args.client_mode,
            start=True,
            **asdict(averager_args),
        )

justheuristic · 2022-01-19T20:19:53Z

Hi! Awesome work!
Feel free to ping us if you encounter any more oddities :)

We'll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update.

This PR fixes several minor issues found in #446 : - fix `prefix=...` in training monitor - create scheduler in training monitor - rename experiment_prefix -> run_id - enable checkpoints on aux peer by default - decouple total steps from scheduler max steps Co-authored-by: Yi Zhou <[email protected]> Co-authored-by: Alexander Borzunov <[email protected]> Co-authored-by: Max Ryabinin <[email protected]>

justheuristic mentioned this issue Jan 24, 2022

Fix monitor state_averager in examples/albert (preliminary) #452

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about loading checkpoint #446

Question about loading checkpoint #446

finger92 commented Jan 14, 2022

justheuristic commented Jan 14, 2022 •

edited

Loading

finger92 commented Jan 14, 2022

borzunov commented Jan 14, 2022

justheuristic commented Jan 15, 2022 •

edited

Loading

finger92 commented Jan 19, 2022

justheuristic commented Jan 19, 2022

Question about loading checkpoint #446

Question about loading checkpoint #446

Comments

finger92 commented Jan 14, 2022

justheuristic commented Jan 14, 2022 • edited Loading

finger92 commented Jan 14, 2022

borzunov commented Jan 14, 2022

justheuristic commented Jan 15, 2022 • edited Loading

finger92 commented Jan 19, 2022

justheuristic commented Jan 19, 2022

justheuristic commented Jan 14, 2022 •

edited

Loading

justheuristic commented Jan 15, 2022 •

edited

Loading