Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about loading checkpoint #446

Open
finger92 opened this issue Jan 14, 2022 · 6 comments
Open

Question about loading checkpoint #446

finger92 opened this issue Jan 14, 2022 · 6 comments

Comments

@finger92
Copy link
Contributor

I was able to resume training by only 'load_state_dict' in monitor peer before using hivemind 1.0.0 version.
The code looks like this:

# monitor peer
if load_from_pretrained:
  self.model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"), strict=False)
  ...
  self.collaborative_optimizer.load_state_dict(torch.load("optimizer.pt", map_location="cpu"))

The peers would load from monitor's state after start up.

However, in ver 1.0.0 or master code, 'load_state_dict' in monitor seems not work.
My question is am I using the wrong method or should I load the checkpoint on the worker peer?

@justheuristic
Copy link
Member

justheuristic commented Jan 14, 2022

Hi! Thanks for the report. I'll appreciate if you can run a few tests to isolate the problem.

Note: there's quite a bunch of them. If you find any discrepancies, you don't need to run the rest of tests. Also: feel free to reach out if you need any assistance with running these checks.

Q0: If possible, please specify hivemind versions you used: both the old one and the new one

Q1: In the new version, when you launch one monitor and one training peer, does the training peer print Downloading parameter from <long PeerID string>? (and not "Failed to load state" / "Cowardly refusing to load state")

Q2: Are you loading a checkpoint saved in an earlier hivemind version or in master? If new hivemind fails to load old checkpoint, does it load the checkpoint that was saved in the new version?

Q3: Please print model and optimizer checksums on the monitor right after it loads parameters from file(monitor) or from peer(trainer) respectively

print("Local epoch:", self.collaborative_optimizer.local_epoch)
print("Params checksum:", sum(p.sum().item() for p in self.model.parameters()))
print("Optimizer checksum:", sum(v.data.numpy().sum() for k,v in self.collaborative_optimizer.state_dict().items()))

And the print same values in a training peer right after it loads state for the first time.

  • Does local_epoch match the epoch that was used when you last saved state?
  • Do these values match? If they do not, did they match in earlier version?

Q4: Can you please check if all keys match successfully? (print outputs of whatever.load_state_dict(...) - it contains a report of which keys matched and which did not)

@finger92
Copy link
Contributor Author

Q0: 0.10.0 vs 1.0.0
Q1: I found a warn
[WARN] [hivemind.optim.state_averager.load_state_from_peers:669] Failed to load state from peer, received inconsistent number of optimizer statistics
Could this be the reason for fail to load checkpoint?

@borzunov
Copy link
Member

Could this be the reason for fail to load checkpoint?

I think yes. Are you sure that the model and the optimizer are defined in the same way in the monitor and the trainer?

If possible, can you please send the code and the CLI args you're running it with, so we can reproduce the issue locally?

@justheuristic
Copy link
Member

justheuristic commented Jan 15, 2022

@finger92 JFYI: the warning is thrown by this line: state_averager.py:669.

This warning is triggered by StopIteration which means that you received less tensors in loaded state than you expected

  • either the peer that sent you state has a different model and/or optimizer configuration (e.g. number of layers, Adam vs Lamb or different options)
  • or there was a connection error - in which case you will see that connection error above the warning (e.g. TimeoutError, BrokenPipeError)

It would be great if you could physically check that the states have the same shape. On both aux and gpu peers, run:

metadata, tensors, infos = self.collaborative_optimizer.state_averager.get_current_state()
print("Number of tensors in state:", len(tensors))

If they match, please also check print(metadata["optimizer_metadata"]) and see if it has the same type/number of elements

If either of the two mismatch between trainer and aux peer, then the two peers created model/optimizer differently and we should look for the problem in the client code (as in "not in hivemind core"). If they match, then the state somehow got broken in transit, we'll help you investigate that.

@finger92
Copy link
Contributor Author

I solved this!
In example/albert
The trainer peer used scheduler while monitor peer not. Which results in some differences in the "optimizer state_dict" of the two peers(scheduler will add a 'initial_lr' to optimizer's state). After adding a non-functional "scheduler" in monitor peer it works fine.

By the way, I changed "prefix" in state_averager in monitor peer's code to let trainer could download state from monitor

self.state_averager = TrainingStateAverager(
            dht=dht,
            optimizer=opt,
            prefix=f"{experiment_prefix}_state_averager",
            state_compression=hivemind.Float16Compression(),
            bandwidth=optimizer_args.bandwidth,
            client_mode=optimizer_args.client_mode,
            start=True,
            **asdict(averager_args),
        )

@justheuristic
Copy link
Member

Hi! Awesome work!
Feel free to ping us if you encounter any more oddities :)

We'll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update.

justheuristic added a commit that referenced this issue Jan 24, 2022
This PR fixes several minor issues found in #446 : 

- fix `prefix=...` in training monitor
- create scheduler in training monitor
- rename experiment_prefix -> run_id
- enable checkpoints on aux peer by default
- decouple total steps from scheduler max steps

Co-authored-by: Yi Zhou <[email protected]>
Co-authored-by: Alexander Borzunov <[email protected]>
Co-authored-by: Max Ryabinin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants