-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about loading checkpoint #446
Comments
Hi! Thanks for the report. I'll appreciate if you can run a few tests to isolate the problem. Note: there's quite a bunch of them. If you find any discrepancies, you don't need to run the rest of tests. Also: feel free to reach out if you need any assistance with running these checks. Q0: If possible, please specify hivemind versions you used: both the old one and the new one Q1: In the new version, when you launch one monitor and one training peer, does the training peer print Q2: Are you loading a checkpoint saved in an earlier hivemind version or in master? If new hivemind fails to load old checkpoint, does it load the checkpoint that was saved in the new version? Q3: Please print model and optimizer checksums on the monitor right after it loads parameters from file(monitor) or from peer(trainer) respectively print("Local epoch:", self.collaborative_optimizer.local_epoch)
print("Params checksum:", sum(p.sum().item() for p in self.model.parameters()))
print("Optimizer checksum:", sum(v.data.numpy().sum() for k,v in self.collaborative_optimizer.state_dict().items())) And the print same values in a training peer right after it loads state for the first time.
Q4: Can you please check if all keys match successfully? (print outputs of whatever.load_state_dict(...) - it contains a report of which keys matched and which did not) |
Q0: 0.10.0 vs 1.0.0 |
I think yes. Are you sure that the model and the optimizer are defined in the same way in the monitor and the trainer? If possible, can you please send the code and the CLI args you're running it with, so we can reproduce the issue locally? |
@finger92 JFYI: the warning is thrown by this line: state_averager.py:669. This warning is triggered by StopIteration which means that you received less tensors in loaded state than you expected
It would be great if you could physically check that the states have the same shape. On both aux and gpu peers, run: metadata, tensors, infos = self.collaborative_optimizer.state_averager.get_current_state()
print("Number of tensors in state:", len(tensors)) If they match, please also check print(metadata["optimizer_metadata"]) and see if it has the same type/number of elements If either of the two mismatch between trainer and aux peer, then the two peers created model/optimizer differently and we should look for the problem in the client code (as in "not in hivemind core"). If they match, then the state somehow got broken in transit, we'll help you investigate that. |
I solved this! By the way, I changed "prefix" in state_averager in monitor peer's code to let trainer could download state from monitor
|
Hi! Awesome work! We'll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update. |
This PR fixes several minor issues found in #446 : - fix `prefix=...` in training monitor - create scheduler in training monitor - rename experiment_prefix -> run_id - enable checkpoints on aux peer by default - decouple total steps from scheduler max steps Co-authored-by: Yi Zhou <[email protected]> Co-authored-by: Alexander Borzunov <[email protected]> Co-authored-by: Max Ryabinin <[email protected]>
I was able to resume training by only 'load_state_dict' in monitor peer before using hivemind 1.0.0 version.
The code looks like this:
The peers would load from monitor's state after start up.
However, in ver 1.0.0 or master code, 'load_state_dict' in monitor seems not work.
My question is am I using the wrong method or should I load the checkpoint on the worker peer?
The text was updated successfully, but these errors were encountered: