-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
32 slurm #36
32 slurm #36
Conversation
In the end, it turned out that naming the model artifacts was broken because this was only added for the wandb logger in version 1.9.0 of pytorch lightning (and indeed it seems no other way existed until this was added, see: Lightning-AI/pytorch-lightning#15990). I did try using the Since our project dependency is 1.8.3.post1 and updating would be a big hassle (and in particular I recall things not working on Baskerville?) I took the route of adapting the changes to the wandb logger in 1.9.0 to 1.8.3.post1 and adding it to our project source code. This does work, and I ran I couple of very quick 100-step models to check that it did. Lmk if that's too hacky a solution! |
I've raised #37 for the versions issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! some minor changes requested. Haven't run a resulting script on Baskerville though and will leave that to you to check.
I also think some of the tests are breaking right now so worth pulling these in from develop
(edited to be agnostic to how those changes are pulled in! merge vs rebase: https://www.atlassian.com/git/tutorials/merging-vs-rebasing)
(edit 2: actually, to be opinionated again, should be a merge? )
train_model( | ||
dm=dmpair.A, | ||
experiment_name=f"{experiment_pair_name}_A", | ||
trainer_config=trainer_config, | ||
) | ||
seed_everything(seed=dmpair_kwargs["seed"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the seeding happen before DMPair creation for control of val split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell no - passing none
to a datamodule for the seed
argument will cause an error when .setup()
is called, even after running seed_everything
. Looking at the source code it seems the datamodule needs its own seen, and doesn't draw on a global seed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, that makes sense. Good stuff. Were you able to verify that seed_everything
gives desired consistency for training runs?
@@ -1,5 +1,5 @@ | |||
trainer_kwargs: | |||
max_epochs: 100 | |||
max_epochs: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this enough for convergence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - looking at the wandb plots val_loss seems to plateau after about 4-6 epochs at most. Could lower this now or leave for #26?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Happy to leave for 26
if hasattr(checkpoint_callback, k) | ||
}, | ||
} | ||
if _WANDB_GREATER_EQUAL_0_10_22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we doing if not? should this raise an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a c/p from the pl source code. My understanding is metadata
becomes None
if not (l 79). Either way the metadata is passed as an argument to wandb.Artifact()
. For this function None
is the default argument so I don't think it should be raising an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok happy to leave as is
I've merged in the fixes to the test - agreed that it looked safer than a rebase! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! Approved with some very minor changes requested.
Could you also verify a training run (and consistency given by seeding) before merging?
I added an extra arg to |
This PR resolves #32 for the case of training scripts, and contributes to #19. It adds the following:
generate_train_scripts.py
now uses templating to generate a slurm batch submissiontrain_all
will submit all batches for a given experiment groupseed_everything
call addedscripts/README
updatedWhat this PR doesn't do is #26, which is left for another PR.