Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32 slurm #36

Merged
merged 18 commits into from
Mar 7, 2023
Merged

32 slurm #36

merged 18 commits into from
Mar 7, 2023

Conversation

philswatton
Copy link
Contributor

This PR resolves #32 for the case of training scripts, and contributes to #19. It adds the following:

  • generate_train_scripts.py now uses templating to generate a slurm batch submission
  • train_all will submit all batches for a given experiment group
  • model checkpointing added, to save the final result
  • epochs reduced to a more sensible 20
  • pytorch seed_everything call added
  • scripts/README updated

What this PR doesn't do is #26, which is left for another PR.

@philswatton philswatton marked this pull request as ready for review March 6, 2023 14:56
@philswatton
Copy link
Contributor Author

In the end, it turned out that naming the model artifacts was broken because this was only added for the wandb logger in version 1.9.0 of pytorch lightning (and indeed it seems no other way existed until this was added, see: Lightning-AI/pytorch-lightning#15990). I did try using the filename argument in ModelCheckpoint but this didn't work.

Since our project dependency is 1.8.3.post1 and updating would be a big hassle (and in particular I recall things not working on Baskerville?) I took the route of adapting the changes to the wandb logger in 1.9.0 to 1.8.3.post1 and adding it to our project source code. This does work, and I ran I couple of very quick 100-step models to check that it did.

Lmk if that's too hacky a solution!

@philswatton philswatton requested a review from lannelin March 6, 2023 15:00
@philswatton
Copy link
Contributor Author

I've raised #37 for the versions issue

Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! some minor changes requested. Haven't run a resulting script on Baskerville though and will leave that to you to check.

I also think some of the tests are breaking right now so worth pulling these in from develop
(edited to be agnostic to how those changes are pulled in! merge vs rebase: https://www.atlassian.com/git/tutorials/merging-vs-rebasing)
(edit 2: actually, to be opinionated again, should be a merge? )

train_model(
dm=dmpair.A,
experiment_name=f"{experiment_pair_name}_A",
trainer_config=trainer_config,
)
seed_everything(seed=dmpair_kwargs["seed"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the seeding happen before DMPair creation for control of val split?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell no - passing none to a datamodule for the seed argument will cause an error when .setup() is called, even after running seed_everything. Looking at the source code it seems the datamodule needs its own seen, and doesn't draw on a global seed at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, that makes sense. Good stuff. Were you able to verify that seed_everything gives desired consistency for training runs?

src/modsim2/model/wandb_logger.py Outdated Show resolved Hide resolved
@@ -1,5 +1,5 @@
trainer_kwargs:
max_epochs: 100
max_epochs: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this enough for convergence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - looking at the wandb plots val_loss seems to plateau after about 4-6 epochs at most. Could lower this now or leave for #26?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Happy to leave for 26

scripts/train_all.sh Outdated Show resolved Hide resolved
scripts/templates/train-template.sh Outdated Show resolved Hide resolved
scripts/templates/train-template.sh Outdated Show resolved Hide resolved
scripts/templates/train-template.sh Outdated Show resolved Hide resolved
if hasattr(checkpoint_callback, k)
},
}
if _WANDB_GREATER_EQUAL_0_10_22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing if not? should this raise an error?

Copy link
Contributor Author

@philswatton philswatton Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a c/p from the pl source code. My understanding is metadata becomes None if not (l 79). Either way the metadata is passed as an argument to wandb.Artifact(). For this function None is the default argument so I don't think it should be raising an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok happy to leave as is

scripts/templates/train-template.sh Outdated Show resolved Hide resolved
scripts/templates/train-template.sh Outdated Show resolved Hide resolved
@philswatton
Copy link
Contributor Author

I've merged in the fixes to the test - agreed that it looked safer than a rebase!

@philswatton philswatton requested a review from lannelin March 7, 2023 11:14
Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! Approved with some very minor changes requested.
Could you also verify a training run (and consistency given by seeding) before merging?

scripts/README.md Show resolved Hide resolved
scripts/README.md Show resolved Hide resolved
@philswatton
Copy link
Contributor Author

I added an extra arg to Trainer to make training deterministic given the seed - all the rest is done too!

@philswatton philswatton merged commit 2586107 into develop Mar 7, 2023
@philswatton philswatton deleted the 32-slurm branch March 7, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace bash script generation with slurm
2 participants