32 slurm #36

philswatton · 2023-03-03T16:45:50Z

This PR resolves #32 for the case of training scripts, and contributes to #19. It adds the following:

generate_train_scripts.py now uses templating to generate a slurm batch submission
train_all will submit all batches for a given experiment group
model checkpointing added, to save the final result
epochs reduced to a more sensible 20
pytorch seed_everything call added
scripts/README updated

What this PR doesn't do is #26, which is left for another PR.

philswatton · 2023-03-06T15:00:42Z

In the end, it turned out that naming the model artifacts was broken because this was only added for the wandb logger in version 1.9.0 of pytorch lightning (and indeed it seems no other way existed until this was added, see: Lightning-AI/pytorch-lightning#15990). I did try using the filename argument in ModelCheckpoint but this didn't work.

Since our project dependency is 1.8.3.post1 and updating would be a big hassle (and in particular I recall things not working on Baskerville?) I took the route of adapting the changes to the wandb logger in 1.9.0 to 1.8.3.post1 and adding it to our project source code. This does work, and I ran I couple of very quick 100-step models to check that it did.

Lmk if that's too hacky a solution!

philswatton · 2023-03-06T15:40:00Z

I've raised #37 for the versions issue

lannelin

looks good! some minor changes requested. Haven't run a resulting script on Baskerville though and will leave that to you to check.

I also think some of the tests are breaking right now so worth pulling these in from develop
(edited to be agnostic to how those changes are pulled in! merge vs rebase: https://www.atlassian.com/git/tutorials/merging-vs-rebasing)
(edit 2: actually, to be opinionated again, should be a merge? )

lannelin · 2023-03-06T15:37:29Z

scripts/train_models.py

    train_model(
        dm=dmpair.A,
        experiment_name=f"{experiment_pair_name}_A",
        trainer_config=trainer_config,
    )
+    seed_everything(seed=dmpair_kwargs["seed"])


should the seeding happen before DMPair creation for control of val split?

As far as I can tell no - passing none to a datamodule for the seed argument will cause an error when .setup() is called, even after running seed_everything. Looking at the source code it seems the datamodule needs its own seen, and doesn't draw on a global seed at all.

ah ok, that makes sense. Good stuff. Were you able to verify that seed_everything gives desired consistency for training runs?

src/modsim2/model/wandb_logger.py

lannelin · 2023-03-06T15:38:19Z

configs/trainer.yaml

@@ -1,5 +1,5 @@
 trainer_kwargs:
-  max_epochs: 100
+  max_epochs: 10


is this enough for convergence?

Yes - looking at the wandb plots val_loss seems to plateau after about 4-6 epochs at most. Could lower this now or leave for #26?

Cool. Happy to leave for 26

scripts/train_all.sh

scripts/templates/train-template.sh

lannelin · 2023-03-06T15:43:00Z

src/modsim2/model/wandb_logger.py

+                        if hasattr(checkpoint_callback, k)
+                    },
+                }
+                if _WANDB_GREATER_EQUAL_0_10_22


what are we doing if not? should this raise an error?

This is a c/p from the pl source code. My understanding is metadata becomes None if not (l 79). Either way the metadata is passed as an argument to wandb.Artifact(). For this function None is the default argument so I don't think it should be raising an error

ok happy to leave as is

scripts/templates/train-template.sh

philswatton · 2023-03-07T11:14:07Z

I've merged in the fixes to the test - agreed that it looked safer than a rebase!

lannelin

looks good! Approved with some very minor changes requested.
Could you also verify a training run (and consistency given by seeding) before merging?

scripts/README.md

philswatton · 2023-03-07T14:16:03Z

I added an extra arg to Trainer to make training deterministic given the seed - all the rest is done too!

philswatton added 8 commits March 2, 2023 17:07

jinja dependency

7f760e6

train scripts now generates slurm scripts

c179f0b

jinja slurm template for model training

d6fb81b

changed output path

4c3029c

model checkpoint saved, reduced epochs, seeding

16b34f8

Added bash script to run all for a given experiment group

05d79bd

adapted pl 1.9.3 wandb logger code to our project's version

13d101c

dependencies

05dfca9

philswatton marked this pull request as ready for review March 6, 2023 14:56

philswatton requested a review from lannelin March 6, 2023 15:00

lannelin requested changes Mar 6, 2023

View reviewed changes

philswatton added 5 commits March 7, 2023 09:14

renames, comments removed, small changes

a647972

tidied template, moved some args to generation script

473cdfe

slurm specific

feb2541

file was renamed and changed

e3ebf10

Merge fixes to tests into 32-slurm

daebbb2

philswatton requested a review from lannelin March 7, 2023 11:14

lannelin approved these changes Mar 7, 2023

View reviewed changes

scripts/README.md Show resolved Hide resolved

scripts/README.md Show resolved Hide resolved

philswatton added 5 commits March 7, 2023 13:39

ignore slurm outputs

da99fb6

.cpu calls added

79f4123

deterministic training behaviour

318748d

link to requirements

13ebfa1

fixed typo

4f9010f

philswatton merged commit 2586107 into develop Mar 7, 2023

philswatton deleted the 32-slurm branch March 7, 2023 14:16

philswatton mentioned this pull request Mar 7, 2023

Replace bash script generation with slurm #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32 slurm #36

32 slurm #36

philswatton commented Mar 3, 2023

philswatton commented Mar 6, 2023

philswatton commented Mar 6, 2023

lannelin left a comment •

edited

Loading

lannelin Mar 6, 2023

philswatton Mar 7, 2023

lannelin Mar 7, 2023

lannelin Mar 6, 2023

philswatton Mar 6, 2023

lannelin Mar 7, 2023

lannelin Mar 6, 2023

philswatton Mar 7, 2023 •

edited

Loading

lannelin Mar 7, 2023

philswatton commented Mar 7, 2023

lannelin left a comment

philswatton commented Mar 7, 2023

32 slurm #36

32 slurm #36

Conversation

philswatton commented Mar 3, 2023

philswatton commented Mar 6, 2023

philswatton commented Mar 6, 2023

lannelin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philswatton Mar 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philswatton commented Mar 7, 2023

lannelin left a comment

Choose a reason for hiding this comment

philswatton commented Mar 7, 2023

lannelin left a comment •

edited

Loading

philswatton Mar 7, 2023 •

edited

Loading