-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOCAL_RANK not being set in slurm #6797
Comments
Adding the following to the
but I would like to get some feedback from the developers before proposing this as a solution. |
What's the value you see for |
That appears to be correct, in
prints |
I think this is a fairly straightforward fix and it makes sense to delegate this responsibility to the
|
@awaelchli what do you think about setting the rank info on the cluster environment / training type plugin during the Trainer init?currently we wait for |
You mean setting the |
We can think of a way to set |
It's not that it needs to be set sooner it's that it's not set correctly at all. As far as I can tell there's no machinery around getting the slurm assigned rank into Also I'm a little concerned about that variable being called LOCAL_RANK which seems to imply that rank 0 on every machine should be doing logging, checkpointing, whatever else. It should only be the global rank zero that does those things. |
yes it defaults to 0 but that's the only assumption you can make if you don't know where else to get the rank information from.
no I think what you are experiencing is really just because it's not set sooner. I wish I had slurm but I can't test it :( |
In your post you mention wandb creating multiple runs. |
@awaelchli i'm not sure if just setting it earlier will work for all use cases, as the rank zero utilities are used outside of the trainer (e.g. in the loggers). a quickfix would be to cover the slurm local rank when we set |
I'll check on that later today, it may be related. I updated recently (and I also reproduced this on the comet logger). I'll try pulling the latest and check without my patch.
I was checking it in the first call to the experiment property of the wandb logger and it was still 0 at that point, which is pretty late. As far as I could tell there's nothing that reads the env variable set by slurm and assigns it.
Exactly which is why I think it makes sense to encapsulate all slurm functionality in a slurm class and you can make another one for another cluster, I guess How do we feel about the local rank issue though? Am I interpreting it correctly that global 0 not local 0 should be doing the logging and stuff? |
@awaelchli @Queuecumber I'll send out a quickfix to at least mitigate the issue. I agree that the cluster environment + subclasses should be the source of truth for this metadata |
yes the naming can be improved. in DDP we set this to be the global rank: https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L173-L174 |
OK it sounds like its maybe a holdover from when the ddp spawn stuff was the only supported method of parallelism in which case local 0 and global 0 are the same What did you have in mind for the quick fix? I'm happy to try to take care of it |
#6802 would be the quick fix - does this resolve the issues you're seeing? |
Off the top of my head it looks right, but why not just let the environment classes set that field directly? |
The environment classes could be initialized later than when the rank_zero_only module is used |
Might make sense to make sure that's one of the first things that happens |
@awaelchli what do you think about adding a this can be optional so that the lightning/spawn environment doesn't need to handle it until the child processes are spawned. for torchelastic and slurm, the global rank is available from the env vars. then the cluster environment can be the source of truth, and we can go from cluster environment => training type plugin => accelerator => accelerator connector => rank zero only.rank setter in the trainer init itself. |
It's possible but some plugins will need change the global rank and so the cluster env will not be anymore the source of truth. Example: ddp2 sets So this would require a setter to. |
Shouldn't those then have corresponding environment classes that get initialized early? That would ensure nothing uses the wrong rank if these accelerators get initialized later |
🐛 Bug
A lot of the PTL tooling around multiprocess depends on a specific environment variable: LOCAL_RANK being set correctly, it seems that when running in slurm this isnt set causing it to return the default of 0 for all processes which makes every process do things that should only be done on rank 0, like log stuff.
Also I'm a little unclear about the name of that variable, if I have multiple nodes, only the global rank 0 not the local rank should be logging and saving checkpoints etc.
To Reproduce
Run in slurm (cant really do it w/ colab), a good way to easily see it is to use the Wandb logger, you'll see that each process makes a new run on the Wandb UI which means that @rank_zero_experiment didnt work properly, and you can confirm this by printing LOCAL_RANK which is defaulted to 0 if unset, it will always give back 0.
Expected behavior
LOCAL_RANK is set correctly or the rest of the tooling is aware of the global rank of the process
Environment
Will update if it's really necessary
The text was updated successfully, but these errors were encountered: