correct way to launch on slurm clusters? #16004
-
I have been using lightning on a single node and it works very well. Thanks for the great project. I am not sure what's the best practices to launch jobs on slurm clusters when lightning is used in the codebase. I am also using hydra and submitit. It seems the hydra submitit launcher plugin does not work well because submitit launcher and lightning are both trying to spawn multiple subprocesses for DDP workers. What do people find as the best practice in multi-node training? Any comments or suggestions are great appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Never mind. I found the doc that describes this. I guess using submitit is just not supported at the moment. |
Beta Was this translation helpful? Give feedback.
Never mind. I found the doc that describes this. I guess using submitit is just not supported at the moment.