You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This runner could use the environment variables set in each process for self identification and a shared filesystem for communication.
Each process can use the SLURM_NODEID for it's ID (a unique monotonic index that is assigned to each process) and the SLURM_JOB_NUM_NODES to know the total number of processes.
The process with rank 0 assumes it is the scheduler, it writes a scheduler file to the shared filesystem.
The process with rank 1 assumes it should run the client code, it waits for the scheduler file to exist and then continues running the contents of the context manager.
All processes with rank 2 and above assume they are workers, they wait for the scheduler file to exist and then start worker processes that connect to the scheduler.
The text was updated successfully, but these errors were encountered:
I'd be interested in this functionality, could I help out? Is the project ready for contributions? I'm no Dask expert but I think I understand the relevant pieces here.
(As a minor point, I think the relevant Slurm variables to index processes are actually SLURM_PROCID and SLURM_NTASKS. Ref.)
This runner could use the environment variables set in each process for self identification and a shared filesystem for communication.
Each process can use the
SLURM_NODEID
for it's ID (a unique monotonic index that is assigned to each process) and theSLURM_JOB_NUM_NODES
to know the total number of processes.0
assumes it is the scheduler, it writes a scheduler file to the shared filesystem.1
assumes it should run the client code, it waits for the scheduler file to exist and then continues running the contents of the context manager.2
and above assume they are workers, they wait for the scheduler file to exist and then start worker processes that connect to the scheduler.The text was updated successfully, but these errors were encountered: