Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm: "start request repeated too quickly" #212

Open
vlj91 opened this issue Sep 29, 2016 · 1 comment
Open

Slurm: "start request repeated too quickly" #212

vlj91 opened this issue Sep 29, 2016 · 1 comment
Assignees
Labels

Comments

@vlj91
Copy link
Contributor

vlj91 commented Sep 29, 2016

Found the following error with a single compute node when launching 32 compute nodes at once:

Sep 29 20:40:20 flight-149 systemd[1]: clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Sep 29 20:40:20 flight-149 systemd[1]: Failed to start Alces Clusterware Slurm compute node daemon.
Sep 29 20:40:20 flight-149 systemd[1]: Unit clusterware-slurm-slurmd.service entered failed state.
Sep 29 20:40:20 flight-149 systemd[1]: clusterware-slurm-slurmd.service failed.
Sep 29 20:40:21 flight-149 systemd[1]: clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Sep 29 20:40:21 flight-149 systemd[1]: start request repeated too quickly for clusterware-slurm-slurmd.service
Sep 29 20:40:21 flight-149 systemd[1]: Failed to start Alces Clusterware Slurm compute node daemon.

Restarting the service fixes it

Process to repeat:

  • Start a cluster using the 2016.3rc6 template (professional edition)
  • Select slurm scheduler type
  • Launch 32 nodes
  • Node(s) may appear in sinfo -N as unknown state
@vlj91 vlj91 added the bug label Sep 29, 2016
@vlj91
Copy link
Contributor Author

vlj91 commented Sep 29, 2016

Output of journalctl -u clusterware-slurm-*:

Starting Alces Clusterware MUNGE daemon...
Starting Alces Clusterware Slurm compute node daemon...
Started Alces Clusterware MUNGE daemon.
clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Failed to start Alces Clusterware Slurm compute node daemon.
Unit clusterware-slurm-slurmd.service entered failed state.
clusterware-slurm-slurmd.service failed.
clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Starting Alces Clusterware Slurm compute node daemon...
PID file /var/run/slurm/slurmd.pid not readable (yet?) after start.
Started Alces Clusterware Slurm compute node daemon.
fatal: Unable to process configuration file
clusterware-slurm-slurmd.service: main process exited, code=exited, status=1/FAILURE
Unit clusterware-slurm-slurmd.service entered failed state.
clusterware-slurm-slurmd.service failed.
clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Starting Alces Clusterware Slurm compute node daemon...
clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Failed to start Alces Clusterware Slurm compute node daemon.
Unit clusterware-slurm-slurmd.service entered failed state.
clusterware-slurm-slurmd.service failed.
clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Starting Alces Clusterware Slurm compute node daemon...
clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Failed to start Alces Clusterware Slurm compute node daemon.
Unit clusterware-slurm-slurmd.service entered failed state.
clusterware-slurm-slurmd.service failed.
clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Starting Alces Clusterware Slurm compute node daemon...
clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Failed to start Alces Clusterware Slurm compute node daemon.
Unit clusterware-slurm-slurmd.service entered failed state.
clusterware-slurm-slurmd.service failed.
clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
start request repeated too quickly for clusterware-slurm-slurmd.service
Failed to start Alces Clusterware Slurm compute node daemon.
Unit clusterware-slurm-slurmd.service entered failed state.
clusterware-slurm-slurmd.service failed.
Starting Alces Clusterware Slurm compute node daemon...
PID file /var/run/slurm/slurmd.pid not readable (yet?) after start.
Started Alces Clusterware Slurm compute node daemon.

@mjtko mjtko modified the milestone: 1.6-fixes Sep 30, 2016
@mjtko mjtko added the ready label Oct 3, 2016
@mjtko mjtko modified the milestones: 1.6-fixes, 1.8-release Jan 3, 2017
@mjtko mjtko modified the milestone: 1.8-release May 8, 2017
@mjtko mjtko removed the ready label May 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants