Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

troubleshooting: diagnosing incorrect task status #697

Closed
hjoliver opened this issue Mar 12, 2024 · 4 comments
Closed

troubleshooting: diagnosing incorrect task status #697

hjoliver opened this issue Mar 12, 2024 · 4 comments
Assignees

Comments

@hjoliver
Copy link
Member

hjoliver commented Mar 12, 2024

Add to the new troubleshooting section once #638 is merged.


The Cylc UIs show the scheduler's current knowledge of task and job state. For active tasks, that involves interaction with the external world:

  • a task enters the "submitted" state if the job runner successfully returns a job ID on job submission
  • then it enters the "running" state if the submitted job returns a "started" status message
  • and finally, it enters the "succeeded" or "failed" state if the running job returns a corresponding status message

(Note the above assumes TCP job status messaging; otherwise the scheduler periodically polls for job status).

Tasks may get "stuck" in an incorrect state if anything blocks this external job status information. For instance, you may see a task that stays in the "submitted" state even though it actually ran and completed.

Polling the task - by which the scheduler queries the job runner and checks the job.status file - will return the correct result, but you may still need to determine what went wrong.

Incorrect task status implies one of two things:

  • the job status message was not sent by the job
    • this implies the job was hard-killed (SIGKILL) or the host went down
    • (a soft kill or job failure will cause a "failed" status message to be sent before exit)
  • or the job ran and completed but was unable to send status messages back
    • this implies network issues blocked the send
    • or the job could not find the Cylc package on the job host, to send the message

You can determine what happened by examining the job logs:

  • if the job finished (succeeded or failed) that will be recorded in the job.status log regardless of message send
  • if message send failed, the job.err log will record errors (this will not stop the job from completing, however)
  • if the job.status file does not record completion, and the job is no longer present in the job runner queue, then the job must have been hard-killed
@hjoliver hjoliver added this to the 8.3.x milestone Mar 12, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 14, 2024

Closed by #638?

If not, push a commit onto upstream/troubleshotting.

@hjoliver
Copy link
Member Author

I didn't think it was covered very well, but maybe I didn't look closely enough. I'll re-check and tweak it if necessary...

@oliver-sanders
Copy link
Member

Here's the troubleshooting entry for job status not updating:

https://github.com/cylc/cylc-doc/pull/638/files#diff-3109576eee7d4e82c35cf79b3678f427036bd7be93134e9cea4cc866a63f8919R110-R164

@hjoliver
Copy link
Member Author

OK cool, that's good enough. I'll close this.

@oliver-sanders oliver-sanders removed this from the 8.3.x milestone Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants