Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data store: waiting jobs #4905

Closed
wxtim opened this issue Jun 8, 2022 · 4 comments · Fixed by #4941
Closed

data store: waiting jobs #4905

wxtim opened this issue Jun 8, 2022 · 4 comments · Fixed by #4941
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@wxtim
Copy link
Member

wxtim commented Jun 8, 2022

Spotted in the wild, "waiting jobs":

{
  "data": {
    "workflows": [
      {
        "taskProxies": [
          {
            "id": "~tpilling/mi-bb964/run2//19600101T0000Z/coupled",
            "state": "waiting",
            "jobs": []
          },
          {
            "id": "~tpilling/mi-bb964/run2//19600101T0000Z/fcm_make2_drivers",
            "state": "waiting",
            "jobs": [
              {
                "id": "~tpilling/mi-bb964/run2//19600101T0000Z/fcm_make2_drivers/01",
                "state": "waiting"
              },
              {
                "id": "~tpilling/mi-bb964/run2//19600101T0000Z/fcm_make2_drivers/03",
                "state": "waiting"
              },
              {
                "id": "~tpilling/mi-bb964/run2//19600101T0000Z/fcm_make2_drivers/02",
                "state": "waiting"
              }
            ]
          },

The "waiting" state is a task state, not a job state. It shouldn't be possible to get a waiting job.

Steps to reproduce:
create a workflow where submission fails after trying 2 hosts.

@wxtim wxtim added the bug Something is wrong :( label Jun 8, 2022
@oliver-sanders oliver-sanders changed the title Bug data store: waiting jobs Jun 8, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.0rc5 milestone Jun 8, 2022
@hjoliver
Copy link
Member

Ping @dwsutherland

@dwsutherland
Copy link
Member

Have reproduce this:

[platforms]
    [[boss]]
        hosts = fdasfeds, fdawsf

    [[foo]]
        inherit = FAM3
        platform = boss
        submission retry delays = PT0S, PT30S, PT1M
        [[[meta]]]
            description = "some task foo"
        [[[environment]]]
            GREETING = "Hello from foo!"
query {
  workflows (ids: ["linear/run1"]) {
    id
    taskProxies {
      id
      state
      jobs (sort: {keys: ["submitNum"], reverse: true}) {
        id
        state
      }
    }
  }
}
{
  "data": {
    "workflows": [
      {
        "id": "~sutherlander/linear/run1",
        "taskProxies": [
          {
            "id": "~sutherlander/linear/run1//20210201T00/foo",
            "state": "waiting",
            "jobs": []
          },
          {
            "id": "~sutherlander/linear/run1//20210101T00/foo",
            "state": "submit-failed",
            "jobs": [
              {
                "id": "~sutherlander/linear/run1//20210101T00/foo/04",
                "state": "submit-failed"
              },
              {
                "id": "~sutherlander/linear/run1//20210101T00/foo/03",
                "state": "waiting"
              },
              {
                "id": "~sutherlander/linear/run1//20210101T00/foo/02",
                "state": "waiting"
              },
              {
                "id": "~sutherlander/linear/run1//20210101T00/foo/01",
                "state": "waiting"
              }
            ]
          },
          {
            "id": "~sutherlander/linear/run1//20210101T00/bar",
            "state": "waiting",
            "jobs": []
          }
        ]
      }
    ]
  }
}

will look into this... I think it's because they adopt the task's state initially.. I suppose it should be preparing -> submit-failed?

@dwsutherland
Copy link
Member

dwsutherland commented Jun 29, 2022

Found the issue, the platform submission failure is happening before the job is inserted into the data-store (and DB):

2022-06-29T02:48:00Z CRITICAL - [20210101T00/foo preparing job:01 flows:1] submission failed
~sutherlander/linear/run1//20210101T00/foo/01 ...... submit-failed
JOB OR STATE NOT FOUND
JOB DOES NOT EXIST
2022-06-29T02:48:00Z INFO - [20210101T00/foo preparing job:01 flows:1] => waiting
2022-06-29T02:48:00Z WARNING - [20210101T00/foo waiting job:01 flows:1] retrying in P0Y (after 2022-06-29T02:48:00Z)
preparing
INSERTING: ~sutherlander/linear/run1//20210101T00/foo/01, STATUS: preparing

just need to find out where this failure is getting triggered .. and I'll put up a fix

@dwsutherland
Copy link
Member

Fix up

@dwsutherland dwsutherland self-assigned this Jun 29, 2022
@hjoliver hjoliver modified the milestones: cylc-8.0rc4, cylc-8.0.0 Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants