Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how execution timeout is set #4859

Closed
dpmatthews opened this issue May 5, 2022 · 5 comments · Fixed by #4863
Closed

Investigate how execution timeout is set #4859

dpmatthews opened this issue May 5, 2022 · 5 comments · Fixed by #4863
Milestone

Comments

@dpmatthews
Copy link
Contributor

As noted in #3706, execution timeout gets set if you define execution time limit but this doesn't appear to be documented and the way it is done probably doesn't make much sense.

@oliver-sanders
Copy link
Member

Execution time limit is a "feature" of the "job runner handler", it may or may not be implemented (though in practice it is).

For example here's the implementation for the "background" handler:

execution_time_limit = submit_opts.get("execution_time_limit")
timeout_str = ""
if execution_time_limit:
timeout_str = (
" timeout --signal=XCPU %d" % execution_time_limit)

And here's the PBS implementation:

if (job_conf["execution_time_limit"] and
directives.get("-l walltime") is None):
directives["-l walltime"] = "%d" % job_conf["execution_time_limit"]

@dpmatthews
Copy link
Contributor Author

Tested with this workflow:

[scheduling]
    [[graph]]
        R1 = neither & limit & timeout & limit-timeout
[runtime]
    [[root]]
        script = "echo Hello"
    [[neither]]
    [[limit]]
        execution time limit = PT3M
    [[timeout]]
        [[[events]]]
            execution timeout = PT2M
    [[limit-timeout]]
        execution time limit = PT3M
        [[[events]]]
            execution timeout = PT2M

Relevant log entries:

2022-05-06T16:20:26+01:00 INFO - [1/neither running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2022-05-06T16:20:26+01:00 INFO - [1/timeout running job:01 flows:1] health: execution timeout=PT2M, polling intervals=PT1H,...

Both fine.

2022-05-06T16:20:26+01:00 INFO - [1/limit running job:01 flows:1] health: execution timeout=PT13M, polling intervals=PT4M,PT2M,PT7M,...

Undocumented behaviour.
The execution timeout is set to the execution time limit + the sum of the execution time limit polling intervals here:

timeout = time_limit + sum(time_limit_delays)

This doesn't make much sense, especially given that the last interval gets used repeatedly so the sum of the intervals doesn't really mean anything.

2022-05-06T16:20:26+01:00 INFO - [1/limit-timeout running job:01 flows:1] health: execution timeout=PT13M, polling intervals=PT4M,PT2M,PT7M,...

As above but in this case the configured timeout is being ignored which I think is a bug.

I propose we just remove the line shown above. Execution timeout will then only be set if configured and won't be affected by whether execution time limit is set.

@oliver-sanders
Copy link
Member

The execution timeout is set to the execution time limit + the sum of the execution time limit polling intervals

😨

I propose we just remove the line shown above

Yeah, that makes sense, but why was it there in the first place?!

@hjoliver
Copy link
Member

hjoliver commented May 12, 2022

The execution timeout is set to the execution time limit + the sum of the execution time limit polling intervals

fearful

I propose we just remove the line shown above

Yeah, that makes sense, but why was it there in the first place?!

git blames Matt for that, March 2018.

Maybe this was the logic:
If you set execution timelimit polling intervals, that suggests you think the task could take that long (to the end of polling) to finish, so it wouldn't make sense to time out before then. Did "the end of polling" used to be a thing for this, as opposed to repeated use of the final interval? That almost makes sense even with a repeated final interval so long as you are explicit up to that point (e.g. use PT10S, 5*PT30S rather than PT10S, PT30S knowing that PT30S will be reused) ... but we'd have to be very clear on the documentation of that. [And of course, if you have job runner that doesn't kill the job on the timelimit].

@dpmatthews
Copy link
Contributor Author

Regardless of the original motivation, this behaviour was never documented.
As noted in #4863 (comment), if there really if a requirement to set the execution timeout based on the time limit then I don't think using the execution polling intervals is the right way to do it.
So, I think it's better to remove this behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants