Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comms timeout #4112

Open
oliver-sanders opened this issue Mar 5, 2021 · 10 comments
Open

comms timeout #4112

oliver-sanders opened this issue Mar 5, 2021 · 10 comments
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 5, 2021

Edit (2020-11-19):

Cylc currently uses a PT5S "round journey" timeout for all requests. This can be overridden by the somewhat confusingly named but still functional --comms-timeout option. There is no immediate pressure to develop this timeout system further or to make it configurable at the moment. Keeping this issue open for now in case that changes.

See #4112 (comment)

The --comms-timeout option from Cylc7 is still present in Cylc8, however, it doesn't do quite what it says on the tin.

It is now a global timeout, it is implemented in the Python layer, however:

# this
$ cylc <cmd> --comms-timeout=<time>
# is roughly equivalent to this
$ timeout <time> cylc <cmd>

Current implementation:

# send message
msg = {'command': command, 'args': args}
msg.update(self.header)
LOG.debug('zmq:send %s', msg)
message = encode_(msg)
self.socket.send_string(message)
# receive response
if self.poller.poll(timeout):
res = await self.socket.recv()
else:
if callable(self.timeout_handler):
self.timeout_handler()
raise ClientTimeout(
'Timeout waiting for server response.'
' This could be due to network or server issues.'
' Check the suite log.'
)

Questions:

  1. Reconsider the case for timeouts and determine how they should look in the future
@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Mar 5, 2021
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Mar 5, 2021
@hjoliver
Copy link
Member

hjoliver commented Mar 9, 2021

I guess we need a timeout for cylc commands executed by task jobs, to avoid hanging jobs unnecessarily. For user-run commands it doesn't matter so much.

@datamel datamel mentioned this issue Mar 19, 2021
7 tasks
@hjoliver
Copy link
Member

hjoliver commented Sep 10, 2021

The issue is, it's not actually a "comms timeout" - should we rename it, to just "timeout"?

Better check that there is a default timeout esp. for cylc message (doc says: for messaging see global config ...)

@hjoliver
Copy link
Member

But ... it does apply only to commands that contact a scheduler.

Decision: get rid of the option, replace it with a network timeout configured in global.cylc. PT30S default? Current default for messaging - PT30S.

@hjoliver hjoliver removed the question Flag this as a question for the next Cylc project meeting. label Sep 10, 2021
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Sep 10, 2021

Current default for messaging - PT30S.

Current default for messaging - PT5S

(note this is a round-trip timeout, i.e. the time between sending the request and receiving the response)

DEFAULT_TIMEOUT = 5. # 5 seconds

@oliver-sanders
Copy link
Member Author

Decision:

  • Remove the --comms-timeout option (no use case0.
  • Make this timeout configurable.
  • Set the default to PT30S.

@dpmatthews
Copy link
Contributor

Note that the Cylc 7 [task messaging]connection timeout setting was removed in #3402.

@oliver-sanders
Copy link
Member Author

Slight problem with the above decision...

The new comms timeout global configuration will be loaded from the global config on the platform from which the request was made. This means that in order to set this configuration properly it must be set on all platforms.

The alternative being to add a field to the contact file to allow the value to be passed through to remote platforms.

Options:

  1. Require the comms timeoout to be set on all platforms.
  2. Copy the comms timeout into the contact file.
  3. Just do away with it or hardcode a sensible default.

BTW: We are currently doing (3) with a hardcoded default of PT5S. Haven't seen any issues yet.

@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Nov 17, 2021
@hjoliver
Copy link
Member

IMO a hard coded timeout is fine, as well as simpler, so long as it is long enough to handle reasonable latencies.

@datamel
Copy link
Contributor

datamel commented Nov 18, 2021

Discussed in VC 18/11/21. Decision remove this from rc1 and track use cases before proceeding with this issue.

@datamel datamel modified the milestones: cylc-8.0rc1, cylc-8.0.0 Nov 18, 2021
@hjoliver
Copy link
Member

(Also from the VC: if we do come back to this, having to configure it on remote platforms is not unreasonable given the functionatlity it affects).

@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.0, cylc-8.x Nov 19, 2021
@oliver-sanders oliver-sanders removed the question Flag this as a question for the next Cylc project meeting. label Nov 19, 2021
@oliver-sanders oliver-sanders removed their assignment Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants