Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent http.Server timeout configurations #30248

Merged
merged 3 commits into from
Aug 11, 2023
Merged

Conversation

jentfoo
Copy link
Contributor

@jentfoo jentfoo commented Aug 9, 2023

This PR builds on the work from #30151 in the following ways:

  • A couple additional server configuartions that were missing timeouts are now covered
  • Timeouts are now configured in a consistent way. This means:
    • Configuring the ReadTimeout which was not covered by only setting ReadHeaderTimeout
    • Set ReadHeaderTimeout to be the more aggressive (1 second) defaults.ReadHeadersTimeout
    • Set a WriteTimeout in cases of potential large responses

@@ -189,7 +189,9 @@ func NewTLSServer(cfg TLSServerConfig) (*TLSServer, error) {
cfg: cfg,
httpServer: &http.Server{
Handler: tracingHandler,
ReadHeaderTimeout: apidefaults.DefaultIOTimeout,
ReadTimeout: apidefaults.DefaultIOTimeout,
ReadHeaderTimeout: defaults.ReadHeadersTimeout,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the more aggressive timeout causing test flakiness, or worse performance issues at scale. 1 second is almost always too short for our CI environment where many test cases are running in parallel.

What do you think about starting with 2 seconds, and leaving this in master for the v14 performance tests before backporting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. From experience I'd much rather start larger and err on the side of caution and slowly reduce the timeout over time.

Copy link
Contributor Author

@jentfoo jentfoo Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about starting with 2 seconds, and leaving this in master for the v14 performance tests before backporting?

Sounds like a great plan to me. In commit ff22401ecd6b5c95793e8c03cba3ea77e1c642e2 I updated the configured timeout to 10 seconds. I feel like 2 seconds is likely fine, but we are coming from 30 seconds so even at 10, I feel like this is a notable net gain.

Maybe in Teleport 15 we go to 2 seconds (no concerns with moving slow to make sure we don't cause impacts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen two test failures, I worry this could be increasing the flakiness.

The timeouts seem pretty reasonable though, does it make sense to reduce test concurrency so that tests can complete faster? @zmb3 what are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tests? It's probably Go 1.21 and not your change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seen twice:

 === Failed
=== FAIL: lib/auth TestAutoRotation (1.75s)
    tls_test.go:418: 
        	Error Trace:	/__w/teleport/teleport/lib/auth/tls_test.go:418
        	Error:      	Error "write tcp 127.0.0.1:47596->127.0.0.1:32991: write: broken pipe" does not contain "certificate"
        	Test:       	TestAutoRotation

I assumed it might be the write timeout causing the connection to be closed.

Copy link
Contributor

@greedy52 greedy52 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#30253
may or may not be your change (most likely not) but i don't think we have found the root cause for the above flaky test yet

@@ -84,8 +86,15 @@ func run() error {
http.HandleFunc("/register/1", s.register1)
http.HandleFunc("/register/2", s.register2)

srv := &http.Server{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but it doesn't really matter, this is just an example program and is not part of Teleport.

@jentfoo jentfoo force-pushed the jent/http_timeouts branch 2 times, most recently from ff22401 to d252012 Compare August 10, 2023 21:27
This commit builds on the work from #30151 in the following ways:
* A couple additional server configuartions that were missing timeouts are now covered
* Timeouts are now configured in a consistent way.  This means:
  - Configuring the `ReadTimeout` which was not covered by only setting `ReadHeaderTimeout`
  - Set `ReadHeaderTimeout` to be the more aggressive (1 second) `defaults.ReadHeadersTimeout`
  - Set a `WriteTimeout` in cases of potential large responses
@jentfoo jentfoo force-pushed the jent/http_timeouts branch from e796c10 to addcb6a Compare August 11, 2023 14:20
@jentfoo jentfoo added this pull request to the merge queue Aug 11, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 11, 2023
@jentfoo jentfoo added this pull request to the merge queue Aug 11, 2023
Merged via the queue into master with commit b7f571b Aug 11, 2023
@jentfoo jentfoo deleted the jent/http_timeouts branch August 11, 2023 16:23
@tigrato tigrato mentioned this pull request Sep 15, 2023
tigrato added a commit that referenced this pull request Sep 15, 2023
This PR removes the `ReadTimeout` and `WriteTimeout` settings from
`kube/proxy.Server`.

The revert is required because both settings were terminating watch
streams earlier and causing several when parsing the long lived data
stream.

Signed-off-by: Tiago Silva <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Sep 15, 2023
This PR removes the `ReadTimeout` and `WriteTimeout` settings from
`kube/proxy.Server`.

The revert is required because both settings were terminating watch
streams earlier and causing several when parsing the long lived data
stream.

Signed-off-by: Tiago Silva <[email protected]>
github-actions bot pushed a commit that referenced this pull request Sep 15, 2023
This PR removes the `ReadTimeout` and `WriteTimeout` settings from
`kube/proxy.Server`.

The revert is required because both settings were terminating watch
streams earlier and causing several when parsing the long lived data
stream.

Signed-off-by: Tiago Silva <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Sep 18, 2023
This PR removes the `ReadTimeout` and `WriteTimeout` settings from
`kube/proxy.Server`.

The revert is required because both settings were terminating watch
streams earlier and causing several when parsing the long lived data
stream.

Signed-off-by: Tiago Silva <[email protected]>
@r0mant
Copy link
Collaborator

r0mant commented Nov 21, 2023

@jentfoo FYI this change broke not just Kube Access that @tigrato fixed in #31945 but application access as well (it took us a week to troubleshoot and pinpoint the cause). I removed all timeouts set by this PR in application access request path in #34843.

I think we should carefully reevaluate all other places where this PR introduced timeouts as well, for example I see it also sets them in local proxy which I think may be prone to the same issue. TBH I'm tempted to just roll this back entirely.

@zmb3
Copy link
Collaborator

zmb3 commented Nov 22, 2023

Looks like this was also the cause of #34201

@tigrato
Copy link
Contributor

tigrato commented Nov 22, 2023

@r0mant @zmb3 check if this one doesn't apply as well #33768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants