Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP2 doesn't retry the first stream on REFUSED_STREAM #44669

Closed
BrennanConroy opened this issue Nov 13, 2020 · 10 comments · Fixed by #48758
Closed

HTTP2 doesn't retry the first stream on REFUSED_STREAM #44669

BrennanConroy opened this issue Nov 13, 2020 · 10 comments · Fixed by #48758

Comments

@BrennanConroy
Copy link
Member

BrennanConroy commented Nov 13, 2020

We have an HttpClient+Kestrel test where we are checking that the stream limit prevents requests from hitting application function when the limit is reached.

The test is flaky and after investigating, it looks like if a batch of streams are sent in parallel when the connection starts and the stream limit is lower than the batch, then there is a race where the first request creates the connection but isn't the first stream sent and if it is for example the 6th stream with a server limit of 5, then it can get the REFUSED_STREAM response and will skip the retry logic because it has the isNewConnection value set to true.

catch (HttpRequestException e) when (!isNewConnection && e.AllowRetry == RequestRetryType.RetryOnSameOrNextProxy)

If the SETTINGS frame is received everything works fine, but because there is a big batch in the beginning, all the streams can be in-flight before the SETTINGS frame is received.

Test issue dotnet/aspnetcore#27379

cc @Tratcher and @halter73 who helped me understand some of the HTTP2 spec when looking at this.

Our test code looks something like this (server code is just pseudo code here):

server.MaxConcurrentStreams = 5;
var blockRequests = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
server.Run(async (request) =>
{
    // unblock after 5th request
    // ...
    await blockRequests.Task;
});

var tasks = new List<Task<HttpResponseMessage>>(10);
for (var i = 0; i < 10; i++)
{
    var requestTask = client.PostAsync(url, new StringContent("Hello World")).DefaultTimeout();
    tasks.Add(requestTask);
}

var responses = await Task.WhenAll(tasks.ToList()).DefaultTimeout(); // throws sometimes
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels Nov 13, 2020
@ghost
Copy link

ghost commented Nov 13, 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details
Description:

We have an HttpClient+Kestrel test where we are checking that the stream limit prevents requests from hitting application function when the limit is reached.

The test is flaky and after investigating, it looks like if a batch of streams are sent in parallel when the connection starts and the stream limit is lower than the batch, then there is a race where the first request creates the connection but isn't the first stream sent and if it is for example the 6th stream with a server limit of 5, then it can get the REFUSED_STREAM response and will skip the retry logic because it has the isNewConnection value set to true.

catch (HttpRequestException e) when (!isNewConnection && e.AllowRetry == RequestRetryType.RetryOnSameOrNextProxy)

If the SETTINGS frame is received everything works fine, but because there is a big batch in the beginning, all the streams can be in-flight before the SETTINGS frame is received.

Test issue dotnet/aspnetcore#27379

cc @Tratcher and @halter73 who helped me understand some of the HTTP2 spec when looking at this.

Our test code looks something like this (server code is just pseudo code here):

server.MaxConcurrentStreams = 5;
var blockRequests = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
server.Run(async (request) =>
{
    // unblock after 5th request
    // ...
    await blockRequests.Task;
});

var tasks = new List<Task<HttpResponseMessage>>(10);
for (var i = 0; i < 10; i++)
{
    var requestTask = client.PostAsync(url, new StringContent("Hello World")).DefaultTimeout();
    tasks.Add(requestTask);
}

var responses = await Task.WhenAll(tasks.ToList()).DefaultTimeout(); // throws sometimes
Author: BrennanConroy
Assignees: -
Labels:

area-System.Net.Http, untriaged

Milestone: -

@Tratcher
Copy link
Member

A) Why isn't the request that was used to create the connection the first request to be sent on that connection? I'd expect it to have priority.
B) Why can't the first request be retried? Because it's assumed to have failed for some more fatal reason like DNS?

@geoffkizer
Copy link
Contributor

geoffkizer commented Nov 14, 2020

Why isn't the request that was used to create the connection the first request to be sent on that connection? I'd expect it to have priority.

As soon as the connection is established, we make it available for use by any requests. So while the request that caused the connection to be created will immediately try to use it, other requests may also be racing to use it and there's no guarantee that the original request that created the connection will actually be the first one to send a request on it.

Why can't the first request be retried? Because it's assumed to have failed for some more fatal reason like DNS?

The reasoning here is that we don't want to get into an infinite retry loop. So if the first request on the connection fails, we don't retry it -- otherwise we could end up just trying to create new connections endlessly.

@geoffkizer
Copy link
Contributor

That said, it does seem like the combination of all these factors leads to a situation where a REFUSED_STREAM is not retried even though it should be. Not sure how to fix that. Suggestions welcome.

@karelz
Copy link
Member

karelz commented Dec 2, 2020

Triage: We should verify that the 1st request on new HTTP connection is assigned to stream 1.
And if it is, then it is super-low priority as it should be fairly rare and exceptional.

This might explain several issues during our Stress testing.

@karelz karelz added bug and removed untriaged New issue has not been triaged by the area owner labels Dec 2, 2020
@karelz karelz added this to the 6.0.0 milestone Dec 2, 2020
@ManickaP
Copy link
Member

ManickaP commented Dec 2, 2020

Did a quick look at the code and it can happen. We construct the connection, store it in _http2Connections and then we initiate the request itself. The streamId is assigned in SendHeadersAsync which happens further down the line so other requests might have been assigned lower stream ids in the meantime.

@geoffkizer
Copy link
Contributor

I think we should change the retry logic so that we only disallow retrying the request on stream 1 (for HTTP2). This would fix the original issue here while still addressing the "endless loop" problem.

That said, refactoring the logic around isNewConnection may be a little tricky.

@geoffkizer
Copy link
Contributor

That said, refactoring the logic around isNewConnection may be a little tricky.

Actually I think it's relatively straightforward. In HttpConnectionPool, we should always treat an HTTP2 (or HTTP3?) connection as a new connection for retry purposes -- i.e. isNewConnection is always true. But in the HTTP2 OnRest logic, where we determine whether a request can be retried or not, we should just disable this for StreamId = 1,

@geoffkizer
Copy link
Contributor

I also wonder if we should additionally just set a hard limit for retry attempts, say 5. We could make this configurable if desired.

As things stand, a badly behaved HTTP2 server could still cause us to loop infinitely if it just keeps resetting new streams that should be valid under the stream limit.

This would allow us to just rip out the isNewConnection stuff entirely, I think, which would be nice. And a hard limit ensures we have a backstop for any server shenanigans, HTTP2 or otherwise.

@geoffkizer
Copy link
Contributor

@scalablecory Any thoughts here?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 25, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 18, 2021
@ghost ghost locked as resolved and limited conversation to collaborators May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants