Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: broken reconnect on some HTTP2 error frames #1261

Merged
merged 6 commits into from
Mar 6, 2025

Conversation

toddbaert
Copy link
Member

@toddbaert toddbaert commented Mar 4, 2025

This PR fixes an issue where some HTTP2 error frames (for example RST_STREAM frames) cause our streams to die and not recover. The RST_STREAM in particular is sent by envoy/istio when upstream connections are lost, though the connection stays open on our end (envoyproxy/envoy#30149, grpc/grpc-go#8041). We are required to restart the stream in this case by reacting to the onCompleted stream handler, which we never did in the eventStream and we recently stopped doing in the syncStream (due to a regression).

Specifically, the PR:

  • handles onError and onComplete stream messages by reconnecting the sync (in-process) and event (RPC) streams (this was the main issue)
  • adds specific unit tests for the above
  • does some renames and refactors
    • anything to do with the sync stream is now called "sync" instead of "grpc" to reduce ambiguity
    • anything to do with RPC mode is now called "rpc" or prefixed with that
    • the RPC eventStream now also uses a blocking queue to consume messages, just like the syncStream

We don't have e2e tests for this issue because we'd need some more feature in the test-harness to do that, but I will create an issue for it.

@toddbaert toddbaert marked this pull request as draft March 5, 2025 00:17
@toddbaert toddbaert force-pushed the fix/stream-error-handling branch from f5773d7 to 2851ff4 Compare March 5, 2025 02:43
Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this is still work in progress, but i added some thoughts - if i find the time i am more than willing to support here :) this is a great improvement, and i think normalizing our code base, reducing complexity has a big benefit. .... in the longrun maybe our rpc mode will also use a construct of flagstore and connector (not part of this) so we have one unified approach to this :)

Copy link
Member

@guidobrei guidobrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed some Javadoc comments that are affected by class renaming of Grpc* to Rpc*. I did not mark all of them.

@toddbaert toddbaert force-pushed the fix/stream-error-handling branch 2 times, most recently from 9412098 to 35bdd6a Compare March 5, 2025 22:27
@toddbaert toddbaert changed the title fix: restore streams on error frames fix: broken reconnect on some HTTP2 error frames Mar 5, 2025
@toddbaert toddbaert marked this pull request as ready for review March 5, 2025 22:32
@toddbaert toddbaert requested a review from lukas-reining March 5, 2025 22:48
Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I feel like we are getting closer and closer to an easy to use codebase, in which we share the same approaches in all the providers. Changing to a queue for also RPC normalizes things, and improves the maintainability of our codebase in the long run

metadataResponse = channelConnector.getBlockingStub().getMetadata(metadataRequest.build());
} catch (Exception metaEx) {
log.error("Metadata exception: {}", metaEx.getMessage(), metaEx);
context.cancel(metaEx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to proceed with a canceled context into the inner while loop, or should we try to restart (which might lead to problems because we might never get into the inner while loop just because we cannot query metadata)?

Copy link
Member Author

@toddbaert toddbaert Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be different from the previous behavior, as context.cancel() will notify the stream listener and put an error event on it, so we will break out of the inner loop after 1 iteration.

However, this would mean that we can get in a fairly tight loop starting the stream over and over if the metadata errors over and over (this is not new). I'm not sure what can be done to improve this since I really do not want to proceed without metadata in case an admin has injected context attrs and there's rules involving them, and I don't want to give up forever on one error either. We could perhaps add a sleep after the metadata error just to make sure the loop isn't super tight... but I don't want to build our own backoff here for this edge case.

I really dislike the fact the metadata is a separate RPC. We need to improve that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toddbaert toddbaert requested a review from chrfwow March 6, 2025 16:19
@toddbaert toddbaert force-pushed the fix/stream-error-handling branch from 710dc1a to c45bcbe Compare March 6, 2025 16:52
toddbaert and others added 6 commits March 6, 2025 12:50
…s/flagd/resolver/rpc/RpcResolver.java

Co-authored-by: Simon Schrottner <[email protected]>
Signed-off-by: Todd Baert <[email protected]>
Signed-off-by: Todd Baert <[email protected]>
Signed-off-by: Todd Baert <[email protected]>
@toddbaert toddbaert force-pushed the fix/stream-error-handling branch from 6e3c931 to 0ba935d Compare March 6, 2025 17:50
@toddbaert toddbaert merged commit 22d2a35 into main Mar 6, 2025
5 checks passed
@toddbaert toddbaert deleted the fix/stream-error-handling branch March 6, 2025 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants