Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cross-sentinel connection failures non-fatal #177

Merged
merged 1 commit into from
Sep 15, 2022

Conversation

wadagso-gertjaap
Copy link
Collaborator

Follow-up on #135 , #167 and #168

This sequence of updates now causes the connection between sentinels, used for requesting attestations from other sentinels, to silently fail. Even though the false return value from tcp_client::init is now considered a warning and the sentinel continues running, the connection is not working because the handler thread never gets started. So now, RPC calls from sentinel to sentinel fail at runtime.

The first sentinel in the startup sequence could potentially have no working connections and get stuck in endless retry loops.

Sentinel to sentinel communication is actually a situation where cluster_connect(endpoints, false) should be used even with a single endpoint, since it's fine for one or more sentinels to be unreachable temporarily (we'll just use another). But because m_server_endpoints.size() <= 1 is used as parameter to cluster_connect that behavior is impossible.

The reason sentinel->sentinel communication uses clusters of 1 is because the sentinel has to control which other sentinels are called to prevent getting more than one attestation from the same sentinel. So we can't just build a cluster of all other sentinels and use send_to_one()

My suggestion, through this PR, is to add an optional boolean overload to init() that allows you to set the error_fatal parameter of cluster_connect manually.

@pr4u4t
Copy link

pr4u4t commented Sep 13, 2022

I've also noticed that +1

Copy link
Member

@metalicjames metalicjames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. Just a question about the implementation.

@wadagso-gertjaap wadagso-gertjaap force-pushed the bugfix-sentinel-attest branch 2 times, most recently from 9f2a30f to 58d002d Compare September 14, 2022 11:13
Copy link
Collaborator

@HalosGhost HalosGhost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your clarification (and the inclusion of that clarification in the documentation—thank you for that!), this looks good to me. Extends tcp_client::init to take an optional boolean allowing for a little more fine-grained control over when connection errors should be considered fatal.

@metalicjames I'd love your sign-off before merging in case I missed anything.

Copy link
Member

@metalicjames metalicjames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HalosGhost HalosGhost merged commit 95c25e2 into mit-dci:trunk Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants