-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-resolve hostnames as fallback when control connection is broken #770
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wprzytula
force-pushed
the
reresolve-dns
branch
2 times, most recently
from
July 26, 2023 19:00
fc4af5c
to
a624165
Compare
avelanarius
reviewed
Jul 27, 2023
Rebased on main. |
wprzytula
force-pushed
the
reresolve-dns
branch
from
August 2, 2023 08:02
914dbd9
to
d1e2b92
Compare
Rebased on |
piodul
reviewed
Aug 21, 2023
wprzytula
force-pushed
the
reresolve-dns
branch
from
August 21, 2023 12:28
d1e2b92
to
9659f3d
Compare
wprzytula
force-pushed
the
reresolve-dns
branch
from
August 22, 2023 10:00
9659f3d
to
ac36ac5
Compare
The new name highlights the fact that the `ContactPoint` is necessarily a resolved address. This is contrary to `KnownNode`, which can contain an unresolved hostname. Additionally, missing docstrings are added for `ContactPoint` and `CloudEndpoint`, and the docstring for `KnownNode` is extended.
This not only makes the `Session::connect()` cleaner, but also enables using the same routine in the "fallback" in next commits.
The `node` module is the most suitable for the following structs: - `KnownNode` - `ResolvedContactPoint` - `CloudEndpoint`
As hostname resolution is conceptually strictly connected to nodes, the `node` module is better suiting those routines than the overloaded `session` module.
The initially known nodes (those before hostname resolution) will be used there to implement the "fallback" in case all known peers change their IPs, but at least some of them stay discoverable under the same hostname.
The code of `read_metadata()` is refactored to prepare for addition of fallback logic. A new log is added at debug level for a successful metadata fetch. Sadly, the new design requires cloning peers due to the borrow checker. Luckily, this only happens if the first fetch attempt fails and is not particularly expensive.
This is the final step in boosting the driver's robustness in case of sudden IP changes in the cluster. In case that the control connection fails to fetch metadata, after all known peers fail to be queried, the fallback phase is entered: addresses of all initially known nodes are resolved again and fetching metadata from them is attempted. Until now, if all nodes in the cluster changed their IPs at once (which is particularly easy in a single-node cluster), the driver was unable to contact the cluster ever more. Now, the driver reconnects the control connection with the next metadata fetch attempt (after at most 60 seconds, as configured in ClusterWorker), fetches new topology, discovers all nodes' addresses and finally connects to them. The next commits are yet to shorten this downtime from 60 seconds to possibly 1 second, by reacting immediately to the control connection being broken.
If the control connection is faulty and hence metadata fetch fails, it is advisable that further attempts to reconnect and fetch take place more frequently. The motivation is: if the control connection fails, it is possible that the node has changed its IP and hence we need to fetch new metadata ASAP to discover its new address. Therefore, the ClusterWorker's sleep time is changed from 60 seconds to 1 second once a metadata fetch fails, and is only reverted back to 60 seconds after a fetch succeeds. We are still not good enough: if all nodes change their IPs at once, we will discover them only after the next metadata fetch is issued, which may happen only after 60 seconds (if previous fetch succeeded). Hence, the next commit introduces immediate signalling that the control connection is broken, so that ClusterWorker begins instantly its every-1-second-attempt phase.
wprzytula
force-pushed
the
reresolve-dns
branch
from
August 28, 2023 07:38
ac36ac5
to
94c50cc
Compare
@piodul Rebased on main. |
Until now, if all nodes changed their IPs at once, we would discover them only after the next metadata fetch is issued, which might happen only after 60 seconds (if previous fetch succeeded). This commit introduces immediate signalling that the control connection got broken, so that ClusterWorker begins instantly its every-1-second-attempt phase. In manual tests, this showed to be very robust: immediately after losing control connection, the reconnect-attempt-phase begins. As soon as any node (one known initially or from recently fetched metadata) becomes reachable, a control connection is opened and metadata is fetched successfully, so the whole cluster is discoverable.
This is a refactor that makes reduces number of parameters needed for MetadataReader::new().
wprzytula
force-pushed
the
reresolve-dns
branch
from
August 28, 2023 07:45
94c50cc
to
d1c4c48
Compare
piodul
approved these changes
Aug 28, 2023
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
See the referenced issue (#756).
What's done
Main matter
read_metadata()
.This is the crucial step in boosting the driver's robustness in case of sudden IP changes in the cluster. In case that the control connection fails to fetch metadata, after all known peers fail to be queried, the fallback phase is entered: addresses of all initially known nodes are resolved again and fetching metadata from them is attempted.
Until now, if all nodes in the cluster changed their IPs at once (which is particularly easy in a single-node cluster), the driver was unable to contact the cluster ever more. Now, the driver reconnects the control connection with the next metadata fetch attempt (after at most 60 seconds, as configured in
ClusterWorker
), fetches new topology, discovers all nodes' addresses and finally connects to them.If the control connection is faulty and hence metadata fetch fails, it is advisable that further attempts to reconnect and fetch take place more frequently. The motivation is: if the control connection fails, it is possible that the node has changed its IP and hence we need to fetch new metadata ASAP to discover its new address. Therefore, the
ClusterWorker
's sleep time is changed from 60 seconds to 1 second once a metadata fetch fails, and is only reverted back to 60 seconds after a fetch succeeds.Until now, if the control connection node changed it IP, we would discover that node only after the next metadata fetch is issued, which would possibly happen only after 60 seconds (if previous fetch succeeded). To make the rediscovery faster, immediate signalling that the control connection got broken is introduced, so that
ClusterWorker
begins instantly its every-1-second-attempt phase.Bonus
ContactPoint
is renamed toResolvedContactPoint
, to stress its difference fromKnownNode
(which is an unresolved contact point, essentially).ResolvedContactPoint
,KnownNode
,CloudEndpoint
) and hostname resolution-related routines are moved fromcluster
andsession
modules to thenode
module. This contributes to making thesession
module cleaner, as well as keeps those entities in the most conceptually related module.Evaluation
Manual tests were conducted. They were identical to ones performed before merging the analogous enhancement to gocql. Both single- and multi-node clusters were tested.
The tests involved random connection breaks between arbitrary nodes and the driver, as well as changes of IPs of all the nodes in the cluster at once.
The implemented solution showed to be very robust in the tests: immediately after losing control connection, the reconnect-attempt-phase begins. As soon as any node (either one known initially or one known from recently fetched metadata) becomes reachable, a control connection is opened and metadata is fetched successfully, so the whole cluster is discoverable. The driver performed stable, with no fluctuations in behaviour over time.
Fixes: #756
Refs: #386
Pre-review checklist
[ ] I added relevant tests for new features and bug fixes.[ ] I have provided docstrings for the public items that I want to introduce.[ ] I have adjusted the documentation in./docs/source/
.Fixes:
annotations to PR description.