-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to obtain timeout from OperationContext for server selection #1209
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a little more invasive than I'd originally hoped.
Anyway to reduce / refactor some of that complexity out?
driver-core/src/main/com/mongodb/internal/connection/BaseCluster.java
Outdated
Show resolved
Hide resolved
@@ -331,8 +326,8 @@ private void logServerSelectionFailure(final ServerSelector serverSelector, | |||
|
|||
@Nullable | |||
private ServerTuple selectServer(final ServerSelector serverSelector, | |||
final ClusterDescription clusterDescription) { | |||
return selectServer(serverSelector, clusterDescription, this::getServer); | |||
final ClusterDescription clusterDescription, final OperationContext operationContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does getServer need a OpContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm. LoadBalancerCluster does server selection for get server.
@@ -462,7 +457,7 @@ private void notifyWaitQueueHandler(final ServerSelectionRequest request) { | |||
waitQueue.add(request); | |||
|
|||
if (waitQueueHandler == null) { | |||
waitQueueHandler = new Thread(new WaitQueueHandler(), "cluster-" + clusterId.getValue()); | |||
waitQueueHandler = new Thread(new WaitQueueHandler(operationContext), "cluster-" + clusterId.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong - as the waitQueueHandler only gets the first operationContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll have a closer look. It seems this doesn't change the current logic, but it will once there's an active timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I rebased the original 3 commits, after merging master. This removed getDescription methods.)
It looks like there is a bug [edit: not actually a bug] in our server selection code. On master, the waitForSrv
method initiates a server selection timeout. However, this method is invoked by getServer, which is invoked from BaseCluster (handleServerSelectionRequest...), and ultimately this call chain comes to where a new ServerSelectionRequest is created. This creation has its own server selection start time (assigned to a field). It is this original server selection timeout that should be used by waitForSrv
, but instead, it creates its own, which could be significantly later than the original start.
While making these changes, I also removed the StartTime from timeout error messages (instead of threading through that startTime). This part of the message will be removed for standardized logging.
One moral is to be careful if one sees an OperationContext being used to create a timeout, but then being passed to downstream methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation above, with double timeouts, never arises because the affected getServer in LoadBalancedCluster is never invoked by BaseCluster (LBC not a subclass of BC).
(But the refactoring is still needed, to pass the operation context/timeout.)
# Conflicts: # driver-core/src/test/unit/com/mongodb/internal/connection/LoadBalancedClusterTest.java
@katcharov let me know if this is ready for re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
JAVA-5184
Refactoring in advance of JAVA-4065.
In JAVA-5175, we used a timeout, but obtained from the existing settings. Here, we obtain that timeout from the OperationContext added in JAVA-5170.
Also fixes flaky tests.