Refactor to obtain timeout from OperationContext for server selection #1209

katcharov · 2023-09-29T13:59:59Z

Refactoring in advance of JAVA-4065.

In JAVA-5175, we used a timeout, but obtained from the existing settings. Here, we obtain that timeout from the OperationContext added in JAVA-5170.

Also fixes flaky tests.

rozza

Seems a little more invasive than I'd originally hoped.

Anyway to reduce / refactor some of that complexity out?

driver-core/src/main/com/mongodb/internal/connection/BaseCluster.java

rozza · 2023-10-02T11:39:07Z

driver-core/src/main/com/mongodb/internal/connection/BaseCluster.java

@@ -331,8 +326,8 @@ private void logServerSelectionFailure(final ServerSelector serverSelector,

    @Nullable
    private ServerTuple selectServer(final ServerSelector serverSelector,
-            final ClusterDescription clusterDescription) {
-        return selectServer(serverSelector, clusterDescription, this::getServer);
+            final ClusterDescription clusterDescription, final OperationContext operationContext) {


Does getServer need a OpContext?

nm. LoadBalancerCluster does server selection for get server.

rozza · 2023-10-02T12:52:44Z

driver-core/src/main/com/mongodb/internal/connection/BaseCluster.java

@@ -462,7 +457,7 @@ private void notifyWaitQueueHandler(final ServerSelectionRequest request) {
            waitQueue.add(request);

            if (waitQueueHandler == null) {
-                waitQueueHandler = new Thread(new WaitQueueHandler(), "cluster-" + clusterId.getValue());
+                waitQueueHandler = new Thread(new WaitQueueHandler(operationContext), "cluster-" + clusterId.getValue());


This looks wrong - as the waitQueueHandler only gets the first operationContext?

Good catch, I'll have a closer look. It seems this doesn't change the current logic, but it will once there's an active timeout.

(I rebased the original 3 commits, after merging master. This removed getDescription methods.)

It looks like there is a bug [edit: not actually a bug] in our server selection code. On master, the waitForSrv method initiates a server selection timeout. However, this method is invoked by getServer, which is invoked from BaseCluster (handleServerSelectionRequest...), and ultimately this call chain comes to where a new ServerSelectionRequest is created. This creation has its own server selection start time (assigned to a field). It is this original server selection timeout that should be used by waitForSrv, but instead, it creates its own, which could be significantly later than the original start.

While making these changes, I also removed the StartTime from timeout error messages (instead of threading through that startTime). This part of the message will be removed for standardized logging.

One moral is to be careful if one sees an OperationContext being used to create a timeout, but then being passed to downstream methods.

The situation above, with double timeouts, never arises because the affected getServer in LoadBalancedCluster is never invoked by BaseCluster (LBC not a subclass of BC).

(But the refactoring is still needed, to pass the operation context/timeout.)

# Conflicts: # driver-core/src/test/unit/com/mongodb/internal/connection/LoadBalancedClusterTest.java

rozza · 2023-10-09T15:25:24Z

@katcharov let me know if this is ready for re-review

rozza

LGTM!

katcharov requested a review from rozza September 29, 2023 13:59

rozza reviewed Oct 2, 2023

View reviewed changes

katcharov added 3 commits October 4, 2023 16:57

Fix flaky timeout tests

a064b77

# Conflicts: # driver-core/src/test/unit/com/mongodb/internal/connection/LoadBalancedClusterTest.java

Pass operationContext through to startServerSelectionTimeout

eaa1e0e

Move start timeout methods to TimeoutContext

7259571

katcharov force-pushed the JAVA-5184 branch from ca8c79c to 7259571 Compare October 4, 2023 23:05

Fix duplicated timeout, remove startTime from messages

f8df257

katcharov requested a review from rozza October 9, 2023 22:44

rozza approved these changes Oct 10, 2023

View reviewed changes

katcharov merged commit 4b002c0 into mongodb:CSOT Oct 12, 2023

katcharov deleted the JAVA-5184 branch October 12, 2023 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to obtain timeout from OperationContext for server selection #1209

Refactor to obtain timeout from OperationContext for server selection #1209

katcharov commented Sep 29, 2023

rozza left a comment

rozza Oct 2, 2023

rozza Oct 2, 2023

rozza Oct 2, 2023

katcharov Oct 3, 2023

katcharov Oct 5, 2023 •

edited

Loading

katcharov Oct 6, 2023 •

edited

Loading

rozza commented Oct 9, 2023

rozza left a comment

Refactor to obtain timeout from OperationContext for server selection #1209

Refactor to obtain timeout from OperationContext for server selection #1209

Conversation

katcharov commented Sep 29, 2023

rozza left a comment

Choose a reason for hiding this comment

rozza Oct 2, 2023

Choose a reason for hiding this comment

rozza Oct 2, 2023

Choose a reason for hiding this comment

rozza Oct 2, 2023

Choose a reason for hiding this comment

katcharov Oct 3, 2023

Choose a reason for hiding this comment

katcharov Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

katcharov Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

rozza commented Oct 9, 2023

rozza left a comment

Choose a reason for hiding this comment

katcharov Oct 5, 2023 •

edited

Loading

katcharov Oct 6, 2023 •

edited

Loading