Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to obtain timeout from OperationContext for server selection #1209

Merged
merged 4 commits into from
Oct 12, 2023

Conversation

katcharov
Copy link
Contributor

JAVA-5184

Refactoring in advance of JAVA-4065.

In JAVA-5175, we used a timeout, but obtained from the existing settings. Here, we obtain that timeout from the OperationContext added in JAVA-5170.

Also fixes flaky tests.

@katcharov katcharov requested a review from rozza September 29, 2023 13:59
Copy link
Member

@rozza rozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little more invasive than I'd originally hoped.

Anyway to reduce / refactor some of that complexity out?

@@ -331,8 +326,8 @@ private void logServerSelectionFailure(final ServerSelector serverSelector,

@Nullable
private ServerTuple selectServer(final ServerSelector serverSelector,
final ClusterDescription clusterDescription) {
return selectServer(serverSelector, clusterDescription, this::getServer);
final ClusterDescription clusterDescription, final OperationContext operationContext) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does getServer need a OpContext?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm. LoadBalancerCluster does server selection for get server.

@@ -462,7 +457,7 @@ private void notifyWaitQueueHandler(final ServerSelectionRequest request) {
waitQueue.add(request);

if (waitQueueHandler == null) {
waitQueueHandler = new Thread(new WaitQueueHandler(), "cluster-" + clusterId.getValue());
waitQueueHandler = new Thread(new WaitQueueHandler(operationContext), "cluster-" + clusterId.getValue());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong - as the waitQueueHandler only gets the first operationContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll have a closer look. It seems this doesn't change the current logic, but it will once there's an active timeout.

Copy link
Contributor Author

@katcharov katcharov Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I rebased the original 3 commits, after merging master. This removed getDescription methods.)

It looks like there is a bug [edit: not actually a bug] in our server selection code. On master, the waitForSrv method initiates a server selection timeout. However, this method is invoked by getServer, which is invoked from BaseCluster (handleServerSelectionRequest...), and ultimately this call chain comes to where a new ServerSelectionRequest is created. This creation has its own server selection start time (assigned to a field). It is this original server selection timeout that should be used by waitForSrv, but instead, it creates its own, which could be significantly later than the original start.

While making these changes, I also removed the StartTime from timeout error messages (instead of threading through that startTime). This part of the message will be removed for standardized logging.

One moral is to be careful if one sees an OperationContext being used to create a timeout, but then being passed to downstream methods.

Copy link
Contributor Author

@katcharov katcharov Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation above, with double timeouts, never arises because the affected getServer in LoadBalancedCluster is never invoked by BaseCluster (LBC not a subclass of BC).

(But the refactoring is still needed, to pass the operation context/timeout.)

# Conflicts:
#	driver-core/src/test/unit/com/mongodb/internal/connection/LoadBalancedClusterTest.java
@rozza
Copy link
Member

rozza commented Oct 9, 2023

@katcharov let me know if this is ready for re-review

@katcharov katcharov requested a review from rozza October 9, 2023 22:44
Copy link
Member

@rozza rozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@katcharov katcharov merged commit 4b002c0 into mongodb:CSOT Oct 12, 2023
@katcharov katcharov deleted the JAVA-5184 branch October 12, 2023 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants