GODRIVER-2335 Preemptively cancel in progress operations when SDAM heartbeats timeout. #1423

qingyang-hu · 2023-10-11T20:50:50Z

GODRIVER-2335

Summary

Preemptively cancel in-progress operations when SDAM heartbeats timeout.

Background & Motivation

This PR only clears pools and cancels in-progress ops if the heartbeat fails twice to avoid contradictory logic between clearing connection pools and canceling operations. Therefore, the SDAM unified tests are skipped.
The spec requires an optional flag for interruption. However, because of the difficulty of optional arguments in the Golang, an alternative method, ClearAll, is provided to interrupt any in-use connections as part
of the clearing.

mongodb-drivers-pr-bot · 2023-11-09T05:23:32Z

API Change Report

./event

compatible changes

PoolEvent.Interruption: added

prestonvasquez · 2023-11-29T19:03:57Z

mongo/integration/unified/testrunner_operation.go

+		}
+		threadOp := new(operation)
+		if err := operationRaw.Unmarshal(threadOp); err != nil {
+			return fmt.Errorf("error unmarshalling 'operation' argument: %v", err)


Suggested change

return fmt.Errorf("error unmarshalling 'operation' argument: %v", err)

return fmt.Errorf("error unmarshaling 'operation' argument: %v", err)

That is what confuses me. I believe we've talked about its spelling in en-US vs en-GB, but I still see a lot "marshalling" in our code. Maybe they are worth an orthography PR.

prestonvasquez · 2023-11-29T19:48:35Z

mongo/integration/unified/testrunner_operation.go

+			select {
+			case <-ch:
+				return nil
+			case <-time.After(10 * time.Second):


What is the significance of 10 seconds here? Can we constantize this value akin to waitForEventTimeout ?

According to the specs:

If the waitForThread operation is not satisfied after 10 seconds, this operation MUST cause a test failure.

prestonvasquez · 2023-11-29T19:55:42Z

mongo/integration/unified/unified_spec_runner.go

@@ -32,6 +32,10 @@ var (
 		"Write commands with snapshot session do not affect snapshot reads": "Test fails frequently. See GODRIVER-2843",
 		// TODO(GODRIVER-2943): Fix and unskip this test case.
 		"Topology lifecycle": "Test times out.  See GODRIVER-2943",
+
+		"Connection pool clear uses interruptInUseConnections=true after monitor timeout":                      "Godriver clears after multiple timeout",


Can we add a comment above these tests with a TODO to resolve or a more concise explanation as to why we are skipping them?

Not running these tests avoids ever executing the runOnThread logic.

@qingyang-hu So is the reason we skip this is because these cases don't apply to the Go Driver?

prestonvasquez · 2023-11-29T20:39:46Z

mongo/integration/unified/client_entity.go

@@ -568,6 +568,8 @@ func setClientOptionsFromURIOptions(clientOpts *options.ClientOptions, uriOpts b
 		switch strings.ToLower(key) {
 		case "appname":
 			clientOpts.SetAppName(value.(string))
+		case "connecttimeoutms":
+			clientOpts.SetConnectTimeout(time.Duration(value.(int32)) * time.Microsecond)


Should this conversion be time.Millisecond? `connect

prestonvasquez · 2023-11-29T21:33:32Z

x/mongo/driver/topology/connection.go

@@ -48,6 +48,8 @@ type connection struct {
 	// - atomic bug: https://pkg.go.dev/sync/atomic#pkg-note-BUG
 	// - suggested layout: https://go101.org/article/memory-layout.html
 	state int64
+	inUse bool
+	err   error


Is there any concern of these values being written / read concurrently? Do we need to have an error here? The only instance of this error being set is in clearAll and is

poolClearedError{err: fmt.Errorf("interrupted"), address: p.address})

I don't see a concurrent issue in the current code. The inUse is set in checkin and checkout, and the value is only read by interruption between checkin and checkout.

The err is optional merely to indicate the ConnectionError from reading/writing more clearly.

The clear or clearAll method may be called concurrently with checkIn/checkOut, leading to a concurrent read/write of inUse.

Some possible solutions:

Make inUse an atomic.Bool.

Keep a set of in-use connections, similar to how we keep a set of all connections and idle connections.

Close all connections if interruptInUseConnections=true (i.e. don't try to distinguish between in-use and idle).

(1) is the smallest change from the current implementation. However, I think we should seriously consider (3) because it doesn't seem like the runtime optimization of lazy-closing idle connections (when we're about to force-close in-use connections) is worth the significant increase in code complexity necessary to keep track of what connections are in-use.

prestonvasquez · 2023-11-29T21:57:55Z

x/mongo/driver/topology/pool.go

+	p.clearImpl(err, serviceID, nil)
+}
+
+func (p *pool) clearImpl(err error, serviceID *primitive.ObjectID, interruptionCallback func()) {


Is there a need to do this with a callback? There only seems to be one implementation of the callback, I think it would be more robust to just combine all of this logic into one function, which would also be closer to the logic in the specification:

clear(interruptInUseConnections: Optional<Boolean>): void;

Something like this:

func (p *pool) clear(err error, serviceID *primitive.ObjectID, interruptInUseConnections bool) { p.clearImpl(err, serviceID, nil) // (existing logic) p.removePerishedConns() if interruptInUseConnections { interuptConnections(p) } // (continue existing logic) }

Where interuptConnections is this:

func interuptConnections(p *pool) { for _, conn := range p.conns { if !conn.inUse || !p.stale(conn) { continue } _ = conn.closeWithErr(poolClearedError{ err: fmt.Errorf("interrupted"), address: p.address, }) _ = p.checkInWithCallback(conn, func() (reason, bool) { if mustLogPoolMessage(p) { keysAndValues := logger.KeyValues{ logger.KeyDriverConnectionID, conn.driverConnectionID, } logPoolMessage(p, logger.ConnectionCheckedIn, keysAndValues...) } if p.monitor != nil { p.monitor.Event(&event.PoolEvent{ Type: event.ConnectionCheckedIn, ConnectionID: conn.driverConnectionID, Address: conn.addr.String(), }) } r, ok := connectionPerished(conn) if ok { r = reason{ loggerConn: logger.ReasonConnClosedStale, event: event.ReasonStale, } } return r, ok }) } }

You are right. We can avoid the callback for cleaner code. However, I'd like to keep both the clear() and clearAll() to correspond with the optional interruptInUseConnections flag in the specs considering the lack of an idiom for optional function parameter in Golang. Moreover, interruption is only used a couple times while clear() has been existing in many places, so it does not seems like a good idea to have the interruptInUseConnections everywhere with clear().

matthewdale · 2023-12-11T19:53:39Z

mongo/integration/unified/unified_spec_runner.go

@@ -32,6 +32,11 @@ var (
 		"Write commands with snapshot session do not affect snapshot reads": "Test fails frequently. See GODRIVER-2843",
 		// TODO(GODRIVER-2943): Fix and unskip this test case.
 		"Topology lifecycle": "Test times out.  See GODRIVER-2943",
+		// The current logic, which was implemented with GODRIVER-2577, only clears pools and cancels in-progress ops if
+		// the heartbeat fails twice. Therefore, we skip the following spec tests, which requires canceling ops immediately.
+		"Connection pool clear uses interruptInUseConnections=true after monitor timeout":                      "Godriver clears after multiple timeout",


Since we skip these spec tests, the new "cancel in-progress operations" feature seems untested by any new or existing tests. We should add tests that make sure the feature works.

I'm going to update the test cases for GODRIVER-2577 in server_test.go to sync up with the changes.

matthewdale · 2023-12-12T00:59:57Z

x/mongo/driver/topology/pool.go

+	w.connOpts = append(w.connOpts, func(cfg *connectionConfig) {
+		cfg.inUse = true
+	})


Setting inUse here via a connection option seems redundant with setting it in the defer func() above. Why do we need to do it both places?

x/mongo/driver/topology/pool.go

matthewdale · 2023-12-12T01:18:03Z

x/mongo/driver/topology/pool.go

+		if conn.inUse && p.stale(conn) {
+			_ = conn.closeWithErr(poolClearedError{


Optional: Consider inverting this logic and de-indenting the code below.

E.g.

if !conn.inUse || !p.stale(conn) { continue } _ = conn.closeWithErr(...) ...

matthewdale · 2023-12-12T01:21:26Z

x/mongo/driver/topology/pool.go

@@ -825,12 +861,58 @@ func (p *pool) checkInNoEvent(conn *connection) error {
 	return nil
 }

+// clearAll does same as the "clear" method and interrupts all in-use connections as well.
+func (p *pool) clearAll(err error, serviceID *primitive.ObjectID) {


Optional: Consider the more accurate name clearInUse.

Note: this recommendation only applies if we're actually trying to clear only in-use connections. If we decide to clear all connections, then this name make sense.

matthewdale · 2023-12-12T01:35:02Z

x/mongo/driver/topology/connection.go

@@ -48,6 +48,8 @@ type connection struct {
 	// - atomic bug: https://pkg.go.dev/sync/atomic#pkg-note-BUG
 	// - suggested layout: https://go101.org/article/memory-layout.html
 	state int64
+	inUse bool
+	err   error


The clear or clearAll method may be called concurrently with checkIn/checkOut, leading to a concurrent read/write of inUse.

Some possible solutions:

Make inUse an atomic.Bool.

Keep a set of in-use connections, similar to how we keep a set of all connections and idle connections.

Close all connections if interruptInUseConnections=true (i.e. don't try to distinguish between in-use and idle).

(1) is the smallest change from the current implementation. However, I think we should seriously consider (3) because it doesn't seem like the runtime optimization of lazy-closing idle connections (when we're about to force-close in-use connections) is worth the significant increase in code complexity necessary to keep track of what connections are in-use.

matthewdale · 2023-12-28T19:24:56Z

x/mongo/driver/topology/connection.go

+	c.closeConnectContext()
+	c.wait() // Make sure that the connection has finished connecting.


It's somewhat surprising that closeWithErr waits for the connection to finish establishing while close does not. Should we move these to close?

matthewdale · 2023-12-28T19:49:03Z

internal/integration/unified/testrunner_operation.go

+		go func(op *operation) {
+			err := op.execute(ctx, loopDone)
+			ch <- err
+		}(threadOp)


This style of running operations on a "thread" doesn't guarantee that operations will be run in the specified order. The runOnThread operation in the unified test format spec fails to mention that property, but my understanding is that unified spec test runners must preserve the order of ops run in a "thread".

The legacy "runOnThread" implementation here uses a job queue per "thread". Consider using a similar approach and/or copying the code from the legacy spec test runner.

matthewdale · 2023-12-28T20:06:04Z

x/mongo/driver/topology/pool.go

+	if atomic.LoadInt64(&p.generation.state) == generationDisconnected {
+		return true
+	}


Optional: This logic isn't introduced here, but it is confusing. The generation state appears to only be "disconnected" when the pool is "closed", and is only used in pool.stale. Consider replacing this line with a check for if the pool is closed and removing the state field from poolGenerationMap.

E.g.

p.stateMu.RLock() if p.state == poolClosed { p.stateMu.RUnlock() return true } p.stateMu.RUnlock()

The p.generation is a substitute for the pool state because p.generation.disconnect() is called in *pool.close(), while locking stateMu may cause deadlock.

matthewdale · 2023-12-29T02:04:07Z

x/mongo/driver/topology/pool.go

+			err:     fmt.Errorf("interrupted"),
+			address: p.address,
+		})
+		_ = p.checkInWithCallback(conn, func() (reason, bool) {


Checking in the connection to close it can be dangerous. If there is ever a case where the logic to check for perished connections has a bug, we may check-in an in-use connection, which could lead to data corruption. We should close the connections directly here and not rely on check-in to close them.

matthewdale · 2023-12-29T02:06:20Z

x/mongo/driver/topology/pool.go

+				logPoolMessage(p, logger.ConnectionCheckedIn, keysAndValues...)
+			}
+
+			if p.monitor != nil {
+				p.monitor.Event(&event.PoolEvent{
+					Type:         event.ConnectionCheckedIn,


I don't think the spec requires that check-in logs/events are emitted when closing in-use connections. If we stop using check-in to close connections (recommended above), we should also not emit these logs/events.

matthewdale · 2023-12-29T02:18:58Z

x/mongo/driver/topology/pool.go

+		r, perished := connectionPerished(conn)
+		if !perished && conn.pool.getState() == poolClosed {
+			perished = true
+			r = reason{
+				loggerConn: logger.ReasonConnClosedPoolClosed,
+				event:      event.ReasonPoolClosed,
+			}
+		}
+		return r, perished


I'm concerned about the amount of duplicate code the checkInWithCallback function creates. There are currently multiple sources of truth for check-in logging, events, and perished connection behavior, depending on which function is called. We should find a way to remove the duplicated code.

matthewdale · 2024-01-05T01:36:54Z

x/mongo/driver/topology/pool.go

+		event := &event.PoolEvent{
+			Type:      event.PoolCleared,
+			Address:   p.address.String(),
+			ServiceID: serviceID,
+			Error:     err,
+		}
+		if interruptAllConnections {
+			event.Interruption = true
+		}


Optional: No need for a conditional block here.

Suggested change

event := &event.PoolEvent{

Type: event.PoolCleared,

Address: p.address.String(),

ServiceID: serviceID,

Error: err,

}

if interruptAllConnections {

event.Interruption = true

}

event := &event.PoolEvent{

Type: event.PoolCleared,

Address: p.address.String(),

ServiceID: serviceID,

Error: err,

Interruption: interruptAllConnections

}

matthewdale · 2024-01-05T01:49:13Z

x/mongo/driver/topology/server_test.go

+			generation, _ := server.pool.generation.getGeneration(&serviceID)
 			assert.Eventuallyf(t,
 				func() bool {
-					generation := server.pool.generation.getGeneration(&serviceID)
+					generation, _ := server.pool.generation.getGeneration(&serviceID)
 					numConns := server.pool.generation.getNumConns(&serviceID)
 					return generation == wantGeneration && numConns == wantNumConns
 				},
 				100*time.Millisecond,
 				1*time.Millisecond,
 				"expected generation number %v, got %v; expected connection count %v, got %v",
 				wantGeneration,
-				server.pool.generation.getGeneration(&serviceID),
+				generation,


Pre-fetching the generation number will lead to confusing test failure messages because we expect it to be updated concurrently while assert.Eventuallyf is running. We want the error message to show what the generation is when the assertion fails.

A possibly simpler approach is to log the state of the system when the assertion fails. For example:

assert.Eventuallyf(t, func() bool { generation := server.pool.generation.getGeneration(&serviceID) generation, _ := server.pool.generation.getGeneration(&serviceID) numConns := server.pool.generation.getNumConns(&serviceID) match := generation == wantGeneration && numConns == wantNumConns if !match { t.Logf("Waiting for generation number %v, got %v", wantGeneration, generation) t.Logf("Waiting for connection count %v, got %v", wantNumConns, numConns) } return match }, 100*time.Millisecond, 10*time.Millisecond, "expected generation number and connection count never matched")

Note that the example also changes the check interval from 1ms to 10ms to reduce log noise in the case there is a failure.

matthewdale

Looks good! 👍

…artbeats timeout. (mongodb#1423)

qingyang-hu force-pushed the godriver2335 branch 7 times, most recently from 360f89e to 4d1fb41 Compare October 13, 2023 13:12

qingyang-hu force-pushed the godriver2335 branch 17 times, most recently from 6ade6cb to 6e5ef26 Compare November 6, 2023 19:20

qingyang-hu force-pushed the godriver2335 branch 2 times, most recently from 95541dc to e24905e Compare November 9, 2023 04:59

qingyang-hu force-pushed the godriver2335 branch 3 times, most recently from 20259b9 to 924351d Compare November 9, 2023 15:27

qingyang-hu marked this pull request as ready for review November 17, 2023 16:44

prestonvasquez requested changes Nov 29, 2023

View reviewed changes

updates

d938fed

qingyang-hu requested a review from prestonvasquez November 30, 2023 17:42

matthewdale reviewed Dec 12, 2023

View reviewed changes

Simplify logic to check in all non-idle connections for interuption.

9dfddd9

qingyang-hu force-pushed the godriver2335 branch from 6cd38fe to 9dfddd9 Compare December 14, 2023 17:24

Merge branch 'master' into godriver2335

97d4d33

qingyang-hu requested a review from matthewdale December 14, 2023 21:07

prestonvasquez previously approved these changes Dec 19, 2023

View reviewed changes

qingyang-hu added the priority-2-medium Medium Priority PR for Review label Dec 20, 2023

matthewdale reviewed Dec 29, 2023

View reviewed changes

Simplify logic.

10bd0ef

qingyang-hu dismissed prestonvasquez’s stale review via 10bd0ef January 4, 2024 19:18

qingyang-hu force-pushed the godriver2335 branch 2 times, most recently from 01f3cbc to 545c404 Compare January 4, 2024 23:45

qingyang-hu added 2 commits January 4, 2024 18:52

Merge branch 'master' into godriver2335

61e0bc0

minor changes

31b50f1

qingyang-hu force-pushed the godriver2335 branch from 545c404 to 31b50f1 Compare January 4, 2024 23:55

qingyang-hu requested review from matthewdale and prestonvasquez January 5, 2024 00:27

matthewdale reviewed Jan 5, 2024

View reviewed changes

minor changes

c3233d8

qingyang-hu requested a review from matthewdale January 5, 2024 15:38

Merge branch 'master' into godriver2335

f998418

matthewdale approved these changes Jan 5, 2024

View reviewed changes

prestonvasquez approved these changes Jan 5, 2024

View reviewed changes

qingyang-hu merged commit e839849 into mongodb:master Jan 5, 2024
36 of 40 checks passed

qingyang-hu added a commit to qingyang-hu/mongo-go-driver that referenced this pull request Feb 7, 2024

GODRIVER-2335 Preemptively cancel in progress operations when SDAM he…

d479cc4

…artbeats timeout. (mongodb#1423)

qingyang-hu mentioned this pull request Feb 7, 2024

GODRIVER-2335 Preemptively cancel in progress operations when SDAM heartbeats timeout. #1549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GODRIVER-2335 Preemptively cancel in progress operations when SDAM heartbeats timeout. #1423

GODRIVER-2335 Preemptively cancel in progress operations when SDAM heartbeats timeout. #1423

qingyang-hu commented Oct 11, 2023 •

edited

Loading

mongodb-drivers-pr-bot bot commented Nov 9, 2023 •

edited

Loading

prestonvasquez Nov 29, 2023

qingyang-hu Nov 30, 2023

prestonvasquez Nov 29, 2023

qingyang-hu Nov 30, 2023

prestonvasquez Nov 29, 2023

prestonvasquez Dec 1, 2023

qingyang-hu Dec 2, 2023

prestonvasquez Nov 29, 2023

prestonvasquez Nov 29, 2023

qingyang-hu Nov 30, 2023

matthewdale Dec 12, 2023

prestonvasquez Nov 29, 2023

qingyang-hu Nov 30, 2023

matthewdale Dec 11, 2023

qingyang-hu Dec 14, 2023

matthewdale Dec 12, 2023

matthewdale Dec 12, 2023 •

edited

Loading

matthewdale Dec 12, 2023 •

edited

Loading

matthewdale Dec 12, 2023

matthewdale Dec 28, 2023

matthewdale Dec 28, 2023

matthewdale Dec 28, 2023

qingyang-hu Jan 5, 2024

matthewdale Dec 29, 2023

matthewdale Dec 29, 2023

matthewdale Dec 29, 2023

matthewdale Jan 5, 2024

matthewdale Jan 5, 2024

matthewdale left a comment

	return fmt.Errorf("error unmarshalling 'operation' argument: %v", err)
	return fmt.Errorf("error unmarshaling 'operation' argument: %v", err)

		if conn.inUse && p.stale(conn) {
		_ = conn.closeWithErr(poolClearedError{

		c.closeConnectContext()
		c.wait() // Make sure that the connection has finished connecting.

GODRIVER-2335 Preemptively cancel in progress operations when SDAM heartbeats timeout. #1423

GODRIVER-2335 Preemptively cancel in progress operations when SDAM heartbeats timeout. #1423

Conversation

qingyang-hu commented Oct 11, 2023 • edited Loading

Summary

Background & Motivation

mongodb-drivers-pr-bot bot commented Nov 9, 2023 • edited Loading

API Change Report

./event

compatible changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdale Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

matthewdale Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdale left a comment

Choose a reason for hiding this comment

qingyang-hu commented Oct 11, 2023 •

edited

Loading

mongodb-drivers-pr-bot bot commented Nov 9, 2023 •

edited

Loading

matthewdale Dec 12, 2023 •

edited

Loading

matthewdale Dec 12, 2023 •

edited

Loading