smartconnpool: do not allow connections to starve #17675

vmg · 2025-01-31T16:50:05Z

Description

This is a potential fix for the issue described in #17662. The bug report has a very comprehensive explanation of the bug (thank you!!), so I'll stick to explaining the fix here.

I've considered introducing stronger synchronization to the pool to prevent the race but decided against it because the race does not affect correctness, as the connections in the pool would only starve in situations where the service just stops receiving new requests over time. As long as the pool periodically receives connection requests, no connection can starve because the clients who raced with other clients and are stuck in the waitlist will wake up right away when the new connections come in. I don't think persistently slowing down the pool to handle for this corner case is worth it.

Hence, my proposed solution reuses the expiry loop (the periodic loop that checks that the connections in the wailist haven't expired naturally by the client) to detect connections that could potentially be starving. This will not be a common occurrence, but when it happens, we can simply force a cycle of the pool (pull from the stacks and hand over to the waiting connections) and that will always awake any starving connections.

The pros of this approach is that it adds zero overhead to the normal get/put path for the pool.
The cons of this approach is that a starving connection can starve for up to 100ms (which is the frequency I've set for the starving loop).

Open to alternative takes for this fix, ideally ones without much overhead to the get/put path.

cc @arthurschreiber @mhamza15 @harshit-gangal @deepthi

Related Issue(s)

Fixes Bug Report: connection pool timed out errors when there is a spike in borrowed/waiting connections due to race condition #17662

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

vitess-bot · 2025-01-31T16:50:08Z

Signed-off-by: Vicent Marti <[email protected]>

codecov · 2025-01-31T17:15:56Z

Codecov Report

Attention: Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.

Project coverage is 67.77%. Comparing base (cb7d61a) to head (e0538b7).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
go/pools/smartconnpool/pool.go	97.05%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #17675      +/-   ##
==========================================
+ Coverage   67.75%   67.77%   +0.01%     
==========================================
  Files        1587     1587              
  Lines      255780   255806      +26     
==========================================
+ Hits       173310   173369      +59     
+ Misses      82470    82437      -33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mhamza15 · 2025-02-03T15:12:13Z

Can confirm we're not seeing it in our CI either 👍. Thanks @vmg!

Signed-off-by: Dirkjan Bussink <[email protected]>

deepthi · 2025-02-03T16:24:30Z

go/pools/smartconnpool/pool.go

@@ -161,7 +161,6 @@ type ConnPool[C Connection] struct {
 // The pool must be ConnPool.Open before it can start giving out connections
 func NewPool[C Connection](config *Config[C]) *ConnPool[C] {
 	pool := &ConnPool[C]{}
-	pool.freshSettingsStack.Store(-1)


Was -1 being used as some kind of sentinel value? Or was this completely unnecessary?

It was a sentinel value to mark when no Setting connection had been returned to the pool yet, but it was not an interesting optimization, so I removed it.

) (#17685) Signed-off-by: Dirkjan Bussink <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

) (#17684) Signed-off-by: Dirkjan Bussink <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

) (#17683) Signed-off-by: Dirkjan Bussink <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

vmg requested review from deepthi and harshit-gangal as code owners January 31, 2025 16:50

github-actions bot added this to the v22.0.0 milestone Jan 31, 2025

mhamza15 and others added 2 commits January 31, 2025 17:50

add test case for waitlist race condition

8c58cc3

Signed-off-by: Vicent Marti <[email protected]>

smartconnpool: do not allow connections to starve

e00df5f

Signed-off-by: Vicent Marti <[email protected]>

vmg force-pushed the vmg/smartconnpool-starve branch from eddc5d7 to e00df5f Compare January 31, 2025 16:50

arthurschreiber added Backport to: release-19.0 Needs to be back ported to release-19.0 Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 labels Feb 1, 2025

Fix races in test

e0538b7

Signed-off-by: Dirkjan Bussink <[email protected]>

dbussink approved these changes Feb 3, 2025

View reviewed changes

arthurschreiber approved these changes Feb 3, 2025

View reviewed changes

deepthi approved these changes Feb 3, 2025

View reviewed changes

dbussink merged commit 30c09f5 into main Feb 3, 2025
201 checks passed

dbussink deleted the vmg/smartconnpool-starve branch February 3, 2025 16:34

This was referenced Feb 3, 2025

[release-19.0] smartconnpool: do not allow connections to starve (#17675) #17683

Merged

[release-20.0] smartconnpool: do not allow connections to starve (#17675) #17684

Merged

vitess-bot bot mentioned this pull request Feb 3, 2025

[release-21.0] smartconnpool: do not allow connections to starve (#17675) #17685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartconnpool: do not allow connections to starve #17675

smartconnpool: do not allow connections to starve #17675

vmg commented Jan 31, 2025

vitess-bot bot commented Jan 31, 2025

codecov bot commented Jan 31, 2025 •

edited

Loading

mhamza15 commented Feb 3, 2025

deepthi Feb 3, 2025

vmg Feb 3, 2025

smartconnpool: do not allow connections to starve #17675

smartconnpool: do not allow connections to starve #17675

Conversation

vmg commented Jan 31, 2025

Description

Related Issue(s)

Checklist

Deployment Notes

vitess-bot bot commented Jan 31, 2025

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov bot commented Jan 31, 2025 • edited Loading

Codecov Report

mhamza15 commented Feb 3, 2025

deepthi Feb 3, 2025

Choose a reason for hiding this comment

vmg Feb 3, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 31, 2025 •

edited

Loading