-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708
Comments
Hello @lizthegrey, thank you for filing this bug and welcome to the Go project! Randomization comes to bite us here: interestingly this is a case where we can perhaps draw lessons from load balancing algorithms e.g. in https://landing.google.com/sre/sre-book/chapters/load-balancing-datacenter/ -- a guide of which you @lizthegrey are an author :) In the bucket of ideas to try, perhaps the pool could be a heap where the keys during insertion are the time left before expiry, and inherently by the definition of a heap, we'll be able to service requests by critically of time to expire. Kindly paging @bradfitz @rsc too for examining pooling mechanisms In regards to instrumentation of database/sql wrappers, @basvanbeek and small part from me, courtesy of OpenCensus, worked on https://opencensus.io/integrations/sql/go_sql/ which provides metrics and traces for SQL calls and is most prominently used in the Go cloud development kit https://gocloud.dev/ |
My PR for the related issues #39471 https://golang.org/cl/237337 implements LIFO by swapping fast delete with slow to maintain order and reusing connections from the end not beginning. |
@stevenh I think your PR is not related to this issue, this issue is talking about when driver have a free connection, it actually pick a random one form the queue and give the connection to that. which is usually not what you want! The more connection pending, the more likely your oldest requests are starved waiting! Implement FIFO seems to be a good way of doing it , LIFO has even more fairness issues |
The previous SQL connection pool behavior used a FIFO queue implemented as a slice, but in 4f6d4bb was changed as a side effect to read the first entry in map recurse order (effectively random among requests not yet timed out when execution started).
I believe we can do much better than this -- we should go back to dequeuing the pending requests in an ordered fashion, while continuing to not leak cancelled requests. And we should be able to instrument to see whether people are getting better performance out of LIFO, FIFO, or random.
See also: #22697 talks about greater pool observability and configurability
See also: #18080 talks about observability as well (Honeycomb has implemented our version of passing contexts into wrapped SQL calls, but loses visibility once they reach the SQL layer)
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Saturate our connection pool with a higher rate of incoming requests than it could temporarily handle after our database got slow. Almost no requests succeeded because we picked random requests from the pool to work on, many of which didn't have enough remaining time to complete without expiring after being pulled off the pending map.
What did you expect to see?
If we have requests coming in 10% faster than they can be fulfilled, 90% of them should succeed and only 10% should time out. And we should have a separate waitDuration counter for successful requests that were dequeued and started running, vs. cancelled requests that timed out before even starting to be serviced.
What did you see instead?
~100% of connections timed out because most of the connections in the pool became dead while being serviced, and random selection wasn't trying first the requests most likely to succeed. and we couldn't tell why, because the waitDuration of the successful requests wasn't separated from the failed requests.
The text was updated successfully, but these errors were encountered: