database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

lizthegrey · 2019-04-27T00:26:38Z

The previous SQL connection pool behavior used a FIFO queue implemented as a slice, but in 4f6d4bb was changed as a side effect to read the first entry in map recurse order (effectively random among requests not yet timed out when execution started).

I believe we can do much better than this -- we should go back to dequeuing the pending requests in an ordered fashion, while continuing to not leak cancelled requests. And we should be able to instrument to see whether people are getting better performance out of LIFO, FIFO, or random.

See also: #22697 talks about greater pool observability and configurability
See also: #18080 talks about observability as well (Honeycomb has implemented our version of passing contexts into wrapped SQL calls, but loses visibility once they reach the SQL layer)

What version of Go are you using (`go version`)?

$ go version
go version go1.11.6 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/lizf/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/lizf/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/go-1.11"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.11/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/lizf/go/src/github.com/honeycombio/hound/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473464966=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Saturate our connection pool with a higher rate of incoming requests than it could temporarily handle after our database got slow. Almost no requests succeeded because we picked random requests from the pool to work on, many of which didn't have enough remaining time to complete without expiring after being pulled off the pending map.

What did you expect to see?

If we have requests coming in 10% faster than they can be fulfilled, 90% of them should succeed and only 10% should time out. And we should have a separate waitDuration counter for successful requests that were dequeued and started running, vs. cancelled requests that timed out before even starting to be serviced.

What did you see instead?

~100% of connections timed out because most of the connections in the pool became dead while being serviced, and random selection wasn't trying first the requests most likely to succeed. and we couldn't tell why, because the waitDuration of the successful requests wasn't separated from the failed requests.

The text was updated successfully, but these errors were encountered:

lizthegrey · 2019-04-27T00:27:26Z

cc @kardianos @julieqiu

odeke-em · 2019-04-27T04:32:07Z

Hello @lizthegrey, thank you for filing this bug and welcome to the Go project!

Randomization comes to bite us here: interestingly this is a case where we can perhaps draw lessons from load balancing algorithms e.g. in https://landing.google.com/sre/sre-book/chapters/load-balancing-datacenter/ -- a guide of which you @lizthegrey are an author :)

In the bucket of ideas to try, perhaps the pool could be a heap where the keys during insertion are the time left before expiry, and inherently by the definition of a heap, we'll be able to service requests by critically of time to expire.

Kindly paging @bradfitz @rsc too for examining pooling mechanisms

In regards to instrumentation of database/sql wrappers, @basvanbeek and small part from me, courtesy of OpenCensus, worked on https://opencensus.io/integrations/sql/go_sql/ which provides metrics and traces for SQL calls and is most prominently used in the Go cloud development kit https://gocloud.dev/

stevenh · 2020-06-14T14:01:07Z

My PR for the related issues #39471 https://golang.org/cl/237337 implements LIFO by swapping fast delete with slow to maintain order and reusing connections from the end not beginning.

thefallentree · 2022-05-05T06:24:52Z

My PR for the related issues #39471 https://golang.org/cl/237337 implements LIFO by swapping fast delete with slow to maintain order and reusing connections from the end not beginning.

@stevenh I think your PR is not related to this issue, this issue is talking about when driver have a free connection, it actually pick a random one form the queue and give the connection to that. which is usually not what you want! The more connection pending, the more likely your oldest requests are starved waiting! Implement FIFO seems to be a good way of doing it , LIFO has even more fairness issues

odeke-em changed the title ~~SQL connection pool was FIFO, is now random, but should be LIFO~~ database/sql: connection pool was originally FIFO, is now random, but should be LIFO Apr 27, 2019

odeke-em added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Apr 27, 2019

kardianos mentioned this issue Jun 10, 2020

database/sql: idle connections are not cleaned up properly #39471

Closed

dolmen mentioned this issue Jan 12, 2021

proposal: database/sql: use a stack-based instead of a queue-based approach for connection pool #43173

Closed

seankhliao added this to the Unplanned milestone Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

lizthegrey commented Apr 27, 2019

lizthegrey commented Apr 27, 2019

odeke-em commented Apr 27, 2019

stevenh commented Jun 14, 2020

thefallentree commented May 5, 2022

database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

Comments

lizthegrey commented Apr 27, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

lizthegrey commented Apr 27, 2019

odeke-em commented Apr 27, 2019

stevenh commented Jun 14, 2020

thefallentree commented May 5, 2022

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?