x/net/http2: random read deadlock #39812

Xfennec · 2020-06-24T13:20:26Z

Hi!

What version of Go are you using (`go version`)?

$ go version
go version go1.13.1 linux/amd64

(seen from at least go1.10.4)

Does this issue reproduce with the latest release?

Currently testing with 1.14.3

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xfennec/.cache/go-build"
GOENV="/home/xfennec/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xfennec/Quiris/Go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build988333864=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We use a chain of two HTTP proxies written in Go to route incoming traffic to our virtual machines:

          incoming traffic
                 |
                 v
          +-------------+
          |   primary   |
          +--+-------+--+
             |       |
        +----+       +----+   HTTP/2
        |                 |
        v                 v
+---------------+    +---------------+
| secondary (1) |    | secondary (2) |
+---+---+---+---+    +---+---+---+---+
    |   |   |            |   |   |
    v   v   v            v   v   v

             VMs, HTTP/1.1

What did you expect to see?

Stable service over time.

What did you see instead?

Every ~10 days, we detect a total stall of requests between our frontal ("primary") proxy and one of the secondary proxies. New requests are accepted, sent to the secondary, then to the VM, but the response is stalled somewhere between the secondary and the primary proxy.

All request to this secondary proxy will then wait forever (since all requests are multiplexed inside the same HTTP2 connection)

Restarting one of the proxies, or even killing the "faulty" HTTP2 connection (using ss -K) between our proxies makes the service work perfectly again.

We've had this issue for months now, and I still fail to see any common pattern (time of the day, server load, requests, …) that would be the trigger. I've tried to write a few test/repro, without success either.

Here's a sample of the stack on the primary proxy during one of the outages:

goroutine 72163481 [sync.Cond.Wait, 5 minutes]:
runtime.goparkunlock(...)
        /usr/lib/golang/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc001b55300, 0x0)
        /usr/lib/golang/src/runtime/sema.go:510 +0xf8
sync.(*Cond).Wait(0xc001b552f0)
        /usr/lib/golang/src/sync/cond.go:56 +0x9d
net/http.(*http2pipe).Read(0xc001b552e8, 0xc0004e0000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/h2_bundle.go:3512 +0xa6
net/http.http2transportResponseBody.Read(0xc001b552c0, 0xc0004e0000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/h2_bundle.go:8448 +0xac
net/http/httputil.(*ReverseProxy).copyBuffer(0xc00014f810, 0x7f4cf39cc878, 0xc0000100a0, 0x7f4cf3a120f8, 0xc001b552c0, 0xc0004e0000, 0x8000, 0x8000, 0xc0006876d0, 0x920ca887, ...)
        /usr/lib/golang/src/net/http/httputil/reverseproxy.go:408 +0xa7
net/http/httputil.(*ReverseProxy).copyResponse(0xc00014f810, 0x7f4cf39cc878, 0xc0000100a0, 0x7f4cf3a120f8, 0xc001b552c0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/httputil/reverseproxy.go:396 +0xb5
net/http/httputil.(*ReverseProxy).ServeHTTP(0xc00014f810, 0x859e00, 0xc0000100a0, 0xc00111fc00)
        /usr/lib/golang/src/net/http/httputil/reverseproxy.go:289 +0x7e2
main.(*ProxyServer).serveReverseProxy(0xc00006f1a0, 0xc00148e680, 0x7d6772, 0x5, 0x859e00, 0xc0000100a0, 0xc00111fc00, 0xc001f07100)
        /home/xfennec/Quiris/Go/src/github.com/OnitiFR/mulch/cmd/mulch-proxy/proxy.go:208 +0x277
main.(*ProxyServer).handleRequest(0xc00006f1a0, 0x859e00, 0xc0000100a0, 0xc00111fc00)
        /home/xfennec/Quiris/Go/src/github.com/OnitiFR/mulch/cmd/mulch-proxy/proxy.go:287 +0x608
net/http.HandlerFunc.ServeHTTP(0xc0000d98b0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:2007 +0x44
net/http.(*ServeMux).ServeHTTP(0xc0000124c0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:2387 +0x1bd
net/http.serverHandler.ServeHTTP(0xc0001460e0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x85a8c0, 0xc0018b72f0, 0xc000a38380, 0xc0001460e0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:3366 +0x8d
net/http.(*http2serverConn).runHandler(0xc0008b9500, 0xc0000100a0, 0xc00111ac00, 0xc0003f52c0)
        /usr/lib/golang/src/net/http/h2_bundle.go:5713 +0x9f
created by net/http.(*http2serverConn).processHeaders
        /usr/lib/golang/src/net/http/h2_bundle.go:5447 +0x4eb

At the same time, on the secondary proxy, I see this:

goroutine 32086483 [select, 5 minutes]:
net/http.(*http2serverConn).writeDataFromHandler(0xc0024f7e00, 0xc0002b8d10, 0xc0007b2000, 0x53, 0x1000, 0xc000d4fc01, 0x1d, 0x6779ed)
        /usr/lib/golang/src/net/http/h2_bundle.go:4577 +0x23f
net/http.(*http2responseWriterState).writeChunk(0xc001188380, 0xc0007b2000, 0x53, 0x1000, 0xc00170fdb8, 0x1, 0xc000bf0340)
        /usr/lib/golang/src/net/http/h2_bundle.go:6043 +0x475
net/http.http2chunkWriter.Write(0xc001188380, 0xc0007b2000, 0x53, 0x1000, 0xc00144c420, 0x855e20, 0xc0003e0c80)
        /usr/lib/golang/src/net/http/h2_bundle.go:5926 +0x49
bufio.(*Writer).Flush(0xc000bf1340, 0x0, 0x0)
        /usr/lib/golang/src/bufio/bufio.go:593 +0x75
net/http.(*http2responseWriter).Flush(0xc001582088)
        /usr/lib/golang/src/net/http/h2_bundle.go:6122 +0x42
net/http.(*http2responseWriter).handlerDone(0xc001582088)
        /usr/lib/golang/src/net/http/h2_bundle.go:6258 +0x46
net/http.(*http2serverConn).runHandler.func1(0xc001582088, 0xc00170ff67, 0xc0024f7e00)
        /usr/lib/golang/src/net/http/h2_bundle.go:5711 +0x2bf
net/http.(*http2serverConn).runHandler(0xc0024f7e00, 0xc001582088, 0xc0020c2a00, 0xc0003509e0)
        /usr/lib/golang/src/net/http/h2_bundle.go:5715 +0xa5
created by net/http.(*http2serverConn).processHeaders
        /usr/lib/golang/src/net/http/h2_bundle.go:5447 +0x4eb

how can I make this bug report more helpful?
any idea for a sample code to replicate this issue without waiting 10 days each time?
as a workaround, what would be the timeout/deadline setting at Tranport level to mitigate this issue?

The text was updated successfully, but these errors were encountered:

davecheney · 2020-06-24T13:29:44Z

what can I make this bug report more helpful?

Have you built and deployed a race enabled version of your application?
See https://blog.golang.org/race-detector .

Xfennec · 2020-06-24T15:41:52Z

Thanks for this link, @davecheney. I've just deployed a race-enabled binary on our primary proxy. No race detection so far.

Side question: are race enabled binaries supposed to be noticeably slower ?

davecheney · 2020-06-25T04:02:07Z

Side question: are race enabled binaries supposed to be noticeably slower ?

yes, race detection does add overhead. I always recommend that people use go test -race but that relies on them having integration style tests that can exercise racy' code paths. The best way to spot data races that I've found is to deploy a race enabled build to a percentage of your production traffic. This seems to strike the right balance between coverage and race detector overhead.

networkimprov · 2020-06-25T10:35:08Z

cc @fraenkel

fraenkel · 2020-06-27T15:50:23Z

This is a dup of #32388 and #33425

davecheney · 2020-06-29T01:06:38Z

Duplicate of #32388 and #33425

Xfennec · 2020-07-20T08:01:26Z

If it's of any help: with Go 1.14.3 + race enabled build, 24 days later, the issue popped again. Same symptoms, same stack trace, no race condition reported. It's an unusually long interval (the average duration between outages is around 10 days), I can't tell if it's related to the upgraded Go, the race enabled build or a change in our load in this period. Or just pure random.

davecheney · 2020-07-20T08:59:57Z

@Xfennec please follow up on one of the related issues. Thank you

cagedmantis added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 24, 2020

cagedmantis added this to the Unplanned milestone Jun 24, 2020

davecheney closed this as completed Jun 29, 2020

golang locked and limited conversation to collaborators Jul 20, 2021

gopherbot added the FrozenDueToAge label Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/http2: random read deadlock #39812

x/net/http2: random read deadlock #39812

Xfennec commented Jun 24, 2020 •

edited

Loading

davecheney commented Jun 24, 2020

Xfennec commented Jun 24, 2020

davecheney commented Jun 25, 2020

networkimprov commented Jun 25, 2020

fraenkel commented Jun 27, 2020

davecheney commented Jun 29, 2020

Xfennec commented Jul 20, 2020

davecheney commented Jul 20, 2020

x/net/http2: random read deadlock #39812

x/net/http2: random read deadlock #39812

Comments

Xfennec commented Jun 24, 2020 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

davecheney commented Jun 24, 2020

Xfennec commented Jun 24, 2020

davecheney commented Jun 25, 2020

networkimprov commented Jun 25, 2020

fraenkel commented Jun 27, 2020

davecheney commented Jun 29, 2020

Xfennec commented Jul 20, 2020

davecheney commented Jul 20, 2020

Xfennec commented Jun 24, 2020 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?