fluxd can overload and become unresponsive #714

squaremo · 2017-08-23T14:40:01Z

In our own dev environment, we have noticed from time to time that fluxd has stopped automatically deploying images. When we go to look at the pod, it shows up as using gigabytes of RAM and having hundreds of goroutines -- normally we'd expect maybe 50MB and tens of goroutines.

Luckily, I managed to capture a stack dump from fluxd this most recent time. Notable bit:

goroutine 18627648 [semacquire]:
sync.runtime_Semacquire(0x1faef64)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0x1faef60)
	/usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0x1faef60)
	/usr/local/go/src/sync/rwmutex.go:86 +0x31
syscall.forkExec(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d781310, 0x0, 0xc41fee25b8, 0xc44d288ea0)
	/usr/local/go/src/syscall/exec_unix.go:185 +0x1fd
syscall.StartProcess(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d781310, 0x2, 0x4, 0x20, 0x20)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d7814b8, 0x0, 0x15e6fe0, 0xc420dafd50)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d7814b8, 0x0, 0x0, 0x6)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42716edc0, 0xc457fc2001, 0xc420dafd50)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
os/exec.(*Cmd).Run(0xc42716edc0, 0xc420dafd50, 0xc44f06df60)
	/usr/local/go/src/os/exec/exec.go:276 +0x2b
github.com/weaveworks/flux/git.execGitCmd(0xc4204f36e0, 0x20, 0x0, 0x0, 0x1f1de20, 0xc420dafce0, 0xc45d781618, 0x5, 0x5, 0x1f1ee60, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:206 +0x193
github.com/weaveworks/flux/git.getNote(0xc4204f36e0, 0x20, 0xc420832410, 0xf, 0xc442af4ec4, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:109 +0x12c
github.com/weaveworks/flux/git.(*Checkout).GetNote(0xc420124240, 0xc442af4ec4, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/repo.go:170 +0xcb
github.com/weaveworks/flux/daemon.(*Daemon).JobStatus(0xc42067c600, 0xc43b2af4a0, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/daemon.go:326 +0x177
github.com/weaveworks/flux/daemon.(*Ref).JobStatus(0xc4202ce870, 0xc43b2af4a0, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/ref.go:65 +0x8c
github.com/weaveworks/flux/remote.(*ErrorLoggingPlatform).JobStatus(0xc4204f3580, 0xc43b2af4a0, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/logging.go:77 +0xde
github.com/weaveworks/flux/remote/rpc.(*RPCServer).JobStatus(0xc437ee0f80, 0xc43b2af4a0, 0x24, 0xc422cd9140, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/rpc/server.go:77 +0x62
reflect.Value.call(0xc4207a6b40, 0xc437cab728, 0x13, 0x16254fb, 0x4, 0xc45d781ee0, 0x3, 0x3, 0x414a59, 0xc4208640b8, ...)
	/usr/local/go/src/reflect/value.go:434 +0x5c8
reflect.Value.Call(0xc4207a6b40, 0xc437cab728, 0x13, 0xc45d781ee0, 0x3, 0x3, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc42f866580, 0xc42f866540, 0xc437ee1030, 0xc437eec380, 0xc42cae4f20, 0x13fd740, 0xc4239f5020, 0x198, 0x145ece0, 0xc422cd9140, ...)
	/usr/local/go/src/net/rpc/server.go:383 +0x148
created by net/rpc.(*Server).ServeCodec
	/usr/local/go/src/net/rpc/server.go:477 +0x421

There's about 600 of those, all stuck in the same place.

fluxd-dev-stackdump.txt

The text was updated successfully, but these errors were encountered:

squaremo · 2017-08-23T14:58:43Z

Here's one that is (I think) in the critical section the others are trying to enter:

goroutine 18715803 [runnable]:
sync.runtime_Semrelease(0x1faef64)
	/usr/local/go/src/runtime/sema.go:57 +0x2b
sync.(*Mutex).Unlock(0x1faef60)
	/usr/local/go/src/sync/mutex.go:124 +0xaa
sync.(*RWMutex).Unlock(0x1faef60)
	/usr/local/go/src/sync/rwmutex.go:125 +0x95
syscall.forkExec(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc469683310, 0x0, 0xc41dfd63ca, 0xc46053ab60)
	/usr/local/go/src/syscall/exec_unix.go:198 +0x306
syscall.StartProcess(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc469683310, 0x2, 0x4, 0x20, 0x20)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc4696834b8, 0x0, 0x15e6fe0, 0xc4531fa930)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc4696834b8, 0x0, 0x0, 0x6)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc462e33a20, 0xc460542001, 0xc4531fa930)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
os/exec.(*Cmd).Run(0xc462e33a20, 0xc4531fa930, 0xc453206c80)
	/usr/local/go/src/os/exec/exec.go:276 +0x2b
github.com/weaveworks/flux/git.execGitCmd(0xc4204f36e0, 0x20, 0x0, 0x0, 0x1f1de20, 0xc4531fa8c0, 0xc469683618, 0x5, 0x5, 0x1f1ee19, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:206 +0x193
github.com/weaveworks/flux/git.getNote(0xc4204f36e0, 0x20, 0xc420832410, 0xf, 0xc455f3185a, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:109 +0x12c
github.com/weaveworks/flux/git.(*Checkout).GetNote(0xc420124240, 0xc455f3185a, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/repo.go:170 +0xcb
github.com/weaveworks/flux/daemon.(*Daemon).JobStatus(0xc42067c600, 0xc444295f50, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/daemon.go:326 +0x177
github.com/weaveworks/flux/daemon.(*Ref).JobStatus(0xc4202ce870, 0xc444295f50, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/ref.go:65 +0x8c
github.com/weaveworks/flux/remote.(*ErrorLoggingPlatform).JobStatus(0xc4204f3580, 0xc444295f50, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/logging.go:77 +0xde
github.com/weaveworks/flux/remote/rpc.(*RPCServer).JobStatus(0xc437ee0f80, 0xc444295f50, 0x24, 0xc44bf50180, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/rpc/server.go:77 +0x62
reflect.Value.call(0xc4207a6b40, 0xc437cab728, 0x13, 0x16254fb, 0x4, 0xc469683ee0, 0x3, 0x3, 0xc420b2a340, 0x4, ...)
	/usr/local/go/src/reflect/value.go:434 +0x5c8
reflect.Value.Call(0xc4207a6b40, 0xc437cab728, 0x13, 0xc469683ee0, 0x3, 0x3, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc42f866580, 0xc42f866540, 0xc437ee1030, 0xc437eec380, 0xc43415eba0, 0x13fd740, 0xc436108a30, 0x198, 0x145ece0, 0xc44bf50180, ...)
	/usr/local/go/src/net/rpc/server.go:383 +0x148
created by net/rpc.(*Server).ServeCodec
	/usr/local/go/src/net/rpc/server.go:477 +0x421

squaremo · 2017-08-23T15:09:23Z

Superficially, this looks like a bunch of git note ... os.exec's are queued up waiting to run, all in the service of answering daemon.JobStatus().

The JobStatus method responds to things like fluxctl release and the web UI. It will first look in its cache of recent jobs, and failing that, will go through all the commits from HEAD backwards to try and find one with a note representing the job in question. That latter step could of course launch as many execs as there are commits, if the job in question doesn't exist.

One possible mitigation would be to put positive and negative results back in the cache, so the next call will find them (until they get evicted); this will at least de-amplify the execs.

It would be good to figure out why they are piling up, though -- since they are exec'd serially, I would take 600 of them to indicate 600 outstanding JobStatus requests, rather than one request and 600 execs.

philwinder · 2017-08-24T12:53:45Z

All of these traces are blocking inside exec_unix. They can't get a lock to be able to fork the
process (source: https://golang.org/src/syscall/exec_unix.go?s=5442:5460).

Why it can't get a lock there is a mystery. I can't find any other references to any code that would have that lock.

So my guess is that something happened in the middle of one fork calls and for some reason it didn't release the lock.

After a search around I've seen various references to certain issues with intel processors, BSD and quite a few changes to the codebase. But I can't see a general theme or fix. But this does look quite low level.

What is even more strange is that we have our own lock that forces us to only perform one operation on a checkout at a time.

philwinder · 2017-08-24T13:40:35Z

Found a trace that is different to the rest of the 600 in the stack.

If you look closely, it is still a JobStatus->GetNote->execGitCmd request, but it is a little further on than the others.

It diverges from the others at this point:
All others: https://github.com/golang/go/blob/go1.7.1/src/os/exec/exec.go#L276
This one: https://github.com/golang/go/blob/go1.7.1/src/os/exec/exec.go#L279

So it has managed to start the process, but is now stuck waiting for the underlying OS process.

I think it is this individual goroutine that has got stuck and all others are blocked behind it.

goroutine 18631036 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0xe2e2, 0xc4320ff980, 0x1000004, 0x0, 0x0, 0x149da00, 0xc46898d3b0, 0x43991e)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc43abf5a10, 0x16fa1c0, 0xc42becadc0, 0xc43390a6e0)
	/usr/local/go/src/os/wait_waitid.go:28 +0xbc
os.(*Process).wait(0xc43abf5a10, 0x0, 0x415f5e, 0xc4339086b0)
	/usr/local/go/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc43abf5a10, 0xc46898d4b8, 0x0, 0x0)
	/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc42becadc0, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:434 +0x6d
os/exec.(*Cmd).Run(0xc42becadc0, 0xc44e7af3b0, 0xc433908650)
	/usr/local/go/src/os/exec/exec.go:279 +0x48
github.com/weaveworks/flux/git.execGitCmd(0xc4204f36e0, 0x20, 0x0, 0x0, 0x1f1de20, 0xc44e7af340, 0xc46898d618, 0x5, 0x5, 0x1f1ee60, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:206 +0x193
github.com/weaveworks/flux/git.getNote(0xc4204f36e0, 0x20, 0xc420832410, 0xf, 0xc44222bd3c, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:109 +0x12c
github.com/weaveworks/flux/git.(*Checkout).GetNote(0xc420124240, 0xc44222bd3c, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/repo.go:170 +0xcb
github.com/weaveworks/flux/daemon.(*Daemon).JobStatus(0xc42067c600, 0xc437d31950, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/daemon.go:326 +0x177
github.com/weaveworks/flux/daemon.(*Ref).JobStatus(0xc4202ce870, 0xc437d31950, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/ref.go:65 +0x8c
github.com/weaveworks/flux/remote.(*ErrorLoggingPlatform).JobStatus(0xc4204f3580, 0xc437d31950, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/logging.go:77 +0xde
github.com/weaveworks/flux/remote/rpc.(*RPCServer).JobStatus(0xc437ee0f80, 0xc437d31950, 0x24, 0xc43ce49b80, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/rpc/server.go:77 +0x62
reflect.Value.call(0xc4207a6b40, 0xc437cab728, 0x13, 0x16254fb, 0x4, 0xc46898dee0, 0x3, 0x3, 0xc420ab9d40, 0x4, ...)
	/usr/local/go/src/reflect/value.go:434 +0x5c8
reflect.Value.Call(0xc4207a6b40, 0xc437cab728, 0x13, 0xc46898dee0, 0x3, 0x3, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc42f866580, 0xc42f866540, 0xc437ee1030, 0xc437eec380, 0xc433902780, 0x13fd740, 0xc439d54a30, 0x198, 0x145ece0, 0xc43ce49b80, ...)
	/usr/local/go/src/net/rpc/server.go:383 +0x148
created by net/rpc.(*Server).ServeCodec
	/usr/local/go/src/net/rpc/server.go:477 +0x421

philwinder · 2017-08-24T13:58:25Z

So one potential solution here is to use a CommandContext type instead of a Command, and pass a timeout that can be cancelled. This should avoid any blocks caused by an underlying issue with a forked process.

The only reason I can give for bypassing our lock is that these are separate checkouts. I.e. this represents 600 daemon loops that are all blocked by a fundamental inability to fork a process.

It doesn't explain why it happened, but I'm fairly happy that it wasn't caused by our code. But we should put a timeout in place to mitigate against it.

rade · 2017-08-24T14:02:15Z

Next time this happens, it would be worth checking the process tree, since that should allow us to determine the exact command that got stuck (if that is indeed what happens).

squaremo · 2017-08-24T14:13:22Z

The only reason I can give for bypassing our lock is that these are separate checkouts.

It's a RWLock, and we take a RLock to do GetNote and CommitsBetween etc.

squaremo · 2017-08-24T14:35:50Z

The ForkLock is taken and released when the process is started (inside os.StartProcess()), so it will have been released already by goroutine 18631036 [syscall], which is past that point.

To mitigate against any low level execution problems, add a context timeout to git commands so that if they fail, it won't cause subsequent requests to back up. Fixes #714

squaremo · 2017-08-25T12:29:05Z

Another mitigation: listing the notes first (one exec), and only calling git note if there's known to be one for a particular commit, would help reduce the number of execs.

In our own config repo, there's about 700 notes and 7000 commits (although we have only been using flux gitsync relatively recently).

This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached.

squaremo · 2017-08-29T16:20:36Z

I suspect this isn't over.

* Add method to list revisions with notes This method uses git notes list to get all notes for a given note ref. We then turn the list into an array and take the second field from the results (which corresponds to the object reference - the commit id in our case). Finally, the result is placed in a map to make it easier to do "if note is in" type queries later. * Check if notes exist before requesting them This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached. * Add extra check for error seen on ubuntu systems. * Add context.

squaremo · 2017-09-18T11:08:12Z

We've not seen the explosion in goroutines and memory since #721 and #726, so I'm tentatively closing this again -- I think the git note pile-up has at least been mitigated.

suprememoocow · 2017-12-14T18:58:24Z

I stumbled across this issue as we're having a very similar issue on our project. We're building a git RPC service for GitLab and, in production, each daemon process is spawning ~20 git commands per second. Occasionally we've been experiencing lockups, including health check ports not accepting. Like this issue, there appears to be contention around the ForkLock in exec_unix.go..

Seeing that you worked around this issue rather than completely solved it, I figured I would share our progress with you, in case you experience it again in future. 🙂

The issue is documented here: https://gitlab.com/gitlab-org/gitaly/issues/823.

We've investigated many different potential solutions, but we're focusing our debugging on a slowdown in fork performance related to process VMM size: see golang/go#5838 (comment)

Generally, right before the process becomes unresponsive we see an increase in VM size.

Go 1.9 switched to posix_spawn which does not copy memory to the child process and seems to have more stable performance over various VMM sizes.

We're planning on switching to a Go 1.9 binary to test if this solves the problem.

Hope this comes in handy at some point! 🙂

philwinder added the bug label Aug 24, 2017

philwinder pushed a commit that referenced this issue Aug 24, 2017

Add timeout to git command

73947c4

To mitigate against any low level execution problems, add a context timeout to git commands so that if they fail, it won't cause subsequent requests to back up. Fixes #714

philwinder mentioned this issue Aug 24, 2017

Add timeout to git command #719

Merged

philwinder self-assigned this Aug 24, 2017

philwinder mentioned this issue Aug 28, 2017

Optimisation - reduce git note gets #721

Merged

philwinder closed this as completed in 987c94c Aug 29, 2017

squaremo reopened this Aug 29, 2017

squaremo closed this as completed Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluxd can overload and become unresponsive #714

fluxd can overload and become unresponsive #714

squaremo commented Aug 23, 2017

squaremo commented Aug 23, 2017

squaremo commented Aug 23, 2017

philwinder commented Aug 24, 2017 •

edited

Loading

philwinder commented Aug 24, 2017

philwinder commented Aug 24, 2017 •

edited

Loading

rade commented Aug 24, 2017

squaremo commented Aug 24, 2017

squaremo commented Aug 24, 2017

squaremo commented Aug 25, 2017

squaremo commented Aug 29, 2017

squaremo commented Sep 18, 2017

suprememoocow commented Dec 14, 2017 •

edited

Loading

fluxd can overload and become unresponsive #714

fluxd can overload and become unresponsive #714

Comments

squaremo commented Aug 23, 2017

squaremo commented Aug 23, 2017

squaremo commented Aug 23, 2017

philwinder commented Aug 24, 2017 • edited Loading

philwinder commented Aug 24, 2017

philwinder commented Aug 24, 2017 • edited Loading

rade commented Aug 24, 2017

squaremo commented Aug 24, 2017

squaremo commented Aug 24, 2017

squaremo commented Aug 25, 2017

squaremo commented Aug 29, 2017

squaremo commented Sep 18, 2017

suprememoocow commented Dec 14, 2017 • edited Loading

philwinder commented Aug 24, 2017 •

edited

Loading

philwinder commented Aug 24, 2017 •

edited

Loading

suprememoocow commented Dec 14, 2017 •

edited

Loading