-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fluxd can overload and become unresponsive #714
Comments
Here's one that is (I think) in the critical section the others are trying to enter:
|
Superficially, this looks like a bunch of The JobStatus method responds to things like One possible mitigation would be to put positive and negative results back in the cache, so the next call will find them (until they get evicted); this will at least de-amplify the execs. It would be good to figure out why they are piling up, though -- since they are exec'd serially, I would take 600 of them to indicate 600 outstanding JobStatus requests, rather than one request and 600 execs. |
All of these traces are blocking inside Why it can't get a lock there is a mystery. I can't find any other references to any code that would have that lock. So my guess is that something happened in the middle of one fork calls and for some reason it didn't release the lock. After a search around I've seen various references to certain issues with intel processors, BSD and quite a few changes to the codebase. But I can't see a general theme or fix. But this does look quite low level. What is even more strange is that we have our own lock that forces us to only perform one operation on a checkout at a time. |
Found a trace that is different to the rest of the 600 in the stack. If you look closely, it is still a It diverges from the others at this point: So it has managed to start the process, but is now stuck waiting for the underlying OS process. I think it is this individual goroutine that has got stuck and all others are blocked behind it.
|
So one potential solution here is to use a The only reason I can give for bypassing our lock is that these are separate checkouts. I.e. this represents 600 daemon loops that are all blocked by a fundamental inability to fork a process. It doesn't explain why it happened, but I'm fairly happy that it wasn't caused by our code. But we should put a timeout in place to mitigate against it. |
Next time this happens, it would be worth checking the process tree, since that should allow us to determine the exact command that got stuck (if that is indeed what happens). |
It's a RWLock, and we take a RLock to do |
The |
To mitigate against any low level execution problems, add a context timeout to git commands so that if they fail, it won't cause subsequent requests to back up. Fixes #714
Another mitigation: listing the notes first (one exec), and only calling git note if there's known to be one for a particular commit, would help reduce the number of execs. In our own config repo, there's about 700 notes and 7000 commits (although we have only been using flux gitsync relatively recently). |
This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached.
This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached.
This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached.
This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached.
This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached.
I suspect this isn't over. |
* Add method to list revisions with notes This method uses git notes list to get all notes for a given note ref. We then turn the list into an array and take the second field from the results (which corresponds to the object reference - the commit id in our case). Finally, the result is placed in a map to make it easier to do "if note is in" type queries later. * Check if notes exist before requesting them This is an optimisation for #714. We perform a single gitCommandExec to get hashes for all commits with notes, place them into a map and use them to query whether a commit has a note or not. This prevents multiple calls to gitCommandExec just to see if there is a note attached. * Add extra check for error seen on ubuntu systems. * Add context.
I stumbled across this issue as we're having a very similar issue on our project. We're building a git RPC service for GitLab and, in production, each daemon process is spawning ~20 Seeing that you worked around this issue rather than completely solved it, I figured I would share our progress with you, in case you experience it again in future. 🙂 The issue is documented here: https://gitlab.com/gitlab-org/gitaly/issues/823. We've investigated many different potential solutions, but we're focusing our debugging on a slowdown in fork performance related to process VMM size: see golang/go#5838 (comment) Generally, right before the process becomes unresponsive we see an increase in VM size. Go 1.9 switched to We're planning on switching to a Go 1.9 binary to test if this solves the problem. Hope this comes in handy at some point! 🙂 |
In our own dev environment, we have noticed from time to time that fluxd has stopped automatically deploying images. When we go to look at the pod, it shows up as using gigabytes of RAM and having hundreds of goroutines -- normally we'd expect maybe 50MB and tens of goroutines.
Luckily, I managed to capture a stack dump from fluxd this most recent time. Notable bit:
There's about 600 of those, all stuck in the same place.
fluxd-dev-stackdump.txt
The text was updated successfully, but these errors were encountered: