-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: lookupIPDeadline leaks goroutines #8602
Labels
Milestone
Comments
More details on the issue and a proof-of-concept patch for the net package that hacks the general cgo_unix.go file to use the GNU extension getaddrinfo_a at: https://jira.mongodb.org/browse/MGO-41?focusedCommentId=719482 The patch avoid major changes by just enforcing the documented cap of 30 seconds. It works, but cannot be adopted as-is as the cgo_unix.go file is shared with systems that most likely do not support this extension. |
I'm sorry for not providing a more clear description. The patch is at: [1] http://paste.ubuntu.com/8354072/ and this is the section of the bug comment referring to it: 4. Improving the Go standard library Improving the Go standard library appropriately will take a little while as all the code paths leading to resolution will need to be refactored to provide a timeout, and different APIs will need to be used in different operating systems, and the current hack may need to moved closer to the ones that do not support a better API for this. With that said, here is an incipient patch for Linux [1], tested on Ubuntu 14.04. As you can see it makes use of the new getaddrinfo_a function, and enforces the documented capping of 30 seconds. The test suite passes with it, and my cooked failure examples pass as well. It obviously has seen little real world testing so far, so if you decide to use it I'd recommend playing with these edges a little more before the release. It's also important to observe that getaddrinfo_a is a GNU extension and the patch changes code that is common to a number of operating systems, including Mac OS. If the Go compiler is not linking with glibc, the file cgo_unix.go file will need to be duplicated so that the proposed changes only affect Linux, and equivalent changes are done using specific APIs for other relevant operating systems. |
The patch seems incomplete. It seems that it ought to pass down the deadline and use that. But I admit that is complex with the use of singleflight. You mention the documented cap of 30 seconds, but I think you are slightly misreading it. That is the maximum length of time the resolver will wait for a specific name server. If that name server does not respond, the resolver will try another. And then there is a number of times that the resolve will query a name server, which is attempts in resolve.conf. The default for attempts is 2 and the cap is 5. So I think that if you have three name servers in resolv.conf, the default time to wait is 3 * 5 (default timeout) * 2 (default attempts) == 45 seconds, and the maximum time to wait is 3 * 30 * 5 == 450 seconds. Imposing our own deadline of 30 seconds in all cases would mean ignoring the values set in resolv.conf in some cases. In the original bug report the core problem appears to be that getaddrinfo is hanging indefinitely, despite the timeouts in resolv.conf. That seems to be a bug in getaddrinfo. It's not obvious to me that using getaddrinfo_a will avoid that bug. getaddrinfo_a is a more complex code path. Do you know if anybody has tried to report the real bug? |
The patch is definitely suboptimal and incomplete both in that it does not handle multiple operating systems, and does not provide the real requested timeout down to the resolving function. That's why I did not submit the patch upstream, despite knowing how to send CLs. You're also right in that I misread the timeout option as being a total, rather than per server. That said, getaddrinfo_a does seem to avoid the bug, and the right fix (for Linux, at least) could build on it by providing the real timeout for the resolving function: "The call blocks until one of the following occurs: (...) The time interval specified in timeout elapses." I don't understand your question about reporting the "real bug". |
Indeed, but I'm not the right person to report it. I cannot reproduce the bug nor provide any information about the environment in which this is happening. I only have indirect evidence that this is true. Also, this seems like a real bug in Go itself too. Even if it is not indefinite, goroutines are piling up for as long as the underlying call hangs, and that can happen for several minutes as you point out. Not only that, but the logic that is in place today prevents follow up calls from proceeding until the previous call returns. |
I believe this is the libc getaddrinfo bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075 |
That bug is better known as https://sourceware.org/bugzilla/show_bug.cgi?id=15946 , and is reportedly fixed in glibc 2.20. It would be interesting to know which glibc version was in use by the person who observed this bug. Note that that bug is about having a very large number of file descriptors open simultaneously for DNS lookups. The reported problem is a large number of goroutines waiting to check a DNS timeout. You can have the latter without the former. |
Looking back at the original bug report, I think it's unlikely to be glibc bug 15946, because that bug only arises when all available file descriptors have been opened. The original bug report is saying that many goroutines are hanging waiting on the answer to a single getaddrinfo request that itself is hanging. The single getaddrinfo request will have a single file descriptor, so I don't see any reason to assume that large numbers of file descriptors are open. |
I'm no longer convinced that the Go code has to change. If getaddrinfo didn't hang, everything would work fine. If there were a portable mechanism we could use to add a timeout to getaddrinfo, it might be worth pursuing for better efficiency. But as far as I can tell we can only add a timeout on GNU/Linux; I don't see any mechanism available on other systems. And the core problem here, getaddrinfo hanging, appears to be due to a bug on GNU/Linux. So I think we need to understand the GNU/Linux bug before we start trying to work around it by following a different code path. |
If you make 100 concurrent requests for a specific DNS hostname with a timeout, and the DNS lookup is slow, then by definition you will have 100 goroutines waiting for that DNS lookup to complete. With the current implementation, if you set a deadline for the lookup, you will have an extra 100 goroutines waiting. That is inefficient but it's not crazily inefficient. You're right, of course, that if a DNS lookup is slow, other requests for the same hostname will wait for the first lookup to complete. And, in particular, if the program retries after a timeout it will get stuck waiting for the first lookup to complete. I think we can fix that problem. How does https://golang.org/cl/154610044 look to you? |
It looks like a good incremental improvement making retries work again. It should not be too hard to get rid of the leaking goroutines as well, by changing the algorithm so that the blocked cgo call is the one notifying waiters when it's done, instead of having one extra middle man for every call. |
CL https://golang.org/cl/154610044 mentions this issue. |
This issue was closed by revision 77595e4. Status changed to Fixed. |
CL https://golang.org/cl/166740043 mentions this issue. |
This issue was updated by revision 21a9141. LGTM=iant R=iant CC=golang-codereviews https://golang.org/cl/166740043 |
wheatman
pushed a commit
to wheatman/go-akaros
that referenced
this issue
Jun 25, 2018
Before this CL, if the system resolver does a very slow DNS lookup for a particular host, all subsequent requests for that host will hang waiting for that lookup to complete. That is more or less expected when Dial is called with no deadline. When Dial has a deadline, though, we can accumulate a large number of goroutines waiting for that slow DNS lookup. Try to avoid this problem by restarting the DNS lookup when it is redone after a deadline is passed. This CL also avoids creating an extra goroutine purely to handle the deadline. No test because we would have to simulate a slow DNS lookup followed by a fast DNS lookup. Fixes golang#8602. LGTM=bradfitz R=bradfitz, mikioh.mikioh CC=golang-codereviews, r, rsc https://golang.org/cl/154610044
wheatman
pushed a commit
to wheatman/go-akaros
that referenced
this issue
Jun 26, 2018
Before this CL, if the system resolver does a very slow DNS lookup for a particular host, all subsequent requests for that host will hang waiting for that lookup to complete. That is more or less expected when Dial is called with no deadline. When Dial has a deadline, though, we can accumulate a large number of goroutines waiting for that slow DNS lookup. Try to avoid this problem by restarting the DNS lookup when it is redone after a deadline is passed. This CL also avoids creating an extra goroutine purely to handle the deadline. No test because we would have to simulate a slow DNS lookup followed by a fast DNS lookup. Fixes golang#8602. LGTM=bradfitz R=bradfitz, mikioh.mikioh CC=golang-codereviews, r, rsc https://golang.org/cl/154610044
wheatman
pushed a commit
to wheatman/go-akaros
that referenced
this issue
Jul 9, 2018
Before this CL, if the system resolver does a very slow DNS lookup for a particular host, all subsequent requests for that host will hang waiting for that lookup to complete. That is more or less expected when Dial is called with no deadline. When Dial has a deadline, though, we can accumulate a large number of goroutines waiting for that slow DNS lookup. Try to avoid this problem by restarting the DNS lookup when it is redone after a deadline is passed. This CL also avoids creating an extra goroutine purely to handle the deadline. No test because we would have to simulate a slow DNS lookup followed by a fast DNS lookup. Fixes golang#8602. LGTM=bradfitz R=bradfitz, mikioh.mikioh CC=golang-codereviews, r, rsc https://golang.org/cl/154610044
wheatman
pushed a commit
to wheatman/go-akaros
that referenced
this issue
Jul 20, 2018
Before this CL, if the system resolver does a very slow DNS lookup for a particular host, all subsequent requests for that host will hang waiting for that lookup to complete. That is more or less expected when Dial is called with no deadline. When Dial has a deadline, though, we can accumulate a large number of goroutines waiting for that slow DNS lookup. Try to avoid this problem by restarting the DNS lookup when it is redone after a deadline is passed. This CL also avoids creating an extra goroutine purely to handle the deadline. No test because we would have to simulate a slow DNS lookup followed by a fast DNS lookup. Fixes golang#8602. LGTM=bradfitz R=bradfitz, mikioh.mikioh CC=golang-codereviews, r, rsc https://golang.org/cl/154610044
wheatman
pushed a commit
to wheatman/go-akaros
that referenced
this issue
Jul 30, 2018
Before this CL, if the system resolver does a very slow DNS lookup for a particular host, all subsequent requests for that host will hang waiting for that lookup to complete. That is more or less expected when Dial is called with no deadline. When Dial has a deadline, though, we can accumulate a large number of goroutines waiting for that slow DNS lookup. Try to avoid this problem by restarting the DNS lookup when it is redone after a deadline is passed. This CL also avoids creating an extra goroutine purely to handle the deadline. No test because we would have to simulate a slow DNS lookup followed by a fast DNS lookup. Fixes golang#8602. LGTM=bradfitz R=bradfitz, mikioh.mikioh CC=golang-codereviews, r, rsc https://golang.org/cl/154610044
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: