-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc/cgo/test: morestack on g0 on Solaris under "unlimit stacksize" #12210
Comments
The hardware is a HP Z230, 3.5GHz Intel Xeon E3-1270v3 Quad Core (8 with Hyperthreading), 8GB of RAM allocated to the zone. |
For what it's worth, this works on Oracle Solaris, so there's likely something subtle here. |
@fazalmajid do you have gcc installed? If so, what version? |
FYI, the solaris builder is running smartos joyent_20141212T011737Z and the
test passes.
|
Yes, GCC 5.2.0 built from source, defaulting to amd64. I build the bootstrap Go-1.4.2 using: |
I can try and bisect the tests to find out which one is triggering the SIGTRAP, if you'd like. |
Bisecting would be great. Thanks. |
@fazalmajid is this failure in comparison to the release version of Go 1.4 or did this test pass previously at some point with Go 1.5 while it was in development? I ask, because Go 1.5 is the first version of Go where cgo support (and thus the cgo test that's failing) was enabled for Solaris:
Also, for the record, why are you building Go with -std=c11? It isn't required, and it's unlikely that C11 or C++11 will work as expected on SmartOS. You should have been able to build Go 1.4 with gcc using -std=c89 or -std=gnu89 and without the other special flags you added. |
Really strange: running the tests manually works: @binarycrusader: I am passing those flags to the bootstrap Go 1.4.2. Without them, I get the error below (this is on an OpenIndiana oi151a1 machine, but I get the same results on SmartOS). If you use an older version of GCC than 5.2, you can probably do without. -std=gnu11 is the default for GCC 5.x.
|
@fazalmajid right, I was suggesting that instead of building Go 1.4 with -std=gnu11, you build it with -std=gnu89. I know that gnu11/c11 is the default for GCC 5.2, I was suggesting you specify the older standard explicitly to see what the result was. Also, as I asked before, what version of Go did you successfully run all tests with before this failure? Was 1.4 the last version you tested? |
Yes, 1.4.2 was the last version I tested. Go 1.4.2 builds successfully with gcc -std=gnu99 but misc/cgo/test has the same failure when building Go 1.5 using the gnu89 Go-1.4.2. As for running the test manually, I found out what is going wrong (or right): it's testing $GOROOT_FINAL/misc/cgo/test/cgo_test.go (that I patched to comment out all cgo tests so I could install), not the unpatched one in the build directory which fails. When I copy the unpatched one back to $GOROOT_FINAL, I can reproduce the error and will get back to bisecting the cgo tests. |
@rsc any of these tests in misc/cgo/cgo_test.go will cause the failure on OI-151a1/GCC-5.2:
|
The trend there is pretty obvious: calls from Go -> C -> Go fail. |
Does the SIGTRAP always occur at the same PC value? Can you run nm on the binary to see what function that PC is in? |
The PC is different for each test. Not sure how to get to the executable, as it seems to be deleted when the test concludes. |
You can get the executable by "cd misc/cgo/test; go test -c". You can run it by "GOTRACEBACK=2 ./test.test". |
According to GDB, the problem is in runtime.morestack in asm_amd64.s line 302:
and that matches what the Go trace reports:
|
Thanks. Now I see that this means that runtime·morestack was called while running on the g0 stack. Unfortunately, I have no idea how that could happen. And the backtrace doesn't make much sense. |
@fazalmajid is it possible for you to try this with a different version of gcc such as 4.7.3? To do so, I believe you'd need to install that older version and then ensure it's first in $PATH both when you build go itself and when you run the tests. It'd be helpful to know which version of gcc the Solaris builder has installed and try with that same version. For the record, I'm using gcc 4.7.3 on a very recent build of Oracle Solaris without any issues. |
The solaris builder (running smartos), is using
gcc 4.7.3.
|
I tried with gcc 4.7.0, with the same results. Let me disable gcc 5.2 altogether and try again |
Nope, even with GCC 5.2 disabled altogether to force cgo to use GCC 4.7.0 to compile, I am still getting the same error. Let me try building GCC 4.7.3.
I also have the OpenSSL patch applied: https://www.openssl.org/~appro/values.c |
I tried again with GCC 4.7.3, and disabled SSP in GCC, to no avail. |
I gather that this passes on Solaris and fails on SmartOS. What is the difference between the two? The nature of the failure makes me suspect that something in the call from Go to C to Go is changing the value of the TLS variable g. That could be due to differences in the system linker. |
@minux: could you share the output of gcc -v on the solaris builder? |
The current solaris-amd64-smartos builder:
$ uname -v
joyent_20141212T011737Z
$ gcc -v
Using built-in specs.
COLLECT_GCC=/opt/local/gcc47/bin/gcc
COLLECT_LTO_WRAPPER=/opt/local/gcc47/libexec/gcc/x86_64-sun-solaris2.11/4.7.3/lto-wrapper
Target: x86_64-sun-solaris2.11
Configured with: ../gcc-4.7.3/configure --enable-languages='c obj-c++ objc
go fortran c++' --enable-shared --enable-long-long
--with-local-prefix=/opt/local --enable-libssp --enable-threads=posix
--with-boot-ldflags='-static-libstdc++ -static-libgcc -Wl,-R/opt/local/lib
' --disable-nls --enable-__cxa_atexit
--with-gxx-include-dir=/opt/local/gcc47/include/c++/ --without-gnu-ld
--with-ld=/usr/bin/ld --with-gnu-as --with-as=/opt/local/bin/gas
--prefix=/opt/local/gcc47 --build=x86_64-sun-solaris2.11
--host=x86_64-sun-solaris2.11 --infodir=/opt/local/gcc47/info
--mandir=/opt/local/gcc47/man
Thread model: posix
gcc version 4.7.3 (GCC)
|
@minux |
Yes, standard image, base64 14.2.0, not sure how to get the UUID, but I run it on many image versions. The GCC is from pkgsrc. |
I rebuilt using a zone with base-64/15.2.0, UUID 5c7d0d24-3475-11e5-8e67-27953a8b237e and the pkgin gcc (which is configured with /usr/bin/ld), and I am still experiencing the error. I also see it on an OpenIndiana oi151a1 machine, so it's not a regression introduced by one of the newer SmartOS/Illumos kernels. The zone has the stock /etc, I only modified /etc/{passwd,shadow,group,user_attr} and used crle to set the 64-bit p
|
I made some progress: I temporarily removed my .tcshrc and .login files to have a clean slate in terms of environment variables and other settings, and the test now passes. I need to figure out what exactly is the root cause: environment variable, plimit, etc.? |
OK, I finally got it: my .login does (among other things) unlimit stacksize. With unlimited stacksize, the test crashes. Without the limit, it works fine. |
Thanks for the report, just for further reference, gcc 5.2 does work:
|
Yes, I confirmed my initial build using GCC 5.2 works when a stacksize limit is set. |
Let me confirm: you are saying that the test works when there is a limit on stack size, and that it fails when there is no limit on stack size? How exactly are you changing the stack size limit? |
Yes. I change the stack size limit using "unlimit stacksize" in tcsh, which in turn uses setrlimit(2) to set RLIMIT_STACK to RLIM_INFINITY (which on Solaris is -3). According to Stevens this is available on pretty much any UNIX OS (XSI, Linux, FreeBSD, Mac OS X, Solaris). |
For what it's worth, executing "ulimit -s unlimited" on an amd64 Linux distribution and then running "go test" in the misc/cgo/test directory is successful, so this is likely related to some of the subtle differences in Solaris when you set stack size to be unlimited:
I was able to produce a similar failure on Linux only by setting the stack limit to a very small amount (64 kB). |
@fazalmajid I've been unable to reproduce this issue on a recent build of Oracle Solaris. That suggests this bug might be in SmartOS itself and older versions of Solaris or that we still don't understand the real issue. By the way, Solaris doesn't generally allow memory overcommit, how much memory did you have free at the time you ran the tests? |
As reported by top in the zone: I don't think it has to do with the actual setting, setting the stacksize limit to a ridiculous value like 20GB (the max you can apparently set before the value you supply is considered unlimited) does not cause the error. |
@fazalmajid I assume this is still happening at the current development head? In your Aug 25 comment you ran and caught the crash under gdb, showing that it was in morestack. Can you run 'where' to see what called morestack? FWIW, although gdb pins the blame on that line of morestack, the problem is actually the previous line, an INT $3 instruction which causes the SIGTRAP. The processor behavior is to advance the PC during the trap, which is likely why you see the line after it being given in the trace. But either way the caller is what we want to know more about. The relevant code is:
and the problem is therefore that somehow morestack has been called on a system stack. The question is why. It would also be interesting to see the output of Thanks. |
Yes, it's still happening with the Git HEAD. Here is the output of "where" and "x/100xg $rsi" as requested:
|
Thanks for the extra information. I got access to a Solaris box and was able to reproduce this. It looks like when the stack is "unlimited", asking Solaris how big the stack is returns the current stack size (in this case, 0x3000 bytes), not its maximum size. Then Go tries to stay within that size and triggers the call to morestack on a cgo callback because the C code in the middle has taken up all of the original 0x3000 bytes and then some. Will send a CL making Go less gullible. |
CL https://golang.org/cl/17452 mentions this issue. |
Posix defines ss_size to be the stack size (i.e. currently allocated, not maximum allocatable): It looks like when RLIMIT_STACK is set to a specific value, the stack size is set to that value (after all, this is just allocating virtual address space, not actual memory which will only happen when a page fault occurs). When RLIMIT_STACK is unlimited, it can't do that, obviously, and thus allocates the minimum (2*PAGESIZE which should be 8KB by default, but I'm guessing by the time ld.so finishes its work, it has grown by one extra page to give your 0x3000 value). I'm not sure what purpose Go is in good company: Java had the same issue (note the lame response): |
Go checks for impending stack overflow at the beginning of most functions. To do that on the system stacks, it needs to know how big the stack is allowed to be. Whatever POSIX happens to say, Solaris seems to be the only system that reports such a small size. It's fine. |
The only thing that concerns me is that I was never able to reproduce this on Solaris proper; this issue seems to be unique to Illumos. I don't think the workaround put in place will break anything, but I intend to research into why there might be a difference. |
@binarycrusader
Here are my results (compiled
|
As 64-bit: As 32-bit: |
@fazalmajid if I had to guess, I'd say the difference might be due to a fix Solaris has for stack_inbounds() being broken for the main thread. Historically, stack_inbounds() just checked for the given address greater-than-or-equal-to the stack base and less than base + curthread->ul_stack.ss_size. We fixed libc_init() so that if the stack is unlimited, it tries reading the stack size from /proc/self/rmap. If that fails, we just set it to 8 Mbytes. That's a wild guess at the moment, so I think the Go fix that was put in place likely remains right workaround for now. |
Fair enough. Your results confirm the Oracle Solaris 11 behavior is different from the Illumos one (and presumably Solaris 10), which explains why the test wasn't failing on your machines. On an unrelated note, I am really impressed with Go's scalability on Solaris, despite how recently it has been fully supported as a platform - I had a nsq_to_http process running yesterday with 600+ LWPs on a 32-core 64-thread machine using the equivalent of 30 cores running flat out. |
At least on my SmartOS box running joyent_20150514T133314Z:
The text was updated successfully, but these errors were encountered: