-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WKS::gc_heap::make_unused_array seg fault #711
Comments
This crash is GC heap corruption. It can be caused by a bug in the runtime; or a bug in your interop code.
Any chance you can share the crash dump with symbols for us to take a look? If it would be ok to share it with just me, my email is in my github profile. |
Hello Jan. Thanks I've email you, happy to share the crash dump and symbols. |
Thank you for sharing the crash dumps offline. The crash is caused by a dead thread registered in I am still trying to find out what might have caused this thread to not remove itself from this list. It would be useful to know whether the missing call to |
Hello Jan, I've added the code you mentioned just at the start of the run and tried to set a break in gdb, I'm not as familiar with gdb as I am with debugging on windows so I might have made a mistake. My first attempt gets a Segmentation fault straight away before the code really gets going: (gdb) break ThreadStore::DetachCurrentThread Breakpoint 1 (ThreadStore::DetachCurrentThread) pending. Program received signal SIGSEGV, Segmentation fault. Then backtrace shows: #0 GetNext (this=0x7fffffffd418) |
This crash is different symptom of the same problem (dead thread in active thread list). Could you please set breakpoint at If |
To provide more context - the path to
We need to find out where things are getting derailed on this path. |
Thanks, just running it now with gdb told to: break TlsDestructionMonitor::~TlsDestructionMonitor I'm guessing this is the correct command syntax? |
Yes, that's the right way to set the breakpoint. |
If you are able to stop at |
Ok thanks I'll try and send an update. I've been on another project most of the day and getting my machine back to the state to be able to rerun the test is taking a while. Hopefully I'll get there soon! |
Didn't get a break point in break TlsDestructionMonitor::~TlsDestructionMonitor before the segmentation fault. Trying __nptl_deallocate_tsd next. |
Think I'm hitting my lack of gdb knowledge, after hitting __nptl_deallocate_tsd I tried break *0x7f3b5f649ca0, then I thought I'd have to continue until <__nptl_deallocate_tsd+144>: callq *%rdx and then single step after that 2nd breakpoint but I'm getting this: (gdb) break *0x7f3b5f649ca0 0x00007ffff58b5bd2 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 |
The address to set the breakpoint on will be different on your machine. You can find it from disassembly e.g. by running Here is the transcript of what I have executed on my test app that just creates threads. It transcript shows that the call always goes into NativeLibrary.so in my testapp:
|
I don't seem to be getting the breakpoint 2 hit: Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () |
Run the test again from the start with the same result, after breaking on __nptl_deallocate_tsd I find the address of <__nptl_deallocate_tsd+144>: callq *%rdx. Then I do break *address and continue and it never hits the second breakpoint for __nptl_deallocate_tsd+144. I've hit continue over 10 times right up to the segmentation fault and breakpoint 2 is never hit. |
Pretty clean run, only had to continue 3 times before segmentation fault: Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 Program received signal SIGSEGV, Segmentation fault. |
Could you please restart and try:
|
Looks like the jump at <__nptl_deallocate_tsd+142>: je 0x7ffff58b5c30 is kicking in, seems to be going in a loop back to <__nptl_deallocate_tsd+96>: Breakpoint 1, 0x00007ffff58b5bd0 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 |
Do you have a custom steps to link PalAttachThread in my test app (https://github.com/dotnet/runtimelab/tree/feature/NativeAOT/samples/NativeLibrary with minor modifications):
PalAttachThread from the crashdumps:
One calls |
Also, my test app does not have dependency on
I would like to know where the dependency on |
Hi Jan, No custom steps, it's build with: dotnet publish Arcontech.DotNetAPI.NetStandard.Native.csproj /p:NativeLib=Shared -r rhel-x64 -c release /p:SelfContained=true where the csproj is pretty standard, it has come company details in the copyright and otherwise just a normal project (added spaces before and after < > otherwise the editor was flattening the text): < Project Sdk="Microsoft.NET.Sdk" > < PropertyGroup > < PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'" > < PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|AnyCPU'" > < ItemGroup > < ItemGroup > < /Project > |
The natively compiled AOT .so is used by some c/c++ code calling into the .so entry points, then the natively compiled AOT .so also loads a different c++/c natively compiled .so that use we for tcpip communication. So the stack is: xxx executable (C/C++ classes that use the arcontech_capi.so) I'll see if I can work out how the reference to libstdc++.so.6 is getting in there. |
Hi Jan, would you know what namespaces to look for that'll cause the Native AOT to reference libstdc++.so over libc.so? I'm trying to figure it out but I can't see anything obvious yet. It might need breaking my project into smaller parts to eliminate but that might not be a quick job. Would you normally not expect libstdc++.so to be referenced? |
I would expect |
I do not think that the managed code makes a difference. I think it will environment problem, like version of clang installed on the machine. Speaking of which - what is the version of clang that you got? The version that I got by running
|
Hello Jan, I've built the NativeLibrary as a Shared .so and got the same result: $ ldd NativeLibrary.so I've added the build output here showing the llvm version: $ /opt/ARCONTECH/.dotnet/dotnet publish /p:NativeLib=Shared -r rhel-x64 -c release /p:SelfContained=true Determining projects to restore... |
Could you please try |
Hello Jan, Reverted back to 3.4.2but still the same issue: [arcon@Darren NativeLibrary]$ /opt/ARCONTECH/.dotnet/dotnet publish /p:NativeLib=Shared -r rhel-x64 -c release /p:SelfContained=true Determining projects to restore... |
I've done a -v diag build with Native Library to see what is going on and I can see: -lstdc++ Is being included just not sure how it gets there. clang "obj/release/net5.0/rhel-x64/native/NativeLibrary.o" -o "bin/release/net5.0/rhel-x64/native/NativeLibrary.so" -Wl,--version-script=obj/release/net5.0/rhel-x64/native/NativeLibrary.exports /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/sdk/libbootstrapperdll.a /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/sdk/libRuntime.a /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/framework/libSystem.Native.a /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/framework/libSystem.Globalization.Native.a /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/framework/libSystem.IO.Compression.Native.a /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/framework/libSystem.Net.Security.Native.a /opt/ARCONTECH/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/6.0.0-preview.2.21125.1/framework/libSystem.Security.Cryptography.Native.OpenSsl.a -g -Wl,-rpath,'$ORIGIN' -Wl,--as-needed -pthread -lstdc++ -ldl -lm -lz -lgssapi_krb5 -lrt -lanl -shared -Wl,--require-defined,CoreRT_StaticInitialization -Wl,--discard-all -Wl,--gc-sections (TaskId:119) |
For some reason, it is causing To make sure that we are following the right trail, could you please try the following?
Does it run fine printing "Hello World" forever, or does it also crash? |
Another experiment to try: Compile empty C program with verbose linker output
|
Hello Jan, I've tried changing the add as you suggested and a.out just produces Hello World! for ever. It's been running for a few minutes and no crash yet. |
The dummy.c verbose compile produces the following matches with libstdc++ attempt to open /usr/bin/../lib/gcc/x86_64-redhat-linux/4.8.5/libstdc++.so succeeded It seems to succeed with the .so and not look for the .a. The /usr/bin/../lib/gcc/x86_64-redhat-linux/4.8.5/ directory (no .a): [arcon@Darren lib64]$ cd /usr/bin/../lib/gcc/x86_64-redhat-linux/4.8.5/ And then the directory the symlink points to: [arcon@Darren 4.8.5]$ cd ../../../../lib64/ |
Ok, it means that the dynamic linking against libstdc++.so.6 alone is not the root cause of the problem. I will keep digging in the crash dumps... |
Thanks. I'll see if I can work out what areas of the code might be provoking the issue. I can also try building on different Linux versions to see if there's any different behaviour. |
Hi Jan, we've tried the code as a fresh check out on a clean CentOS 7 VM then rebuilt and it still behaves the same. If we run the code using .Net Core it runs fine but as a Native AOT build we get the same error as above. We've also retested Native AOT on windows and that is stable. Seems to be a Native AOT linux only issue. |
Ok, I think I have figured it out:
I think either of these options will fix the problem - can you give it a try?
The standard .NET Core is compiled with statically linked C++ runtime (the second option above) and it is why it works fine. |
- Add note about -pthread option on Unix (see dotnet#711 for details) - Fix warnings
- Add note about -pthread option on Unix (see #711 for details) - Fix warnings
Hello Jan, I've tried the -pthread option and can confirm it's working. Thanks for looking into this issue, sorry it turned out to be a config issue but it wasn't an obvious problem to track down. I'll carry on testing Native AOT with the -pthread build. |
Thank you for your cooperation with tracking this down! We will know what the problem is next time somebody hits this. |
Hello I am building a C POD dll/so for C# code using Native AOT. The C# code also then uses a C++/C dll/so. When I run the code under windows for Native AOT/.Net Core and Framework it all works fine. Under linux it works fine for .Net Core but I get a seg fault with AOT. I've built as debug and got a back track with GDB. I can get it happen pretty quickly (with a few minutes run). Back trace is:
#0 0x00007ffff61604b7 in WKS::gc_heap::make_unused_array (
x=0x8a3e30 "\340\021\376\366\377\177", size=140737332173216, clearp=0,
resetp=)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:27805
#1 0x00007ffff61927f1 in fix_allocation_context (acontext=0x8a3ce0,
for_gc_p=, record_ac_p=1)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:6913
#2 WKS::GCHeap::FixAllocContext (this=, context=0x8a3ce0,
arg=0x1, heap=)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:40955
#3 0x00007ffff615224d in GCToEEInterface::GcEnumAllocContexts (
fn=0x7ffff61606c0 <WKS::fix_alloc_context(gc_alloc_context*, void*)>,
param=0x7fffbdf8e630)
at /__w/1/s/src/coreclr/nativeaot/Runtime/gcrhscan.cpp:104
#4 0x00007ffff6178a57 in fix_allocation_contexts (for_gc_p=1)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:6994
#5 WKS::gc_heap::garbage_collect (n=0)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:20036
#6 0x00007ffff6169c70 in WKS::GCHeap::GarbageCollectGeneration (
this=, gen=0, reason=reason_alloc_soh)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:41936
#7 0x00007ffff616be37 in WKS::gc_heap::try_allocate_more_space (
acontext=, size=, flags=,
---Type to continue, or q to quit---
gen_number=)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:15841
#8 0x00007ffff61924c0 in allocate_more_space (acontext=0x7fffb8000c10,
flags=0, alloc_generation_number=0, size=)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:16343
#9 allocate (jsize=48, acontext=0x7fffb8000c10, flags=0)
at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:16374
#10 WKS::GCHeap::Alloc (this=, context=0x7fffb8000c10, size=48,
flags=0) at /__w/1/s/src/coreclr/nativeaot/Runtime/../../gc/gc.cpp:40912
#11 0x00007ffff61aff28 in RhpNewObject ()
at /__w/1/s/src/coreclr/nativeaot/Runtime/unix/unixasmmacrosamd64.inc:435
#12 0x00007ffff626a8fd in arcontech_capi_Arcontech_DotNetAPI_RecordWatcher__OnUpdate (this=..., databook=..., databookStatus=DatabookUnavailable,
replaceAll=false, fields=...)
at /opt/ARCONTECH/svn/AKB-2675/Arcontech.DotNetAPI/RecordWatcher.cs:445
#13 0x00007ffff626c8b4 in arcontech_capi_Arcontech_DotNetAPI_RecordWatcher__Arcontech_DotNetAPI_FeedApi_IFeedInstrumentWatcher_Update (this=..., iItem=...)
at /opt/ARCONTECH/svn/AKB-2675/Arcontech.DotNetAPI/RecordWatcher.cs:919
#14 0x00007ffff62a5975 in arcontech_capi_Arcontech_DotNetAPI_FeedApi_FeedHandlerInstrumentWatcher__ProcessBatch (this=..., receptionCache=...)
at /opt/ARCONTECH/svn/AKB-2675/Arcontech.DotNetAPI/FeedApi/FeedHandlerInstrumentWatcher.cs:112
#15 0x00007ffff6612179 in __Arcontech_DotNetAPI_FeedApi_IFeedQueueItem_DispatchM---Type to continue, or q to quit---
essage (this=..., receptionCache=...)
at /opt/ARCONTECH/svn/AKB-2675/Arcontech.DotNetAPI/FeedApi/FeedHandlerBase.cs:924
#16 0x00007ffff62d926c in arcontech_capi_Arcontech_DotNetAPI_FeedApi_FeedHandlerQueueDispatcher_QueueItem__Dispatch (this=...)
at /opt/ARCONTECH/svn/AKB-2675/Arcontech.DotNetAPI/FeedApi/FeedHandlerQueueDispatcher.cs:38
#17 0x00007ffff62ab048 in arcontech_capi_Arcontech_DotNetAPI_FeedApi_FeedHandlerQueueDispatcher__ProcessQueue (this=...)
at /opt/ARCONTECH/svn/AKB-2675/Arcontech.DotNetAPI/FeedApi/FeedHandlerQueueDispatcher.cs:147
#18 0x00007ffff645bef7 in S_P_CoreLib_System_Threading_Thread_StartHelper__RunWorker (this=...)
at //src/libraries/System.Private.CoreLib/src/System/Threading/Thread.cs:68
#19 0x00007ffff645be78 in S_P_CoreLib_System_Threading_Thread_StartHelper__Run
(this=...)
at //src/libraries/System.Private.CoreLib/src/System/Threading/Thread.cs:54
#20 0x00007ffff6370947 in S_P_CoreLib_System_Threading_Thread__StartThread (
parameter=140737352128784)
at //src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Thread.CoreRT.cs:430
---Type to continue, or q to quit---
#21 0x00007ffff6370e80 in S_P_CoreLib_System_Threading_Thread__ThreadEntryPoint
(parameter=140737352128784)
at //src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Thread.CoreRT.Unix.cs:111
#22 0x00007ffff58b5e65 in start_thread () from /lib64/libpthread.so.0
#23 0x00007ffff70e888d in clone () from /lib64/libc.so.6
The text was updated successfully, but these errors were encountered: