-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak after upgrading to .NET Core 1.1 on Linux #7144
Comments
@agoretsky which memory column are you referring to? If it is VIRT, it is likely ok and there is no leak. If it was RES, it would be a problem. The virtual memory used by the application doesn't reflect the physical memory usage. We allocate upto 2GB of virtual memory range on x64 as a range in which we then map physical memory as needed. The virtual memory address space is private to each process and each process has about 128 terrabytes of virtual address space available. |
@janvorli RES |
@agoretsky ok, I'll take a look. |
@agoretsky I have tried to run your app locally and I am not seeing that memory consumption:
I am getting exactly the same values for 1.1.0 and 1.0.1. I have verified that I am really running the 1.1.0 bits by attaching lldb to the process and running "target modules list" command. The excerpt from the output confirms I am: |
@janvorli,
After running repro with these lines in docker container it started growing is not immediately and very slow, but also reached more than 1GB
docker logs output:
According to logs it can't connect to the server, I changed repro app to .NET Core 1.0.1 and got successfully connected logs:
So, using .NET Core 1.1 repro app can't connect to the server and consumes a lot of memory, but with .NET Core 1.0.1 can |
@agoretsky I have added your new lines of code and I can repro the same exception - but with both 1.0.1 and 1.1.0. I've again attached lldb to verify that I am really running with 1.0.1 runtime. |
I am running it on a physical machine with Ubuntu 14.04 |
My first thought was that we might have dropped a TLS version from being supported in 1.1, but that doesn't seem to be the case (and if you're seeing it on 1.0.1, Jan, would be even less likely). Do we know what host:port is being talked to when the remote side drops the connection? |
I was on ubuntu 16.04.1 LTS and I pushed dockerfile which I also used to repro it |
Now on ubuntu with net 1.1 I am seeing only "Connecting" at logs and growing memory, with 1.0.1 I am getting "Connected" after "Connecting" and everything is ok. I can send you my test apple certificate file and password if you email me |
@agoretsky, @bartonjs - I can confirm that I can repro the problem. With the cert and password I have received from @agoretsky, the app compiled targetting 1.1.0 leaks and targetting 1.1.0 does not. In both cases though, the output of the app is the same, I am not seeing any exceptions. But I think that could be caused by the fact that my Linux native box has 24GB of RAM. @agoretsky how much RAM does your machine have?
The memory consumption for 1.1.0 grows to about 2GB for me in about 30 seconds. In a couple of minutes, it has grown to about 6GB. Then it occassionally drops down - to a value between 2..5GB, grows up and it keeps cycling that way. I have looked at the memory map and used sos commands to see if it is related to our EE heaps (GC heap plus other heaps used by the CoreCLR runtime). The GC heap is about 230MB and the other heaps just couple of megabytes. I have used Linux memleax tool (after fixing few issues in it that it had with the fact that we map non-elf files into memory as executables) to try to track down the source of the leaks. This tool attaches to a process,hooks to malloc / free and reports call stacks of all calls to malloc that was not freed within certain time interval. I've set it to 30 seconds and it seems we are leaking memory allocated in the PEM_ASN1_read_bio a lot. The allocated amount per each malloc is small (2..30 bytes), but the number of times it is allocates is huge, according to this loop. The memleax reports 11 distinct call stacks, differentiating only in the native stack frames in the libcrypto.so. The call stack doesn't contain the functions in the pal_x509.cpp due to the fact that the calls to the libcrypto.so are compiled as tail calls and also the PEM_ASN1_read_bio ends up being a tail call. And the memleax obviously cannot understand the managed code frames. But since the last managed call frame before the native code stack in libcrypto.so is always the same, I believe all the leaks stem from the same interop call. Unfortunately, I am unable to run the memleax on the app when it runs under lldb, so I will have to find out a way to identify the problematic caller. I will update it here after I find it. |
@janvorli, there was 2GB for Ubuntu VM |
Based on the stack traces from the memleax and some additional debugging, I believe that the leaked allocations stem from PemReadBioX509Crl. |
@janvorli Nope. Unless the CRL was simply that big. We would have preceded the call with Is the CRL object leaking? Or something else?
|
@bartonjs what I can see is that just executing the CryptoNative_PemReadBioX509Crl enlarges the consumed native memory by 900MB. I mean, right before the call the memory consumption is 900MB less than right after the call. Then the CryptoNative_X509CrlDestroy gets called, which doesn't change anything on the consumption side and then CryptoNative_X509StoreDestory gets called and it also doesn't change anything. That's why I was thinking that there might be some callbacks that are invoked while the CryptoNative_PemReadBioX509Crl is executing. |
I have tried another memory allocation profiler tool to see where the allocations come from. This time it was the heaptrack tool (http://milianw.de/blog/heaptrack-a-heap-memory-profiler-for-linux). It generates somehow more detailed output with summaries on memory allocation. I have left it running for the duration of the first request in the batch. Here is an excerpt from the report. The full report contains call stacks that result in calling those functions. And 99% of that memory stems from
The report also contains more detailed breakdown for each of those. One example:
This is a summary the tool provides:
Now look at the corresponding values from the 1.0.1 for the same duration of the app execution:
And the summary:
@bartonjs any ideas? It looks pretty weird, but the results are 100% reproducible and stable. |
@bartonjs so I have just tested the app with version 1.0.1 running under lldb and I have found something I haven't expected. On 1.0.1 it doesn't call the CryptoNative_PemReadBioX509Crl at all! I have set a breakpoint to CryptoNative_PemReadBioX509Crl, I can see it being resolved correctly, but it is never hit. |
https://github.com/dotnet/corefx/commits/master/src/System.Security.Cryptography.X509Certificates/src/Internal/Cryptography/Pal.Unix/CertificateAssetDownloader.cs hasn't changed since 1.0.0; so... something else changed. The effective flow:
Candidates:
Even with parts of X509Certificates that changed between 1.0 and 1.1, I can't see any that would affect this. And spot checking in other libraries hasn't turned up anything, either. |
@bartonjs - I have kept repeating the tests for 1.0.1 and 1.1.0 many times, each time deleting the obj and bin folders, modifying the project to refer to the specific version, doing dotnet restore and dotnet build. The behavior is always the same. I don't have the ~/.dotnet/corefx/cryptography/x509/crls, but I have ~/.dotnet/corefx/cryptography/crls. In there, I have three files, the newest one which probably is the one the test was using has date of 12/19 and it is large - about 110MB. |
Using gdb (so I could set fork-follow-mode child while using Scenario: Setup (approximate):
RES analysis:
X509_CRL_free log:
So we created and kept 0x8da068 and 0xbee048. We created and discarded 0x8a4cc8 and 0xf15d18. So really we'd need to know where (full stack trace, I guess) memory was allocated from once that function was called to until it ended (each time) but let it cross off anything which was later freed. Then we could determine if OpenSSL has a memory leak, or if we were just using it poorly. |
@bartonjs for a simple test case, the allocation tracking tool memleax would likely find the leaks with their call stacks easily. Could you please provide me with the source of your testing app so that I can give it a try? At the same time, I can try to pass it the large 100MB crl I have found in ~/.dotnet/corefx/cryptography/crls and see if the memory allocation would correspond to what I am seeing with the app from @agoretsky. But the really strange thing is why the CryptoNative_PemReadBioX509Crl is not called with 1.0.1. Can you give me an advice where can I set a breakpoint in the managed code where the flow you have described begins so that I can step through the code to see why it didn't try to call the CryptoNative_PemReadBioX509Crl ? |
@janvorli System.Security.Cryptography.X509Certificates' Internal.Cryptography.Pal.OpenSslX509ChainProcessor.BuildChain If that doesn't get called at all on 1.0 that would be "the reason" (if this is using System.Net.Http, then System.Net.Http's System.Net.Http.CurlHandler.SslProvider.VerifyCertChain might be skipping the call to X509Chain.Build() if it can do a quicker test directly with native code). https://github.com/bartonjs/SampleProjects/blob/master/PrintChain/PrintChain.cs is the source, but it takes the certificate as input (not the CRL). If you've captured the host (and port) it was talking to you can grab the certificate via |
@bartonjs I have found the 1.0.1 version doesn't call the
|
I've forgotten to add that it calls the |
When running on 1.1.0, the |
The default has always been Online; so whomever requested the chain build used to set NoCheck, and now is doing something different. So... now we go to whomever built the X509Chain... or, at least, whomever called Build on it. |
I have investigated the call stacks leading to the In the first path common to both, the revocation mode stems from SslStream.AuthenticateAsClientAsync that is called directly by the @agoretsky;s app with the
In the 1.1.0 case, the additional code path has a call to Here is the relevant part of the call stack at the
In 1.0.1, the call stack from BeginAuthenticateAsClient looks the same upto the Interop+OpenSsl.AllocateSslContext. But the TLSCertificateExtensions.BuildNewChain is never called. And here is why. The call was added in 1.1.0: So that's the culprit. I am no SSL expert, so I cannot say if it is a bug to call TLSCertificateExtensions.BuildNewChain when the AuthenticateAsClientAsync is called by the user's app with the Anyways, there is still a remaining question why building the cert chain with the check for revocation mode set to "online" consumes that much memory in this specific case (I'll try the simple test from @bartonjs to see if the cert is just large and causes that memory consumption). And also why on my machine, the memory consumption kept growing later, as if the certificate got loaded again and again and its memory was getting released only after a long time (Since I was seeing that the memory consumption ended up being capped at about 6GB even after hundreds of requests, my guess is it is not a leak) |
@bartonjs I have tried your testing app. The certificates that came from both the servers used by the app are fine, they didn't cause a significant memory consumption. GC.Collect();
GC.WaitForPendingFinalizers(); This is the output of your testing app:
I have tried to get some info on that CRL file using "openssl crl" command and I can confirm that it also consumed about 1GB of memory. So that's how much openssl really needs for the revocation list. But we should still be able to release that memory after we are done with it. It might be a bug in openssl though. @agoretsky can I share your cert / password with @bartonjs via private email so that he can try it locally? |
@janvorli, yes, you can share it via private email with anybody who may help |
Suggestion for a short term workaround is more than welcome.. |
@agoretsky In your example, if you pass the client certs in AuthenticateAsClientAsync instead of using the LocalCertificateCallback (so make the callback null), does that work around your problem? The callback version has to build the chain to present it to you, and that builds it with revocation checked. (The memory leak is still seemingly there, and bad, but maybe you can avoid the code) |
Changed:
To: Result is still same: Main process exited, code=killed, status=9/KILL |
Okay, I don't think there's a "leak", just... an unfortunate interaction of Apple's big CRL and Taking my sample program and replacing main with public static void Main(string[] args)
{
for (int i = 0; i < 10; i++)
{
RealMain(args);
Console.WriteLine($"=== End pass {i+1}");
}
Console.WriteLine("Press enter to exit...");
Console.ReadLine();
} does not appreciably change the final RES result from it running only once. The 900MB RES value is because I couldn't run valgrind on a corerun-launched process; but I did make a native version of the cert chain builder. It peaked RES at ~998MB, and after freeing everything was still at ~903MB. During a 10x internal run it kept bouncing between those two numbers. valgrind (when it was only 1x) says that it didn't appreciably leak; so the There's still the "why is this building a revocation-checked chain now and wasn't in 1.0?" question, and it looks the fix for https://github.com/dotnet/corefx/issues/9142 builds the chain with revocation checked, which isn't necessary... so that just needs to be changed to build that chain with RevocationMode.NoCheck (since it cares only about pathing). |
One additional detail to the original comment from @agoretsky is that establishing SSLStream with Apple certificates don't work 1.0.0, 1.0.1, 1.0.3, 1.1.0. Same results in all 4 versions. The original comment mentions the upgrade from 1.0.1 to 1.1.0 triggers this, but I am not able to confirm that 1.0.1 works. |
Fixed (for .NET Core 2.0) in dotnet/corefx#15432. |
Hi there!
After upgrading our project to .NET Core 1.1 we have faced with memory leak problem on Linux. I can't reproduce it on Windows, so I can't profile it :(
I found code which causes this problem and published it here: https://github.com/agoretsky/memory_leak
To reproduce you will need provide .p12 certificate file path and password in Main method.
Steps to reproduce problem:
The text was updated successfully, but these errors were encountered: