Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak after upgrading to .NET Core 1.1 on Linux #7144

Closed
agoretsky opened this issue Dec 16, 2016 · 36 comments
Closed

Memory Leak after upgrading to .NET Core 1.1 on Linux #7144

agoretsky opened this issue Dec 16, 2016 · 36 comments
Milestone

Comments

@agoretsky
Copy link

Hi there!
After upgrading our project to .NET Core 1.1 we have faced with memory leak problem on Linux. I can't reproduce it on Windows, so I can't profile it :(
I found code which causes this problem and published it here: https://github.com/agoretsky/memory_leak
To reproduce you will need provide .p12 certificate file path and password in Main method.

Steps to reproduce problem:

  1. Take this repository https://github.com/agoretsky/memory_leak
  2. Provide path to .p12 certificate file path and password in Program.cs Main method
  3. Run it on Linux
  4. See consumed memory using top (I see about 100MB)
  5. Change Microsoft.NETCore.App to 1.1.0 and netcoreapp1.0 to netcoreapp1.1 in project.json
  6. Run it on Linux again
  7. See consumed memory using top (I see about 1GB)
@benaadams
Copy link
Member

/cc @CIPop @bartonjs

@janvorli
Copy link
Member

@agoretsky which memory column are you referring to? If it is VIRT, it is likely ok and there is no leak. If it was RES, it would be a problem. The virtual memory used by the application doesn't reflect the physical memory usage. We allocate upto 2GB of virtual memory range on x64 as a range in which we then map physical memory as needed. The virtual memory address space is private to each process and each process has about 128 terrabytes of virtual address space available.

@agoretsky
Copy link
Author

@janvorli RES

@janvorli
Copy link
Member

@agoretsky ok, I'll take a look.

@janvorli
Copy link
Member

@agoretsky I have tried to run your app locally and I am not seeing that memory consumption:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
56673 janvorli  20   0 3216148  35004  18468 R 100,0  0,1   0:09.34 dotnet

I am getting exactly the same values for 1.1.0 and 1.0.1. I have verified that I am really running the 1.1.0 bits by attaching lldb to the process and running "target modules list" command. The excerpt from the output confirms I am:
[ 11] 669633CA-4B45-8254-3E88-5C612C2D2612-21A6D2F0 /home/janvorli/dotnet/shared/Microsoft.NETCore.App/1.1.0/libcoreclr.so

@agoretsky
Copy link
Author

agoretsky commented Dec 16, 2016

@janvorli,
After your last message I already was not at work and I've tried to repro it on my Mac:
I've created docker container on my Mac with provided example and I was not able to reproduce it, then I decided to repeat my work env where I've seen this problem, so I installed ubuntu using virtual box and tried to reproduce problem and I was not able to see it again..
Ok, I have checked out our full app and run it and there is also no problems, but at work it was growing very fast after deploy. I made one request to our app and start waiting, after some minutes it started growing slowly, but reached 1GB. I closed our app. Next I run my repro and it reached 1GB very fast, I rebooted ubuntu, ran repro again and it also immediately reached 1GB.
Then I left ubuntu and returned to docker and decided to add some lines to repro to try stimulate leaks:

        var deviceToken = "ca1e9059ac913d0eecbb2484cfbe2d42943b4e031899901f40bf878d940ef0fd";
        var notification = new ApnsNotification(deviceToken, "");
        for(int i=0;i<100;i++)
        {
            Thread.Sleep(100);

            applePushNotificationService.QueueApnsNotification(notification);
        }

        Console.ReadKey();

After running repro with these lines in docker container it started growing is not immediately and very slow, but also reached more than 1GB

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
42 test 20 0 4728.5m 1.611g 20.7m S 0.0 82.6 0:22.72 dotnet

docker logs output:

42
20:15:53.666 DEBUG 1 ChangeScale Scaled Changed to: 1
20:16:04.744 INFO 5 SendNotificationsBatch APNS-Client[1]: Sending Batch ID=1, Count=100
20:16:04.774 INFO 5 Connect APNS-Client[1]: Connecting (Batch ID=1)
20:29:48.147 ERROR 4 MoveNext APNS-CLIENT[1]: Send Batch Error: Batch ID=1 System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic(IAsyncResult iar, Func2 endFunction, Action1 endAction, Task1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MemoryLeak.PushNotifications.Apple.ApnsConnection.d__25.MoveNext() in /home/test/app/proj/PushNotifications/Apple/ApnsConnection.cs:line 451
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MemoryLeak.PushNotifications.Apple.ApnsConnection.d__21.MoveNext() in /home/test/app/proj/PushNotifications/Apple/ApnsConnection.cs:line 152
20:29:48.470 INFO 4 MoveNext APNS-Client[1]: Sent Batch, waiting for possible response...
20:29:49.305 INFO 5 MoveNext APNS-Client[1]: Received -1 bytes response...
20:29:49.305 INFO 5 MoveNext APNS-Client[1]: Batch (ID=1) completed with no error response...
20:29:49.305 INFO 5 MoveNext APNS-Client[1]: Done Reading for Batch ID=1, reseting batch timer...

According to logs it can't connect to the server, I changed repro app to .NET Core 1.0.1 and got successfully connected logs:

ChangeScale Scaled Changed to: 1
00:20:30.602 INFO 6 SendNotificationsBatch APNS-Client[1]: Sending Batch ID=1, Count=100
00:20:30.622 INFO 6 Connect APNS-Client[1]: Connecting (Batch ID=1)
00:20:31.899 INFO 8 MoveNext APNS-Client[1]: Connected (Batch ID=1)

So, using .NET Core 1.1 repro app can't connect to the server and consumes a lot of memory, but with .NET Core 1.0.1 can

@janvorli
Copy link
Member

janvorli commented Dec 16, 2016

@agoretsky I have added your new lines of code and I can repro the same exception - but with both 1.0.1 and 1.1.0. I've again attached lldb to verify that I am really running with 1.0.1 runtime.

@janvorli
Copy link
Member

I am running it on a physical machine with Ubuntu 14.04

@bartonjs
Copy link
Member

My first thought was that we might have dropped a TLS version from being supported in 1.1, but that doesn't seem to be the case (and if you're seeing it on 1.0.1, Jan, would be even less likely).

Do we know what host:port is being talked to when the remote side drops the connection? openssl s_client is a pretty good way to test on Unix systems (and with -debug gives pretty good indications of why things are doing what they are).

@agoretsky
Copy link
Author

agoretsky commented Dec 17, 2016

I was on ubuntu 16.04.1 LTS and I pushed dockerfile which I also used to repro it

@agoretsky
Copy link
Author

agoretsky commented Dec 17, 2016

Now on ubuntu with net 1.1 I am seeing only "Connecting" at logs and growing memory, with 1.0.1 I am getting "Connected" after "Connecting" and everything is ok. I can send you my test apple certificate file and password if you email me

@janvorli
Copy link
Member

@agoretsky, @bartonjs - I can confirm that I can repro the problem. With the cert and password I have received from @agoretsky, the app compiled targetting 1.1.0 leaks and targetting 1.1.0 does not.

In both cases though, the output of the app is the same, I am not seeing any exceptions. But I think that could be caused by the fact that my Linux native box has 24GB of RAM. @agoretsky how much RAM does your machine have?
For each of the requests, I get:

06:31:45.695   INFO    5 SendNotificationsBatch APNS-Client[1]: Sending Batch ID=1, Count=100
06:31:45.718   INFO    5 Connect APNS-Client[1]: Connecting (Batch ID=1)
06:31:46.166   INFO   11 MoveNext APNS-Client[1]: Connected (Batch ID=1)
06:31:46.166   INFO    6 MoveNext APNS-Client[1]: Sent Batch, waiting for possible response...
06:31:46.331   INFO   11 MoveNext APNS-Client[1]: Data Available...
06:31:46.331   INFO   11 MoveNext APNS-Client[1]: Finished Read.
06:31:46.331   INFO   11 MoveNext APNS-Client[1]: Received 6 bytes response...
06:31:46.331   INFO   11 MoveNext APNS-Client[1]: Batch (ID=1) completed with error response...
06:31:46.331   INFO   11 MoveNext APNS-Client[1]: Failing Notification 1
06:31:46.336   INFO   11 Disconnect APNS-Client[1]: Disconnecting (Batch ID=1)
06:31:46.336   INFO   11 Disconnect APNS-Client[1]: Disconnected (Batch ID=1)

The memory consumption for 1.1.0 grows to about 2GB for me in about 30 seconds. In a couple of minutes, it has grown to about 6GB. Then it occassionally drops down - to a value between 2..5GB, grows up and it keeps cycling that way. I have looked at the memory map and used sos commands to see if it is related to our EE heaps (GC heap plus other heaps used by the CoreCLR runtime). The GC heap is about 230MB and the other heaps just couple of megabytes.
Looking at the process map, I can see that the huge amount of memory is allocated in the native C heap.

I have used Linux memleax tool (after fixing few issues in it that it had with the fact that we map non-elf files into memory as executables) to try to track down the source of the leaks. This tool attaches to a process,hooks to malloc / free and reports call stacks of all calls to malloc that was not freed within certain time interval. I've set it to 30 seconds and it seems we are leaking memory allocated in the PEM_ASN1_read_bio a lot. The allocated amount per each malloc is small (2..30 bytes), but the number of times it is allocates is huge, according to this loop.

The memleax reports 11 distinct call stacks, differentiating only in the native stack frames in the libcrypto.so. The call stack doesn't contain the functions in the pal_x509.cpp due to the fact that the calls to the libcrypto.so are compiled as tail calls and also the PEM_ASN1_read_bio ends up being a tail call. And the memleax obviously cannot understand the managed code frames. But since the last managed call frame before the native code stack in libcrypto.so is always the same, I believe all the leaks stem from the same interop call.

Unfortunately, I am unable to run the memleax on the app when it runs under lldb, so I will have to find out a way to identify the problematic caller.

I will update it here after I find it.

@agoretsky
Copy link
Author

@janvorli, there was 2GB for Ubuntu VM

@janvorli
Copy link
Member

Based on the stack traces from the memleax and some additional debugging, I believe that the leaked allocations stem from PemReadBioX509Crl.
I have also found that when I set a breakpoint at CryptoNative_PemReadBioX509Crl and when it is hit, step out to the caller, the allocated native memory grows by about 900MB.
@bartonjs do you have any idea why it would eat so much memory? Does it call some managed callbacks that could call some other SSL functions that would allocate?

@bartonjs
Copy link
Member

@janvorli Nope. Unless the CRL was simply that big. We would have preceded the call with CryptoNative_DecodeX509Crl (length is the second argument) if it was a new download; otherwise we'd call CryptoNative_BioNewFile with the path of the file to read as the first argument (to see its on-disk (Base64/PEM) size).

Is the CRL object leaking? Or something else?

pcrl->references should be 1 when CryptoNative_PemReadBioX509Crl returns, bump to 2 after we call CryptoNative_X509StoreAddCrl, then immediately drop back to 1 because we call CryptoNative_X509CrlDestroy. When we later call CryptoNative_X509StoreDestroy the reference count should go to 0 (resulting in associated data being freed and then it not being reliable to watch that memory location anymore 😄).

@janvorli
Copy link
Member

@bartonjs what I can see is that just executing the CryptoNative_PemReadBioX509Crl enlarges the consumed native memory by 900MB. I mean, right before the call the memory consumption is 900MB less than right after the call. Then the CryptoNative_X509CrlDestroy gets called, which doesn't change anything on the consumption side and then CryptoNative_X509StoreDestory gets called and it also doesn't change anything. That's why I was thinking that there might be some callbacks that are invoked while the CryptoNative_PemReadBioX509Crl is executing.
But it is possible that another thread was running at the same time and caused that memory to grow.
When I do the same thing with the same app built against 1.0.1, the memory consumption doesn't grow at all.

@janvorli
Copy link
Member

I have tried another memory allocation profiler tool to see where the allocations come from. This time it was the heaptrack tool (http://milianw.de/blog/heaptrack-a-heap-memory-profiler-for-linux). It generates somehow more detailed output with summaries on memory allocation. I have left it running for the duration of the first request in the batch. Here is an excerpt from the report. The full report contains call stacks that result in calling those functions. And 99% of that memory stems from PEM_ASN1_read_bio call

56742719 calls to allocation functions with 1.42GB peak consumption from
CRYPTO_malloc

8473 calls to allocation functions with 293.66MB peak consumption from
CRYPTO_realloc

The report also contains more detailed breakdown for each of those. One example:

82.70MB consumed over 1 calls from:
    asn1_enc_save
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    ASN1_item_ex_d2i
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    0x7fbad79a0b71
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    0x7fbad79a0db7
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    ASN1_item_ex_d2i
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    ASN1_item_d2i
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    PEM_ASN1_read_bio
      in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    0x7fba661c98a2
      in ??

This is a summary the tool provides:

bytes allocated in total (ignoring deallocations): 2.69GB (198.59MB/s)
calls to allocation functions: 56875716 (4206472/s)
temporary memory allocations: 4035692 (298475/s)
peak heap memory consumption: 1.28GB
peak RSS (including heaptrack overhead): 6.30MB

Now look at the corresponding values from the 1.0.1 for the same duration of the app execution:

343886 calls to allocation functions with 2.30MB peak consumption from
CRYPTO_malloc

10013 calls to allocation functions with 101.73KB peak consumption from
CRYPTO_realloc

And the summary:

bytes allocated in total (ignoring deallocations): 54.34MB (843.73KB/s)
calls to allocation functions: 470693 (7308/s)
temporary memory allocations: 20071 (311/s)
peak heap memory consumption: 6.72MB

@bartonjs any ideas? It looks pretty weird, but the results are 100% reproducible and stable.

@janvorli
Copy link
Member

@bartonjs so I have just tested the app with version 1.0.1 running under lldb and I have found something I haven't expected. On 1.0.1 it doesn't call the CryptoNative_PemReadBioX509Crl at all! I have set a breakpoint to CryptoNative_PemReadBioX509Crl, I can see it being resolved correctly, but it is never hit.
It is called on 1.1.0 only. And all the huge memory allocation stems from that call.

@bartonjs
Copy link
Member

https://github.com/dotnet/corefx/commits/master/src/System.Security.Cryptography.X509Certificates/src/Internal/Cryptography/Pal.Unix/CertificateAssetDownloader.cs hasn't changed since 1.0.0; so... something else changed.

The effective flow:

  • If chain.RevocationMode == NoCheck, return.
  • Calculate the path where the applicable CRL would be
  • If a file exists at that path
    • Call CryptoNative_PemReadBioX509Crl
    • If that succeeds, add it to the context, return.
  • If chain.RevocationMode == Offline, return.
  • Lookup up the first http (not https, not ldap) CRL Distribution Point (CDP) record
  • If none, return
  • Download the resource
  • If failed, return
  • Call CryptoNative_DecodeX509Crl
  • If succeeded, add it to the context and save it to disk as PEM, return
  • Call CryptoNative_PemReadBioX509Crl
  • If succeeded, add it to the context and save it to disk as PEM, return

Candidates:

  • You did the 1.0.0 test with an empty ~/.dotnet/corefx/cryptography/x509/crls directory, then ran it again with 1.1:
    • The file didn't exist, so it was downloaded
    • The download succeeded and it parsed as DER (binary), it got saved to disk
  • Somewhere in the calling context something changed from being in RevocationMode.NoCheck to ... a different mode.

Even with parts of X509Certificates that changed between 1.0 and 1.1, I can't see any that would affect this. And spot checking in other libraries hasn't turned up anything, either.

@janvorli
Copy link
Member

@bartonjs - I have kept repeating the tests for 1.0.1 and 1.1.0 many times, each time deleting the obj and bin folders, modifying the project to refer to the specific version, doing dotnet restore and dotnet build. The behavior is always the same.

I don't have the ~/.dotnet/corefx/cryptography/x509/crls, but I have ~/.dotnet/corefx/cryptography/crls. In there, I have three files, the newest one which probably is the one the test was using has date of 12/19 and it is large - about 110MB.

@bartonjs
Copy link
Member

Using gdb (so I could set fork-follow-mode child while using dotnet run... also commands is nice.)

Scenario:
I have a program that builds an X509Chain in Online mode. It disposes the certs after the chain is built, so I have an "I'm done" marker (an X509_free call that doesn't come from X509_STORE_free)

Setup (approximate):

(gdb) b CryptoNative_PemReadBioX509Crl
(gdb) b X509_CRL_free
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>x ($rdi+24)
>end
(gdb) b X509_verify_cert
(gdb) b CryptoNative_X509StoreCtxGetChain
(gdb) b X509_STORE_free
(gdb) b X509_free
(gdb) commands
Type commands for breakpoint(s) 6, one per line.
End with a line saying just "end".
>bt 4
>end
  • Every time it gets into X509_CRL_free we'll print (crl->references) (($rdi+24)).
  • When we get into X509_free I check bt 4 to find out if we're still in X509_STORE_free; if not we're done.
  • X509_verify_cert marks when we are done with setup and moving into execution. Every address of a CRL we created should have been seen in X509_CRL_free (we create it, add it (which bumps the reference), then free it (which drops the reference)).
  • CryptoNative_X509StoreCtxGetChain marks when X509_verify_cert finished.
  • X509_STORE_free is a marker of us being done building the chain. This should result in all the CRL objects being freed.

RES analysis:

  • Before first call to CryptoNative_PemReadBioX509Crl: 40120
  • finish that call, now it's 47248 (1.6MB CRL file, contains 34132 entries)
  • At the call to X509_verify_cert: 47348
  • At CryptoNative_X509StoreCtxGetChain: 48752
  • At the X509_free not from X509_STORE_free: 47584

X509_CRL_free log:

0x8da068:       0x00000002
0xbee048:       0x00000002
0x8a4cc8:       0x00000001
0xf15d18:       0x00000002
0xf15d18:       0x00000001
# X509_verify_cert is called here
0x8da068:       0x00000002
0x8da068:       0x00000003
0x8da068:       0x00000002
0x18:   Cannot access memory at address 0x18
0x18:   Cannot access memory at address 0x18
0x18:   Cannot access memory at address 0x18
0xbee048:       0x00000002
0xbee048:       0x00000003
0xbee048:       0x00000002
0x18:   Cannot access memory at address 0x18
0x18:   Cannot access memory at address 0x18
0x18:   Cannot access memory at address 0x18
0xbee048:       0x00000002
0xbee048:       0x00000003
0xbee048:       0x00000002
0x18:   Cannot access memory at address 0x18
0x18:   Cannot access memory at address 0x18
0x18:   Cannot access memory at address 0x18
# X509_STORE_free is called here
0x8da068:       0x00000001
0xbee048:       0x00000001

So we created and kept 0x8da068 and 0xbee048. We created and discarded 0x8a4cc8 and 0xf15d18.
X509_STORE_free passes a lot of nullptr's to X509_CRL_free. But we freed everything we had.

So really we'd need to know where (full stack trace, I guess) memory was allocated from once that function was called to until it ended (each time) but let it cross off anything which was later freed.

Then we could determine if OpenSSL has a memory leak, or if we were just using it poorly.

@janvorli
Copy link
Member

@bartonjs for a simple test case, the allocation tracking tool memleax would likely find the leaks with their call stacks easily. Could you please provide me with the source of your testing app so that I can give it a try? At the same time, I can try to pass it the large 100MB crl I have found in ~/.dotnet/corefx/cryptography/crls and see if the memory allocation would correspond to what I am seeing with the app from @agoretsky.

But the really strange thing is why the CryptoNative_PemReadBioX509Crl is not called with 1.0.1. Can you give me an advice where can I set a breakpoint in the managed code where the flow you have described begins so that I can step through the code to see why it didn't try to call the CryptoNative_PemReadBioX509Crl ?

@bartonjs
Copy link
Member

@janvorli System.Security.Cryptography.X509Certificates' Internal.Cryptography.Pal.OpenSslX509ChainProcessor.BuildChain

If that doesn't get called at all on 1.0 that would be "the reason" (if this is using System.Net.Http, then System.Net.Http's System.Net.Http.CurlHandler.SslProvider.VerifyCertChain might be skipping the call to X509Chain.Build() if it can do a quicker test directly with native code).

https://github.com/bartonjs/SampleProjects/blob/master/PrintChain/PrintChain.cs is the source, but it takes the certificate as input (not the CRL).

If you've captured the host (and port) it was talking to you can grab the certificate via openssl s_client -connect www.example.org:5112 (then Ctrl+D to close the socket) and save the --- BEGIN CERTIFICATE --- ... --- END CERTIFICATE --- (including the dashes and such) to www.example.org.cer.

@janvorli
Copy link
Member

@bartonjs I have found the 1.0.1 version doesn't call the CryptoNative_PemReadBioX509Crl because the revocationMode is equal to X509RevocationMode.NoCheck. I can see the object X509Chain on the stack, the revocation mode comes from the X509ChainPolicy referenced by its _chainPolicy member:

(lldb) dumpobj 00007fff58036d50
Name:        System.Security.Cryptography.X509Certificates.X509Chain
MethodTable: 00007fff7d4fa950
EEClass:     00007fff7d4c7c08
Size:        56(0x38) bytes
File:        /home/janvorli/dotnet/shared/Microsoft.NETCore.App/1.0.1/System.Security.Cryptography.X509Certificates.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fff7d4fb168  4000048        8 ...s.X509ChainPolicy  0 instance 00007fff58036db8 _chainPolicy
0000000000000000  4000049       10              SZARRAY  0 instance 0000000000000000 _lazyChainStatus
00007fff7d4faca8  400004a       18 ...ElementCollection  0 instance 00007fff58036f10 _chainElements
00007fff7d4fc640  400004b       20 ...phy.Pal.IChainPal  0 instance 0000000000000000 _pal
00007fff7cc54910  400004c       28        System.Object  0 instance 00007fff58036d88 _syncRoot

(lldb) dumpobj 00007fff58036db8
Name:        System.Security.Cryptography.X509Certificates.X509ChainPolicy
MethodTable: 00007fff7d4fb168
EEClass:     00007fff7d4c7e08
Size:        72(0x48) bytes
File:        /home/janvorli/dotnet/shared/Microsoft.NETCore.App/1.0.1/System.Security.Cryptography.X509Certificates.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fff7d4fb2e8  4000053        8 ...phy.OidCollection  0 instance 00007fff58036e00 <ApplicationPolicy>k__BackingField
00007fff7d4fb2e8  4000054       10 ...phy.OidCollection  0 instance 00007fff58036e58 <CertificatePolicy>k__BackingField
00007fff7cc56b98  4000055       30      System.DateTime  1 instance 00007fff58036de8 <VerificationTime>k__BackingField
00007fff7cc55f78  4000056       38      System.TimeSpan  1 instance 00007fff58036df0 <UrlRetrievalTimeout>k__BackingField
00007fff7d4fa2f0  4000057       18 ...ficate2Collection  0 instance 00007fff58036e98 <ExtraStore>k__BackingField
00007fff7d480638  4000058       20         System.Int32  1 instance                0 _revocationMode
00007fff7d4f9d60  4000059       24         System.Int32  1 instance                2 _revocationFlag
00007fff7d480788  400005a       28         System.Int32  1 instance                0 _verificationFlags

@janvorli
Copy link
Member

I've forgotten to add that it calls the OpenSslX509ChainProcessor.BuildChain, it checks the revocationMode parameter and since it is X509RevocationMode.NoCheck, it sets lookupCrl to false

@janvorli
Copy link
Member

When running on 1.1.0, the revocationMode is X509RevocationMode.Online instead. Any idea why it would be different between 1.0.1 and 1.1.0?

@bartonjs
Copy link
Member

The default has always been Online; so whomever requested the chain build used to set NoCheck, and now is doing something different. So... now we go to whomever built the X509Chain... or, at least, whomever called Build on it.

@janvorli
Copy link
Member

I have investigated the call stacks leading to the OpenSslX509ChainProcessor.BuildChain and while there is just one path in 1.0.1, in 1.1.0, there are two.

In the first path common to both, the revocation mode stems from SslStream.AuthenticateAsClientAsync that is called directly by the @agoretsky;s app with the checkCertificateRevocation argument set to false. The part of the call stack that calls the BuildChain looks like this:

Internal.Cryptography.Pal.OpenSslX509ChainProcessor.BuildChain(System.Security.Cryptography.X509Certificates.X509Certificate2, System.Collections.Generic.HashSet`1<System.Security.Cryptography.X509Certificates.X509Certificate2>, System.Collections.Generic.HashSet`1<System.Security.Cryptography.X509Certificates.X509Certificate2>, System.Collections.Generic.HashSet`1<System.Security.Cryptography.X509Certificates.X509Certificate2>, System.Security.Cryptography.OidCollection, System.Security.Cryptography.OidCollection, System.Security.Cryptography.X509Certificates.X509RevocationMode, System.Security.Cryptography.X509Certificates.X509RevocationFlag, System.DateTime, System.TimeSpan ByRef) + 87
Internal.Cryptography.Pal.ChainPal.BuildChain(Boolean, Internal.Cryptography.ICertificatePal, System.Security.Cryptography.X509Certificates.X509Certificate2Collection, System.Security.Cryptography.OidCollection, System.Security.Cryptography.OidCollection, System.Security.Cryptography.X509Certificates.X509RevocationMode, System.Security.Cryptography.X509Certificates.X509RevocationFlag, System.DateTime, System.TimeSpan) + 315
System.Security.Cryptography.X509Certificates.X509Chain.Build(System.Security.Cryptography.X509Certificates.X509Certificate2) + 263
System.Net.Security.CertificateValidation.BuildChainAndVerifyProperties(System.Security.Cryptography.X509Certificates.X509Chain, System.Security.Cryptography.X509Certificates.X509Certificate2, Boolean, System.String) + 52
System.Net.Security.SecureChannel.VerifyRemoteCertificate(System.Net.Security.RemoteCertValidationCallback) + 567
System.Net.Security.SslState.CompleteHandshake() + 85
System.Net.Security.SslState.CheckCompletionBeforeNextReceive(System.Net.Security.ProtocolToken, System.Net.AsyncProtocolRequest) + 169
System.Net.Security.SslState.StartSendBlob(Byte[], Int32, System.Net.AsyncProtocolRequest) + 322
System.Net.Security.SslState.ProcessReceivedBlob(Byte[], Int32, System.Net.AsyncProtocolRequest) + 436
System.Net.Security.SslState.ReadFrameCallback(System.Net.AsyncProtocolRequest) + 148

In the 1.1.0 case, the additional code path has a call to TLSCertificateExtensions.BuildNewChain which creates the X509Chain with the revocation mode left at the default, that means "online".

Here is the relevant part of the call stack at the TLSCertificateExtensions.BuildNewChain in 1.1.0:

Internal.Cryptography.Pal.OpenSslX509ChainProcessor.BuildChain(System.Security.Cryptography.X509Certificates.X509Certificate2, System.Collections.Generic.HashSet`1<System.Security.Cryptography.X509Certificates.X509Certificate2>, System.Collections.Generic.HashSet`1<System.Security.Cryptography.X509Certificates.X509Certificate2>, System.Collections.Generic.HashSet`1<System.Security.Cryptography.X509Certificates.X509Certificate2>, System.Security.Cryptography.OidCollection, System.Security.Cryptography.OidCollection, System.Security.Cryptography.X509Certificates.X509RevocationMode, System.Security.Cryptography.X509Certificates.X509RevocationFlag, System.DateTime, System.TimeSpan ByRef)
Internal.Cryptography.Pal.ChainPal.BuildChain(Boolean, Internal.Cryptography.ICertificatePal, System.Security.Cryptography.X509Certificates.X509Certificate2Collection, System.Security.Cryptography.OidCollection, System.Security.Cryptography.OidCollection, System.Security.Cryptography.X509Certificates.X509RevocationMode, System.Security.Cryptography.X509Certificates.X509RevocationFlag, System.DateTime, System.TimeSpan)
System.Security.Cryptography.X509Certificates.X509Chain.Build(System.Security.Cryptography.X509Certificates.X509Certificate2)
System.Net.Http.TLSCertificateExtensions.BuildNewChain(System.Security.Cryptography.X509Certificates.X509Certificate2, Boolean)
Interop+OpenSsl.AllocateSslContext(System.Security.Authentication.SslProtocols, Microsoft.Win32.SafeHandles.SafeX509Handle, System.Security.Cryptography.SafeEvpPKeyHandle, System.Net.Security.EncryptionPolicy, Boolean, Boolean)
System.Net.Security.SafeDeleteSslContext..ctor(System.Net.Security.SafeFreeSslCredentials, Boolean, Boolean)
System.Net.SslStreamPal.HandshakeInternal(System.Net.Security.SafeFreeCredentials, System.Net.Security.SafeDeleteContext ByRef, System.Net.Security.SecurityBuffer, System.Net.Security.SecurityBuffer, Boolean, Boolean)
System.Net.Security.SecureChannel.GenerateToken(Byte[], Int32, Int32, Byte[] ByRef)
System.Net.Security.SecureChannel.NextMessage(Byte[], Int32, Int32)
System.Net.Security.SslState.StartSendBlob(Byte[], Int32, System.Net.AsyncProtocolRequest)
System.Net.Security.SslState.ForceAuthentication(Boolean, Byte[], System.Net.AsyncProtocolRequest)
System.Net.Security.SslState.ProcessAuthentication(System.Net.LazyAsyncResult)
System.Net.Security.SslStream.BeginAuthenticateAsClient(System.String, System.Security.Cryptography.X509Certificates.X509CertificateCollection, System.Security.Authentication.SslProtocols, Boolean, System.AsyncCallback, System.Object)

In 1.0.1, the call stack from BeginAuthenticateAsClient looks the same upto the Interop+OpenSsl.AllocateSslContext. But the TLSCertificateExtensions.BuildNewChain is never called. And here is why. The call was added in 1.1.0:
dotnet/corefx@02ea17c

So that's the culprit. I am no SSL expert, so I cannot say if it is a bug to call TLSCertificateExtensions.BuildNewChain when the AuthenticateAsClientAsync is called by the user's app with the checkCertificateRevocation argument set to false or if rather the former state in 1.0.1 was buggy and so the new behavior is a fix.

Anyways, there is still a remaining question why building the cert chain with the check for revocation mode set to "online" consumes that much memory in this specific case (I'll try the simple test from @bartonjs to see if the cert is just large and causes that memory consumption). And also why on my machine, the memory consumption kept growing later, as if the certificate got loaded again and again and its memory was getting released only after a long time (Since I was seeing that the memory consumption ended up being capped at about 6GB even after hundreds of requests, my guess is it is not a leak)

@janvorli
Copy link
Member

@bartonjs I have tried your testing app. The certificates that came from both the servers used by the app are fine, they didn't cause a significant memory consumption.
But the certificate that I have received from @agoretsky is the one that causes the trouble. The generated .crl for this certificate is the one that is larger than 100MB and when I run your app on the cert, it consumes about 1GB of memory.
Moreover, the consumed memory doesn't go away after the following lines in your app (I've added Console.ReadKey() before and after these lines for my testing)

            GC.Collect();
            GC.WaitForPendingFinalizers();

This is the output of your testing app:

Validating Certificate: Apple Push Services: work.ayyo.must

Chain is NOT valid

Chain Element Status:
  C=US, O=Ayyo S.a r.l, OU=QH8V6J5HSG, CN=Apple Push Services: work.ayyo.must, userId=work.ayyo.must
    PartialChain (unable to get local issuer certificate)
    RevocationStatusUnknown (unable to get certificate CRL)

Chain Summary:
    PartialChain (unable to get local issuer certificate)
    RevocationStatusUnknown (unable to get certificate CRL)

I have tried to get some info on that CRL file using "openssl crl" command and I can confirm that it also consumed about 1GB of memory. So that's how much openssl really needs for the revocation list. But we should still be able to release that memory after we are done with it. It might be a bug in openssl though.
Btw, the revocation list contains more than 2 million entries!

@agoretsky can I share your cert / password with @bartonjs via private email so that he can try it locally?

@agoretsky
Copy link
Author

@janvorli, yes, you can share it via private email with anybody who may help

@vvdimov
Copy link

vvdimov commented Jan 4, 2017

Suggestion for a short term workaround is more than welcome..

@bartonjs
Copy link
Member

bartonjs commented Jan 4, 2017

@agoretsky In your example, if you pass the client certs in AuthenticateAsClientAsync instead of using the LocalCertificateCallback (so make the callback null), does that work around your problem?

The callback version has to build the chain to present it to you, and that builds it with revocation checked. (The memory leak is still seemingly there, and bad, but maybe you can avoid the code)

@vvdimov
Copy link

vvdimov commented Jan 4, 2017

Changed:

 this.stream = new SslStream(
                    this.client.GetStream(), 
                    false,
                    this.ValidateRemoteCertificate,
                    (sender, targetHost, localCerts, remoteCert, acceptableIssuers) => this.certificate);

To:
stream = new SslStream(client.GetStream(), false, ValidateRemoteCertificate, null);

Result is still same: Main process exited, code=killed, status=9/KILL

@bartonjs
Copy link
Member

bartonjs commented Jan 5, 2017

Okay, I don't think there's a "leak", just... an unfortunate interaction of Apple's big CRL and malloc/free.

Taking my sample program and replacing main with

        public static void Main(string[] args)
        {
            for (int i = 0; i < 10; i++)
            {
                RealMain(args);
                Console.WriteLine($"=== End pass {i+1}");
            }

            Console.WriteLine("Press enter to exit...");
            Console.ReadLine();
        }

does not appreciably change the final RES result from it running only once. The 900MB RES value is because free seems to never return partially allocated pages to the OS. The oodles of CRL entries created lots of partial pages which subsequent malloc calls will reuse.

I couldn't run valgrind on a corerun-launched process; but I did make a native version of the cert chain builder. It peaked RES at ~998MB, and after freeing everything was still at ~903MB. During a 10x internal run it kept bouncing between those two numbers. valgrind (when it was only 1x) says that it didn't appreciably leak; so the mallocs and frees were balanced (it took ~6GB of RAM to prove this, though, and took a while).

There's still the "why is this building a revocation-checked chain now and wasn't in 1.0?" question, and it looks the fix for https://github.com/dotnet/corefx/issues/9142 builds the chain with revocation checked, which isn't necessary... so that just needs to be changed to build that chain with RevocationMode.NoCheck (since it cares only about pathing).

@vvdimov
Copy link

vvdimov commented Jan 5, 2017

One additional detail to the original comment from @agoretsky is that establishing SSLStream with Apple certificates don't work 1.0.0, 1.0.1, 1.0.3, 1.1.0. Same results in all 4 versions. The original comment mentions the upgrade from 1.0.1 to 1.1.0 triggers this, but I am not able to confirm that 1.0.1 works.

@bartonjs
Copy link
Member

Fixed (for .NET Core 2.0) in dotnet/corefx#15432.

@bartonjs bartonjs removed their assignment Jan 25, 2017
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants