Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SslStream tests: timeout #15692

Closed
stephentoub opened this issue Nov 12, 2015 · 56 comments · Fixed by dotnet/corefx#7937
Closed

SslStream tests: timeout #15692

stephentoub opened this issue Nov 12, 2015 · 56 comments · Fixed by dotnet/corefx#7937
Assignees
Labels
area-System.Net test-bug Problem in test source code (most likely)
Milestone

Comments

@stephentoub
Copy link
Member

http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6135/console

System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success [FAIL]
        Handshake completed in the allotted time
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\SslStreamStreamToStreamTest.cs(36,0): at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success()
     System.Net.Security.Tests.CertificateValidationClientServer.CertificateValidationClientServer_EndToEnd_Ok [FAIL]
        Client/Server Authentication timed out.
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\CertificateValidationClientServer.cs(80,0): at System.Net.Security.Tests.CertificateValidationClientServer.<CertificateValidationClientServer_EndToEnd_Ok>d__0.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
     System.Net.Security.Tests.ClientAsyncAuthenticateTest.ClientAsyncAuthenticate_EachProtocol_Success [FAIL]
        Timed Out
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(252,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncSslHelper>d__5c.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(108,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncAuthenticate_EachProtocol_Success>d__16.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
@CIPop
Copy link
Member

CIPop commented Nov 12, 2015

/cc @markwilkie
We're already at 10s for these tests. I'll increase to 15s but, unless this is a product bug, this is simply showing that the CI system boxes are not scaling for the number of concurrent tests that they need to run.

@CIPop
Copy link
Member

CIPop commented Nov 12, 2015

For reference: on my system, the entire System.Net.Security test suite executes in about 1 second.

@stephentoub
Copy link
Member Author

@CIPop, I expect you're right, but should we also consider the possibility that there's a race condition / deadlock somewhere and these timeouts actually represent a real product issue? Or is that not possible?

@CIPop
Copy link
Member

CIPop commented Nov 13, 2015

consider the possibility that there's a race condition

I'm not disregarding that possibility. If this happens after I increase the timeouts, I'll see if I can repro in my own environment or if I can add memory-dump generation code to the tests (i.e. a call to MiniDumpWriteDump ).

CIPop referenced this issue in CIPop/corefx Nov 13, 2015
APIs will throw NotSupportedException if SSLv2 or v3 is used. This
behavior is different from .NET Desktop.
Tests have been changed to pin this new behavior.
Increased passing-test timeouts to 15s.

Fix #3114 (partial), #4467
@stephentoub
Copy link
Member Author

Same three failed here: dotnet/corefx#4478 (comment)

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

CIPop referenced this issue in CIPop/corefx Nov 16, 2015
APIs will throw NotSupportedException if SSLv2 or v3 is used. This
behavior is different from .NET Desktop.
Tests have been changed to pin this new behavior.
Increased passing-test timeouts to 15s.

Fix #3114 (partial), #4467
@CIPop
Copy link
Member

CIPop commented Nov 16, 2015

Increased the timeouts to 15s.

If this is still noticed we'll need to try to repro on a system that can be debugged and ETW can be collected. #15570 is preventing us from using the CI failures for investigation.

@CIPop
Copy link
Member

CIPop commented Nov 16, 2015

Please reopen this issue if the errors still show up in CI.

It would be interesting to understand if the timeouts decreased the frequency of this failure.

@CIPop CIPop closed this as completed Nov 16, 2015
@stephentoub
Copy link
Member Author

Increased the timeouts to 15s.

When? This has been reproing fairly regularly for the last few days, in fact I'd say the regularity has actually increased.

Please reopen this issue if the errors still show up in CI.

They're not going to as I disabled them yesterday in dotnet/corefx#4526 due to how many PRs they were taking out.

@stephentoub stephentoub reopened this Nov 16, 2015
@CIPop
Copy link
Member

CIPop commented Nov 16, 2015

The last commit referenced on this issue contained the changes:

CIPop referenced this issue from a commit in CIPop/corefx 4 hours ago

@CIPop Removing SSLv2 and SSLv3 support from System.Net.Security. … 1f4deea

I've checked that there is no reference from source-code to issue dotnet/corefx#4467.

@stephentoub
Copy link
Member Author

I see, you just checked it in and removed the active issues I added yesterday. Ok.

@mellinoe mellinoe reopened this Nov 19, 2015
@stephentoub
Copy link
Member Author

And again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug_prtest/6315/console

System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success [FAIL]
        Handshake completed in the allotted time
        Expected: True
        Actual:   False

@mellinoe
Copy link
Contributor

@stephentoub
Copy link
Member Author

Failed again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug_prtest/6441/console

System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success [FAIL]
        Handshake completed in the allotted time
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_debug_prtest\src\System.Net.Security\tests\FunctionalTests\SslStreamStreamToStreamTest.cs(35,0): at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success()
     System.Net.Security.Tests.CertificateValidationClientServer.CertificateValidationClientServer_EndToEnd_Ok [FAIL]
        Client/Server Authentication timed out.
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_debug_prtest\src\System.Net.Security\tests\FunctionalTests\CertificateValidationClientServer.cs(80,0): at System.Net.Security.Tests.CertificateValidationClientServer.<CertificateValidationClientServer_EndToEnd_Ok>d__0.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@CIPop, failed again:
dotnet/corefx#4740 (comment)

We need to re-disable those tests.

@stephentoub
Copy link
Member Author

@CIPop
Copy link
Member

CIPop commented Dec 4, 2015

We need to re-disable those tests.

Agreed. Disabling some of the tests via dotnet/corefx#4821.

@CIPop
Copy link
Member

CIPop commented Dec 9, 2015

Moving to RTM: at this point, this is most likely a test-only issue.
Further investigation based on CI failures is not possible (tracking the engineering enhancement with #15570).

I can only confirm by setting up an environment similar to the CI system and repro with logging enabled.

@stephentoub
Copy link
Member Author

http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_opensuse13.2_release_tst/33/consoleFull

15:55:57    System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(randomizedData: False) [FAIL]
15:55:57       Handshake complete
15:55:57       Expected: True
15:55:57       Actual:   False
15:55:57       Stack Trace:
15:55:58             at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(Boolean randomizedData)

@stephentoub
Copy link
Member Author

http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_win10_release/137/consoleFull

12:21:07      System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(randomizedData: False) [FAIL]
12:21:07         Handshake complete
12:21:07         Expected: True
12:21:07         Actual:   False
12:21:07         Stack Trace:
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\System.Net.Security\tests\FunctionalTests\SslStreamStreamToStreamTest.cs(107,0): at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(Boolean randomizedData)
12:21:07      System.Net.Security.Tests.ClientAsyncAuthenticateTest.ClientAsyncAuthenticate_AllServerVsIndividualClientSupportedProtocols_Success(clientProtocol: Tls) [FAIL]
12:21:07         System.TimeoutException : The operation has timed out.
12:21:07         Stack Trace:
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\Common\tests\System\Threading\Tasks\TaskTimeoutExtensions.cs(23,0): at System.Threading.Tasks.TaskTimeoutExtensions.<TimeoutAfter>d__0.MoveNext()
12:21:07   Discovered:  System.Net.WebSockets.Client.Tests
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(142,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncSslHelper>d__14.MoveNext()
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(90,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncAuthenticate_AllServerVsIndividualClientSupportedProtocols_Success>d__9.MoveNext()
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07   Starting:    System.Net.WebSockets.Client.Tests
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@justinvp
Copy link
Contributor

justinvp commented Apr 5, 2016

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@stephentoub
Copy link
Member Author

@justinvp
Copy link
Contributor

@stephentoub
Copy link
Member Author

@CIPop
Copy link
Member

CIPop commented Apr 14, 2016

I was able to reproduce these in my controlled environment, running only this test suite. The repro rates are:
30% with 10 parallel executions on a 2core/1GB RAM VM running Win10.
5% with 100 parallel executions on a 12core/16GB RAM machine running Win10 preview.

Now working on instrumenting the code to get better understanding of what's wrong. At this point it doesn't appear that changing the timeout values affect the outcome.

@ericeil @stephentoub @davidsh For similar investigations, I'm working on a script that automates stress execution of any test. The current code is available https://github.com/CIPop/corefx/blob/pe/src/Common/tests/Scripts/ParallelTestExecution.ps1

@justinvp
Copy link
Contributor

@stephentoub
Copy link
Member Author

@davidsh
Copy link
Contributor

davidsh commented Apr 17, 2016

Just a thought...but maybe with PR dotnet/corefx#7800, these timeouts might go away?

@stephentoub
Copy link
Member Author

Just a thought...but maybe with PR dotnet/corefx#7800, these timeouts might go away?

Certainly possible. It's been merged. We'll see if we continue to hit occurrences of this.

@CIPop
Copy link
Member

CIPop commented Apr 25, 2016

After all fixes I was not able to repro this in my environment.

On machines with low CPU resources, a smaller timeout will cause this issue. If that's the case, we'll need to increase the current setting of 1 minute timeout for passing tests.
The current timeout has been verified to work on a 12 CPU box with 100 parallel tests and 2/8CPU VMs with 20 parallel tests.

@CIPop
Copy link
Member

CIPop commented Dec 6, 2016

@karelz PTAL

@karelz
Copy link
Member

karelz commented Dec 6, 2016

Long thread, what's the summary I should look at?
What I am interested:

  • What is the symptom?
  • What is the root-cause? (problematic pattern)
  • How common is the problematic pattern?
  • What is the solution?
  • How is it applicable to other tests? Can we detect them upfront or do we have to wit for them to fail?

@CIPop
Copy link
Member

CIPop commented Dec 6, 2016

What is the symptom?

SslStream tests were intermittently failing.

What is the root-cause? (problematic pattern)

The tests are not written to support multiple executions in parallel. We should write proper stress/perf tests for that purpose.
On a normal dev station this is not observed unless the machine is very busy. In CI, I've been told that it's common for the machine to be busy executing multiple requests on the same system.

See my last comment above:

On machines with low CPU resources, a smaller timeout will cause this issue. If that's the case, we'll need to increase the current setting of 1 minute timeout for passing tests.
The current timeout has been verified to work on a 12 CPU box with 100 parallel tests and 2/8CPU VMs with 20 parallel tests.

How common is the problematic pattern?

After increasing the timeout I haven't seen failures.

What is the solution?

This has been fixed since April 25th.
The fix at that time was to increase the timeout. Another potential fix was to run the tests on a machine with less CPU/network stress.

How is it applicable to other tests?

This is applicable to most intermittent network timeouts logged by the CI system.

Can we detect them upfront or do we have to wit for them to fail?

It is theoretically possible but I wouldn't expect contributors to model the CI system and come up with the right timeout.

@karelz
Copy link
Member

karelz commented Dec 7, 2016

SslStream tests were intermittently failing.

What kind of failure(s)? Timeouts? Something else?

The tests are not written to support multiple executions in parallel.

What does it mean? Do they have shared state? Or do they have hardcoded timeouts which are too short for 'busy machines'?

On a normal dev station this is not observed unless the machine is very busy.

And that's a problem of the test having too aggressive expectations, it should have some room for busy machines / slow VMs / etc. Not unreasonably large timeouts, but still some room to accommodate for differences.
What about a general timeout expression like 15s + 10 * (test_duration_on_fast_machine). The constants are of course tweakable.

How common is the problematic pattern?

I meant: How common is the problematic pattern (too aggressive timeout) in all tests? Is it one off? Rare? Common? Majority?

The fix at that time was to increase the timeout. Another potential fix was to run the tests on a machine with less CPU/network stress.

Reasonable timeout is the right fix. We should not force every developer / CI system to run on exclusive CPU/network. Again, we should be reasonable with the timeout allowances.

This is applicable to most intermittent network timeouts logged by the CI system.

Is there some smart/easy way how to discover all such tests? Or change them all at once?

It is theoretically possible but I wouldn't expect contributors to model the CI system and come up with the right timeout.

I don't follow. It should be us doing the sweep and "cleaning up" our tests, right? I wasn't suggesting enlisting contributors to do our dirty laundry :)

@CIPop
Copy link
Member

CIPop commented Dec 7, 2016

@karelz It's probably simpler if you go through the issue and the PR to answer most of your questions or we have an offline discussion to clarify more.
I'll try to summarize here: SslStream was failing with timeouts (as the title suggests). This is expected in any service/application running on a machine running out of resources (CPU, RAM, I/O), as it reaches the threshold where the overall performance will start to decay if more operations are queued. Network, Windows services (lsass for TLS in this case), CPU and RAM are the shared resources. Another resource is Azure itself which has, by SLA, 99.9% availability (the VM stability itself).

I don't follow. It should be us doing the sweep and "cleaning up" our tests, right? I wasn't suggesting enlisting contributors to do our dirty laundry :)

Definitely not: what I'm saying is quite the opposite. We can't expect developers that don't have access to CI machines or specs to write a test that runs just fine on their dev machine but will fail in CI running out of resources. It's hard to write and test code that scales correctly. Fixing the test code to run in parallel with multiple instances of itself in a heavily constrained CI system would basically enlist contributors to do the heavy lifting for each test, even if it's not a stress/perf test expected to reach the limits of the system.

Reasonable timeout is the right fix.

I agree (see my comment in #15381).

One problem is that the "CI tuning" process was iterative and involved me running the tests with different values over and over again in CI and using my simulation script (which itself had to be tuned to simulate the CI system on my dev box): dotnet/corefx#7937. Another problem is that as we're adding new tests, the CI system is more likely to run out of resources as several builds/tests are running in parallel on the same VM. Because of this, the system will become loaded in a different way which the likelihood of time-sensitive tests to exhibit different failure-modes.

The next step or an alternative to make the system even more tolerant to timeouts would be a retry mechanism for everything involving network (#3697). There are downsides here:

  • In certain cases, under high load, retrying will just add more pressure on the system. This can be solved by an exponential back-off algorithm often found in networking protocols although that would also need some performance tuning to run well on the CI system.
  • Adding the mechanism will require changing the code to a more complicated version that becomes harder to read and harder to diagnose. This goes back to the non-goal of making it harder for external devs to write tests.

@karelz
Copy link
Member

karelz commented Dec 7, 2016

we have an offline discussion to clarify more

Sounds good to me

This is expected in any service/application running on a machine running out of resources (CPU, RAM, I/O), as it reaches the threshold where the overall performance will start to decay if more operations are queued.

Yep, and that's exactly why timeouts should not be too tight. They should expect other normal operations on the system.

Another resource is Azure itself which has, by SLA, 99.9% availability (the VM stability itself).

That is relevant only for tests hitting Azure servers. Is that the case here -- that is vital information and the reason why we need proper write up on the failures and root-causes and not just links to GH issues where the information is scattered across large discussions.

One problem is that the "CI tuning" process was iterative and involved me running the tests with different values over and over again in CI and using my simulation script (which itself had to be tuned to simulate the CI system on my dev box): dotnet/corefx#7937. Another problem is that as we're adding new tests, the CI system is more likely to run out of resources as several builds/tests are running in parallel on the same VM. Because of this, the system will become loaded in a different way which the likelihood of time-sensitive tests to exhibit different failure-modes.

Why can't we just time the test on normal machine (or guess) and then apply something like "15s + 10 * (test_duration_on_fast_machine)" to determine the timeout?

more tolerant to timeouts would be a retry mechanism for everything involving network

That sounds like a lot of work and it pollutes the tests if I understand it correctly. I would start with cheaper solutions first (buffering timeouts).

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 1.0.0-rtm milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net test-bug Problem in test source code (most likely)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants