SslStream tests: timeout #15692

stephentoub · 2015-11-12T13:59:28Z

http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6135/console

System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success [FAIL]
        Handshake completed in the allotted time
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\SslStreamStreamToStreamTest.cs(36,0): at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success()
     System.Net.Security.Tests.CertificateValidationClientServer.CertificateValidationClientServer_EndToEnd_Ok [FAIL]
        Client/Server Authentication timed out.
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\CertificateValidationClientServer.cs(80,0): at System.Net.Security.Tests.CertificateValidationClientServer.<CertificateValidationClientServer_EndToEnd_Ok>d__0.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
     System.Net.Security.Tests.ClientAsyncAuthenticateTest.ClientAsyncAuthenticate_EachProtocol_Success [FAIL]
        Timed Out
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(252,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncSslHelper>d__5c.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           d:\j\workspace\dotnet_corefx_windows_release_prtest\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(108,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncAuthenticate_EachProtocol_Success>d__16.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

The text was updated successfully, but these errors were encountered:

CIPop · 2015-11-12T23:08:14Z

/cc @markwilkie
We're already at 10s for these tests. I'll increase to 15s but, unless this is a product bug, this is simply showing that the CI system boxes are not scaling for the number of concurrent tests that they need to run.

CIPop · 2015-11-12T23:08:52Z

For reference: on my system, the entire System.Net.Security test suite executes in about 1 second.

stephentoub · 2015-11-12T23:41:37Z

@CIPop, I expect you're right, but should we also consider the possibility that there's a race condition / deadlock somewhere and these timeouts actually represent a real product issue? Or is that not possible?

CIPop · 2015-11-13T00:32:54Z

consider the possibility that there's a race condition

I'm not disregarding that possibility. If this happens after I increase the timeouts, I'll see if I can repro in my own environment or if I can add memory-dump generation code to the tests (i.e. a call to MiniDumpWriteDump ).

APIs will throw NotSupportedException if SSLv2 or v3 is used. This behavior is different from .NET Desktop. Tests have been changed to pin this new behavior. Increased passing-test timeouts to 15s. Fix #3114 (partial), #4467

stephentoub · 2015-11-13T03:15:06Z

Same three failed here: dotnet/corefx#4478 (comment)

stephentoub · 2015-11-13T04:26:22Z

And again here: http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6163/

stephentoub · 2015-11-15T02:17:17Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug/1777/console

stephentoub · 2015-11-16T00:21:43Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6224/

APIs will throw NotSupportedException if SSLv2 or v3 is used. This behavior is different from .NET Desktop. Tests have been changed to pin this new behavior. Increased passing-test timeouts to 15s. Fix #3114 (partial), #4467

CIPop · 2015-11-16T21:00:07Z

Increased the timeouts to 15s.

If this is still noticed we'll need to try to repro on a system that can be debugged and ETW can be collected. #15570 is preventing us from using the CI failures for investigation.

CIPop · 2015-11-16T21:01:45Z

Please reopen this issue if the errors still show up in CI.

It would be interesting to understand if the timeouts decreased the frequency of this failure.

stephentoub · 2015-11-16T23:26:49Z

Increased the timeouts to 15s.

When? This has been reproing fairly regularly for the last few days, in fact I'd say the regularity has actually increased.

Please reopen this issue if the errors still show up in CI.

They're not going to as I disabled them yesterday in dotnet/corefx#4526 due to how many PRs they were taking out.

CIPop · 2015-11-16T23:39:33Z

The last commit referenced on this issue contained the changes:

CIPop referenced this issue from a commit in CIPop/corefx 4 hours ago

 @CIPop Removing SSLv2 and SSLv3 support from System.Net.Security. … 1f4deea

I've checked that there is no reference from source-code to issue dotnet/corefx#4467.

stephentoub · 2015-11-16T23:44:41Z

I see, you just checked it in and removed the active issues I added yesterday. Ok.

mellinoe · 2015-11-19T00:43:29Z

I just hit two of the issues in a recent run, with what seems to be the same exception, message, etc.

http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6300/

stephentoub · 2015-11-19T13:24:49Z

And again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug_prtest/6315/console

System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success [FAIL]
        Handshake completed in the allotted time
        Expected: True
        Actual:   False

mellinoe · 2015-11-20T00:58:09Z

Another instance:

http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6341/testReport/

stephentoub · 2015-11-25T00:40:56Z

Failed again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug_prtest/6441/console

System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success [FAIL]
        Handshake completed in the allotted time
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_debug_prtest\src\System.Net.Security\tests\FunctionalTests\SslStreamStreamToStreamTest.cs(35,0): at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_Authentication_Success()
     System.Net.Security.Tests.CertificateValidationClientServer.CertificateValidationClientServer_EndToEnd_Ok [FAIL]
        Client/Server Authentication timed out.
        Expected: True
        Actual:   False
        Stack Trace:
           d:\j\workspace\dotnet_corefx_windows_debug_prtest\src\System.Net.Security\tests\FunctionalTests\CertificateValidationClientServer.cs(80,0): at System.Net.Security.Tests.CertificateValidationClientServer.<CertificateValidationClientServer_EndToEnd_Ok>d__0.MoveNext()
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
           --- End of stack trace from previous location where exception was thrown ---
              at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
              at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

stephentoub · 2015-11-28T23:11:18Z

Failed again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug_prtest/6486/

stephentoub · 2015-11-30T20:18:22Z

@CIPop, failed again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6493/console

stephentoub · 2015-12-01T00:38:30Z

@CIPop, failed again:
dotnet/corefx#4740 (comment)

We need to re-disable those tests.

stephentoub · 2015-12-04T05:02:05Z

And again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_debug_prtest/6608/console

CIPop · 2015-12-04T07:22:04Z

We need to re-disable those tests.

Agreed. Disabling some of the tests via dotnet/corefx#4821.

CIPop · 2015-12-09T19:48:59Z

Moving to RTM: at this point, this is most likely a test-only issue.
Further investigation based on CI failures is not possible (tracking the engineering enhancement with #15570).

I can only confirm by setting up an environment similar to the CI system and repro with logging enabled.

stephentoub · 2016-03-28T01:03:49Z

http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_opensuse13.2_release_tst/33/consoleFull

15:55:57    System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(randomizedData: False) [FAIL]
15:55:57       Handshake complete
15:55:57       Expected: True
15:55:57       Actual:   False
15:55:57       Stack Trace:
15:55:58             at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(Boolean randomizedData)

stephentoub · 2016-03-28T01:06:10Z

http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_win10_release/137/consoleFull

12:21:07      System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(randomizedData: False) [FAIL]
12:21:07         Handshake complete
12:21:07         Expected: True
12:21:07         Actual:   False
12:21:07         Stack Trace:
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\System.Net.Security\tests\FunctionalTests\SslStreamStreamToStreamTest.cs(107,0): at System.Net.Security.Tests.SslStreamStreamToStreamTest.SslStream_StreamToStream_LargeWrites_Sync_Success(Boolean randomizedData)
12:21:07      System.Net.Security.Tests.ClientAsyncAuthenticateTest.ClientAsyncAuthenticate_AllServerVsIndividualClientSupportedProtocols_Success(clientProtocol: Tls) [FAIL]
12:21:07         System.TimeoutException : The operation has timed out.
12:21:07         Stack Trace:
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\Common\tests\System\Threading\Tasks\TaskTimeoutExtensions.cs(23,0): at System.Threading.Tasks.TaskTimeoutExtensions.<TimeoutAfter>d__0.MoveNext()
12:21:07   Discovered:  System.Net.WebSockets.Client.Tests
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(142,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncSslHelper>d__14.MoveNext()
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07            d:\j\workspace\outerloop_win---b5478759\src\System.Net.Security\tests\FunctionalTests\ClientAsyncAuthenticateTest.cs(90,0): at System.Net.Security.Tests.ClientAsyncAuthenticateTest.<ClientAsyncAuthenticate_AllServerVsIndividualClientSupportedProtocols_Success>d__9.MoveNext()
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
12:21:07   Starting:    System.Net.WebSockets.Client.Tests
12:21:07            --- End of stack trace from previous location where exception was thrown ---
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
12:21:07               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

stephentoub · 2016-03-31T03:42:36Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_windows_nt_release/134/consoleFull

stephentoub · 2016-03-31T11:39:03Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_debug_prtest/4298/

stephentoub · 2016-03-31T11:40:09Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_release_prtest/4291/

stephentoub · 2016-03-31T12:55:33Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_debug_prtest/4308/

stephentoub · 2016-04-02T21:44:22Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_windows_nt_debug/146/consoleFull

stephentoub · 2016-04-03T16:31:50Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_release_prtest/4412/

justinvp · 2016-04-05T14:49:11Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_debug_prtest/4495/

stephentoub · 2016-04-06T12:02:01Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_release_prtest/4539/consoleFull

stephentoub · 2016-04-07T14:55:21Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_windows_nt_debug/152/

stephentoub · 2016-04-12T13:38:30Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_release_prtest/4776/

justinvp · 2016-04-13T01:47:44Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_release_prtest/4835/

stephentoub · 2016-04-14T02:00:11Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_debug/1260/

CIPop · 2016-04-14T16:53:18Z

I was able to reproduce these in my controlled environment, running only this test suite. The repro rates are:
30% with 10 parallel executions on a 2core/1GB RAM VM running Win10.
5% with 100 parallel executions on a 12core/16GB RAM machine running Win10 preview.

Now working on instrumenting the code to get better understanding of what's wrong. At this point it doesn't appear that changing the timeout values affect the outcome.

@ericeil @stephentoub @davidsh For similar investigations, I'm working on a script that automates stress execution of any test. The current code is available https://github.com/CIPop/corefx/blob/pe/src/Common/tests/Scripts/ParallelTestExecution.ps1

justinvp · 2016-04-15T01:30:28Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/windows_nt_debug_prtest/5002/

stephentoub · 2016-04-17T21:34:37Z

Again here:
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/master/job/outerloop_win7_debug/3/

davidsh · 2016-04-17T21:38:12Z

Just a thought...but maybe with PR dotnet/corefx#7800, these timeouts might go away?

stephentoub · 2016-04-18T01:12:50Z

Just a thought...but maybe with PR dotnet/corefx#7800, these timeouts might go away?

Certainly possible. It's been merged. We'll see if we continue to hit occurrences of this.

CIPop · 2016-04-25T17:09:43Z

After all fixes I was not able to repro this in my environment.

On machines with low CPU resources, a smaller timeout will cause this issue. If that's the case, we'll need to increase the current setting of 1 minute timeout for passing tests.
The current timeout has been verified to work on a 12 CPU box with 100 parallel tests and 2/8CPU VMs with 20 parallel tests.

CIPop · 2016-12-06T22:22:44Z

@karelz PTAL

karelz · 2016-12-06T22:57:12Z

Long thread, what's the summary I should look at?
What I am interested:

What is the symptom?
What is the root-cause? (problematic pattern)
How common is the problematic pattern?
What is the solution?
How is it applicable to other tests? Can we detect them upfront or do we have to wit for them to fail?

CIPop · 2016-12-06T23:39:07Z

What is the symptom?

SslStream tests were intermittently failing.

What is the root-cause? (problematic pattern)

The tests are not written to support multiple executions in parallel. We should write proper stress/perf tests for that purpose.
On a normal dev station this is not observed unless the machine is very busy. In CI, I've been told that it's common for the machine to be busy executing multiple requests on the same system.

See my last comment above:

On machines with low CPU resources, a smaller timeout will cause this issue. If that's the case, we'll need to increase the current setting of 1 minute timeout for passing tests.
The current timeout has been verified to work on a 12 CPU box with 100 parallel tests and 2/8CPU VMs with 20 parallel tests.

How common is the problematic pattern?

After increasing the timeout I haven't seen failures.

What is the solution?

This has been fixed since April 25th.
The fix at that time was to increase the timeout. Another potential fix was to run the tests on a machine with less CPU/network stress.

How is it applicable to other tests?

This is applicable to most intermittent network timeouts logged by the CI system.

Can we detect them upfront or do we have to wit for them to fail?

It is theoretically possible but I wouldn't expect contributors to model the CI system and come up with the right timeout.

karelz · 2016-12-07T00:31:53Z

SslStream tests were intermittently failing.

What kind of failure(s)? Timeouts? Something else?

The tests are not written to support multiple executions in parallel.

What does it mean? Do they have shared state? Or do they have hardcoded timeouts which are too short for 'busy machines'?

On a normal dev station this is not observed unless the machine is very busy.

And that's a problem of the test having too aggressive expectations, it should have some room for busy machines / slow VMs / etc. Not unreasonably large timeouts, but still some room to accommodate for differences.
What about a general timeout expression like 15s + 10 * (test_duration_on_fast_machine). The constants are of course tweakable.

How common is the problematic pattern?

I meant: How common is the problematic pattern (too aggressive timeout) in all tests? Is it one off? Rare? Common? Majority?

The fix at that time was to increase the timeout. Another potential fix was to run the tests on a machine with less CPU/network stress.

Reasonable timeout is the right fix. We should not force every developer / CI system to run on exclusive CPU/network. Again, we should be reasonable with the timeout allowances.

This is applicable to most intermittent network timeouts logged by the CI system.

Is there some smart/easy way how to discover all such tests? Or change them all at once?

It is theoretically possible but I wouldn't expect contributors to model the CI system and come up with the right timeout.

I don't follow. It should be us doing the sweep and "cleaning up" our tests, right? I wasn't suggesting enlisting contributors to do our dirty laundry :)

CIPop · 2016-12-07T18:41:34Z

@karelz It's probably simpler if you go through the issue and the PR to answer most of your questions or we have an offline discussion to clarify more.
I'll try to summarize here: SslStream was failing with timeouts (as the title suggests). This is expected in any service/application running on a machine running out of resources (CPU, RAM, I/O), as it reaches the threshold where the overall performance will start to decay if more operations are queued. Network, Windows services (lsass for TLS in this case), CPU and RAM are the shared resources. Another resource is Azure itself which has, by SLA, 99.9% availability (the VM stability itself).

I don't follow. It should be us doing the sweep and "cleaning up" our tests, right? I wasn't suggesting enlisting contributors to do our dirty laundry :)

Definitely not: what I'm saying is quite the opposite. We can't expect developers that don't have access to CI machines or specs to write a test that runs just fine on their dev machine but will fail in CI running out of resources. It's hard to write and test code that scales correctly. Fixing the test code to run in parallel with multiple instances of itself in a heavily constrained CI system would basically enlist contributors to do the heavy lifting for each test, even if it's not a stress/perf test expected to reach the limits of the system.

Reasonable timeout is the right fix.

I agree (see my comment in #15381).

One problem is that the "CI tuning" process was iterative and involved me running the tests with different values over and over again in CI and using my simulation script (which itself had to be tuned to simulate the CI system on my dev box): dotnet/corefx#7937. Another problem is that as we're adding new tests, the CI system is more likely to run out of resources as several builds/tests are running in parallel on the same VM. Because of this, the system will become loaded in a different way which the likelihood of time-sensitive tests to exhibit different failure-modes.

The next step or an alternative to make the system even more tolerant to timeouts would be a retry mechanism for everything involving network (#3697). There are downsides here:

In certain cases, under high load, retrying will just add more pressure on the system. This can be solved by an exponential back-off algorithm often found in networking protocols although that would also need some performance tuning to run well on the CI system.
Adding the mechanism will require changing the code to a more complicated version that becomes harder to read and harder to diagnose. This goes back to the non-goal of making it harder for external devs to write tests.

karelz · 2016-12-07T20:18:47Z

we have an offline discussion to clarify more

Sounds good to me

This is expected in any service/application running on a machine running out of resources (CPU, RAM, I/O), as it reaches the threshold where the overall performance will start to decay if more operations are queued.

Yep, and that's exactly why timeouts should not be too tight. They should expect other normal operations on the system.

Another resource is Azure itself which has, by SLA, 99.9% availability (the VM stability itself).

That is relevant only for tests hitting Azure servers. Is that the case here -- that is vital information and the reason why we need proper write up on the failures and root-causes and not just links to GH issues where the information is scattered across large discussions.

One problem is that the "CI tuning" process was iterative and involved me running the tests with different values over and over again in CI and using my simulation script (which itself had to be tuned to simulate the CI system on my dev box): dotnet/corefx#7937. Another problem is that as we're adding new tests, the CI system is more likely to run out of resources as several builds/tests are running in parallel on the same VM. Because of this, the system will become loaded in a different way which the likelihood of time-sensitive tests to exhibit different failure-modes.

Why can't we just time the test on normal machine (or guess) and then apply something like "15s + 10 * (test_duration_on_fast_machine)" to determine the timeout?

more tolerant to timeouts would be a retry mechanism for everything involving network

That sounds like a lot of work and it pollutes the tests if I understand it correctly. I would start with cheaper solutions first (buffering timeouts).

stephentoub assigned CIPop Nov 12, 2015

CIPop closed this as completed Nov 16, 2015

stephentoub reopened this Nov 16, 2015

stephentoub closed this as completed Nov 16, 2015

mellinoe reopened this Nov 19, 2015

CIPop closed this as completed in dotnet/corefx#7937 Apr 22, 2016

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 1.0.0-rtm milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Jan 4, 2021

SslStream tests: timeout #15692

SslStream tests: timeout #15692

Comments

stephentoub commented Nov 12, 2015

CIPop commented Nov 12, 2015

CIPop commented Nov 12, 2015

stephentoub commented Nov 12, 2015

CIPop commented Nov 13, 2015

stephentoub commented Nov 13, 2015

stephentoub commented Nov 13, 2015

stephentoub commented Nov 15, 2015

stephentoub commented Nov 16, 2015

CIPop commented Nov 16, 2015

CIPop commented Nov 16, 2015

stephentoub commented Nov 16, 2015

CIPop commented Nov 16, 2015

stephentoub commented Nov 16, 2015

mellinoe commented Nov 19, 2015

stephentoub commented Nov 19, 2015

mellinoe commented Nov 20, 2015

stephentoub commented Nov 25, 2015

stephentoub commented Nov 28, 2015

stephentoub commented Nov 30, 2015

stephentoub commented Dec 1, 2015

stephentoub commented Dec 4, 2015

CIPop commented Dec 4, 2015

CIPop commented Dec 9, 2015

stephentoub commented Mar 28, 2016

stephentoub commented Mar 28, 2016

stephentoub commented Mar 31, 2016

stephentoub commented Mar 31, 2016

stephentoub commented Mar 31, 2016

stephentoub commented Mar 31, 2016

stephentoub commented Apr 2, 2016

stephentoub commented Apr 3, 2016

justinvp commented Apr 5, 2016

stephentoub commented Apr 6, 2016

stephentoub commented Apr 7, 2016

stephentoub commented Apr 12, 2016

justinvp commented Apr 13, 2016

stephentoub commented Apr 14, 2016

CIPop commented Apr 14, 2016

justinvp commented Apr 15, 2016

stephentoub commented Apr 17, 2016

davidsh commented Apr 17, 2016

stephentoub commented Apr 18, 2016

CIPop commented Apr 25, 2016

CIPop commented Dec 6, 2016

karelz commented Dec 6, 2016

CIPop commented Dec 6, 2016

karelz commented Dec 7, 2016

CIPop commented Dec 7, 2016

karelz commented Dec 7, 2016