-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SslStream tests: timeout #15692
Comments
/cc @markwilkie |
For reference: on my system, the entire System.Net.Security test suite executes in about 1 second. |
@CIPop, I expect you're right, but should we also consider the possibility that there's a race condition / deadlock somewhere and these timeouts actually represent a real product issue? Or is that not possible? |
I'm not disregarding that possibility. If this happens after I increase the timeouts, I'll see if I can repro in my own environment or if I can add memory-dump generation code to the tests (i.e. a call to MiniDumpWriteDump ). |
APIs will throw NotSupportedException if SSLv2 or v3 is used. This behavior is different from .NET Desktop. Tests have been changed to pin this new behavior. Increased passing-test timeouts to 15s. Fix #3114 (partial), #4467
Same three failed here: dotnet/corefx#4478 (comment) |
APIs will throw NotSupportedException if SSLv2 or v3 is used. This behavior is different from .NET Desktop. Tests have been changed to pin this new behavior. Increased passing-test timeouts to 15s. Fix #3114 (partial), #4467
Increased the timeouts to 15s. If this is still noticed we'll need to try to repro on a system that can be debugged and ETW can be collected. #15570 is preventing us from using the CI failures for investigation. |
Please reopen this issue if the errors still show up in CI. It would be interesting to understand if the timeouts decreased the frequency of this failure. |
When? This has been reproing fairly regularly for the last few days, in fact I'd say the regularity has actually increased.
They're not going to as I disabled them yesterday in dotnet/corefx#4526 due to how many PRs they were taking out. |
The last commit referenced on this issue contained the changes:
I've checked that there is no reference from source-code to issue dotnet/corefx#4467. |
I see, you just checked it in and removed the active issues I added yesterday. Ok. |
I just hit two of the issues in a recent run, with what seems to be the same exception, message, etc. http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6300/ |
And again here:
|
Failed again here:
|
Failed again here: |
@CIPop, failed again: We need to re-disable those tests. |
Agreed. Disabling some of the tests via dotnet/corefx#4821. |
Moving to RTM: at this point, this is most likely a test-only issue. I can only confirm by setting up an environment similar to the CI system and repro with logging enabled. |
|
http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/outerloop_win10_release/137/consoleFull
|
I was able to reproduce these in my controlled environment, running only this test suite. The repro rates are: Now working on instrumenting the code to get better understanding of what's wrong. At this point it doesn't appear that changing the timeout values affect the outcome. @ericeil @stephentoub @davidsh For similar investigations, I'm working on a script that automates stress execution of any test. The current code is available https://github.com/CIPop/corefx/blob/pe/src/Common/tests/Scripts/ParallelTestExecution.ps1 |
Just a thought...but maybe with PR dotnet/corefx#7800, these timeouts might go away? |
Certainly possible. It's been merged. We'll see if we continue to hit occurrences of this. |
After all fixes I was not able to repro this in my environment. On machines with low CPU resources, a smaller timeout will cause this issue. If that's the case, we'll need to increase the current setting of 1 minute timeout for passing tests. |
@karelz PTAL |
Long thread, what's the summary I should look at?
|
SslStream tests were intermittently failing.
The tests are not written to support multiple executions in parallel. We should write proper stress/perf tests for that purpose. See my last comment above:
After increasing the timeout I haven't seen failures.
This has been fixed since April 25th.
This is applicable to most intermittent network timeouts logged by the CI system.
It is theoretically possible but I wouldn't expect contributors to model the CI system and come up with the right timeout. |
What kind of failure(s)? Timeouts? Something else?
What does it mean? Do they have shared state? Or do they have hardcoded timeouts which are too short for 'busy machines'?
And that's a problem of the test having too aggressive expectations, it should have some room for busy machines / slow VMs / etc. Not unreasonably large timeouts, but still some room to accommodate for differences.
I meant: How common is the problematic pattern (too aggressive timeout) in all tests? Is it one off? Rare? Common? Majority?
Reasonable timeout is the right fix. We should not force every developer / CI system to run on exclusive CPU/network. Again, we should be reasonable with the timeout allowances.
Is there some smart/easy way how to discover all such tests? Or change them all at once?
I don't follow. It should be us doing the sweep and "cleaning up" our tests, right? I wasn't suggesting enlisting contributors to do our dirty laundry :) |
@karelz It's probably simpler if you go through the issue and the PR to answer most of your questions or we have an offline discussion to clarify more.
Definitely not: what I'm saying is quite the opposite. We can't expect developers that don't have access to CI machines or specs to write a test that runs just fine on their dev machine but will fail in CI running out of resources. It's hard to write and test code that scales correctly. Fixing the test code to run in parallel with multiple instances of itself in a heavily constrained CI system would basically enlist contributors to do the heavy lifting for each test, even if it's not a stress/perf test expected to reach the limits of the system.
I agree (see my comment in #15381). One problem is that the "CI tuning" process was iterative and involved me running the tests with different values over and over again in CI and using my simulation script (which itself had to be tuned to simulate the CI system on my dev box): dotnet/corefx#7937. Another problem is that as we're adding new tests, the CI system is more likely to run out of resources as several builds/tests are running in parallel on the same VM. Because of this, the system will become loaded in a different way which the likelihood of time-sensitive tests to exhibit different failure-modes. The next step or an alternative to make the system even more tolerant to timeouts would be a retry mechanism for everything involving network (#3697). There are downsides here:
|
Sounds good to me
Yep, and that's exactly why timeouts should not be too tight. They should expect other normal operations on the system.
That is relevant only for tests hitting Azure servers. Is that the case here -- that is vital information and the reason why we need proper write up on the failures and root-causes and not just links to GH issues where the information is scattered across large discussions.
Why can't we just time the test on normal machine (or guess) and then apply something like "15s + 10 * (test_duration_on_fast_machine)" to determine the timeout?
That sounds like a lot of work and it pollutes the tests if I understand it correctly. I would start with cheaper solutions first (buffering timeouts). |
http://dotnet-ci.cloudapp.net/job/dotnet_corefx_windows_release_prtest/6135/console
The text was updated successfully, but these errors were encountered: