-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Nix] adjust pytest retrys #4558
Conversation
5ee7ea9
to
dc33e28
Compare
@@ -243,6 +244,7 @@ def test_tls13_session_resumption_s2n_client(managed_process, cipher, curve, cer | |||
b'SSL_accept:SSLv3/TLS write certificate') == num_full_connections | |||
|
|||
|
|||
@pytest.mark.flaky(reruns=7, reruns_delay=2, condition=platform.machine().startswith("aarch")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the errors when these tests fail due to flakiness? It seems like the tests could be broken if they require 5 or 7 retries in order to succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current working theory is that this is the test framework/interactions with subprocesses. I bumped the timeout values from 5 seconds to 60, and then noticed that the retried tests count exactly matches the number of tests running for 60 seconds. So these are being timed out by pytest, unclear why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting ok. I guess the concern is that if we make a change that causes one of these tests to become flaky, we likely won't notice it since they're retrying so many times. But I guess the extra retries only apply to arm, so as long as we continue running them in an environment with retries disabled we should notice a flaky regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, we're running these now on x86 with 2 retires, so there is no CI running these specific integrationv2 tests with 0 retires. I don't believe we're collecting retry metrics, which would be an interesting datapoint...
Resolved issues:
part of #3841
Description of changes:
Several tests were failing under nix on arm (e.g. happy_path), but upon investigation, the pytest arguments for nix were less forgiving than our other ci jobs.
This PR aligns the default pytest retry count, from 0 to 2 globally.
These changes address the following test failures:
An ad-hoc job running just the above on both architectures is here
Call-outs:
Unfortunately, a few tests need more retries than just 2 or a pause between retries. Iterating through these failures, I've bumped up the retry count/delay using pytest rerunfailures flaky decorator, but only on specific tests while running on arm.
For renegotiate, I also observed many of the retries were from us hitting the sub-process timeout of 5 seconds. Experimenting with huge timeouts led to landing on 8 seconds as a more reliable timeout.
Testing:
How is this change tested (unit tests, fuzz tests, etc.)? local/ci
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.