-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable test reruns on failed fragiled tests #217
Conversation
9496e53
to
67b59e4
Compare
c6cb7a3
to
0a0c6d8
Compare
It looks like this closes # 189. Add this information to the commit message instead of "Part of tarantool / tarantool problem # 5050". |
f1cfe21
to
7869fe9
Compare
Ok, changed. |
e76a809
to
358cf41
Compare
After some discussion with @Totktonada and @avtikhon, it was decided to try using JSON for configuration a "fragile".
Useful links: |
358cf41
to
a72b6cd
Compare
Implemented. |
e40595e
to
78a301d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, transition to JSON format should be implemented by a first patch in the patchset. And all new "fragile" options need to be added only to JSON. Introducing two new competing formats in the same patchset is very strange decision.
Also, I think the format fragile = <basename of the test> ; gh-<issue> md5sum:<checksum>
is strange (and potentially not valid), because:
- ; / # is used for a comment adding and it looks like
md5sum: <checksum>
is a comment, but it is not. - = / : is a separator for key/value and mixing them in the same file is a strange decision.
I didn't look thoroughly, but I would highlight several points.
If you want my opinion about extracting checksums from comments — it is disgusting. Sorry. There are more points I would highlight about the patchset, but it does not have much sense if we'll get rid of checksums at all. The patchset modifies many different parts of test-run: too much for this small feature, IMHO. |
339ed9b
to
01d9865
Compare
568344a
to
c7b1b6c
Compare
Added ability to use fragile list in JSON format from 'suite.ini' files: fragile = { "tests": { "bitset.test.lua": { "issues": [ "gh-4095" ] } }}
c7b1b6c
to
b6b7d41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks syntactically correct to me.
I pushed two branches into tarantool today (and rerun failed jobs several times):
There are other fails, but I didn't look whether the output was predictable (at least vinyl.ddl.test.lua, don't remember others). I know, you're sure you can mark almost everything with checksums, but I see the opposite picture now. Maybe I'm just lucky. At least, please, left ability to rerun a test, which is not marked with any checksum (I didn't look at the last patchset, maybe it is there already). Aside of this, the 'transient fail' statistics was nice: no need to roll a screen to find whether reruns occur after full local testing. |
I've run the same 3 days twice in hour and I got really lot of results. Checked some of them I've got the same results as Alexander below, but it was expected. I've updated the table [1] with the new column 'Checksum' and set there some of results, green - checksums that available, red - that have changing data. As I've checked before the same issue with checksum we'll have just in some tests - found only 9 with printing box info and some other tests printing other changing data.
Right, we have some these fails, but it passes when tests run inline using fragile list for it.
Right this test I've in the table marked as red.
Right, we have some tests like this, but we have 2 ways to resolve it: fix the test either remove box information printing if it is not really needed. I've prepared the initial patch with checksums set as green in the table mentioned above and the results really good - only 1 new real fail found [2] ! [2] - https://gitlab.com/tarantool/tarantool/-/pipelines/193120097/builds
That was the main cause why this feature with tests reruns I didn't want to have without checksums.
Actually after checksums will be written to the suites.ini files these fails would be really rare. |
It is maybe okay as a temporary workaround, but it is not so complex to fix the cause: bind httpd.py on zero port (to let kernel to choose a random one) and read a real port from its output (we already read heartbeats from here).
To be honest, I don't get the answer.
But I still want to know about this, when it occurs, to verify whether it is introduced by a patch I work on. It is okay to postpone this feature, but it would be usable. |
lib/worker.py
Outdated
if is_fragile: | ||
testname = os.path.basename(task_id[0]) | ||
fragile_checksums = self.suite.get_test_fragile_checksums(testname) | ||
if is_fragile and fragile_checksums: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you block ability to rerun a test without a checksum. It is possible that a test cannot be verified just using a result file, but we want to allow reruns for it.
I remember the reason why you don't want to enable reruns by default when no checksums are provided: because it means that existing configs will let tests that are currently in a fragile list being retried.
However this is not so until we'll set retries
explicitly, so it seems we can do this without subtle behaviour changes. I mean, if we'll consider empty (or non-existing) checksums list as 'rerun it anyway', everything will be good. But if you prefer explicit '*' in checksum, I don't mind.
You can also consider this as out of scope (not needed right now), but for me it looks quite natural that…
fragile = {
"retries": 2,
"tests": {
"bitset.test.lua": {
"issues": ["gh-4095"]
}
}
}
…means 'rerun bitset.test.lua two times'.
Anyway, it is up to you. I just noted that the problem you highlighted looks as not applicable to our situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've commented it in the same time when you wrote the questions in here:
#217 (comment)
in short: I agree with it, but as standalone commit for this feature.
Status: It looks almost ready (considering given decisions), I just want to discuss several moments before we'll land the patchset. |
Absolutely agree that run the tests like this is temporary workaround and it need to be fixed. And we'll continue work on its fixing in already exists standalone issues like:
I mean that blind rerun of the flaky issues is more dangerous than it's manual rerun that we currently have. Check of 'checksums' is a way to make these reruns partly automate to help the testing infrastructure to avoid of manual reruns. Standalone rerun we'll hide the new flaky issues.
Right, let's implement it as a next step, not now. |
b6b7d41
to
cde0988
Compare
Added ability to set per suite in suite.ini configuration file 'retries' option, which sets the number of accepted reruns of the tests failed from 'fragile' list: fragile = { "retries": 10, "tests": { "bitset.test.lua": { "issues": [ "gh-4095" ], } }} Part of #189
Added ability to check results file checksum on tests fail and compare with the checksums of the known issues mentioned in the fragile list. Fragile list should consist of the results files checksums with its issues in the format: fragile = { "retries": 10, "tests": { "bitset.test.lua": { "issues": [ "gh-4095" ], "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ] } }} Closes #189
cde0988
to
f8a8135
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Retry a failed test when it is marked as fragile (and several other conditions are met, see below). The test-run already allows to set a list of fragile tests. They are run one-by-one after all parallel ones in order to eliminate possible resource starvation and fit timings to ones when the tests pass. See [1]. In practice this approach does not help much against our problem with flaky tests. We decided to retry failed tests, when they are known as flagile. See [2]. The core idea is to split responsibility: known flaky fails will not deflect attention of a developer, but each fragile test will be marked explicitly, trackerized and will be analyzed by the quality assurance team. The default behaviour is not changed: each test from the fragile list will be run once after all parallel ones. But now it is possible to set retries amount. Beware: the implementation does not allow to just set retries count, it also requires to provide an md5sum of a failed test output (so called reject file). The idea here is to ensure that we retry the test only in case of a known fail: not some other fail within the test. This approach has the limitation: in case of fail a test may output an information that varies from run to run or depend of a base directory. We should always verify the output before put its checksum into the configuration file. Despite doubts regarding this approach, it looks simple and we decided to try and revisit it if there will be a need. See configuration example in [3]. [1]: tarantool/test-run#187 [2]: tarantool/test-run#189 [3]: tarantool/test-run#217 Part of #5050
Retry a failed test when it is marked as fragile (and several other conditions are met, see below). The test-run already allows to set a list of fragile tests. They are run one-by-one after all parallel ones in order to eliminate possible resource starvation and fit timings to ones when the tests pass. See [1]. In practice this approach does not help much against our problem with flaky tests. We decided to retry failed tests, when they are known as flagile. See [2]. The core idea is to split responsibility: known flaky fails will not deflect attention of a developer, but each fragile test will be marked explicitly, trackerized and will be analyzed by the quality assurance team. The default behaviour is not changed: each test from the fragile list will be run once after all parallel ones. But now it is possible to set retries amount. Beware: the implementation does not allow to just set retries count, it also requires to provide an md5sum of a failed test output (so called reject file). The idea here is to ensure that we retry the test only in case of a known fail: not some other fail within the test. This approach has the limitation: in case of fail a test may output an information that varies from run to run or depend of a base directory. We should always verify the output before put its checksum into the configuration file. Despite doubts regarding this approach, it looks simple and we decided to try and revisit it if there will be a need. See configuration example in [3]. [1]: tarantool/test-run#187 [2]: tarantool/test-run#189 [3]: tarantool/test-run#217 Part of #5050 (cherry picked from commit 43482ee)
Retry a failed test when it is marked as fragile (and several other conditions are met, see below). The test-run already allows to set a list of fragile tests. They are run one-by-one after all parallel ones in order to eliminate possible resource starvation and fit timings to ones when the tests pass. See [1]. In practice this approach does not help much against our problem with flaky tests. We decided to retry failed tests, when they are known as flagile. See [2]. The core idea is to split responsibility: known flaky fails will not deflect attention of a developer, but each fragile test will be marked explicitly, trackerized and will be analyzed by the quality assurance team. The default behaviour is not changed: each test from the fragile list will be run once after all parallel ones. But now it is possible to set retries amount. Beware: the implementation does not allow to just set retries count, it also requires to provide an md5sum of a failed test output (so called reject file). The idea here is to ensure that we retry the test only in case of a known fail: not some other fail within the test. This approach has the limitation: in case of fail a test may output an information that varies from run to run or depend of a base directory. We should always verify the output before put its checksum into the configuration file. Despite doubts regarding this approach, it looks simple and we decided to try and revisit it if there will be a need. See configuration example in [3]. [1]: tarantool/test-run#187 [2]: tarantool/test-run#189 [3]: tarantool/test-run#217 Part of #5050 (cherry picked from commit 43482ee)
Retry a failed test when it is marked as fragile (and several other conditions are met, see below). The test-run already allows to set a list of fragile tests. They are run one-by-one after all parallel ones in order to eliminate possible resource starvation and fit timings to ones when the tests pass. See [1]. In practice this approach does not help much against our problem with flaky tests. We decided to retry failed tests, when they are known as flagile. See [2]. The core idea is to split responsibility: known flaky fails will not deflect attention of a developer, but each fragile test will be marked explicitly, trackerized and will be analyzed by the quality assurance team. The default behaviour is not changed: each test from the fragile list will be run once after all parallel ones. But now it is possible to set retries amount. Beware: the implementation does not allow to just set retries count, it also requires to provide an md5sum of a failed test output (so called reject file). The idea here is to ensure that we retry the test only in case of a known fail: not some other fail within the test. This approach has the limitation: in case of fail a test may output an information that varies from run to run or depend of a base directory. We should always verify the output before put its checksum into the configuration file. Despite doubts regarding this approach, it looks simple and we decided to try and revisit it if there will be a need. See configuration example in [3]. [1]: tarantool/test-run#187 [2]: tarantool/test-run#189 [3]: tarantool/test-run#217 Part of #5050 (cherry picked from commit 43482ee)
The test-run submodule is updated in tarantool in the following commits: 2.6.0-104-g43482eedc, 2.5.1-86-gc5bb549f6, 2.4.2-70-gef330c3b0, 1.10.7-35-g91260069c. |
Removed obvious part in rpm spec for Travis-CI, due to it is no longer in use. ---- Comments from @Totktonada ---- This change is a kind of revertion of the commit d48406d ('test: add more tests to packaging testing'), which did close #4599. Here I described the story, why the change was made and why it is reverted now. We run testing during an RPM package build: it may catch some distribution specific problem. We had reduced quantity of tests and single thread tests execution to keep the testing stable and don't break packages build and deployment due to known fragile tests. Our CI had to use Travis CI, but we were in transition to GitLab CI to use our own machines and don't reach Travis CI limit with five jobs running in parallel. We moved package builds to GitLab CI, but kept build+deploy jobs on Travis CI for a while: GitLab CI was the new for us and we wanted to do this transition smoothly for users of our APT / YUM repositories. After enabling packages building on GitLab CI, we wanted to enable more tests (to catch more problems) and wanted to enable parallel execution of tests to speed up testing (and reduce amount of time a developer wait for results). We observed that if we'll enable more tests and parallel execution on Travis CI, the testing results will become much less stable and so we'll often have holes in deployed packages and red CI. So, we decided to keep the old way testing on Travis CI and perform all changes (more tests, more parallelism) only for GitLab CI. We had a guess that we have enough machine resources and will able to do some load balancing to overcome flaky fails on our own machines, but in fact we picked up another approach later (see below). That's all story behind #4599. What changes from those days? We moved deployment jobs to GitLab CI[^1] and now we completely disabled Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI or right to GitHub Actions[^2]. We revisited our approach to improve stability of testing. Attemps to do some load balancing together with attempts to keep not-so-large execution time were failed. We should increase parallelism for speed, but decrease it for stability at the same time. There is no optimal balance. So we decided to track flaky fails in the issue tracker and restart a test after a known fail (see details in [1]). This way we don't need to exclude tests and disable parallelism in order to get the stable and fast testing[^3]. At least in theory. We're on the way to verify this guess, but hopefully we'll stick with some adequate defaults that will work everywhere[^4]. To sum up, there are several reasons to remove the old workaround, which was implemented in the scope of #4599: no Travis CI, no foreseeable reasons to exclude tests and reduce parallelism depending on a CI provider. Footnotes: [^1]: This is simplification. Travis CI deployment jobs were not moved as is. GitLab CI jobs push packages to the new repositories backend (#3380). Travis CI jobs were disabled later (as part of #4947), after proofs that the new infrastructure works fine. However this is the another story. [^2]: Now we're going to use GitHub Actions for all jobs, mainly because GitLab CI is poorly integrated with GitHub pull requests (when source branch is in a forked repository). [^3]: Some work toward this direction still to be done: First, 'replication' test suite still excluded from the testing under RPM package build. It seems, we should just enable it back, it is tracked by #4798. Second, there is the issue [2] to get rid of ancient traces of the old attempts to keep the testing stable (from test-run side). It'll give us more parallelism in testing. [^4]: Of course, we perform investigations of flaky fails and fix code and testing problems it feeds to us. However it appears to be the long activity. References: [1]: tarantool/test-run#217 [2]: https://github.com/tarantool/test-run/issues/251
Removed obvious part in rpm spec for Travis-CI, due to it is no longer in use. ---- Comments from @Totktonada ---- This change is a kind of revertion of the commit d48406d ('test: add more tests to packaging testing'), which did close #4599. Here I described the story, why the change was made and why it is reverted now. We run testing during an RPM package build: it may catch some distribution specific problem. We had reduced quantity of tests and single thread tests execution to keep the testing stable and don't break packages build and deployment due to known fragile tests. Our CI had to use Travis CI, but we were in transition to GitLab CI to use our own machines and don't reach Travis CI limit with five jobs running in parallel. We moved package builds to GitLab CI, but kept build+deploy jobs on Travis CI for a while: GitLab CI was the new for us and we wanted to do this transition smoothly for users of our APT / YUM repositories. After enabling packages building on GitLab CI, we wanted to enable more tests (to catch more problems) and wanted to enable parallel execution of tests to speed up testing (and reduce amount of time a developer wait for results). We observed that if we'll enable more tests and parallel execution on Travis CI, the testing results will become much less stable and so we'll often have holes in deployed packages and red CI. So, we decided to keep the old way testing on Travis CI and perform all changes (more tests, more parallelism) only for GitLab CI. We had a guess that we have enough machine resources and will able to do some load balancing to overcome flaky fails on our own machines, but in fact we picked up another approach later (see below). That's all story behind #4599. What changes from those days? We moved deployment jobs to GitLab CI[^1] and now we completely disabled Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI or right to GitHub Actions[^2]. We revisited our approach to improve stability of testing. Attemps to do some load balancing together with attempts to keep not-so-large execution time were failed. We should increase parallelism for speed, but decrease it for stability at the same time. There is no optimal balance. So we decided to track flaky fails in the issue tracker and restart a test after a known fail (see details in [1]). This way we don't need to exclude tests and disable parallelism in order to get the stable and fast testing[^3]. At least in theory. We're on the way to verify this guess, but hopefully we'll stick with some adequate defaults that will work everywhere[^4]. To sum up, there are several reasons to remove the old workaround, which was implemented in the scope of #4599: no Travis CI, no foreseeable reasons to exclude tests and reduce parallelism depending on a CI provider. Footnotes: [^1]: This is simplification. Travis CI deployment jobs were not moved as is. GitLab CI jobs push packages to the new repositories backend (#3380). Travis CI jobs were disabled later (as part of #4947), after proofs that the new infrastructure works fine. However this is the another story. [^2]: Now we're going to use GitHub Actions for all jobs, mainly because GitLab CI is poorly integrated with GitHub pull requests (when source branch is in a forked repository). [^3]: Some work toward this direction still to be done: First, 'replication' test suite still excluded from the testing under RPM package build. It seems, we should just enable it back, it is tracked by #4798. Second, there is the issue [2] to get rid of ancient traces of the old attempts to keep the testing stable (from test-run side). It'll give us more parallelism in testing. [^4]: Of course, we perform investigations of flaky fails and fix code and testing problems it feeds to us. However it appears to be the long activity. References: [1]: tarantool/test-run#217 [2]: https://github.com/tarantool/test-run/issues/251 (cherry picked from commit d9c25b7)
Removed obvious part in rpm spec for Travis-CI, due to it is no longer in use. ---- Comments from @Totktonada ---- This change is a kind of revertion of the commit d48406d ('test: add more tests to packaging testing'), which did close #4599. Here I described the story, why the change was made and why it is reverted now. We run testing during an RPM package build: it may catch some distribution specific problem. We had reduced quantity of tests and single thread tests execution to keep the testing stable and don't break packages build and deployment due to known fragile tests. Our CI had to use Travis CI, but we were in transition to GitLab CI to use our own machines and don't reach Travis CI limit with five jobs running in parallel. We moved package builds to GitLab CI, but kept build+deploy jobs on Travis CI for a while: GitLab CI was the new for us and we wanted to do this transition smoothly for users of our APT / YUM repositories. After enabling packages building on GitLab CI, we wanted to enable more tests (to catch more problems) and wanted to enable parallel execution of tests to speed up testing (and reduce amount of time a developer wait for results). We observed that if we'll enable more tests and parallel execution on Travis CI, the testing results will become much less stable and so we'll often have holes in deployed packages and red CI. So, we decided to keep the old way testing on Travis CI and perform all changes (more tests, more parallelism) only for GitLab CI. We had a guess that we have enough machine resources and will able to do some load balancing to overcome flaky fails on our own machines, but in fact we picked up another approach later (see below). That's all story behind #4599. What changes from those days? We moved deployment jobs to GitLab CI[^1] and now we completely disabled Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI or right to GitHub Actions[^2]. We revisited our approach to improve stability of testing. Attemps to do some load balancing together with attempts to keep not-so-large execution time were failed. We should increase parallelism for speed, but decrease it for stability at the same time. There is no optimal balance. So we decided to track flaky fails in the issue tracker and restart a test after a known fail (see details in [1]). This way we don't need to exclude tests and disable parallelism in order to get the stable and fast testing[^3]. At least in theory. We're on the way to verify this guess, but hopefully we'll stick with some adequate defaults that will work everywhere[^4]. To sum up, there are several reasons to remove the old workaround, which was implemented in the scope of #4599: no Travis CI, no foreseeable reasons to exclude tests and reduce parallelism depending on a CI provider. Footnotes: [^1]: This is simplification. Travis CI deployment jobs were not moved as is. GitLab CI jobs push packages to the new repositories backend (#3380). Travis CI jobs were disabled later (as part of #4947), after proofs that the new infrastructure works fine. However this is the another story. [^2]: Now we're going to use GitHub Actions for all jobs, mainly because GitLab CI is poorly integrated with GitHub pull requests (when source branch is in a forked repository). [^3]: Some work toward this direction still to be done: First, 'replication' test suite still excluded from the testing under RPM package build. It seems, we should just enable it back, it is tracked by #4798. Second, there is the issue [2] to get rid of ancient traces of the old attempts to keep the testing stable (from test-run side). It'll give us more parallelism in testing. [^4]: Of course, we perform investigations of flaky fails and fix code and testing problems it feeds to us. However it appears to be the long activity. References: [1]: tarantool/test-run#217 [2]: https://github.com/tarantool/test-run/issues/251 (cherry picked from commit d9c25b7)
Removed obvious part in rpm spec for Travis-CI, due to it is no longer in use. ---- Comments from @Totktonada ---- This change is a kind of revertion of the commit d48406d ('test: add more tests to packaging testing'), which did close #4599. Here I described the story, why the change was made and why it is reverted now. We run testing during an RPM package build: it may catch some distribution specific problem. We had reduced quantity of tests and single thread tests execution to keep the testing stable and don't break packages build and deployment due to known fragile tests. Our CI had to use Travis CI, but we were in transition to GitLab CI to use our own machines and don't reach Travis CI limit with five jobs running in parallel. We moved package builds to GitLab CI, but kept build+deploy jobs on Travis CI for a while: GitLab CI was the new for us and we wanted to do this transition smoothly for users of our APT / YUM repositories. After enabling packages building on GitLab CI, we wanted to enable more tests (to catch more problems) and wanted to enable parallel execution of tests to speed up testing (and reduce amount of time a developer wait for results). We observed that if we'll enable more tests and parallel execution on Travis CI, the testing results will become much less stable and so we'll often have holes in deployed packages and red CI. So, we decided to keep the old way testing on Travis CI and perform all changes (more tests, more parallelism) only for GitLab CI. We had a guess that we have enough machine resources and will able to do some load balancing to overcome flaky fails on our own machines, but in fact we picked up another approach later (see below). That's all story behind #4599. What changes from those days? We moved deployment jobs to GitLab CI[^1] and now we completely disabled Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI or right to GitHub Actions[^2]. We revisited our approach to improve stability of testing. Attemps to do some load balancing together with attempts to keep not-so-large execution time were failed. We should increase parallelism for speed, but decrease it for stability at the same time. There is no optimal balance. So we decided to track flaky fails in the issue tracker and restart a test after a known fail (see details in [1]). This way we don't need to exclude tests and disable parallelism in order to get the stable and fast testing[^3]. At least in theory. We're on the way to verify this guess, but hopefully we'll stick with some adequate defaults that will work everywhere[^4]. To sum up, there are several reasons to remove the old workaround, which was implemented in the scope of #4599: no Travis CI, no foreseeable reasons to exclude tests and reduce parallelism depending on a CI provider. Footnotes: [^1]: This is simplification. Travis CI deployment jobs were not moved as is. GitLab CI jobs push packages to the new repositories backend (#3380). Travis CI jobs were disabled later (as part of #4947), after proofs that the new infrastructure works fine. However this is the another story. [^2]: Now we're going to use GitHub Actions for all jobs, mainly because GitLab CI is poorly integrated with GitHub pull requests (when source branch is in a forked repository). [^3]: Some work toward this direction still to be done: First, 'replication' test suite still excluded from the testing under RPM package build. It seems, we should just enable it back, it is tracked by #4798. Second, there is the issue [2] to get rid of ancient traces of the old attempts to keep the testing stable (from test-run side). It'll give us more parallelism in testing. [^4]: Of course, we perform investigations of flaky fails and fix code and testing problems it feeds to us. However it appears to be the long activity. References: [1]: tarantool/test-run#217 [2]: https://github.com/tarantool/test-run/issues/251 (cherry picked from commit d9c25b7)
Added ability to use fragile list in JSON format from 'suite.ini'
files.
Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list.
Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Closes tarantool/tarantool#189.