-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve behavior after exception in begin/end stream lumi #44624
Improve behavior after exception in begin/end stream lumi #44624
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44624/39814
|
A new Pull Request was created by @wddgit for master. It involves the following packages:
@makortel, @Dr15Jones, @smuzaffar, @cmsbuild can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
-1 Failed Tests: UnitTests Unit TestsI found 1 errors in the following unit tests: ---> test runtestPhysicsToolsPatAlgos had ERRORS Comparison SummarySummary:
|
enable threading |
please test Rerunning tests with threading enabled. The failure that occurred is unrelated to this PR and is already occurring in the IBs. @makortel is the 312.0 comparison failure one of the known ones? Is there an issue The others are known. |
Those comparison differences are in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a first look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the TestServiceOne
and TestServiceTwo
would be moved to FWCore/Integration/plugins
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move them. That is a better place. Thanks.
abort test |
8bc74f7
to
2866525
Compare
Seems like a new non-reproducibility |
I opened an issue #44779 |
@cmsbuild, please test Let's try again |
-1 Failed Tests: UnitTests Unit TestsI found 1 errors in the following unit tests: ---> test runtestPhysicsToolsPatAlgos had ERRORS Comparison SummarySummary:
|
Now the comparisons show only #43790 |
+core |
This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2) |
@cmsbuild, ignore tests-rejected with ib-failure |
+1 |
merge |
PR description:
Improve the behavior of the Framework after stream begin/end lumi exceptions. This is the first in a series of PRs where we plan to make the behavior after exceptions more consistent in all the begin/end transitions.
The intent is that nothing in the output will change if there are not any exceptions.
This work was motivated by discussions related to Issues #43831 and #42501.
The core group has discussed this. The following is the behavior we plan to implement in all the begin/end transitions eventually. I think this is consistent with our previous discussions with some extra details added that were fleshed out as this PR was implemented:
For a stream begin transition (of any type, run/lumi/processBlock/top level), the Framework will try to execute all modules (possibly concurrently) and continues executing them all even if one or more of the modules throws. The same is true for stream end transitions except if a module in a begin transition throws an exception in a pre module signal, the module itself, or a post module signal, then the corresponding end transition for that module will not execute. This is all also true for global transitions except that a module that depends on a product produced by a module that threw an exception (directly or indirectly) will not execute.
For both stream and global transitions, EventSetup prefetching throwing an exception could cause a module to
not be executed.
If and only if there will be an attempt to execute the module, the pre module and post module signals will execute. If the pre module signal throws, then module does not execute but the post signal executes anyway. If the module throws, then the post signal executes anyway.
If there is any attempt to run a transition at all, then all 4 of the non-module signals will execute (pre and post begin,
pre and post end). Note that streams might entirely skip the transitions associated with a specific lumi (and in the future
this behavior might be extended to runs also). If a stream skips a lumi (or run), then none of the signals is executed and
the modules will not be run. This skipping might occur if another stream has process the last event or thrown an exception (although it might not skip because the notification of this has to propagate to the right location before the stream starts the lumi or run, once a stream starts it keeps going).
If the pre begin signal throws, the modules begin and end functions are not executed. If the post begin signal throws, the module begin functions may have already run (too late to stop them). It will attempt to run the end module functions for any modules that succeeded with their begin functions (and begin module signals). If the pre end signal throws, none of the module end functions is executed.
If an exception occurs, notification of that is propagated and when it gets to the WaitingTaskHolder created in the
processRuns function, then new runs and lumis and new events will not be started. Before that propagation is complete,
streams other than the one experiencing the exception might or might not start subsequent runs, subsequent lumis, or subsequent events (they could get ahead because the other streams are already ahead when the exception is thrown or just because of the time for the flag to propagate, in general this is not predictable or reproducible). Also note that if an exception occurs during stream end lumi, the stream where the exception occurs would have already started the next run and/or next lumi before the exception occurs.
At endStream and endJob, all exceptions are collected and printed. At other transitions, only the first exception is collected and printed. It is more likely in these cases that the first exception is interesting and the rest are just side effects of the first exception and cause confusion rather than aid debugging.
If one service throws while a signal is handled, the Framework continues trying to run all the other services and will report the first exception.
The global write transitions do not occur at all if the global begin transition did not succeed.
PR validation:
An existing unit test covering exceptions in different transitions is extended to cover the most salient cases. Additional manual testing of many various cases was also done. Existing unit tests pass.