Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve behavior after exception in begin/end stream lumi #44624

Merged
merged 1 commit into from
Apr 19, 2024

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Apr 4, 2024

PR description:

Improve the behavior of the Framework after stream begin/end lumi exceptions. This is the first in a series of PRs where we plan to make the behavior after exceptions more consistent in all the begin/end transitions.

The intent is that nothing in the output will change if there are not any exceptions.

This work was motivated by discussions related to Issues #43831 and #42501.

The core group has discussed this. The following is the behavior we plan to implement in all the begin/end transitions eventually. I think this is consistent with our previous discussions with some extra details added that were fleshed out as this PR was implemented:

  1. For a stream begin transition (of any type, run/lumi/processBlock/top level), the Framework will try to execute all modules (possibly concurrently) and continues executing them all even if one or more of the modules throws. The same is true for stream end transitions except if a module in a begin transition throws an exception in a pre module signal, the module itself, or a post module signal, then the corresponding end transition for that module will not execute. This is all also true for global transitions except that a module that depends on a product produced by a module that threw an exception (directly or indirectly) will not execute.

  2. For both stream and global transitions, EventSetup prefetching throwing an exception could cause a module to
    not be executed.

  3. If and only if there will be an attempt to execute the module, the pre module and post module signals will execute. If the pre module signal throws, then module does not execute but the post signal executes anyway. If the module throws, then the post signal executes anyway.

  4. If there is any attempt to run a transition at all, then all 4 of the non-module signals will execute (pre and post begin,
    pre and post end). Note that streams might entirely skip the transitions associated with a specific lumi (and in the future
    this behavior might be extended to runs also). If a stream skips a lumi (or run), then none of the signals is executed and
    the modules will not be run. This skipping might occur if another stream has process the last event or thrown an exception (although it might not skip because the notification of this has to propagate to the right location before the stream starts the lumi or run, once a stream starts it keeps going).

  5. If the pre begin signal throws, the modules begin and end functions are not executed. If the post begin signal throws, the module begin functions may have already run (too late to stop them). It will attempt to run the end module functions for any modules that succeeded with their begin functions (and begin module signals). If the pre end signal throws, none of the module end functions is executed.

  6. If an exception occurs, notification of that is propagated and when it gets to the WaitingTaskHolder created in the
    processRuns function, then new runs and lumis and new events will not be started. Before that propagation is complete,
    streams other than the one experiencing the exception might or might not start subsequent runs, subsequent lumis, or subsequent events (they could get ahead because the other streams are already ahead when the exception is thrown or just because of the time for the flag to propagate, in general this is not predictable or reproducible). Also note that if an exception occurs during stream end lumi, the stream where the exception occurs would have already started the next run and/or next lumi before the exception occurs.

  7. At endStream and endJob, all exceptions are collected and printed. At other transitions, only the first exception is collected and printed. It is more likely in these cases that the first exception is interesting and the rest are just side effects of the first exception and cause confusion rather than aid debugging.

  8. If one service throws while a signal is handled, the Framework continues trying to run all the other services and will report the first exception.

  9. The global write transitions do not occur at all if the global begin transition did not succeed.

PR validation:

An existing unit test covering exceptions in different transitions is extended to cover the most salient cases. Additional manual testing of many various cases was also done. Existing unit tests pass.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44624/39814

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2024

A new Pull Request was created by @wddgit for master.

It involves the following packages:

  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/Services (core)
  • FWCore/Utilities (core)

@makortel, @Dr15Jones, @smuzaffar, @cmsbuild can you please review it and eventually sign? Thanks.
@makortel, @felicepantaleo, @fwyzard, @missirol this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Apr 4, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2024

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d2b23f/38610/summary.html
COMMIT: 8bc74f7
CMSSW: CMSSW_14_1_X_2024-04-04-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44624/38610/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test runtestPhysicsToolsPatAlgos had ERRORS

Comparison Summary

Summary:

@wddgit
Copy link
Contributor Author

wddgit commented Apr 5, 2024

enable threading

@wddgit
Copy link
Contributor Author

wddgit commented Apr 5, 2024

please test

Rerunning tests with threading enabled.

The failure that occurred is unrelated to this PR and is already occurring in the IBs.

@makortel is the 312.0 comparison failure one of the known ones? Is there an issue
for that one?

The others are known.
Non reproducibility in wf 136.793 #43293
Non-reproducibility in TriggerResults in 24834.0, 24834.911, and 25034.999 #43790

@makortel
Copy link
Contributor

makortel commented Apr 5, 2024

is the 312.0 comparison failure one of the known ones?

Those comparison differences are in MessageLogger directory, and are related to the logging of error(?) messages. They occur sporadically (also in other workflows) and can be ignored.

Copy link
Contributor

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a first look

FWCore/Framework/interface/StreamSchedule.h Show resolved Hide resolved
FWCore/Framework/interface/maker/Worker.h Show resolved Hide resolved
FWCore/Framework/interface/maker/Worker.h Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the TestServiceOne and TestServiceTwo would be moved to FWCore/Integration/plugins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move them. That is a better place. Thanks.

@wddgit
Copy link
Contributor Author

wddgit commented Apr 5, 2024

abort test

@makortel
Copy link
Contributor

Seems like a new non-reproducibility

@makortel
Copy link
Contributor

I opened an issue #44779

@makortel
Copy link
Contributor

@cmsbuild, please test

Let's try again

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d2b23f/38944/summary.html
COMMIT: 26b519a
CMSSW: CMSSW_14_1_X_2024-04-18-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44624/38944/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test runtestPhysicsToolsPatAlgos had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 50 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3319599
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3319576
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Now the comparisons show only #43790

@makortel
Copy link
Contributor

+core

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@makortel
Copy link
Contributor

@cmsbuild, ignore tests-rejected with ib-failure

@rappoccio
Copy link
Contributor

+1

@rappoccio
Copy link
Contributor

merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants