Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release ProcessDesc in main() to release some memory #42503

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Aug 7, 2023

PR description:

Profiling the live memory of #40437 (comment) showed that the edm::ProcessDesc was taking about 40 MB
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue40437/reco_07.5_live/496

The ProcessDesc is not really needed in main() after the EventProcessor is constructed, so this PR releases the ownership of ProcessDesc in main(), so that the object will be destructed in the EventProcessor constructor.

This change implies that the modules may no longer hold the edm::ParameterSet, that is given to their constructor, by reference/pointer (while testing the change I came across with #42502, but there may be more).

Resolves cms-sw/framework-team#616

PR validation:

Workflow 11834.21 reco step runs with #42502

The IgProf MEM_LIVE in 11834.21 reco step in 13_2_0_pre3

There is residual ~1.7 MB contribution from PyBind11ProcessDesc that look like memory leaks in python itself(?).

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-42503/36509

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2023

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • FWCore/Framework (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@missirol, @wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Aug 7, 2023

@cmsbuild, please test with #42502

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2023

-1

Failed Tests: RelVals RelVals-INPUT AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1d46d6/34146/summary.html
COMMIT: 94b5def
CMSSW: CMSSW_13_3_X_2023-08-07-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/42503/34146/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

----- Begin Fatal Exception 08-Aug-2023 00:00:12 CEST-----------------------
An exception of category 'Configuration' occurred while
   [0] Processing  Event run: 346512 lumi: 250 event: 243042266 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module L1TMuonEndCapShowerProducer/'valEmtfStage2Showers'
Exception Message:
MissingParameter: Parameter 'enableOneLooseShower' not found.
----- End Fatal Exception -------------------------------------------------
  • 138.5138.5_ExpressCollisions2021/step2_ExpressCollisions2021.log
  • 139.001139.001_RunMinimumBias2021/step2_RunMinimumBias2021.log
Expand to see more relval errors ...

RelVals-INPUT

  • 140.002140.002_RunSingleMuon2022A/step2_RunSingleMuon2022A.log
  • 139.003139.003_RunHLTPhy2021/step2_RunHLTPhy2021.log
  • 139.004139.004_RunNoBPTX2021/step2_RunNoBPTX2021.log
Expand to see more relval errors ...

AddOn Tests

[hlt_mc_GRun:1] cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi -s GEN,SIM,DIGI,L1,DIGI2RAW --mc --scenario=pp -n 10 --conditions auto:run3_mc_GRun --relval 9000,50 --datatier "GEN-SIM-RAW" --eventcontent RAWSIM --customise=HLTrigger/Configuration/CustomConfigs.L1T --era Run3_2023 --fileout file:RelVal_Raw_GRun_MC.root : FAILED - elapsed time: 201 sec (ended on Tue Aug  8 00:01:42 2023) - exit: 35584
----- Begin Fatal Exception 08-Aug-2023 00:02:21 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   Additional Info:
      [a] Input file file:RelVal_Raw_GRun_MC.root could not be opened.
      [b] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
read from Storage::xread returned 256. Asked to read n bytes: 300 from offset: 0 with file size: 256

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 08-Aug-2023 00:03:45 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   Additional Info:
      [a] Input file file:RelVal_Raw_GRun_MC.root could not be opened.
      [b] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
read from Storage::xread returned 256. Asked to read n bytes: 300 from offset: 0 with file size: 256

----- End Fatal Exception -------------------------------------------------
Expand to see more addon errors ...

@makortel
Copy link
Contributor Author

makortel commented Aug 8, 2023

Looks like this PR is one of those that require some cleanup to be done first. A quick git grep showed at least the following files having reference to a edm::ParameterSet

  • CalibCalorimetry/HcalPlugins/src/HBHEDarkeningEP.h
  • DQM/SiPixelPhase1Common/interface/HistogramManager.h
  • DQM/TrackerRemapper/plugins/TrackerRemapper.cc
  • HLTriggerOffline/Egamma/interface/EmDQM.h
  • L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapShowerProducer.h
  • L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapTrackProducer.h
  • L1Trigger/L1TMuonOverlapPhase1/interface/Tools/CandidateSimMuonMatcher.h
  • PhysicsTools/UtilAlgos/interface/CachingVariable.h
  • RecoBTag/ImpactParameter/plugins/IPProducer.h
  • RecoPPS/Local/interface/RPixDetClusterizer.h
  • RecoPPS/Local/interface/TotemRPClusterProducerAlgorithm.h
  • SimG4Core/PhysicsLists/interface/CMSEmStandardPhysicsEMMT.h
  • SimMuon/MCTruth/interface/CSCTruthTest.h
  • SimPPS/RPDigiProducer/plugins/RPLinearChargeDivider.h
  • SimPPS/RPDigiProducer/plugins/RPVFATSimulator.h

@wddgit
Copy link
Contributor

wddgit commented Aug 9, 2023

What exactly was the "git grep" command used to generate the list of modules?

I am wondering if we could miss items that we are referencing in this ParameterSet we want to delete. The getParameterSet and getParameterSetVector functions return references to internal vectors or ParameterSets.

Even if we find and remove all of them, is there anything to prevent people from adding new code that saves such references? These might be difficult to identify, possibly pointing to something stale but valid sometimes and then yielding unpredictable results.

Maybe the memory savings are worth the risk. It does concern me a little.

@makortel
Copy link
Contributor Author

What exactly was the "git grep" command used to generate the list of modules?

git grep -E "ParameterSet ?&" | fgrep -v \( | fgrep \; | fgrep -v \)  | fgrep -v =

I am wondering if we could miss items that we are referencing in this ParameterSet we want to delete. The getParameterSet and getParameterSetVector functions return references to internal vectors or ParameterSets.

I'd bet it missed something. I'm hoping the next round of testing (after all the cases identified in #42503 (comment) have been converted) would show more cases to migrate (if there are any).

Even if we find and remove all of them, is there anything to prevent people from adding new code that saves such references?

Out of the box there would be nothing. But the situation is similar with e.g. Event and EventSetup that are transient (facade) objects. I don't recall we would have added any checks at the time the EventSetup was made a facade.

These might be difficult to identify, possibly pointing to something stale but valid sometimes and then yielding unpredictable results.

I agree this is a valid concern. The change should definitively be announced (although I hope we have never made any promises of the lifetime of the process PSet). Maybe a static analyzer check would help as well (such that it is enforced in PR tests)?

@makortel
Copy link
Contributor Author

PRs to address all the files in #42503 (comment) (some of them did not need action) have now been merged. I'd wait for the next IB before launching tests again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1d46d6/34503/summary.html
COMMIT: 92652cc
CMSSW: CMSSW_13_3_X_2023-08-28-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42503/34503/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 18 lines from the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3153095
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3153070
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

At this point it is probably best to wait for 13_4_X to be opened to avoid potentially breaking workflows (even if hopefully unlikely) towards the end of the release cycle, and to not disturb the code integration for HI data taking.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

To refresh.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1d46d6/34847/summary.html
COMMIT: 92652cc
CMSSW: CMSSW_13_3_X_2023-09-20-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/42503/34847/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
24834.5 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3358044
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3358019
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

@cmsbuild, plese test

Refresh again

@makortel
Copy link
Contributor Author

@cmsbuild, please test

Refreshing without a typo

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1d46d6/34936/summary.html
COMMIT: 92652cc
CMSSW: CMSSW_13_3_X_2023-09-27-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42503/34936/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 6 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3358320
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3358295
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

+core

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release ProcessDesc in main() to release some memory
4 participants