Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of the association maps, multivector manager, HGCAL Rechits and Validation with significant speedup of Phase-2 workflows #45865

Merged

Conversation

felicepantaleo
Copy link
Contributor

@felicepantaleo felicepantaleo commented Sep 3, 2024

PR description:

In the context of Next Generation Triggers and HGCAL TICL reconstruction, fast AssociatorMaps between Reco and Sim, Sim to Sim, and Reco to Reco are required in order to develop new reconstruction algorithms and study their performance. They have also been included in the HGCalValidator and in the TICLDumper, but can be extended to many other detectors (MTD for instance, as it is the new main offender).
The HGCAL Rechit producer was optimized by avoiding calling virtual functions for every rechit and to enable support for the MultiVectorManager.
The MultiVectorManager was optimized and together with the HGCAL RecHit map was used to optimize the E/Gamma reconstruction and the associators.
The HGCAL Validation was redesigned to make effective usage of the new associators. The framework was completely redesigned to generate automatically validation plots for any new collection of TICL tracksters against SimTracksters from Simclusters and CaloParticles.

Performance measurements:

Performed with 14_1_0_pre7 with TTbar PU200, 10 events, single thread, single stream

14_1_0_pre7:

  • Total time: 536s
  • Total allocated memory: 90GB

14_1_0_pre7+PR

  • Total time: 171s
  • Total allocated memory: 56GB

Performance comparison

Full performance comparison table here:
https://docs.google.com/spreadsheets/d/1vdmDDJJ07tf9ekQ8EC24sMqqvtCtiU2Mh6FawsIRu24/edit?usp=sharing

Notable numbers:

  • Overall Phase-2 TTbar PU200 Reconstruction+Validation workflows: speedup 313% , allocated memory -38%

  • Overall E/Gamma Phase-2 Reconstruction: 1.5x speedup

  • ElectronSeedProducer ecalDrivenElectronSeeds: 3x speedup

  • HGCalValidator: 30x speedup

  • New associators: between two orders of magnitude and infinite speedup depending if you want to run on the CaloParticles from pileup as well. In this case the associators in release will crash for the amount of resources needed.

  • Job Memory usage step4: -22%
    image

@cms-sw/hgcal-dpg-l2 @cms-sw/egamma-pog-l2 @waredjeb @AuroraPerego @rovere

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45865/41614

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

A new Pull Request was created by @felicepantaleo for master.

It involves the following packages:

  • CommonTools/RecoAlgos (reconstruction)
  • DataFormats/HGCRecHit (upgrade, reconstruction)
  • HLTrigger/Configuration (hlt)
  • RecoEgamma/EgammaElectronProducers (reconstruction)
  • RecoEgamma/EgammaHLTProducers (hlt)
  • RecoHGCal/Configuration (upgrade, reconstruction)
  • RecoHGCal/TICL (upgrade, reconstruction)
  • RecoLocalCalo/HGCalRecAlgos (upgrade, reconstruction)
  • RecoLocalCalo/HGCalRecProducers (upgrade, reconstruction)
  • SimCalorimetry/HGCalAssociatorProducers (upgrade, simulation)
  • SimDataFormats/Associations (simulation)
  • Validation/Configuration (dqm, simulation)
  • Validation/HGCalValidation (dqm)

@Martin-Grunewald, @antoniovagnerini, @civanch, @cmsbuild, @jfernan2, @mandrenguyen, @mdhildreth, @mmusich, @nothingface0, @rvenditti, @srimanob, @subirsarkar, @syuvivida, @tjavaid can you please review it and eventually sign? Thanks.
@Fedespring, @HuguesBrun, @Martin-Grunewald, @Prasant1993, @ReyerBand, @Sam-Harper, @SohamBhattacharya, @VourMa, @a-kapoor, @abbiendi, @afiqaize, @ahinzmann, @apsallid, @argiro, @bsunanda, @cericeci, @cseez, @edjtscott, @fabiocos, @forthommel, @gkasieczka, @hatakeyamak, @jainshilpi, @jdolen, @jhgoh, @lecriste, @lgray, @martinamalberti, @missirol, @mmusich, @pfs, @ram1123, @rappoccio, @rchatter, @rovere, @sameasy, @sethzenz, @silviodonato, @sobhatta, @thomreis, @valsdav, @vandreev11, @varuns23, @wang0jin, @youyingli this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@felicepantaleo
Copy link
Contributor Author

test parameters:

  • workflow_opts= -w upgrade
  • workflow = 29896.203

@felicepantaleo
Copy link
Contributor Author

@cmsbuild please test

@felicepantaleo felicepantaleo changed the title Redesign of the association maps, multivector manager with significant speedup of Phase-2 reconstruction workflows Redesign of the association maps, multivector manager, HGCAL Rechits and Validation with significant speedup of Phase-2 reconstruction workflows Sep 3, 2024
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

-1

Failed Tests: ClangBuild
Size: This PR adds an extra 476KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-232398/41255/summary.html
COMMIT: acff6a7
CMSSW: CMSSW_14_2_X_2024-09-03-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45865/41255/install.sh to create a dev area with all the needed externals and cmssw changes.

Clang Build

I found compilation error while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

>> Entering Package RecoLocalCalo/HGCalRecProducers
>> Entering Package SimCalorimetry/HGCalAssociatorProducers
>> Entering Package SimDataFormats/Associations
>> Entering Package Validation/HGCalValidation
>> Compile sequence completed for CMSSW CMSSW_14_2_X_2024-09-03-1100
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1
+ eval scram build outputlog '&&' '(python3' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/cms-bot/buildLogAnalyzer.py --logDir /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_2_X_2024-09-03-1100/tmp/el8_amd64_gcc12/cache/log/src '||' 'true)'
++ scram build outputlog
>> Entering Package CommonTools/RecoAlgos
Entering library rule at src/CommonTools/RecoAlgos/plugins
>> Compiling edm plugin src/CommonTools/RecoAlgos/plugins/BasicClusterMerger.cc


@felicepantaleo
Copy link
Contributor Author

You can find the full details of the performance comparison here:
https://docs.google.com/spreadsheets/d/1vdmDDJJ07tf9ekQ8EC24sMqqvtCtiU2Mh6FawsIRu24/edit?usp=sharing

I'll add this link in the initial message as well.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45865/41617

@mmusich
Copy link
Contributor

mmusich commented Oct 2, 2024

enable hlt_p2_timing

@mmusich
Copy link
Contributor

mmusich commented Oct 2, 2024

@cmsbuild, please test

@mmusich
Copy link
Contributor

mmusich commented Oct 2, 2024

Based on the already available HLT phase-2 timing results, there is a visible reduction:

Baseline This PR
Screenshot from 2024-10-02 15-52-31 Screenshot from 2024-10-02 15-52-57

from 5437.9 ms/evt (baseline) to 5388.1 ms/evt (this PR) i.e. a -0.915% reduction in timing.

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2024

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-232398/41880/summary.html
COMMIT: db17892
CMSSW: CMSSW_14_2_X_2024-10-02-1100/el8_amd64_gcc12
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45865/41880/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 248 differences found in the comparisons
  • DQMHistoTests: Total files compared: 45
  • DQMHistoTests: Total histograms compared: 3425773
  • DQMHistoTests: Total failures: 15015
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3410738
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 2745.032 KiB( 44 files compared)
  • DQMHistoSizes: changed ( 24834.911,... ): 734.076 KiB HGCAL/HGCalValidator
  • DQMHistoSizes: changed ( 29896.203 ): -1659.424 KiB HGCAL/HGCalValidator
  • Checked 197 log files, 167 edm output root files, 45 DQM output files
  • TriggerResults: no differences found

@felicepantaleo
Copy link
Contributor Author

book-keeping I'd like to test it once again enabling the hlt phase-2 timing flag once we have an IB that contains #46185.

@mmusich thanks. Do you think it is worth also nesting hlt_p2_timing into profiling, so that when running with profiling enabled we also get the hlt_p2_timing enabled?

@mmusich
Copy link
Contributor

mmusich commented Oct 3, 2024

Do you think it is worth also nesting hlt_p2_timing into profiling, so that when running with profiling enabled we also get the hlt_p2_timing enabled?

no strong opinion, if that's fine with @cms-sw/reconstruction-l2 it's fine by me. OTOH I would reserve the option of running just the phase-2 HLT menu timing without the rest of the reco profiling.

@jfernan2
Copy link
Contributor

jfernan2 commented Oct 3, 2024

Do you think it is worth also nesting hlt_p2_timing into profiling, so that when running with profiling enabled we also get the hlt_p2_timing enabled?

no strong opinion, if that's fine with @cms-sw/reconstruction-l2 it's fine by me. OTOH I would reserve the option of running just the phase-2 HLT menu timing without the rest of the reco profiling.

Honestly, I'd rather prefer to keep the two tests decoupled for the moment, we are refurbishing the profiling test at present due to igprof problem. Thanks

@felicepantaleo
Copy link
Contributor Author

@cms-sw/dqm-l2 let me know if you have any more comments.

@mmusich
Copy link
Contributor

mmusich commented Oct 9, 2024

pinging again @cms-sw/dqm-l2 @antoniovagnerini, please review / sign, thank you.

std::vector<dqm::reco::MonitorElement*> h_denom_caloparticle_eta[numberOfValidationTypes_];
std::vector<dqm::reco::MonitorElement*> h_denom_caloparticle_phi[numberOfValidationTypes_];
std::vector<dqm::reco::MonitorElement*> h_denom_caloparticle_en[numberOfValidationTypes_];
std::vector<dqm::reco::MonitorElement*> h_denom_caloparticle_pt[numberOfValidationTypes_];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that the doubling in numberOfValidationTypes_ from 2 to 4 is behind the expected increase in DQM Histogram size by 2.7MB?
DQMHistoSizes: Histogram memory added: 2745.032 KiB( 44 files compared)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@antoniovagnerini
Copy link

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @mandrenguyen, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 0b675bf into cms-sw:master Oct 15, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants