Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ECAL portable data formats, collections and conditions for alpaka #42930

Conversation

thomreis
Copy link
Contributor

@thomreis thomreis commented Oct 2, 2023

PR description:

This PR is the first in a planned series of three to migrate the ECAL local reconstruction for Run 3 and Phase 2 to alpaka.
It adds the conditions formats and portable collections as well as the data formats and portable collections.

Conditions:

  • Multifit conditions
  • Electronics mapping

SoA data formats:

  • Digis
  • Digis Phase 2
  • Uncalibrated RecHits

All developments of @valsdav , @Jakub-Gajownik , and @thomreis have been squashed into one commit for clarity.

Since this PR just adds collections for future use in the coming commits no changes are expected from this PR.

PRs 2 and 3 will be made to add the Phase 2 local reconstruction code and multifit algorithm migrations, respectively.

PR validation:

Passes GPU WFs 12434.512 and 24834.612

@thomreis
Copy link
Contributor Author

thomreis commented Oct 2, 2023

type ecal

@thomreis
Copy link
Contributor Author

thomreis commented Oct 2, 2023

enable gpu

@cmsbuild cmsbuild added the ecal label Oct 2, 2023
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-42930/37074

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2023

A new Pull Request was created by @thomreis (Thomas Reis) for master.

It involves the following packages:

  • CondFormats/DataRecord (db, alca)
  • CondFormats/EcalObjects (db, alca)
  • DataFormats/EcalDigi (simulation)
  • DataFormats/EcalRecHit (reconstruction)

@mandrenguyen, @civanch, @jfernan2, @cmsbuild, @mdhildreth, @francescobrivio, @saumyaphor4252, @perrotta, @consuegs can you please review it and eventually sign? Thanks.
@tocheng, @missirol, @mmusich, @seemasharmafnal, @rovere, @rchatter, @wang0jin, @thomreis, @argiro this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@thomreis
Copy link
Contributor Author

thomreis commented Oct 2, 2023

please test

@mmusich
Copy link
Contributor

mmusich commented Oct 2, 2023

assign heterogeneous

  • looks missing (might be wrong though)

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2023

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

<use name="HeterogeneousCore/CUDACore"/>
<use name="HeterogeneousCore/CUDAUtilities"/>
<use name="boost"/>
<use name="boost_serialization"/>
<use name="rootmath"/>
<use name="clhep"/>
<use name="cuda"/>
<use name="HeterogeneousCore/AlpakaCore"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependence should not be needed in data format package.

<use name="HeterogeneousCore/CUDACore"/>
<use name="HeterogeneousCore/CUDAUtilities"/>
<use name="boost"/>
<use name="boost_serialization"/>
<use name="rootmath"/>
<use name="clhep"/>
<use name="cuda"/>
<use name="HeterogeneousCore/AlpakaCore"/>
<use name="HeterogeneousCore/AlpakaInterface"/>
<flags ALPAKA_BACKENDS="1"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
<flags ALPAKA_BACKENDS="1"/>
<flags ALPAKA_BACKENDS="cuda rocm"/>

or (in a recent IB)

Suggested change
<flags ALPAKA_BACKENDS="1"/>
<flags ALPAKA_BACKENDS="!serial"/>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to yesterday's IB and implemented the second option.

<use name="FWCore/MessageLogger"/>
<use name="FWCore/Utilities"/>
<use name="HeterogeneousCore/AlpakaInterface"/>
<flags ALPAKA_BACKENDS="cuda rocm"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a recent IB could be also

Suggested change
<flags ALPAKA_BACKENDS="cuda rocm"/>
<flags ALPAKA_BACKENDS="!serial"/>


GENERATE_SOA_LAYOUT(EcalDigiPhase2SoALayout,
SOA_COLUMN(uint32_t, id),
SOA_COLUMN(DataArrayPhase2Struct, data),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampleSize seems to be 16 (phase1) or 10 (phase2) So this column has 16-or-10 2-byte sub-elements per SoA-element (i.e. 32-or-20 bytes).

How is this data structure accessed by the consuming code? Does the consuming code use a GPU-thread per element of this SoA? If yes, then I'm wondering if it would be more performant to split this column into sampleSize arrays with Eigen vectors? In principle that should yield memory access pattern that is more coalesced than this one.

(as far as I'm concerned this optimization can be done later, it would anyway be better to study the impact on the performance rather than to rely on anything I wrote above on the top of my head)

(this pattern repeats in other SoAs as well, I'm not commenting on them in order to have one place to discuss about it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the other way around, in fact. 10 elements for Phase 1 and 16 for Phase 2.
At the moment one thread will use all 10 or 16 elements to calculate the weighted sum in the Phase 2 case.
I am not sure how the fitting in the multifit algorithm works but I think if also uses all samples to perform the fit in one thread.

targetClass="ecal::DigiHostCollection"
version="[1-]"
source="ecal::EcalDigiSoA layout_;"
target="buffer_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
target="buffer_"
target="buffer_,layout_,view_"

targetClass="ecal::DigiPhase2HostCollection"
version="[1-]"
source="ecal::EcalDigiPhase2SoA layout_;"
target="buffer_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
target="buffer_"
target="buffer_,layout_,view_"

targetClass="ecal::UncalibratedRecHitHostCollection"
version="[1-]"
source="ecal::EcalUncalibratedRecHitSoA layout_;"
target="buffer_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
target="buffer_"
target="buffer_,layout_,view_"

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2023

-1

Failed Tests: HeaderConsistency UnitTests RelVals-GPU
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a3482e/35011/summary.html
COMMIT: a9d393c
CMSSW: CMSSW_13_3_X_2023-10-02-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/42930/35011/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test CondToolsLHCInfoNewPopConTest had ERRORS

RelVals-GPU

  • 12434.58712434.587_TTbar_14TeV+2023_Patatrack_AllTripletsGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_AllTripletsGPU_Validation.log
  • 12434.58612434.586_TTbar_14TeV+2023_Patatrack_AllTripletsGPU/step2_TTbar_14TeV+2023_Patatrack_AllTripletsGPU.log

Comparison Summary

Summary:

  • You potentially added 6 lines to the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3358320
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3358292
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

makortel commented Oct 2, 2023

RelVals-GPU

  • 12434.58712434.587_TTbar_14TeV+2023_Patatrack_AllTripletsGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_AllTripletsGPU_Validation.log
  • 12434.58612434.586_TTbar_14TeV+2023_Patatrack_AllTripletsGPU/step2_TTbar_14TeV+2023_Patatrack_AllTripletsGPU.log

These failure should be fixed in the next IB that includes #42916

@thomreis thomreis force-pushed the ecal-reco-gpu-alpaka-migration-integration-133x branch from a9d393c to 5756e0f Compare October 3, 2023 09:23
@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a3482e/35386/summary.html
COMMIT: 438a156
CMSSW: CMSSW_13_3_X_2023-10-24-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42930/35386/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • Reco comparison results: 26 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3357400
  • DQMHistoTests: Total failures: 1070
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3356308
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 420
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39320
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2023

@perrotta

As you point out, persisted structures with fixed sizes may give problems in schema evolution if we decide at some point to modify that size. I agree that transient ones can be treated with more liberty. Nonetheless, my point is also that if we hard-code the size parameter of one structure which is defined in another class and package, then we can get problems nonetheless when we try to modify it in one place and need to propagate that change elsewhere. If the fixed size is defined centrally in the class that introduces it, instead, it can be picked everywhere else, and it has to get modified only once in case of need.

I was answering to the other Andrea and not to your review.

@thomreis
Copy link
Contributor Author

These are anyway not persisted, right? IIRC fixed-size containers persisted in the database cannot be schema-evolved (as of today) which can be problematic if this has to be changed down the line, but transient structures in memory can be whatever is most convenient.

As you point out, persisted structures with fixed sizes may give problems in schema evolution if we decide at some point to modify that size. I agree that transient ones can be treated with more liberty. Nonetheless, my point is also that if we hard-code the size parameter of one structure which is defined in another class and package, then we can get problems nonetheless when we try to modify it in one place and need to propagate that change elsewhere. If the fixed size is defined centrally in the class that introduces it, instead, it can be picked everywhere else, and it has to get modified only once in case of need.

Is this not the case here? The array sizes are defined in the conditions format SoA definitions and will be used in the other packages to do operations with the conditions. For example in the ESProducer inside the RecoLocalCalo/EcalRecProducers package to initialte a memcpy: https://github.com/thomreis/cmssw/blob/ecal-reco-gpu-alpaka-migration-multifit-133x/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc#L142
Or is there a better place where the array sizes could be defined?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 25, 2023

+heterogeneous

@perrotta
Copy link
Contributor

+1

@thomreis
Copy link
Contributor Author

Hi @cms-sw/simulation-l2 @cms-sw/reconstruction-l2 do you have any more comments on this PR?

@civanch
Copy link
Contributor

civanch commented Oct 26, 2023

+1

@thomreis
Copy link
Contributor Author

Hi @cms-sw/reconstruction-l2 do you have any further comments?

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit d746f22 into cms-sw:master Nov 1, 2023
@thomreis thomreis deleted the ecal-reco-gpu-alpaka-migration-integration-133x branch February 2, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants