Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT Farm crashes (PFRecHitSoAProducerHCAL@alpaka) when HCAL is out #44668

Closed
silviodonato opened this issue Apr 9, 2024 · 23 comments
Closed

Comments

@silviodonato
Copy link
Contributor

silviodonato commented Apr 9, 2024

The HLT farm got a lot of errors in run=379174 since HCAL was removed from the global run

The error is:

An exception of category 'StdException' occurred while
[0] Processing Event run: 379174 lumi: 1 event: 4626 stream: 12
[1] Running path 'DST_PFScouting_DatasetMuon_v1'
[2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el8_amd64_gcc12/build/CMSSW_14_0_4-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error : 'cudaErrorInvalidConfiguration': 'invalid configuration argument'!

I will add the recipes to reproduce this error as soon as the data from the run without HCAL is available.

In 379178 HCAL was added back and everything worked fine.

http://cmsonline.cern.ch/cms-elog/1209406

@cms-sw/hlt-l2 @cms-sw/heterogeneous-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 9, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 9, 2024

A new Issue was created by @silviodonato.

@makortel, @rappoccio, @sextonkennedy, @Dr15Jones, @smuzaffar, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Apr 9, 2024

assign hlt, heterogeneous

@cms-sw/pf-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 9, 2024

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mzarucki
Copy link
Contributor

mzarucki commented Apr 9, 2024

Not directly related to this particular issue with HCAL, but based on earlier cases in Run 3, I just wanted to add the comment that this should also be checked for ECAL and Pixel (not sure if this should prompt a different ticket or a tag of corresponding DPG contacts). From our (FOG) side, we could check if there are recent runs available with either detectors out.

@mmusich
Copy link
Contributor

mmusich commented Apr 9, 2024

For the record, this can be reproduced also offline by means of preparing some 2024 RAW data without HCAL data in it, using this configuration snippet:

import FWCore.ParameterSet.Config as cms

process = cms.Process('TEST')

import FWCore.ParameterSet.VarParsing as VarParsing
options = VarParsing.VarParsing('analysis')
options.setDefault('inputFiles', [
    'root://eoscms.cern.ch//eos/cms/store/data/Run2024B/EphemeralHLTPhysics0/RAW/v1/000/379/075/00000/44f5f661-b536-49d9-b455-8e31371b2d86.root'
])
options.setDefault('maxEvents', 100)
options.parseArguments()

# set max number of input events
process.maxEvents.input = options.maxEvents

# initialize MessageLogger and output report
process.options.wantSummary = False
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = 100 # only report every 100th event start
process.MessageLogger.cerr.enableStatistics = False # enable "MessageLogger Summary" message
process.MessageLogger.cerr.threshold = 'INFO' # change to 'WARNING' not to show INFO-level messages
## enable reporting of INFO-level messages (default is limit=0, i.e. no messages reported)
#process.MessageLogger.cerr.INFO = cms.untracked.PSet(
#    reportEvery = cms.untracked.int32(1), # every event!
#    limit = cms.untracked.int32(-1)       # no limit!
#)

###
### Source (input file)
###
process.source = cms.Source('PoolSource',
    fileNames = cms.untracked.vstring(options.inputFiles)
)
print('process.source.fileNames =', process.source.fileNames)

###
### Path (FEDRAWData producers)
###
_HCALFEDs = [foo for foo in range(1100, 1199)]

from EventFilter.Utilities.EvFFEDExcluder_cfi import EvFFEDExcluder as _EvFFEDExcluder
process.rawDataNOHCAL = _EvFFEDExcluder.clone(
    src = 'rawDataCollector',
    fedsToExclude = _HCALFEDs,
)

process.rawDataSelectionPath = cms.Path(
    process.rawDataNOHCAL
)

###
### EndPath (output file)
###
process.rawDataOutputModule = cms.OutputModule('PoolOutputModule',
    fileName = cms.untracked.string('file:tmp.root'),
    outputCommands = cms.untracked.vstring(
        'drop *',
        'keep FEDRawDataCollection_rawDataNOHCAL_*_*',
        'keep edmTriggerResults_*_*_*',
        'keep triggerTriggerEvent_*_*_*',
    )
)

process.outputEndPath = cms.EndPath( process.rawDataOutputModule )

and then running:

hltGetConfiguration /dev/CMSSW_14_0_0/GRun/V92 --globaltag 140X_dataRun3_HLT_v3 --data --unprescale --output minimal --max-events 100 --eras Run3 --input file:tmp.root > hltDataNoHCAL.py                                                       
sed -i 's/rawDataCollector/rawDataNOHCAL/g' hltDataNoHCAL.py
cmsRun hltDataNoHCAL.py > & hlt.log &

This (even on CPU) results in a crash, with log attached: hlt.log

@mmusich
Copy link
Contributor

mmusich commented Apr 9, 2024

Not directly related to this particular issue with HCAL, but based on earlier cases in Run 3, I just wanted to add the comment that this should also be checked for ECAL and Pixel (not sure if this should prompt a different ticket or a tag of corresponding DPG contacts). From our (FOG) side, we could check if there are recent runs available with either detectors out.

@mzarucki
as discussed elsewhere applying the same recipe as above, but adjusting the FED selection to exclude either Pixel or ECAL:

_siPixelFEDs = [foo for foo in range(1200, 1349)]
_ECALFEDs = [foo for foo in range(600, 670)]

one can also produce data without those FEDs. Running the same test as above doesn't produce a crash.

@mmusich
Copy link
Contributor

mmusich commented Apr 9, 2024

type pf

@mmusich
Copy link
Contributor

mmusich commented Apr 9, 2024

@jsamudio FYI

@cmsbuild cmsbuild added the pf label Apr 9, 2024
@jsamudio
Copy link
Contributor

jsamudio commented Apr 9, 2024

I am taking a look

@jsamudio
Copy link
Contributor

jsamudio commented Apr 9, 2024

I think the issue comes from using the rechit number to specify block launches in the alpaka kernels. I am currently trying to avoid kernel launches in the case where there are 0 hcal rechits (no HCAL).

@jsamudio
Copy link
Contributor

jsamudio commented Apr 9, 2024

I think the issue comes from using the rechit number to specify block launches in the alpaka kernels. I am currently trying to avoid kernel launches in the case where there are 0 hcal rechits (no HCAL).

Just have to make the fix a bit more elegant and I will get a branch together for further testing.

@jsamudio
Copy link
Contributor

jsamudio commented Apr 9, 2024

Testing still, but here is the branch for others: https://github.com/jsamudio/cmssw/tree/dev_HCALoutFix based on CMSSW_14_0_4

@hatakeyamak
Copy link
Contributor

Perhaps obvious, but
CMSSW_14_0_X...jsamudio:cmssw:dev_HCALoutFix
to highlight changes...

@fwyzard
Copy link
Contributor

fwyzard commented Apr 10, 2024

👍🏻

@mmusich
Copy link
Contributor

mmusich commented Apr 10, 2024

FWIW, I tested that jsamudio@8f5cee3 prevents the segmentation violation in PFClusterSoAProducer@alpaka:hltParticleFlowClusterHBHESoA using the "offline reproducer" recipe at #44668 (comment) though I am now getting exceptions of the type:

----- Begin Fatal Exception 10-Apr-2024 08:54:38 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 379075 lumi: 340 event: 79432557 stream: 2
   [1] Running path 'AlCa_PFJet40_CPUOnly_v6'
   [2] Calling method for module LegacyPFRecHitProducer/'hltParticleFlowRecHitHBHECPUOnly'
Exception Message:
A std::exception was thrown.
unordered_map::at
----- End Fatal Exception -------------------------------------------------

@jsamudio
Copy link
Contributor

FWIW, I tested that jsamudio@8f5cee3 prevents the segmentation violation in PFClusterSoAProducer@alpaka:hltParticleFlowClusterHBHESoA using the "offline reproducer" recipe at #44668 (comment) though I am now getting exceptions of the type:


----- Begin Fatal Exception 10-Apr-2024 08:54:38 CEST-----------------------

An exception of category 'StdException' occurred while

   [0] Processing  Event run: 379075 lumi: 340 event: 79432557 stream: 2

   [1] Running path 'AlCa_PFJet40_CPUOnly_v6'

   [2] Calling method for module LegacyPFRecHitProducer/'hltParticleFlowRecHitHBHECPUOnly'

Exception Message:

A std::exception was thrown.

unordered_map::at

----- End Fatal Exception -------------------------------------------------

@mmusich I may have missed something when cleaning up the commit. I will check on it again as I thought I had treated such exceptions.

@mmusich
Copy link
Contributor

mmusich commented Apr 10, 2024

@jsamudio

may have missed something when cleaning up the commit. I will check on it again as I thought I had treated such exceptions.

apologies - my bad - apparently I didn't fetch the whole commit. Re trying from scratch, I don't observe crashes anymore.
Please open a PR to master first and an immediate backport, so that we can have better review (and possibly include the fix in the next release for online).

@mmusich
Copy link
Contributor

mmusich commented Apr 10, 2024

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

+hlt

@makortel
Copy link
Contributor

+heterogeneous

@makortel
Copy link
Contributor

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants