-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT Farm crashes (PFRecHitSoAProducerHCAL@alpaka) when HCAL is out #44668
Comments
cms-bot internal usage |
A new Issue was created by @silviodonato. @makortel, @rappoccio, @sextonkennedy, @Dr15Jones, @smuzaffar, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous @cms-sw/pf-l2 FYI |
New categories assigned: hlt,heterogeneous @Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Not directly related to this particular issue with HCAL, but based on earlier cases in Run 3, I just wanted to add the comment that this should also be checked for ECAL and Pixel (not sure if this should prompt a different ticket or a tag of corresponding DPG contacts). From our (FOG) side, we could check if there are recent runs available with either detectors out. |
For the record, this can be reproduced also offline by means of preparing some 2024 RAW data without HCAL data in it, using this configuration snippet: import FWCore.ParameterSet.Config as cms
process = cms.Process('TEST')
import FWCore.ParameterSet.VarParsing as VarParsing
options = VarParsing.VarParsing('analysis')
options.setDefault('inputFiles', [
'root://eoscms.cern.ch//eos/cms/store/data/Run2024B/EphemeralHLTPhysics0/RAW/v1/000/379/075/00000/44f5f661-b536-49d9-b455-8e31371b2d86.root'
])
options.setDefault('maxEvents', 100)
options.parseArguments()
# set max number of input events
process.maxEvents.input = options.maxEvents
# initialize MessageLogger and output report
process.options.wantSummary = False
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = 100 # only report every 100th event start
process.MessageLogger.cerr.enableStatistics = False # enable "MessageLogger Summary" message
process.MessageLogger.cerr.threshold = 'INFO' # change to 'WARNING' not to show INFO-level messages
## enable reporting of INFO-level messages (default is limit=0, i.e. no messages reported)
#process.MessageLogger.cerr.INFO = cms.untracked.PSet(
# reportEvery = cms.untracked.int32(1), # every event!
# limit = cms.untracked.int32(-1) # no limit!
#)
###
### Source (input file)
###
process.source = cms.Source('PoolSource',
fileNames = cms.untracked.vstring(options.inputFiles)
)
print('process.source.fileNames =', process.source.fileNames)
###
### Path (FEDRAWData producers)
###
_HCALFEDs = [foo for foo in range(1100, 1199)]
from EventFilter.Utilities.EvFFEDExcluder_cfi import EvFFEDExcluder as _EvFFEDExcluder
process.rawDataNOHCAL = _EvFFEDExcluder.clone(
src = 'rawDataCollector',
fedsToExclude = _HCALFEDs,
)
process.rawDataSelectionPath = cms.Path(
process.rawDataNOHCAL
)
###
### EndPath (output file)
###
process.rawDataOutputModule = cms.OutputModule('PoolOutputModule',
fileName = cms.untracked.string('file:tmp.root'),
outputCommands = cms.untracked.vstring(
'drop *',
'keep FEDRawDataCollection_rawDataNOHCAL_*_*',
'keep edmTriggerResults_*_*_*',
'keep triggerTriggerEvent_*_*_*',
)
)
process.outputEndPath = cms.EndPath( process.rawDataOutputModule ) and then running: hltGetConfiguration /dev/CMSSW_14_0_0/GRun/V92 --globaltag 140X_dataRun3_HLT_v3 --data --unprescale --output minimal --max-events 100 --eras Run3 --input file:tmp.root > hltDataNoHCAL.py
sed -i 's/rawDataCollector/rawDataNOHCAL/g' hltDataNoHCAL.py
cmsRun hltDataNoHCAL.py > & hlt.log & This (even on CPU) results in a crash, with log attached: hlt.log |
@mzarucki _siPixelFEDs = [foo for foo in range(1200, 1349)]
_ECALFEDs = [foo for foo in range(600, 670)] one can also produce data without those FEDs. Running the same test as above doesn't produce a crash. |
type pf |
@jsamudio FYI |
I am taking a look |
I think the issue comes from using the rechit number to specify block launches in the alpaka kernels. I am currently trying to avoid kernel launches in the case where there are 0 hcal rechits (no HCAL). |
Just have to make the fix a bit more elegant and I will get a branch together for further testing. |
Testing still, but here is the branch for others: https://github.com/jsamudio/cmssw/tree/dev_HCALoutFix based on |
Perhaps obvious, but |
👍🏻 |
FWIW, I tested that jsamudio@8f5cee3 prevents the segmentation violation in
|
@mmusich I may have missed something when cleaning up the commit. I will check on it again as I thought I had treated such exceptions. |
apologies - my bad - apparently I didn't fetch the whole commit. Re trying from scratch, I don't observe crashes anymore. |
proposed fixes: |
+hlt
|
+heterogeneous |
@cmsbuild, please close |
This issue is fully signed and ready to be closed. |
The HLT farm got a lot of errors in run=379174 since HCAL was removed from the global run
The error is:
I will add the recipes to reproduce this error as soon as the data from the run without HCAL is available.
In 379178 HCAL was added back and everything worked fine.
http://cmsonline.cern.ch/cms-elog/1209406
@cms-sw/hlt-l2 @cms-sw/heterogeneous-l2
The text was updated successfully, but these errors were encountered: