-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595
Comments
cms-bot internal usage |
A new Issue was created by @vince502. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous, reconstruction |
New categories assigned: hlt,heterogeneous,reconstruction @Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Investigating from PF side |
@cms-sw/pf-l2 FYI |
type pf |
For the serial sync crash, with
And printing out the detId of the hit, I see that it is I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with |
@kakwok FYI |
also @abdoulline @cms-sw/hcal-dpg-l2 |
FWIW I confirm that with this: diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..cd7f215abf1 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,6 +59,12 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
const uint32_t detId = rh.detId();
const uint32_t depth = HCAL::getDepth(detId);
const uint32_t subdet = getSubdet(detId);
+
+ if (detId == 0) {
+ printf("Rechit with detId %u has subdetector %u and depth %u ! \n", detId, subdet, depth);
+ return false;
+ }
+
if (topology.cutsFromDB()) {
threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
} else { the reproducer runs to completion both forcing the backend to be serial [1] or gpu [2] [1] CPU only#!/bin/bash -ex
# CMSSW_14_0_12_MULTIARCHS
hltGetConfiguration run:383830 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt.py
cat <<@EOF >> hlt.py
try:
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
process.MessageLogger.cerr.enableStatistics = False
except:
pass
process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ['cpu']
@EOF
cmsRun hlt.py &> hlt.log [2] GPU only#!/bin/bash -ex
# CMSSW_14_0_12_MULTIARCHS
hltGetConfiguration run:383830 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt2.py
cat <<@EOF >> hlt2.py
try:
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
process.MessageLogger.cerr.enableStatistics = False
except:
pass
process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ['gpu-nvidia']
rmPaths = set()
for pathName in process.paths_():
if 'DQM' in pathName:
rmPaths.add(pathName)
elif 'CPUOnly' in pathName:
rmPaths.add(pathName)
for rmPath in rmPaths:
process.__delattr__(rmPath)
process.hltAlCaPFJet40GPUxorCPUFilter.triggerConditions=cms.vstring( 'AlCa_PFJet40_v31' )
@EOF
cmsRun hlt2.py &> hlt2.log
|
A slightly more elegant version is: diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..576342bc16a 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,14 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
const uint32_t detId = rh.detId();
const uint32_t depth = HCAL::getDepth(detId);
const uint32_t subdet = getSubdet(detId);
+
if (topology.cutsFromDB()) {
- threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
+ const auto& denseId = HCAL::detId2denseId(detId);
+ if (denseId != HCAL::kInvalidDenseId) {
+ threshold = topology.noiseThreshold()[denseId];
+ } else {
+ return false;
+ }
} else {
if (subdet == HcalBarrel) {
threshold = params.energyThresholds()[depth - 1]; of course it remains to be understood the origin of such rechits associated to detid = 0.
vs here:
|
One question (maybe just for my own understanding). In Should they (and if so, are they), also skipped in the PFRecHit+Cluster reconstruction in Alpaka (which starts from the HBHE RecHits in SoA format) ? |
If I understand correctly, yes; those rechits with chi2<0 will be skipped in the subsequent PF reconstruction. |
If those bad hits exist in the SoA that is passed to PF RecHit, then I see no explicit skip over |
As a quick workaround for the crashes, would it make sense to add back the conversion to legacy, and from legacy to SoA ? This should filter away the bad hits and prevent the crashes, and could be implemented as a configuration-only change, while a better fix is worked on. |
Maybe the question is not to me, but I think it's a good idea (i would prefer to fix the release without touching the menu, but in this case this should give the correct results, and reduce pressure on different fronts..). It's implemented in the menu below, and the latter does not crash on the reproducer of this issue.
@cms-sw/hlt-l2, if you agree, I will follow up with FOG in a CMSHLT ticket (and I will ask you there to double-check the menu). |
is it costless? |
I will check that in parallel. :) |
OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one. |
diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..39f1948f73c 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
const uint32_t detId = rh.detId();
const uint32_t depth = HCAL::getDepth(detId);
const uint32_t subdet = getSubdet(detId);
+
+ if (rh.chi2() < 0)
+ return false;
+
if (topology.cutsFromDB()) { this also prevents the crash in the reproducers #45595 (comment) |
I guess the config workaround is fine (for online/FOG) while the C++ fix should be used as soon as possible after. |
Thanks @mmusich, IMO I think these protections against bad channels and invalid |
As a side note, for the one event that I checked amongst those causing this crash, one gets the following warning when running the legacy HBHERecHit producer (using an older HLT menu).
I don't know if this is the same hit as the one leading to the crash, but, just for my understanding, could HCAL experts explain
@cms-sw/hcal-dpg-l2 |
Hi @missirol
|
I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel
|
The printout shows ADC/TDC pattern, which seems to be OK, indeed, This protection in the legacy producer code above was introduced in 2021 when there was an occurrence of the bad SOI, which crashed the Prompt reco... |
Let me explicitly involve @mariadalfonso and @igv4321 in the discussion from HCAL side. |
ah, good catch! Indeed the |
Thanks for the explanations, @abdoulline . |
proposed fixes:
|
+hlt
|
@cms-sw/reconstruction-l2 @cms-sw/heterogeneous-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue. |
I would like to avoid producing the invalid channels at all, but since this is now tracked in #45651, we can close this issue. |
+heterogeneous |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
During the recent pp phyiscs fills 9945, 9947 we have deployed the new HLT menu /cdaq/physics/Run2024/2e34/v1.4.1/HLT/V2, and we started to see crashes in processes during the run (elog). The crashes comes from an illegal memory access.
In particular, we quote the errors from run 383830 in this issue.
So far on the offline side crashes were observed using the same error stream files.
From
f3mon
crashes in process shows like,Setting to produce crashes (not necessarily same exact problem) in the streamers
In the
hlt.py
I have tried several settings to reproduce the same cuda memory access error,will result in crash by
Module: alpaka_serial_sync::PFRecHitSoAProducerHCAL:hltParticleFlowRecHitHBHESoASerialSync (crashed)
Output : log_method1.txt
AND removed the following paths from
process.schedule = cms.Schedule( ...
block inhlt.py
and run
cmsRun hlt.py 2>&1 | tee logForce_method2.txt
Output : logForce_method2.txt
The text was updated successfully, but these errors were encountered: