HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595

vince502 · 2024-07-30T18:05:56Z

During the recent pp phyiscs fills 9945, 9947 we have deployed the new HLT menu /cdaq/physics/Run2024/2e34/v1.4.1/HLT/V2, and we started to see crashes in processes during the run (elog). The crashes comes from an illegal memory access.

In particular, we quote the errors from run 383830 in this issue.

So far on the offline side crashes were observed using the same error stream files.

From f3mon crashes in process shows like,

An exception of category 'StdException' occurred while
[0] Processing Event run: 383830 lumi: 83 event: 113500824 stream: 22
[1] Running path 'DST_PFScouting_DoubleMuon_v5'
[2] Calling method for module HcalDigisSoAProducer@alpaka/'hltHcalDigisSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_12_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_12_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(143) 'ret = TApi::eventQuery(event.getNativeHandle())' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

Setting to produce crashes (not necessarily same exact problem) in the streamers

# I first copied files from /eos/cms/store/group/tsg/FOG/error_stream_root/run383830/
# using gpu-c2a02-39-04
cmsrel CMSSW_14_0_12_MULTIARCHS
cd CMSSW_14_0_12_MULTIARCHS/src/
cmsenv
mkdir crashAlpakaFiles
scp lxplus.cern.ch:/eos/cms/store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root crashAlpakaFiles

hltGetConfiguration run:383830 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'file:crashAlpakaFiles/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root' \
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1 # 8 for method 2 below
process.options.numberOfStreams = 0

@EOF

In the hlt.py I have tried several settings to reproduce the same cuda memory access error,

Plain run

cmsRun hlt.py 2>&1 | tee log_method1.txt

will result in crash by
Module: alpaka_serial_sync::PFRecHitSoAProducerHCAL:hltParticleFlowRecHitHBHESoASerialSync (crashed)
Output : log_method1.txt

Force enable using cuda modules

echo "process.options.accelerators = ['gpu-nvidia']" >> hlt.py

AND removed the following paths from process.schedule = cms.Schedule( ... block in hlt.py

process.DQM_PixelReconstruction_v11, 
process.DQM_EcalReconstruction_v11, 
process.DQM_HcalReconstruction_v9,
process.Dataset_DQMGPUvsCPU

and run cmsRun hlt.py 2>&1 | tee logForce_method2.txt

Output : logForce_method2.txt

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-07-30T18:06:17Z

cms-bot internal usage

cmsbuild · 2024-07-30T18:06:18Z

A new Issue was created by @vince502.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich · 2024-07-30T18:07:59Z

assign hlt, heterogeneous, reconstruction

cmsbuild · 2024-07-30T18:08:22Z

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

jsamudio · 2024-07-30T18:08:30Z

Investigating from PF side

mmusich · 2024-07-30T18:08:31Z

@cms-sw/pf-l2 FYI

mmusich · 2024-07-30T18:08:35Z

type pf

jsamudio · 2024-07-30T19:05:50Z

For the serial sync crash, with gdb I see:

Thread 1 "cmsRun" received signal SIGSEGV, Segmentation fault.
0x00007fff3463d43a in alpaka_serial_sync::PFRecHitProducerKernelConstruct<alpaka_serial_sync::particleFlowRecHitProducer::HCAL>::applyCuts (rh=..., params=..., topology=...)
    at src/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc:63
63            threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];

And printing out the detId of the hit, I see that it is detId == 0, meaning denseId == HCAL::kInvalidDenseId which is std::numeric_limits<uint32_t>::max() and thus giving the segfault here.

I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.

mmusich · 2024-07-30T19:06:47Z

@kakwok FYI

mmusich · 2024-07-30T19:20:59Z

also @abdoulline @cms-sw/hcal-dpg-l2

mmusich · 2024-07-30T21:01:11Z

I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.

FWIW I confirm that with this:

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..cd7f215abf1 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,6 +59,12 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (detId == 0) {
+      printf("Rechit with detId %u has subdetector %u and depth %u ! \n", detId, subdet, depth);
+      return false;
+    }
+
     if (topology.cutsFromDB()) {
       threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
     } else {

the reproducer runs to completion both forcing the backend to be serial [1] or gpu [2]

[1] CPU only

#!/bin/bash -ex

# CMSSW_14_0_12_MULTIARCHS

hltGetConfiguration run:383830 \
  	    --globaltag 140X_dataRun3_HLT_v3 \
  	    --data \
  	    --no-prescale \
  	    --no-output \
  	    --max-events -1 \
  	    --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt.py

cat <<@EOF >> hlt.py
try:
  del process.MessageLogger
  process.load('FWCore.MessageLogger.MessageLogger_cfi')
  process.MessageLogger.cerr.enableStatistics = False
except:
  pass

process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ['cpu']
@EOF

cmsRun hlt.py &> hlt.log

[2] GPU only

#!/bin/bash -ex

# CMSSW_14_0_12_MULTIARCHS

hltGetConfiguration run:383830 \
  	    --globaltag 140X_dataRun3_HLT_v3 \
  	    --data \
  	    --no-prescale \
  	    --no-output \
  	    --max-events -1 \
  	    --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt2.py

cat <<@EOF >> hlt2.py
try:
  del process.MessageLogger
  process.load('FWCore.MessageLogger.MessageLogger_cfi')
  process.MessageLogger.cerr.enableStatistics = False
except:
  pass

process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ['gpu-nvidia']

rmPaths = set()
for pathName in process.paths_():
if 'DQM' in pathName:
  rmPaths.add(pathName)
elif 'CPUOnly' in pathName:
  rmPaths.add(pathName)
  
for rmPath in rmPaths:
  process.__delattr__(rmPath)

process.hltAlCaPFJet40GPUxorCPUFilter.triggerConditions=cms.vstring( 'AlCa_PFJet40_v31' )  
@EOF

cmsRun hlt2.py &> hlt2.log

mmusich · 2024-07-30T21:15:28Z

A slightly more elegant version is:

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..576342bc16a 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,14 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
     if (topology.cutsFromDB()) {
-      threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
+      const auto& denseId = HCAL::detId2denseId(detId);
+      if (denseId != HCAL::kInvalidDenseId) {
+        threshold = topology.noiseThreshold()[denseId];
+      } else {
+        return false;
+      }
     } else {
       if (subdet == HcalBarrel) {
         threshold = params.energyThresholds()[depth - 1];

of course it remains to be understood the origin of such rechits associated to detid = 0.
By the way I guess it would help if we would be emitting different printouts here

cmssw/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CalorimeterDefinitions.h

Line 114 in 3a38543

printf("invalid detId: %u\n", detId);

vs here:

cmssw/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CalorimeterDefinitions.h

Line 200 in 3a38543

printf("invalid detId: %u\n", detId);

missirol · 2024-07-30T21:46:42Z

@kakwok @jsamudio

One question (maybe just for my own understanding).

In HcalRecHitSoAToLegacy, hits in SoA format corresponding to "bad channels" (chi2 < 0) are skipped (not converted to legacy).

Should they (and if so, are they), also skipped in the PFRecHit+Cluster reconstruction in Alpaka (which starts from the HBHE RecHits in SoA format) ?

kakwok · 2024-07-30T21:52:23Z

If I understand correctly, yes; those rechits with chi2<0 will be skipped in the subsequent PF reconstruction.
And it seems to make sense for me to skip those rechits.

jsamudio · 2024-07-30T22:13:46Z

If those bad hits exist in the SoA that is passed to PF RecHit, then I see no explicit skip over chi2 <0 on our side. And our first check is just the energy threshold.

fwyzard · 2024-07-31T04:26:36Z

As a quick workaround for the crashes, would it make sense to add back the conversion to legacy, and from legacy to SoA ?

This should filter away the bad hits and prevent the crashes, and could be implemented as a configuration-only change, while a better fix is worked on.

missirol · 2024-07-31T06:10:17Z

Maybe the question is not to me, but I think it's a good idea (i would prefer to fix the release without touching the menu, but in this case this should give the correct results, and reduce pressure on different fronts..).

It's implemented in the menu below, and the latter does not crash on the reproducer of this issue.

/cdaq/test/missirol/dev/CMSSW_14_0_0/tmp/240731_cmssw45595/Test01/HLT/V2

@cms-sw/hlt-l2, if you agree, I will follow up with FOG in a CMSHLT ticket (and I will ask you there to double-check the menu).

mmusich · 2024-07-31T06:12:50Z

if you agree, I will follow up with FOG in a CMSHLT ticket

is it costless?

missirol · 2024-07-31T06:19:24Z

I will check that in parallel. :)

mmusich · 2024-07-31T06:24:52Z

I will check that in parallel. :)

OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one.

mmusich · 2024-07-31T06:33:54Z

i would prefer to fix the release without touching the menu,

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..39f1948f73c 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (rh.chi2() < 0)
+      return false;
+
     if (topology.cutsFromDB()) {

this also prevents the crash in the reproducers #45595 (comment)

Martin-Grunewald · 2024-07-31T06:43:32Z

I guess the config workaround is fine (for online/FOG) while the C++ fix should be used as soon as possible after.

jsamudio · 2024-07-31T09:35:57Z

Thanks @mmusich, IMO I think these protections against bad channels and invalid detId should have been in place anyhow. The logic for the non-DB thresholds would have skipped such rechits as well. I apologize that I didn't catch it sooner.

missirol · 2024-07-31T09:43:39Z

Thanks @mmusich @jsamudio !

missirol · 2024-07-31T17:42:12Z

As a side note, for the one event that I checked amongst those causing this crash, one gets the following warning when running the legacy HBHERecHit producer (using an older HLT menu).

Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 19:21:24.279 CEST
%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  31-Jul-2024 19:21:24 CEST Run: 383830 Event: 113486368
 bad SOI/maxTS in cell (HB -8,59,1)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG

I don't know if this is the same hit as the one leading to the crash, but, just for my understanding, could HCAL experts explain

what could cause this kind of hit to be present
whether or not it is expected to happen (or, it should never happen)
whether or not skipping this hit in downstream reconstruction modules is the correct action to take (i guess so, but just to make sure)

@cms-sw/hcal-dpg-l2

abdoulline · 2024-07-31T19:00:12Z

Hi @missirol
(not trying to wear HCAL expert's hat)

looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS
is not expected to happen
skipping this hit is a necessity and it's done in the HCAL local reco

kakwok · 2024-07-31T20:38:16Z

I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel

Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 22:32:04.622 CEST
[Alpaka digi input] (HB -8,59,1) digi = DetID=(HB -8,59,1) flavor=3 8 samples
  ADC=9 TDC=3 CAPID=1
  ADC=10 TDC=3 CAPID=2
  ADC=21 TDC=3 CAPID=3
  ADC=239 TDC=1 CAPID=0
  ADC=12 TDC=3 CAPID=1
  ADC=16 TDC=3 CAPID=2
  ADC=12 TDC=3 CAPID=3
  ADC=17 TDC=3 CAPID=0

abdoulline · 2024-08-01T04:36:08Z

I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel

The printout shows ADC/TDC pattern, which seems to be OK, indeed,
It misses info about SOI (= digi,presamples() ). But if SOI is incorrect (-1 as in the warning above, provided by Marion), then the algo skips this hit the same way it does with bad channels listed in DB (in HcalChannelQuality),

This protection in the legacy producer code above was introduced in 2021
#35944

when there was an occurrence of the bad SOI, which crashed the Prompt reco...

abdoulline · 2024-08-01T04:41:23Z

Let me explicitly involve @mariadalfonso and @igv4321 in the discussion from HCAL side.

kakwok · 2024-08-01T16:51:12Z

ah, good catch! Indeed the SOI is missing in the printout.
Which means this line sample.soi() is not true:
https://github.com/cms-sw/cmssw/blob/master/DataFormats/HcalDigi/interface/QIE11DataFrame.h#L44

missirol · 2024-08-01T16:56:24Z

looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS

is not expected to happen

skipping this hit is a necessity and it's done in the HCAL local reco

Thanks for the explanations, @abdoulline .

mmusich · 2024-08-05T06:38:31Z

proposed fixes:

skip bad channels in PFRecHitProducerKernelConstruct<T>::applyCuts #45604 (master)
[14.0.X] skip bad channels in PFRecHitProducerKernelConstruct<T>::applyCuts #45611 (14.0.X, entered CMSSW_14_0_13_patch1)

mmusich · 2024-08-06T08:44:35Z

+hlt

see HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595 (comment)
CMSSW_14_0_13_patch1 (containing the fix) deployed online on Aug 5th, 2024 together with new HLT menus (reverting the workaround CMSHLT-3302, collisions menu version v1.4.3 and circulating, cosmic v1.5.2), see HLT report in daily run meeting report of Aug 6th, 2024. No further crashes related to this issues are observed.

mmusich · 2024-08-19T10:44:26Z

@cms-sw/reconstruction-l2 @cms-sw/heterogeneous-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.

fwyzard · 2024-08-19T11:03:43Z

I would like to avoid producing the invalid channels at all, but since this is now tracked in #45651, we can close this issue.

fwyzard · 2024-08-19T11:03:49Z

+heterogeneous

jfernan2 · 2024-09-04T15:59:07Z

+1

cmsbuild · 2024-09-04T15:59:26Z

This issue is fully signed and ready to be closed.

makortel · 2024-09-04T16:05:29Z

@cmsbuild, please close

cmsbuild added the pending-assignment label Jul 30, 2024

vince502 changed the title ~~HLT crash in after V1.4.1 menu deploy - alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830)~~ HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) Jul 30, 2024

cmsbuild added reconstruction-pending hlt-pending pending-signatures heterogeneous-pending and removed pending-assignment labels Jul 30, 2024

cmsbuild added the pf label Jul 30, 2024

mmusich added a commit to mmusich/hltScripts that referenced this issue Jul 31, 2024

add testing scripts for issue cms-sw/cmssw#45595

67f4acf

mmusich mentioned this issue Jul 31, 2024

[14.0.X] skip bad channels in PFRecHitProducerKernelConstruct<T>::applyCuts #45611

Merged

mandrenguyen mentioned this issue Aug 1, 2024

Build CMSSW_14_0_13_patch1 #45615

Closed

cmsbuild added hlt-approved and removed hlt-pending labels Aug 6, 2024

kakwok mentioned this issue Aug 6, 2024

MAHI-Alpaka improvements #45651

Open

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Aug 19, 2024

cmsbuild added reconstruction-approved fully-signed and removed reconstruction-pending pending-signatures labels Sep 4, 2024

cmsbuild closed this as completed Sep 4, 2024

HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595

HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595

Comments

vince502 commented Jul 30, 2024

cmsbuild commented Jul 30, 2024 • edited Loading

cmsbuild commented Jul 30, 2024

mmusich commented Jul 30, 2024

cmsbuild commented Jul 30, 2024

jsamudio commented Jul 30, 2024

mmusich commented Jul 30, 2024

mmusich commented Jul 30, 2024

jsamudio commented Jul 30, 2024

mmusich commented Jul 30, 2024

mmusich commented Jul 30, 2024

mmusich commented Jul 30, 2024

mmusich commented Jul 30, 2024

missirol commented Jul 30, 2024 • edited Loading

kakwok commented Jul 30, 2024

jsamudio commented Jul 30, 2024

fwyzard commented Jul 31, 2024

missirol commented Jul 31, 2024

mmusich commented Jul 31, 2024

missirol commented Jul 31, 2024

mmusich commented Jul 31, 2024

mmusich commented Jul 31, 2024

Martin-Grunewald commented Jul 31, 2024

jsamudio commented Jul 31, 2024

missirol commented Jul 31, 2024

missirol commented Jul 31, 2024

abdoulline commented Jul 31, 2024 • edited Loading

kakwok commented Jul 31, 2024

abdoulline commented Aug 1, 2024 • edited Loading

abdoulline commented Aug 1, 2024

kakwok commented Aug 1, 2024

missirol commented Aug 1, 2024

mmusich commented Aug 5, 2024

mmusich commented Aug 6, 2024 • edited Loading

mmusich commented Aug 19, 2024

fwyzard commented Aug 19, 2024

fwyzard commented Aug 19, 2024

jfernan2 commented Sep 4, 2024

cmsbuild commented Sep 4, 2024

makortel commented Sep 4, 2024

cmsbuild commented Jul 30, 2024 •

edited

Loading

missirol commented Jul 30, 2024 •

edited

Loading

abdoulline commented Jul 31, 2024 •

edited

Loading

abdoulline commented Aug 1, 2024 •

edited

Loading

mmusich commented Aug 6, 2024 •

edited

Loading