Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595

Closed
vince502 opened this issue Jul 30, 2024 · 43 comments

Comments

@vince502
Copy link

During the recent pp phyiscs fills 9945, 9947 we have deployed the new HLT menu /cdaq/physics/Run2024/2e34/v1.4.1/HLT/V2, and we started to see crashes in processes during the run (elog). The crashes comes from an illegal memory access.

In particular, we quote the errors from run 383830 in this issue.

So far on the offline side crashes were observed using the same error stream files.

From f3mon crashes in process shows like,

An exception of category 'StdException' occurred while
[0] Processing Event run: 383830 lumi: 83 event: 113500824 stream: 22
[1] Running path 'DST_PFScouting_DoubleMuon_v5'
[2] Calling method for module HcalDigisSoAProducer@alpaka/'hltHcalDigisSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_12_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_12_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(143) 'ret = TApi::eventQuery(event.getNativeHandle())' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

Setting to produce crashes (not necessarily same exact problem) in the streamers

# I first copied files from /eos/cms/store/group/tsg/FOG/error_stream_root/run383830/
# using gpu-c2a02-39-04
cmsrel CMSSW_14_0_12_MULTIARCHS
cd CMSSW_14_0_12_MULTIARCHS/src/
cmsenv
mkdir crashAlpakaFiles
scp lxplus.cern.ch:/eos/cms/store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root crashAlpakaFiles

hltGetConfiguration run:383830 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'file:crashAlpakaFiles/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root' \
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1 # 8 for method 2 below
process.options.numberOfStreams = 0

@EOF

In the hlt.py I have tried several settings to reproduce the same cuda memory access error,

  1. Plain run
cmsRun hlt.py 2>&1 | tee log_method1.txt

will result in crash by
Module: alpaka_serial_sync::PFRecHitSoAProducerHCAL:hltParticleFlowRecHitHBHESoASerialSync (crashed)
Output : log_method1.txt

  1. Force enable using cuda modules
echo "process.options.accelerators = ['gpu-nvidia']" >> hlt.py 

AND removed the following paths from process.schedule = cms.Schedule( ... block in hlt.py

process.DQM_PixelReconstruction_v11, 
process.DQM_EcalReconstruction_v11, 
process.DQM_HcalReconstruction_v9,
process.Dataset_DQMGPUvsCPU

and run cmsRun hlt.py 2>&1 | tee logForce_method2.txt

Output : logForce_method2.txt

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 30, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @vince502.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@vince502 vince502 changed the title HLT crash in after V1.4.1 menu deploy - alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) Jul 30, 2024
@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

assign hlt, heterogeneous, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jsamudio
Copy link
Contributor

Investigating from PF side

@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

@cms-sw/pf-l2 FYI

@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

type pf

@cmsbuild cmsbuild added the pf label Jul 30, 2024
@jsamudio
Copy link
Contributor

For the serial sync crash, with gdb I see:

Thread 1 "cmsRun" received signal SIGSEGV, Segmentation fault.
0x00007fff3463d43a in alpaka_serial_sync::PFRecHitProducerKernelConstruct<alpaka_serial_sync::particleFlowRecHitProducer::HCAL>::applyCuts (rh=..., params=..., topology=...)
    at src/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc:63
63            threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];

And printing out the detId of the hit, I see that it is detId == 0, meaning denseId == HCAL::kInvalidDenseId which is std::numeric_limits<uint32_t>::max() and thus giving the segfault here.

I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.

@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

@kakwok FYI

@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

also @abdoulline @cms-sw/hcal-dpg-l2

@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.

FWIW I confirm that with this:

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..cd7f215abf1 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,6 +59,12 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (detId == 0) {
+      printf("Rechit with detId %u has subdetector %u and depth %u ! \n", detId, subdet, depth);
+      return false;
+    }
+
     if (topology.cutsFromDB()) {
       threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
     } else {

the reproducer runs to completion both forcing the backend to be serial [1] or gpu [2]

[1] CPU only
#!/bin/bash -ex

# CMSSW_14_0_12_MULTIARCHS

hltGetConfiguration run:383830 \
  	    --globaltag 140X_dataRun3_HLT_v3 \
  	    --data \
  	    --no-prescale \
  	    --no-output \
  	    --max-events -1 \
  	    --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt.py

cat <<@EOF >> hlt.py
try:
  del process.MessageLogger
  process.load('FWCore.MessageLogger.MessageLogger_cfi')
  process.MessageLogger.cerr.enableStatistics = False
except:
  pass

process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ['cpu']
@EOF

cmsRun hlt.py &> hlt.log
[2] GPU only
#!/bin/bash -ex

# CMSSW_14_0_12_MULTIARCHS

hltGetConfiguration run:383830 \
  	    --globaltag 140X_dataRun3_HLT_v3 \
  	    --data \
  	    --no-prescale \
  	    --no-output \
  	    --max-events -1 \
  	    --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt2.py

cat <<@EOF >> hlt2.py
try:
  del process.MessageLogger
  process.load('FWCore.MessageLogger.MessageLogger_cfi')
  process.MessageLogger.cerr.enableStatistics = False
except:
  pass

process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ['gpu-nvidia']

rmPaths = set()
for pathName in process.paths_():
if 'DQM' in pathName:
  rmPaths.add(pathName)
elif 'CPUOnly' in pathName:
  rmPaths.add(pathName)
  
for rmPath in rmPaths:
  process.__delattr__(rmPath)

process.hltAlCaPFJet40GPUxorCPUFilter.triggerConditions=cms.vstring( 'AlCa_PFJet40_v31' )  
@EOF

cmsRun hlt2.py &> hlt2.log

@mmusich
Copy link
Contributor

mmusich commented Jul 30, 2024

A slightly more elegant version is:

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..576342bc16a 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,14 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
     if (topology.cutsFromDB()) {
-      threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
+      const auto& denseId = HCAL::detId2denseId(detId);
+      if (denseId != HCAL::kInvalidDenseId) {
+        threshold = topology.noiseThreshold()[denseId];
+      } else {
+        return false;
+      }
     } else {
       if (subdet == HcalBarrel) {
         threshold = params.energyThresholds()[depth - 1];

of course it remains to be understood the origin of such rechits associated to detid = 0.
By the way I guess it would help if we would be emitting different printouts here

vs here:

@missirol
Copy link
Contributor

missirol commented Jul 30, 2024

@kakwok @jsamudio

One question (maybe just for my own understanding).

In HcalRecHitSoAToLegacy, hits in SoA format corresponding to "bad channels" (chi2 < 0) are skipped (not converted to legacy).

Should they (and if so, are they), also skipped in the PFRecHit+Cluster reconstruction in Alpaka (which starts from the HBHE RecHits in SoA format) ?

@kakwok
Copy link
Contributor

kakwok commented Jul 30, 2024

If I understand correctly, yes; those rechits with chi2<0 will be skipped in the subsequent PF reconstruction.
And it seems to make sense for me to skip those rechits.

@jsamudio
Copy link
Contributor

If those bad hits exist in the SoA that is passed to PF RecHit, then I see no explicit skip over chi2 <0 on our side. And our first check is just the energy threshold.

@fwyzard
Copy link
Contributor

fwyzard commented Jul 31, 2024

As a quick workaround for the crashes, would it make sense to add back the conversion to legacy, and from legacy to SoA ?

This should filter away the bad hits and prevent the crashes, and could be implemented as a configuration-only change, while a better fix is worked on.

@missirol
Copy link
Contributor

Maybe the question is not to me, but I think it's a good idea (i would prefer to fix the release without touching the menu, but in this case this should give the correct results, and reduce pressure on different fronts..).

It's implemented in the menu below, and the latter does not crash on the reproducer of this issue.

/cdaq/test/missirol/dev/CMSSW_14_0_0/tmp/240731_cmssw45595/Test01/HLT/V2

@cms-sw/hlt-l2, if you agree, I will follow up with FOG in a CMSHLT ticket (and I will ask you there to double-check the menu).

@mmusich
Copy link
Contributor

mmusich commented Jul 31, 2024

if you agree, I will follow up with FOG in a CMSHLT ticket

is it costless?

@missirol
Copy link
Contributor

I will check that in parallel. :)

@mmusich
Copy link
Contributor

mmusich commented Jul 31, 2024

I will check that in parallel. :)

OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one.

@mmusich
Copy link
Contributor

mmusich commented Jul 31, 2024

i would prefer to fix the release without touching the menu,

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..39f1948f73c 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (rh.chi2() < 0)
+      return false;
+
     if (topology.cutsFromDB()) {

this also prevents the crash in the reproducers #45595 (comment)

@Martin-Grunewald
Copy link
Contributor

I guess the config workaround is fine (for online/FOG) while the C++ fix should be used as soon as possible after.

@jsamudio
Copy link
Contributor

Thanks @mmusich, IMO I think these protections against bad channels and invalid detId should have been in place anyhow. The logic for the non-DB thresholds would have skipped such rechits as well. I apologize that I didn't catch it sooner.

@missirol
Copy link
Contributor

Thanks @mmusich @jsamudio !

mmusich added a commit to mmusich/hltScripts that referenced this issue Jul 31, 2024
@missirol
Copy link
Contributor

As a side note, for the one event that I checked amongst those causing this crash, one gets the following warning when running the legacy HBHERecHit producer (using an older HLT menu).

Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 19:21:24.279 CEST
%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  31-Jul-2024 19:21:24 CEST Run: 383830 Event: 113486368
 bad SOI/maxTS in cell (HB -8,59,1)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG

I don't know if this is the same hit as the one leading to the crash, but, just for my understanding, could HCAL experts explain

  • what could cause this kind of hit to be present
  • whether or not it is expected to happen (or, it should never happen)
  • whether or not skipping this hit in downstream reconstruction modules is the correct action to take (i guess so, but just to make sure)

@cms-sw/hcal-dpg-l2

@abdoulline
Copy link

abdoulline commented Jul 31, 2024

Hi @missirol
(not trying to wear HCAL expert's hat)

  • looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS
  • is not expected to happen
  • skipping this hit is a necessity and it's done in the HCAL local reco

@kakwok
Copy link
Contributor

kakwok commented Jul 31, 2024

I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel

Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 22:32:04.622 CEST
[Alpaka digi input] (HB -8,59,1) digi = DetID=(HB -8,59,1) flavor=3 8 samples
  ADC=9 TDC=3 CAPID=1
  ADC=10 TDC=3 CAPID=2
  ADC=21 TDC=3 CAPID=3
  ADC=239 TDC=1 CAPID=0
  ADC=12 TDC=3 CAPID=1
  ADC=16 TDC=3 CAPID=2
  ADC=12 TDC=3 CAPID=3
  ADC=17 TDC=3 CAPID=0

@abdoulline
Copy link

abdoulline commented Aug 1, 2024

I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel

The printout shows ADC/TDC pattern, which seems to be OK, indeed,
It misses info about SOI (= digi,presamples() ). But if SOI is incorrect (-1 as in the warning above, provided by Marion), then the algo skips this hit the same way it does with bad channels listed in DB (in HcalChannelQuality),

This protection in the legacy producer code above was introduced in 2021
#35944

when there was an occurrence of the bad SOI, which crashed the Prompt reco...

@abdoulline
Copy link

Let me explicitly involve @mariadalfonso and @igv4321 in the discussion from HCAL side.

@kakwok
Copy link
Contributor

kakwok commented Aug 1, 2024

ah, good catch! Indeed the SOI is missing in the printout.
Which means this line sample.soi() is not true:
https://github.com/cms-sw/cmssw/blob/master/DataFormats/HcalDigi/interface/QIE11DataFrame.h#L44

@missirol
Copy link
Contributor

missirol commented Aug 1, 2024

  • looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS
  • is not expected to happen
  • skipping this hit is a necessity and it's done in the HCAL local reco

Thanks for the explanations, @abdoulline .

@mmusich
Copy link
Contributor

mmusich commented Aug 5, 2024

@mmusich
Copy link
Contributor

mmusich commented Aug 6, 2024

+hlt

@mmusich
Copy link
Contributor

mmusich commented Aug 19, 2024

@cms-sw/reconstruction-l2 @cms-sw/heterogeneous-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 19, 2024

I would like to avoid producing the invalid channels at all, but since this is now tracked in #45651, we can close this issue.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 19, 2024

+heterogeneous

@jfernan2
Copy link
Contributor

jfernan2 commented Sep 4, 2024

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 4, 2024

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Sep 4, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests