`OneToManyAssoc` assertion faillure in HLT menu in `CMSSW_14_0_15` #45834

mmusich · 2024-08-29T11:53:24Z

During the final steps of testing of CMSSW_14_0_15 TSG / FOG encountered this runtime assertion:

cmsRun: src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:48: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE 
= -1]: Assertion `view.contentSize > 0' failed.

While FOG (@mzarucki @trocino @vince502) could produce later a recipe to reproduce the exact crash, in the meanwhile it can be reproduced by using the following script:

# cmsrel CMSSW_14_0_15
# cd CMSSW_14_0_15/src
# export USER_SCRAM_TARGET=x86-64-v3
# scram b enable-multi-targets
# cmsenv

#!/bin/bash

# List of run numbers
runs=(377893
      378039
      378113
      378366
      378369
      378906
      378940
      378981
      378985
      378993
      378994
      379154
      379174
      380115
      380360
      380399
      380466
      380513
      380531
      380624
      381067
      381147
      381443
      381479
      381543
      381544
      381549
      382250
      382461
      382580
      382594
      382617
      382654
      383034
      383155
      383162
      383219
      383363
      383368
      383377)

# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  # export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
    --globaltag ${global_tag} \
    --data \
    --eras Run3 \
    --l1-emulator uGT \
    --l1 L1Menu_Collisions2024_v1_3_0_xml \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

Few facts that I tested:

the crash happens only in presence of a GPU on the machine
it does happen if one chooses USER_SCRAM_TARGET=x86-64-v3 OR NOT (so it doesn't seem to depend on the micro-architecture)
~~undoing locally the changes of https://github.com/cms-sw/cmssw/pull/45744/files doesn't solve the issue~~
@AdrianoDee tested in CMSSW_14_0_15 relval wf 12861.402 (SingleNuE10_2024_GenSim+Digi_Patatrack_PixelOnlyAlpaka_2024+RecoNano_Patatrack_PixelOnlyAlpaka_2024+HARVESTNano_Patatrack_PixelOnlyAlpaka_202) and reported a crash: this seem to indicate the issue stems from lack of pixel hits in the input events.
the issue persists in recent IBs, e.g. CMSSW_14_0_X_2024-08-28-2300

The text was updated successfully, but these errors were encountered:

mmusich · 2024-08-29T11:53:38Z

assign hlt, heterogeneous, reconstruction

cmsbuild · 2024-08-29T11:53:40Z

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-08-29T11:53:41Z

cms-bot internal usage

cmsbuild · 2024-08-29T11:53:42Z

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich · 2024-08-29T12:15:08Z

Errata corrige:

undoing locally the changes of https://github.com/cms-sw/cmssw/pull/45744/files doesn't solve the issue

It actually does:

diff --git a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
index da3c3bef392..14d0a2e1aa8 100644
--- a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
+++ b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
@@ -41,16 +41,6 @@ namespace cms::alpakatools {
       assert(deviceData.nHits() == hostData.nHits());
       assert(deviceData.offsetBPIX2() == hostData.offsetBPIX2());
 #endif
-      // Update the contents address of the phiBinner histo container after the copy from device happened
-      alpaka::wait(queue);
-      typename TrackingRecHitSoA<TrackerTraits>::PhiBinnerView pbv;
-      pbv.assoc = &(hostData.view().phiBinner());
-      pbv.offSize = -1;
-      pbv.offStorage = nullptr;
-      pbv.contentSize = hostData.nHits();
-      pbv.contentStorage = hostData.view().phiBinnerStorage();
-      hostData.view().phiBinner().initStorage(pbv);
-
       return hostData;
     }
   };

followed by git cms-checkdeps -a does solve the issue.

mmusich · 2024-08-29T13:04:15Z

With this change:

diff --git a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
index da3c3bef392..5bdb74fbb6e 100644
--- a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
+++ b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
@@ -34,6 +34,11 @@ namespace cms::alpakatools {
     template <typename TQueue>
     static auto copyAsync(TQueue& queue, TrackingRecHitDevice<TrackerTraits, TDevice> const& deviceData) {
       TrackingRecHitHost<TrackerTraits> hostData(queue, deviceData.view().metadata().size());
+
+      // Don't bother if zero hits
+      if (deviceData.view().metadata().size() == 0) 
+        return hostData;
+        
       alpaka::memcpy(queue, hostData.buffer(), deviceData.buffer());
 #ifdef GPU_DEBUG
       printf("TrackingRecHitsSoACollection: I'm copying to host.\n");

I tested successfully (on lxplus8-gpu):

the script at OneToManyAssoc assertion faillure in HLT menu in CMSSW_14_0_15 #45834 (comment) (standard FOG-like tests for pp)
the script at Runtime crash when forcing only pixel tracking+vertexing on serial_sync backend #45708 (comment) (hybrid GPU+CPU menu, HIon-like)
runTheMatrix.py --what upgrade -l 12861.402

I let the experts comment if that would be enough of a protection and if some more tests are needed

makortel · 2024-08-29T13:33:18Z

@AdrianoDee Do the consumers of TrackingRecHitSoACollection check the view().metadata().size() before accessing any of the scalar data?

If the CopyHost::copyAsync() returns before the alpaka::memcpy(), the scalar data in the TrackingRecHitHost will stay uninitialized, that will likely result in random (well, undefined) behavior in code that does not check view().metadata().size().

(I'm not really arguing against avoiding the memcpy() if size() == 0, just want to make sure the consequences of avoiding the memcpy() when size() == 0 are understood)

One possibility to be foolproof would be to zero the memory in hostData in this case (with std::memset()).

AdrianoDee · 2024-08-29T14:11:43Z

@makortel so I went checking to be sure:

the CA track plugin does check that view().metadata().size() >= 2 before starting the track building;
the converter from SoA to legacy hits does the check for view().metadata().size()==0;
the DQM monitoring plugin loop is limited by nHits() == view().metadata().size();
the DQM comparison module is limited by the "Reference" SoA size. Maybe this could be moved to be the min of the two (in general).

But (sorry for the long list above I spotted this when I had just finished the list) we do access offsetBPIX2() when allocating the needed memory and this could lead to unexpected behaviors. I imagine it would be good to add a check just before that (regardless the fix here):

diff --git a/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc b/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
index c6615c08d73..d21ed39afc6 100644
--- a/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
+++ b/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
@@ -298,6 +298,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     using GPUKernels = CAHitNtupletGeneratorKernels<TrackerTraits>;
 
     TrackSoA tracks(queue);
+    
+    // Don't bother if less than 2 this
+    if (hits_d.view().metadata().size() < 2) 
+       return tracks;
 
     GPUKernels kernels(m_params, hits_d.view().metadata().size(), hits_d.offsetBPIX2(), queue);

mmusich · 2024-08-29T14:13:52Z

Another test that is somehow worrisome is that if I feed the "alpaka-migrated" menu from CMSHLT-3284 into the relval machinery via:

cmsrel CMSSW_14_0_15
cd CMSSW_14_0_15/src
cmsenv
git cms-addpkg HLTrigger/Configuration
git cms-addpkg Configuration/PyReleaseValidation
hltGetConfiguration /users/soohwan/HLT_140X/Alpaka/HIonV173/V10 \
   --globaltag auto:phase1_2024_realistic \
   --mc \
   --unprescale \
   --cff > "${CMSSW_BASE}"/src/HLTrigger/Configuration/python/HLT_User_cff.py
scram b -j 20

and then apply the following patch:

diff --git a/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py b/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
index 8a70a74aa0c..f9dc0a0397f 100644
--- a/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
+++ b/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
@@ -2865,7 +2865,7 @@ upgradeProperties[2017] = {
     '2022HI' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2022_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2873,7 +2873,7 @@ upgradeProperties[2017] = {
     '2022HIRP' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2022_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb_approxSiStripClusters',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2881,7 +2881,7 @@ upgradeProperties[2017] = {
     '2023HI' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2023_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2889,7 +2889,7 @@ upgradeProperties[2017] = {
     '2023HIRP' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2023_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb_approxSiStripClusters',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],

in a release that I have prepared with the "would be fix" that I discussed above and then run:

runTheMatrix.py --what upgrade -l 15261.0 --maxSteps 2 (neutrino gun input)
runTheMatrix.py --what upgrade -l 15224.0 --maxSteps 2 (TTbar input)

the neutrino gun test dies at the second event (killed, I suppose for excessive memory - or some other infinite loop), whereas if I use HIon (the CUDA-based menu that we have currently in release), both tests run correctly.

mmusich · 2024-08-29T14:39:11Z

But (sorry for the long list above I spotted this when I had just finished the list) we do access offsetBPIX2() when allocating the needed memory and this could lead to unexpected behaviors. I imagine it would be good to add a check just before that (regardless the fix here):

thanks @AdrianoDee , I confirm with the additional changes at #45834 (comment) the test at #45834 (comment) runs fine.

AdrianoDee · 2024-08-29T15:02:37Z

Great, you're welcome.

In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data). But also I imagine the cost of the std::memset here would be negligible (?).

makortel · 2024-08-29T15:24:05Z

In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data).

If we leave it to this option, I think it would be good to document this requirement in the TrackingRecHit{Host,Device,SoACollection}.h files.

But also I imagine the cost of the std::memset here would be negligible (?).

This would be my expectation as well, as the size of the buffer should be fairly small, and the calls should be fairly infrequent.

fwyzard · 2024-09-02T11:43:00Z

In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data). But also I imagine the cost of the std::memset here would be negligible (?).

I would prefer to find a way to make the TrackingRecHitsSoACollection-relates classes self-contained, so that downstream users do not need to check if they are valid before accessing them.

mmusich · 2024-09-06T08:50:49Z

proposed solutions:

TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits #45837 (master)
[14.1.X] TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits #45857 (CMSSW_14_1_X)
[14.0.X] TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits #45858 (CMSSW_14_0_X)

mmusich · 2024-09-06T08:52:05Z

+hlt

technically this particular issue was solved by the PRs in OneToManyAssoc assertion faillure in HLT menu in CMSSW_14_0_15 #45834 (comment).
this triggered another issue Assertion 'nHits >= 3' failed in RelVal 12861.0 #45871 which was solved by other PRs, see Assertion 'nHits >= 3' failed in RelVal 12861.0 #45871 (comment)

makortel · 2024-09-06T13:28:43Z

+heterogeneous

jfernan2 · 2024-09-06T16:28:15Z

+1

cmsbuild · 2024-09-06T16:28:40Z

This issue is fully signed and ready to be closed.

makortel · 2024-09-06T16:33:02Z

@cmsbuild, please close

cmsbuild added reconstruction-pending hlt-pending pending-signatures heterogeneous-pending labels Aug 29, 2024

This was referenced Aug 29, 2024

[14.0.X] initialise phiBinner in device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits> #45744

Merged

Runtime crash when forcing only pixel tracking+vertexing on serial_sync backend #45708

Closed

mmusich mentioned this issue Aug 29, 2024

TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits #45837

Merged

This was referenced Sep 2, 2024

[14.1.X] TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits #45857

Merged

[14.0.X] TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits #45858

Merged

cmsbuild added hlt-approved and removed hlt-pending labels Sep 6, 2024

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Sep 6, 2024

cmsbuild added reconstruction-approved fully-signed and removed reconstruction-pending pending-signatures labels Sep 6, 2024

cmsbuild closed this as completed Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`OneToManyAssoc` assertion faillure in HLT menu in `CMSSW_14_0_15` #45834

`OneToManyAssoc` assertion faillure in HLT menu in `CMSSW_14_0_15` #45834

mmusich commented Aug 29, 2024 •

edited

Loading

mmusich commented Aug 29, 2024

cmsbuild commented Aug 29, 2024

cmsbuild commented Aug 29, 2024 •

edited

Loading

cmsbuild commented Aug 29, 2024

mmusich commented Aug 29, 2024

mmusich commented Aug 29, 2024

makortel commented Aug 29, 2024

AdrianoDee commented Aug 29, 2024

mmusich commented Aug 29, 2024 •

edited

Loading

mmusich commented Aug 29, 2024

AdrianoDee commented Aug 29, 2024

makortel commented Aug 29, 2024

fwyzard commented Sep 2, 2024

mmusich commented Sep 6, 2024

mmusich commented Sep 6, 2024

makortel commented Sep 6, 2024

jfernan2 commented Sep 6, 2024

cmsbuild commented Sep 6, 2024

makortel commented Sep 6, 2024

OneToManyAssoc assertion faillure in HLT menu in CMSSW_14_0_15 #45834

OneToManyAssoc assertion faillure in HLT menu in CMSSW_14_0_15 #45834

Comments

mmusich commented Aug 29, 2024 • edited Loading

mmusich commented Aug 29, 2024

cmsbuild commented Aug 29, 2024

cmsbuild commented Aug 29, 2024 • edited Loading

cmsbuild commented Aug 29, 2024

mmusich commented Aug 29, 2024

mmusich commented Aug 29, 2024

makortel commented Aug 29, 2024

AdrianoDee commented Aug 29, 2024

mmusich commented Aug 29, 2024 • edited Loading

mmusich commented Aug 29, 2024

AdrianoDee commented Aug 29, 2024

makortel commented Aug 29, 2024

fwyzard commented Sep 2, 2024

mmusich commented Sep 6, 2024

mmusich commented Sep 6, 2024

makortel commented Sep 6, 2024

jfernan2 commented Sep 6, 2024

cmsbuild commented Sep 6, 2024

makortel commented Sep 6, 2024

`OneToManyAssoc` assertion faillure in HLT menu in `CMSSW_14_0_15` #45834

`OneToManyAssoc` assertion faillure in HLT menu in `CMSSW_14_0_15` #45834

mmusich commented Aug 29, 2024 •

edited

Loading

cmsbuild commented Aug 29, 2024 •

edited

Loading

mmusich commented Aug 29, 2024 •

edited

Loading