Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneToManyAssoc assertion faillure in HLT menu in CMSSW_14_0_15 #45834

Closed
mmusich opened this issue Aug 29, 2024 · 19 comments
Closed

OneToManyAssoc assertion faillure in HLT menu in CMSSW_14_0_15 #45834

mmusich opened this issue Aug 29, 2024 · 19 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Aug 29, 2024

During the final steps of testing of CMSSW_14_0_15 TSG / FOG encountered this runtime assertion:

cmsRun: src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:48: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE 
= -1]: Assertion `view.contentSize > 0' failed.

While FOG (@mzarucki @trocino @vince502) could produce later a recipe to reproduce the exact crash, in the meanwhile it can be reproduced by using the following script:

# cmsrel CMSSW_14_0_15
# cd CMSSW_14_0_15/src
# export USER_SCRAM_TARGET=x86-64-v3
# scram b enable-multi-targets
# cmsenv

#!/bin/bash

# List of run numbers
runs=(377893
      378039
      378113
      378366
      378369
      378906
      378940
      378981
      378985
      378993
      378994
      379154
      379174
      380115
      380360
      380399
      380466
      380513
      380531
      380624
      381067
      381147
      381443
      381479
      381543
      381544
      381549
      382250
      382461
      382580
      382594
      382617
      382654
      383034
      383155
      383162
      383219
      383363
      383368
      383377)

# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  # export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
    --globaltag ${global_tag} \
    --data \
    --eras Run3 \
    --l1-emulator uGT \
    --l1 L1Menu_Collisions2024_v1_3_0_xml \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

Few facts that I tested:

  • the crash happens only in presence of a GPU on the machine
  • it does happen if one chooses USER_SCRAM_TARGET=x86-64-v3 OR NOT (so it doesn't seem to depend on the micro-architecture)
  • undoing locally the changes of https://github.com/cms-sw/cmssw/pull/45744/files doesn't solve the issue
  • @AdrianoDee tested in CMSSW_14_0_15 relval wf 12861.402 (SingleNuE10_2024_GenSim+Digi_Patatrack_PixelOnlyAlpaka_2024+RecoNano_Patatrack_PixelOnlyAlpaka_2024+HARVESTNano_Patatrack_PixelOnlyAlpaka_202) and reported a crash: this seem to indicate the issue stems from lack of pixel hits in the input events.
  • the issue persists in recent IBs, e.g. CMSSW_14_0_X_2024-08-28-2300
@mmusich
Copy link
Contributor Author

mmusich commented Aug 29, 2024

assign hlt, heterogeneous, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 29, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Aug 29, 2024

Errata corrige:

undoing locally the changes of https://github.com/cms-sw/cmssw/pull/45744/files doesn't solve the issue

It actually does:

diff --git a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
index da3c3bef392..14d0a2e1aa8 100644
--- a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
+++ b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
@@ -41,16 +41,6 @@ namespace cms::alpakatools {
       assert(deviceData.nHits() == hostData.nHits());
       assert(deviceData.offsetBPIX2() == hostData.offsetBPIX2());
 #endif
-      // Update the contents address of the phiBinner histo container after the copy from device happened
-      alpaka::wait(queue);
-      typename TrackingRecHitSoA<TrackerTraits>::PhiBinnerView pbv;
-      pbv.assoc = &(hostData.view().phiBinner());
-      pbv.offSize = -1;
-      pbv.offStorage = nullptr;
-      pbv.contentSize = hostData.nHits();
-      pbv.contentStorage = hostData.view().phiBinnerStorage();
-      hostData.view().phiBinner().initStorage(pbv);
-
       return hostData;
     }
   };

followed by git cms-checkdeps -a does solve the issue.

@mmusich
Copy link
Contributor Author

mmusich commented Aug 29, 2024

With this change:

diff --git a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
index da3c3bef392..5bdb74fbb6e 100644
--- a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
+++ b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
@@ -34,6 +34,11 @@ namespace cms::alpakatools {
     template <typename TQueue>
     static auto copyAsync(TQueue& queue, TrackingRecHitDevice<TrackerTraits, TDevice> const& deviceData) {
       TrackingRecHitHost<TrackerTraits> hostData(queue, deviceData.view().metadata().size());
+
+      // Don't bother if zero hits
+      if (deviceData.view().metadata().size() == 0) 
+        return hostData;
+        
       alpaka::memcpy(queue, hostData.buffer(), deviceData.buffer());
 #ifdef GPU_DEBUG
       printf("TrackingRecHitsSoACollection: I'm copying to host.\n");

I tested successfully (on lxplus8-gpu):

I let the experts comment if that would be enough of a protection and if some more tests are needed

@makortel
Copy link
Contributor

@AdrianoDee Do the consumers of TrackingRecHitSoACollection check the view().metadata().size() before accessing any of the scalar data?

If the CopyHost::copyAsync() returns before the alpaka::memcpy(), the scalar data in the TrackingRecHitHost will stay uninitialized, that will likely result in random (well, undefined) behavior in code that does not check view().metadata().size().

(I'm not really arguing against avoiding the memcpy() if size() == 0, just want to make sure the consequences of avoiding the memcpy() when size() == 0 are understood)

One possibility to be foolproof would be to zero the memory in hostData in this case (with std::memset()).

@AdrianoDee
Copy link
Contributor

@makortel so I went checking to be sure:

But (sorry for the long list above I spotted this when I had just finished the list) we do access offsetBPIX2() when allocating the needed memory and this could lead to unexpected behaviors. I imagine it would be good to add a check just before that (regardless the fix here):

diff --git a/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc b/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
index c6615c08d73..d21ed39afc6 100644
--- a/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
+++ b/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
@@ -298,6 +298,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     using GPUKernels = CAHitNtupletGeneratorKernels<TrackerTraits>;
 
     TrackSoA tracks(queue);
+    
+    // Don't bother if less than 2 this
+    if (hits_d.view().metadata().size() < 2) 
+       return tracks;
 
     GPUKernels kernels(m_params, hits_d.view().metadata().size(), hits_d.offsetBPIX2(), queue);
 

@mmusich
Copy link
Contributor Author

mmusich commented Aug 29, 2024

Another test that is somehow worrisome is that if I feed the "alpaka-migrated" menu from CMSHLT-3284 into the relval machinery via:

cmsrel CMSSW_14_0_15
cd CMSSW_14_0_15/src
cmsenv
git cms-addpkg HLTrigger/Configuration
git cms-addpkg Configuration/PyReleaseValidation
hltGetConfiguration /users/soohwan/HLT_140X/Alpaka/HIonV173/V10 \
   --globaltag auto:phase1_2024_realistic \
   --mc \
   --unprescale \
   --cff > "${CMSSW_BASE}"/src/HLTrigger/Configuration/python/HLT_User_cff.py
scram b -j 20   

and then apply the following patch:

diff --git a/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py b/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
index 8a70a74aa0c..f9dc0a0397f 100644
--- a/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
+++ b/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
@@ -2865,7 +2865,7 @@ upgradeProperties[2017] = {
     '2022HI' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2022_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2873,7 +2873,7 @@ upgradeProperties[2017] = {
     '2022HIRP' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2022_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb_approxSiStripClusters',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2881,7 +2881,7 @@ upgradeProperties[2017] = {
     '2023HI' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2023_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2889,7 +2889,7 @@ upgradeProperties[2017] = {
     '2023HIRP' : {
         'Geom' : 'DB:Extended',
         'GT':'auto:phase1_2023_realistic_hi',
-        'HLTmenu': '@fake2',
+        'HLTmenu': 'User',
         'Era':'Run3_pp_on_PbPb_approxSiStripClusters',
         'BeamSpot': 'DBrealistic',
         'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],

in a release that I have prepared with the "would be fix" that I discussed above and then run:

  • runTheMatrix.py --what upgrade -l 15261.0 --maxSteps 2 (neutrino gun input)
  • runTheMatrix.py --what upgrade -l 15224.0 --maxSteps 2 (TTbar input)

the neutrino gun test dies at the second event (killed, I suppose for excessive memory - or some other infinite loop), whereas if I use HIon (the CUDA-based menu that we have currently in release), both tests run correctly.

@mmusich
Copy link
Contributor Author

mmusich commented Aug 29, 2024

But (sorry for the long list above I spotted this when I had just finished the list) we do access offsetBPIX2() when allocating the needed memory and this could lead to unexpected behaviors. I imagine it would be good to add a check just before that (regardless the fix here):

thanks @AdrianoDee , I confirm with the additional changes at #45834 (comment) the test at #45834 (comment) runs fine.

@AdrianoDee
Copy link
Contributor

Great, you're welcome.

In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data). But also I imagine the cost of the std::memset here would be negligible (?).

@makortel
Copy link
Contributor

In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data).

If we leave it to this option, I think it would be good to document this requirement in the TrackingRecHit{Host,Device,SoACollection}.h files.

But also I imagine the cost of the std::memset here would be negligible (?).

This would be my expectation as well, as the size of the buffer should be fairly small, and the calls should be fairly infrequent.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 2, 2024

In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data). But also I imagine the cost of the std::memset here would be negligible (?).

I would prefer to find a way to make the TrackingRecHitsSoACollection-relates classes self-contained, so that downstream users do not need to check if they are valid before accessing them.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 6, 2024

+hlt

@makortel
Copy link
Contributor

makortel commented Sep 6, 2024

+heterogeneous

@jfernan2
Copy link
Contributor

jfernan2 commented Sep 6, 2024

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 6, 2024

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Sep 6, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants