-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OneToManyAssoc
assertion faillure in HLT menu in CMSSW_14_0_15
#45834
Comments
assign hlt, heterogeneous, reconstruction |
New categories assigned: hlt,heterogeneous,reconstruction @Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
cms-bot internal usage |
A new Issue was created by @mmusich. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Errata corrige:
It actually does: diff --git a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
index da3c3bef392..14d0a2e1aa8 100644
--- a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
+++ b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
@@ -41,16 +41,6 @@ namespace cms::alpakatools {
assert(deviceData.nHits() == hostData.nHits());
assert(deviceData.offsetBPIX2() == hostData.offsetBPIX2());
#endif
- // Update the contents address of the phiBinner histo container after the copy from device happened
- alpaka::wait(queue);
- typename TrackingRecHitSoA<TrackerTraits>::PhiBinnerView pbv;
- pbv.assoc = &(hostData.view().phiBinner());
- pbv.offSize = -1;
- pbv.offStorage = nullptr;
- pbv.contentSize = hostData.nHits();
- pbv.contentStorage = hostData.view().phiBinnerStorage();
- hostData.view().phiBinner().initStorage(pbv);
-
return hostData;
}
}; followed by |
With this change: diff --git a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
index da3c3bef392..5bdb74fbb6e 100644
--- a/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
+++ b/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h
@@ -34,6 +34,11 @@ namespace cms::alpakatools {
template <typename TQueue>
static auto copyAsync(TQueue& queue, TrackingRecHitDevice<TrackerTraits, TDevice> const& deviceData) {
TrackingRecHitHost<TrackerTraits> hostData(queue, deviceData.view().metadata().size());
+
+ // Don't bother if zero hits
+ if (deviceData.view().metadata().size() == 0)
+ return hostData;
+
alpaka::memcpy(queue, hostData.buffer(), deviceData.buffer());
#ifdef GPU_DEBUG
printf("TrackingRecHitsSoACollection: I'm copying to host.\n"); I tested successfully (on
I let the experts comment if that would be enough of a protection and if some more tests are needed |
@AdrianoDee Do the consumers of If the (I'm not really arguing against avoiding the One possibility to be foolproof would be to zero the memory in |
@makortel so I went checking to be sure:
But (sorry for the long list above I spotted this when I had just finished the list) we do access diff --git a/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc b/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
index c6615c08d73..d21ed39afc6 100644
--- a/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
+++ b/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGenerator.cc
@@ -298,6 +298,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
using GPUKernels = CAHitNtupletGeneratorKernels<TrackerTraits>;
TrackSoA tracks(queue);
+
+ // Don't bother if less than 2 this
+ if (hits_d.view().metadata().size() < 2)
+ return tracks;
GPUKernels kernels(m_params, hits_d.view().metadata().size(), hits_d.offsetBPIX2(), queue);
|
Another test that is somehow worrisome is that if I feed the "alpaka-migrated" menu from CMSHLT-3284 into the relval machinery via: cmsrel CMSSW_14_0_15
cd CMSSW_14_0_15/src
cmsenv
git cms-addpkg HLTrigger/Configuration
git cms-addpkg Configuration/PyReleaseValidation
hltGetConfiguration /users/soohwan/HLT_140X/Alpaka/HIonV173/V10 \
--globaltag auto:phase1_2024_realistic \
--mc \
--unprescale \
--cff > "${CMSSW_BASE}"/src/HLTrigger/Configuration/python/HLT_User_cff.py
scram b -j 20 and then apply the following patch: diff --git a/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py b/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
index 8a70a74aa0c..f9dc0a0397f 100644
--- a/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
+++ b/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py
@@ -2865,7 +2865,7 @@ upgradeProperties[2017] = {
'2022HI' : {
'Geom' : 'DB:Extended',
'GT':'auto:phase1_2022_realistic_hi',
- 'HLTmenu': '@fake2',
+ 'HLTmenu': 'User',
'Era':'Run3_pp_on_PbPb',
'BeamSpot': 'DBrealistic',
'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2873,7 +2873,7 @@ upgradeProperties[2017] = {
'2022HIRP' : {
'Geom' : 'DB:Extended',
'GT':'auto:phase1_2022_realistic_hi',
- 'HLTmenu': '@fake2',
+ 'HLTmenu': 'User',
'Era':'Run3_pp_on_PbPb_approxSiStripClusters',
'BeamSpot': 'DBrealistic',
'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2881,7 +2881,7 @@ upgradeProperties[2017] = {
'2023HI' : {
'Geom' : 'DB:Extended',
'GT':'auto:phase1_2023_realistic_hi',
- 'HLTmenu': '@fake2',
+ 'HLTmenu': 'User',
'Era':'Run3_pp_on_PbPb',
'BeamSpot': 'DBrealistic',
'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'],
@@ -2889,7 +2889,7 @@ upgradeProperties[2017] = {
'2023HIRP' : {
'Geom' : 'DB:Extended',
'GT':'auto:phase1_2023_realistic_hi',
- 'HLTmenu': '@fake2',
+ 'HLTmenu': 'User',
'Era':'Run3_pp_on_PbPb_approxSiStripClusters',
'BeamSpot': 'DBrealistic',
'ScenToRun' : ['GenSim','Digi','RecoNano','HARVESTNano','ALCA'], in a release that I have prepared with the "would be fix" that I discussed above and then run:
the neutrino gun test dies at the second event (killed, I suppose for excessive memory - or some other infinite loop), whereas if I use |
thanks @AdrianoDee , I confirm with the additional changes at #45834 (comment) the test at #45834 (comment) runs fine. |
Great, you're welcome. In general I think whoever uses this SoA must check for the size being non-zero and we could leave the copy as is (without zero-ing the data). But also I imagine the cost of the |
If we leave it to this option, I think it would be good to document this requirement in the
This would be my expectation as well, as the size of the buffer should be fairly small, and the calls should be fairly infrequent. |
I would prefer to find a way to make the |
+hlt
|
+heterogeneous |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
During the final steps of testing of
CMSSW_14_0_15
TSG / FOG encountered this runtime assertion:While FOG (@mzarucki @trocino @vince502) could produce later a recipe to reproduce the exact crash, in the meanwhile it can be reproduced by using the following script:
Few facts that I tested:
USER_SCRAM_TARGET=x86-64-v3
OR NOT (so it doesn't seem to depend on the micro-architecture)undoing locally the changes of https://github.com/cms-sw/cmssw/pull/45744/files doesn't solve the issueCMSSW_14_0_15
relval wf12861.402
(SingleNuE10_2024_GenSim+Digi_Patatrack_PixelOnlyAlpaka_2024+RecoNano_Patatrack_PixelOnlyAlpaka_2024+HARVESTNano_Patatrack_PixelOnlyAlpaka_202
) and reported a crash: this seem to indicate the issue stems from lack of pixel hits in the input events.CMSSW_14_0_X_2024-08-28-2300
The text was updated successfully, but these errors were encountered: