Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay failures after #41198 introduced #41246

Closed
rappoccio opened this issue Mar 31, 2023 · 56 comments
Closed

Replay failures after #41198 introduced #41246

rappoccio opened this issue Mar 31, 2023 · 56 comments

Comments

@rappoccio
Copy link
Contributor

rappoccio commented Mar 31, 2023

There are currently T0 replay failures that have begun in 13_0_2:

An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::getNextItemType
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<l1t::MuonShower> read too many bytes: 119 instead of 118

This is likely caused by the introduction of #41198

@dinyar please take a look, this is very urgent.

@rappoccio
Copy link
Contributor Author

urgent

@cmsbuild
Copy link
Contributor

A new Issue was created by @rappoccio .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@rappoccio
Copy link
Contributor Author

assign @dinyar, @aloeliger, @cms-sw/l1-l2

@Dr15Jones
Copy link
Contributor

@rappoccio is the replay starting from a ROOT file or from a Streamer file? If the latter, than the there was a schema change to a data product and Streamers do not support such a change.

@rappoccio
Copy link
Contributor Author

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@epalencia,@aloeliger,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@rappoccio
Copy link
Contributor Author

@Dr15Jones that's a good question. I will ask.

@Dr15Jones
Copy link
Contributor

Looking at the blame log for the class in question
https://github.com/cms-sw/cmssw/blame/master/DataFormats/L1Trigger/interface/MuonShower.h

it clearly shows that an additional member data of type bool was added to the class. That would account for why it tried to read 1 byte too many.

@aloeliger
Copy link
Contributor

@Dr15Jones Thanks for the quick check on that. Do these changes need to be reverted? This is part of a large series of changes introduced by @eyigitba @elfontan and @dinyar for muon shower triggering at L1.

@Dr15Jones
Copy link
Contributor

Do these changes need to be reverted?

Depends on what you want to do. Streamer files can only safely be read by the exact same version of CMSSW as was used to create them. That does not mean you can't change the online format, it only means you can't read older files after such a change. That means changes to data products used online must be coordinated between HLT and T0.

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 31, 2023 via email

@aloeliger
Copy link
Contributor

Sorry, for my edification, is there an implied mismatch in CMSSW version between HLT and tier 0 right now?

@rappoccio
Copy link
Contributor Author

No, this is occurring in a replay. We cannot deploy 13_0_2 without a successful test at the T0.

@makortel
Copy link
Contributor

@rappoccio Could you point to the full details of the replay? I'm puzzled why this particular data format class is causing trouble (it is not part of RAW nor Scouting, as far as I can tell)

@makortel
Copy link
Contributor

For the record, the culprit PR in 13_0_X is #41198

@rappoccio rappoccio changed the title Replay failures after #41207 introduced Replay failures after #41198 introduced Mar 31, 2023
@aloeliger
Copy link
Contributor

Okay. Let me try to work this from the L1 side a bit and see if I can't locate Dinyar.

@dinyar
Copy link
Contributor

dinyar commented Mar 31, 2023

Hi,

Right, so from what I read in the thread this is expected for streamer files if a data member is added (though if I understood @makortel correctly it might actually not be entirely expected). The added data member is required for a new type of muon shower that some people would like to trigger on. If there's a way to add this without causing these issues I'm happy to implement them, but from my limited understanding also iorules wouldn't help here, correct?

Cheers,
Dinyar

@Dr15Jones
Copy link
Contributor

@dinyar

If there's a way to add this without causing these issues I'm happy to implement them, but from my limited understanding also iorules wouldn't help here, correct?

Correct, iorules would not work here as Streamer files can not make use of any scheme evaluation (either the automatic type nor the iorules) from ROOT.

@makortel
Copy link
Contributor

@dinyar Could you elaborate where the l1t::MuonShower class is intended to be used? Like, is it meant to be used transiently in HLT, transiently in prompt reco, persisted in RECO (really AOD) files, created in HLT and delivered to Tier0, something else?

@dinyar
Copy link
Contributor

dinyar commented Mar 31, 2023

Sure, we certainly need it in the DQM, similar for the HLT (the showers are used as an input to the uGT emulator which runs as part of the HLT). I'm not really sure whether it's used in prompt reco, but I think that it's persisted in AOD. The main use case from my perspective is in any case DQM and HLT, but I e.g. don't know whether HLT creates them and delivers them to T0 or whether they're unpacked from RAW at T0 a second time (I don't have much knowledge of how things work at T0 unfortunately).

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 31, 2023 via email

@makortel
Copy link
Contributor

Sorry. I don't agree. Failures for streamer incompatibility are ok and should be worked around in the tier0 tests

I was replying along the same line. This is not the first time Tier0 replay fails because of data format changes, e.g. every change in Scouting data formats makes the replay (that uses Scouting streamer files) to fail. We have worked around those in the past. Is something preventing that now?

@francescobrivio
Copy link
Contributor

I agree with @davidlange6
I think this issue was simply due to the fact that streamer format changed between 13_0_0 and 13_0_2 and to replay old runs in the newer release Tier0 should just repack using the older release.
And that's what @germanfgv did already! 😄

@rappoccio
Copy link
Contributor Author

Sorry. I don't agree. Failures for streamer incompatibility are ok and should be worked around in the tier0 tests On Mar 31, 2023 3:36 PM, rappoccio @.> wrote: No, this is occurring in a replay. We cannot deploy 13_0_2 without a successful test at the T0. — Reply to this email directly, view it on GitHub<#41246 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ62JYZAPORKLYPW3NTW63MWHANCNFSM6AAAAAAWOUXWUQ. You are receiving this because you commented.Message ID: @.>

I've discussed with T0, we had a bit of a miscommunication. They used 13_0_0 ONLY for the repacker, not the full replay. The jobs are subsequently passing, so this should be fine. I will close the issue. Apologies for the miscommunication, it looks like we're okay now.

@makortel
Copy link
Contributor

makortel commented Apr 3, 2023

Do we have any production release with this issue? Or just prereleases?

The problem affects all releases after (and including) 13_0_0_pre3 (where we updated ROOT to 6.26, also the release the file read in #41246 (comment) was produced with), and unfortunately that includes 13_0_X production releases.

@smuzaffar (@cms-sw/externals-l2) we'd need to update ROOT at minimum to include root-project/root#11446, or to v6-26-08 or later in both 13_0_X and 13_1_X as soon as possible

@smuzaffar
Copy link
Contributor

@makortel , we are already using latest root 6-26-10 + fixes from the patches branch in 13.1.X IBs. The only changes we are missing are the 3 commits pushed today. If we see this issue in default 13.1.X IBs then may be root-project/root#11446 was not backported to 6.26 patches branch?

@smuzaffar
Copy link
Contributor

I see that root 6.26 patches branch is missing the fix in https://github.com/root-project/root/blob/v6-26-00-patches/io/io/src/TStreamerInfo.cxx . @pcanal , can you please confirm that root-project/root#11446 was not back ported to root 6.26 ?

@makortel
Copy link
Contributor

makortel commented Apr 3, 2023

Ah, good point, I was sloppy when checking the ROOT versions. But 13_0_0 still uses 6.26.07 (plus the patches), right?

@smuzaffar
Copy link
Contributor

yes, 13.0.X is using old version.

@smuzaffar
Copy link
Contributor

sorry my bad root-project/root#11446 is part of root 6.26 patches branch https://github.com/root-project/root/commits/v6-26-00-patches/io/io/src/TStreamerInfo.cxx .

So basically we just need to backport the root from 13.1.X in to 13.0.X

@makortel
Copy link
Contributor

makortel commented Apr 3, 2023

So basically we just need to backport the root from 13.1.X in to 13.0.X

Right

@smuzaffar
Copy link
Contributor

cms-sw/cmsdist#8421 backports root 6.26 changes from CMSSW_13_1_X to CMSSW_13_0_X

@francescobrivio
Copy link
Contributor

sorry my bad root-project/root#11446 is part of root 6.26 patches branch https://github.com/root-project/root/commits/v6-26-00-patches/io/io/src/TStreamerInfo.cxx .

So basically we just need to backport the root from 13.1.X in to 13.0.X

Hi @smuzaffar, but the problem reported by Marco in #41246 (comment) (and reproducible with Matti's recipe in #41246 (comment)) is also showing up in 13_1_X IB (also in the ROOT6 IBs)...what am I missing?

@smuzaffar
Copy link
Contributor

@francescobrivio , if I understand correctly the #41246 (comment) comment then the problem was how the file was writing. I think it was written by root version without the fix. Including newer root means new files will be written using fixed version of root.

@pcanal
Copy link
Contributor

pcanal commented Apr 4, 2023

With scanfile.C.txt (which of course can be renamed :))
you can use:

root.exe -b -l -q filename.root 'scanfile.C.txt(gFile) | sort -u

To get a list of the class (names) that do not have a StreamerInfo in the file but probably should. (i.e. would cause a problem if they schema evolve from the schema seen in those older CMSSW version with ROOT v6.27 and older).

@perrotta
Copy link
Contributor

perrotta commented Apr 7, 2023

@smuzaffar @makortel if I understand it correctly with the merge of cms-sw/cmsdist#8421 in CMSSW_13_0_X this issue can be considered solved, and therefore closed here in github (unless you want to keep it open for a follow up). Can you please confirm?

@makortel
Copy link
Contributor

makortel commented Apr 7, 2023

@perrotta The immediate issue seems to be resolved, but there will be some longer-term follow ups that would be useful to record in GitHub. I could open a new issue for those though.

@makortel
Copy link
Contributor

I opened a new issue #41348 where I'll collect more information of where we are or have been missing StreamerInfo. Given that the most urgent issue was solved, I think we can now close this issue.

@makortel
Copy link
Contributor

+core

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests