Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crash in run-374803 (HcalUnpacker::unpackUTCA) #42960

Closed
mmusich opened this issue Oct 6, 2023 · 21 comments
Closed

HLT crash in run-374803 (HcalUnpacker::unpackUTCA) #42960

mmusich opened this issue Oct 6, 2023 · 21 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Oct 6, 2023

In run-374803 (HI collisions, release CMSSW_13_2_5_patch1), DAQ reported a CMSSW crash at HLT not seen previously, to my knowledge [link to HLT elog].
A piece of stack trace which is possibly relevant is in [1].
Once the corresponding error-stream files become available, we'll attempt to reproduce offline the crash.

FYI: @cms-sw/hlt-l2 @fwyzard @mzarucki @trtomei @trocino

[1]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Oct 5 19:38:28 CEST 2023
Thread 19 (Thread 0x7f8268ffe700 (LWP 1202271) "cmsRun"):
#0 0x00007f8345c61a71 in poll () from /lib64/libc.so.6
#1 0x00007f833baf8d2f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f833bac075c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f833bac11bb in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f82eefd290a in HcalUHTRData::const_iterator::operator++() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#6 0x00007f82eefd7eac in HcalUnpacker::unpackUTCA(FEDRawData const&, HcalElectronicsMap const&, HcalUnpacker::Collections&, HcalUnpackerReport&, bool) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#7 0x00007f82ca8f3709 in HcalRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiPlugins.so
#8 0x00007f834868b3ed in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9 0x00007f8348671b52 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f83485fc5aa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f83485fca58 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f834856ea8f in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f8346def2e4 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f825fb15000, waiter=..., this=0x7f8340b7e200) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f8340b7e200) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/arena.cpp:137
#16 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/market.cpp:599
#17 0x00007f8346df14a6 in tbb::detail::r1::rml::private_worker::run (this=0x7f8340b73a80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#18 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f8340b73a80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#19 0x00007f8345f3f17a in start_thread () from /lib64/libpthread.so.0
#20 0x00007f8345c6cdf3 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ] 
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2023

A new Issue was created by @mmusich Marco Musich.

@antoniovilela, @smuzaffar, @makortel, @rappoccio, @sextonkennedy, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Oct 6, 2023

type hcal

@cmsbuild cmsbuild added the hcal label Oct 6, 2023
@mmusich
Copy link
Contributor Author

mmusich commented Oct 6, 2023

assign hlt

(I let others assign to other groups, if needed.)

@cms-sw/hcal-dpg-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2023

New categories assigned: hlt

@Martin-Grunewald,@mmusich,@missirol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol
Copy link
Contributor

missirol commented Oct 6, 2023

The crash appears to be reproducible. Below is a recipe to reproduce it on lxplus with CMSSW_13_2_5_patch1 (no GPUs required).

./run.sh -r 374803 -t 1 -i /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream -f run374803_ls0160_index000038_fu-c2b05-42-01_pid1201197 -o tmp -w

Content of run.sh:

#!/bin/bash

# defaults
showHelpMsg=false
runNumKeyword=-1
numThreadsDefault=32
numThreads="${numThreadsDefault}"
numStreamsDefault=0
numStreams="${numStreamsDefault}"
errDirPathDefault=/store/error_stream
errDirPath="${errDirPathDefault}"
outDirPathDefault=tmp
outDirPath="${outDirPathDefault}"
outDirOverWriteDefault=false
outDirOverWrite="${outDirOverWriteDefault}"
extraFilePatternDefault=""
extraFilePattern="${extraFilePatternDefault}"
noCmsRunDefault=false
noCmsRun="${noCmsRunDefault}"

# help message
usage() {
  cat <<@EOF
Description:
  This script can be used to run the HLT menu of a given run on error-stream files in FEDRawData (FRD) format.
  One cmsRun job per file is executed. The log files of all jobs are saved in an output directory.
  If a given job fails, the name of the corresponding log file is added to a file named "failed.txt" in the output directory.
  For all the files of a given run, the script uses the same HLT menu as used online during that run.

Example:
  The example below runs on all the files matching "/store/error_stream/run3676*/*fu-c2b04-32-01*.raw".
  Each cmsRun job uses 32 threads and 24 CMSSW streams. The results are saved in an directory named "tmp".
  If the output directory already exists, it will be overwritten, since "-w" is specified.

  > ./rerun_hlt_on_error_stream.sh -r 3676 -t 32 -s 24 -i /store/error_stream -f fu-c2b04-32-01 -o tmp -w

Options:
  -h, --help          Show this help message

  -r, --runNumber     Run number (a wildcard is appended: for example,
                      if "-r 123" is used, all runs matching "123*" will be considered)

  -t, --threads       Number of threads                     [Optional] [Default: ${numThreadsDefault}]

  -s, --streams       Number of CMSSW streams               [Optional] [Default: ${numStreamsDefault}]

  -i, --input-dir     Path to error-stream directory        [Optional] [Default: ${errDirPathDefault}]
                      containing one sub-folder per run

  -o, --output-dir    Path to output directory              [Optional] [Default: ${outDirPathDefault}]

  -w, --overwrite     Overwrite output directory
                      (if it already exists)                [Optional] [Default: ${outDirOverWriteDefault}]

  -f, --file-pattern  String to be used to restrict to
                      a subset of input files               [Optional] [Default: ${extraFilePatternDefault}]

  -n, --no-cmsRun     Do not run cmsRun job(s)              [Optional] [Default: ${noCmsRunDefault}]

  If optional arguments are not specified, the corresponding default values will be used.

@EOF
}

# command-line interface
while [[ $# -gt 0 ]]; do
  case "$1" in
    -h|--help) showHelpMsg=true; shift;;
    -r|--runNumber) runNumKeyword=$2; shift; shift;;
    -t|--threads) numThreads=$2; shift; shift;;
    -s|--streams) numStreams=$2; shift; shift;;
    -i|--input-dir) errDirPath=$2; shift; shift;;
    -o|--output-dir) outDirPath=$2; shift; shift;;
    -w|--overwrite) outDirOverWrite=true; shift;;
    -f|--file-pattern) extraFilePattern=$2; shift; shift;;
    -n|--no-cmsRun) noCmsRun=true; shift;;
    *) shift;;
  esac
done

# print help message
if [ "${showHelpMsg}" == true ]; then
  usage
  exit 0
fi

posNumRegex='^[0-9]+$'
if ! [[ "${runNumKeyword}" =~ ${posNumRegex} ]] ; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid run number (must be a positive integer without sign) [-r]: ${runNumKeyword}"
  exit 1
elif [ "${runNumKeyword}" -le 0 ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid run number (must be a number higher than zero) [-r]: ${runNumKeyword}"
  exit 1
fi

if ! [[ "${numThreads}" =~ ${posNumRegex} ]] ; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid number of threads per job (must be a positive integer without sign) [-t]: ${numThreads}"
  exit 1
elif [ "${numThreads}" -le 0 ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid number of threads per job (must be a number higher than zero) [-t]: ${numThreads}"
  exit 1
fi

if ! [[ "${numStreams}" =~ ${posNumRegex} ]] ; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid number of CMSSW streams per job (must be a positive integer without sign) [-s]: ${numStreams}"
  exit 1
fi

if [ ! -d "${errDirPath}" ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- target input directory does not exist [-i]: ${errDirPath}"
  exit 1
fi

if [ -z "${CMSSW_BASE}" ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s" ">> ERROR" " -- environment variable CMSSW_BASE not found"
  printf "%s\n" ": it is necessary to first set up the CMSSW environment"
  printf "%s\n\n" "            (for example via \"source setup.sh -r CMSSW_X_Y_Z\")"
  exit 1
fi

errDirAbsPath=$(readlink -e "${errDirPath}")
runDirPrePath="${errDirAbsPath}"/run"${runNumKeyword}"

if [ $(ls -d "${runDirPrePath}"* 2> /dev/null | wc -l) -eq 0 ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- no input directories found: ${runDirPrePath}*"
  exit 1
fi

[ "${outDirOverWrite}" != true ] || (rm -rf "${outDirPath}")

if [ -d "${outDirPath}" ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- target output directory already exists [-o]: ${outDirPath}"
  exit 1
fi

mkdir -p "${outDirPath}"
cd "${outDirPath}"

for dirPath in $(ls -d "${runDirPrePath}"*); do
  runNumber="${dirPath: -6}"
  echo "--------------------------------------------------"
  echo " run: ${runNumber}"
  echo "--------------------------------------------------"
  hltGetCmd="hltConfigFromDB --runNumber ${runNumber}"
  echo "${hltGetCmd} ..."
  hltCfg=run"${runNumber}"_cfg.py
  ${hltGetCmd} > "${hltCfg}"
  hltCfg=$(readlink -e "${hltCfg}")
  cat <<EOF >> "${hltCfg}"
import sys
if len(sys.argv) < 3:
    raise RuntimeError("one command-line argument required: path to file in FEDRawData (FRD) format")

process.source.fileListMode = True
process.source.fileNames = [sys.argv[2]]

process.options.numberOfThreads = ${numThreads}
process.options.numberOfStreams = ${numStreams}

del process.PrescaleService

del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')

process.EvFDaqDirector.buBaseDir = "${errDirAbsPath}"
process.EvFDaqDirector.runNumber = ${runNumber}

process.hltDQMFileSaverPB.runNumber = ${runNumber}

if hasattr(process, "hltOnlineBeamSpotESProducer"):
    process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)

# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  # array of non-empty FRD files
  frdFiles=($(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw))
  for frdFile in "${frdFiles[@]}"; do
    frdFileBasename=$(basename "${frdFile}")
    if [[ "${frdFileBasename}" != *"${extraFilePattern}"* ]]; then
      continue
    fi
    jobTag="${frdFileBasename::-4}"
    hltLog="${jobTag}".log
    frdFileAbsPath=$(readlink -e "${dirPath}"/"${frdFileBasename}")
    echo -e "\n${jobTag} ..."
    echo -e "# cmsRun ${hltCfg} ${frdFileAbsPath}\n" > "${hltLog}"
    if [ "${noCmsRun}" != true ]; then
      rm -rf run"${runNumber}" && mkdir -p run"${runNumber}"
      cmsRun "${hltCfg}" "${frdFileAbsPath}" &>> "${hltLog}"
      exitCode=$?
      [ ${exitCode} -eq 0 ] || echo "${hltLog}" >> failed.txt
      echo "${jobTag} ... done (exit code: ${exitCode})"
    fi
  done
  rm -rf run"${runNumber}"
  unset frdFile frdFiles
  unset runNumber hltCfg
done
unset dirPath

@abdoulline
Copy link

@missirol thanks a lot! We'll try to track it down.
At the moment the assumption is that the unpacker may not be protected enough against some weird RAW data corruption, which signs were spotted in this run...

@mmusich
Copy link
Contributor Author

mmusich commented Oct 6, 2023

In case it helps, compiling with debug symbols, the segmentation fault originates here

bool isHeader() const { return ((*m_ptr) & 0x8000) != 0; }

when called from

for (++i; i != iend && !i.isHeader(); ++i) {

@abdoulline
Copy link

@mmusich very kind of you, thanks!

@mmusich
Copy link
Contributor Author

mmusich commented Oct 11, 2023

This happened again in run-374970, see link to HLT e-log.
The relevant part of the stack trace appears to be similar [1].
I attach for completeness the full stack trace from F3Mon: f3mon_logtable_2023-10-11T06 40 08.534Z.txt

[1]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Wed Oct 11 06:42:19 CEST 2023
Thread 11 (Thread 0x7f8acb9ff700 (LWP 4193159) "cmsRun"):
#0 0x00007f8b6e9baa71 in poll () from /lib64/libc.so.6
#1 0x00007f8b67995d2f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f8b6795d75c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f8b6795e1bb in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f8b1852690a in HcalUHTRData::const_iterator::operator++() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#6 0x00007f8b1852bb7a in HcalUnpacker::unpackUTCA(FEDRawData const&, HcalElectronicsMap const&, HcalUnpacker::Collections&, HcalUnpackerReport&, bool) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#7 0x00007f8af3e47709 in HcalRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiPlugins.so
#8 0x00007f8b713e43ed in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9 0x00007f8b713cab52 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f8b713555aa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f8b71355a58 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f8b712c7a8f in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f8b6fb482e4 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f8a64b34300, waiter=..., this=0x7f8b698b1e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f8b698b1e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/arena.cpp:137
#16 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/market.cpp:599
#17 0x00007f8b6fb4a4a6 in tbb::detail::r1::rml::private_worker::run (this=0x7f8b698a7c80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#18 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f8b698a7c80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#19 0x00007f8b6ec9817a in start_thread () from /lib64/libpthread.so.0
#20 0x00007f8b6e9c5df3 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ] 

@cms-sw/hcal-dpg-l2 do you have news about this?

@unreglicious
Copy link

Hi,

would it be possible to stick these error streams in the standard .root format and put somewhere on EOS?

This would allow us (HCAL operations) to digest the FED RAW data itself for any clues.

Thanks and cheers,
Pasha

@mmusich
Copy link
Contributor Author

mmusich commented Oct 11, 2023

Hi @unreglicious ,
you can find the edm-converted .root file for the incriminated LS at:

/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/debug/374803/converted_run374803_ls0160_index000038.root

For the record I obtained this by using:

cmsRun convertFromRawToEDM.py /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run374803/run374803_ls0160_index000038_fu-c2b05-42-01_pid1201197.raw converted_run374803_ls0160_index000038.root

where convertFromRawToEDM.py is [*]

[*]

import FWCore.ParameterSet.Config as cms
import os
import sys
import glob

if sys.argv[0] == 'cmsRun':
  INPUTS = [sys.argv[2]]
  outputFilePath = sys.argv[3]
else:
  INPUTS = [sys.argv[1]]
  outputFilePath = sys.argv[2]

process = cms.Process('TEST')
process.options.wantSummary = False
process.maxEvents.input = -1

inputFilePaths = []
for inp_i in INPUTS:
  for inp_j in glob.glob(inp_i):
    inp_j2 = 'file:'+inp_j if os.path.isfile(inp_j) else inp_j
    inputFilePaths.append(inp_j2)
inputFilePaths = sorted(list(set(inputFilePaths)))
print(inputFilePaths)

process.EvFDaqDirector = cms.Service('EvFDaqDirector')

process.source = cms.Source('FedRawDataInputSource',
  fileListMode = cms.untracked.bool(True),
  fileNames = cms.untracked.vstring(inputFilePaths)
)

process.edmOutput = cms.OutputModule('PoolOutputModule',
  dataset = cms.untracked.PSet(
    dataTier = cms.untracked.string('RAW')
  ),
  fileName = cms.untracked.string('file:'+outputFilePath)
)

process.outputPath = cms.EndPath(process.edmOutput)

@unreglicious
Copy link

Thanks a lot, @mmusich! Both .raw and recipe are very helpful!

@abdoulline
Copy link

PRs addressing the issue (amendments suggested by Jeremy Mans) submitted
#43011 (master)
#43012 (132X)

@mmusich
Copy link
Contributor Author

mmusich commented Oct 13, 2023

PRs addressing the issue (amendments suggested by Jeremy Mans) submitted

Thank you @abdoulline

@mmusich
Copy link
Contributor Author

mmusich commented Oct 13, 2023

assign reconstruction

  • the proposed fixes lie in the reco area

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor Author

mmusich commented Oct 13, 2023

For the record there was yet another occurrence in run-375055, see link to HLT e-log.
The relevant part of the stack trace appears to be similar [1].
I attach for completeness the full stack trace from F3Mon: f3mon_logtable_2023-10-13T06 47 59.369Z.txt

[1]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Oct 12 18:57:35 CEST 2023
Thread 12 (Thread 0x7f7efbdff700 (LWP 3485309) "cmsRun"):
#0 0x00007f7f7ba0ea71 in poll () from /lib64/libc.so.6
#1 0x00007f7f7096cd2f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f7f7093475c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f7f709351bb in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f7f2553b90a in HcalUHTRData::const_iterator::operator++() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#6 0x00007f7f25540ec9 in HcalUnpacker::unpackUTCA(FEDRawData const&, HcalElectronicsMap const&, HcalUnpacker::Collections&, HcalUnpackerReport&, bool) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#7 0x00007f7f00e5c709 in HcalRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiPlugins.so
#8 0x00007f7f7e4383ed in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9 0x00007f7f7e41eb52 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f7f7e3a95aa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f7f7e3a9a58 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f7f7e31ba8f in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f7f7cb9c2e4 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f7e71576900, waiter=..., this=0x7f7f7692f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f7f7692f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/arena.cpp:137
#16 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/market.cpp:599
#17 0x00007f7f7cb9e4a6 in tbb::detail::r1::rml::private_worker::run (this=0x7f7f76923e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#18 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f7f76923e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#19 0x00007f7f7bcec17a in start_thread () from /lib64/libpthread.so.0
#20 0x00007f7f7ba19df3 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ] 

@jfernan2
Copy link
Contributor

+1
Fixed by #43011

@mmusich
Copy link
Contributor Author

mmusich commented Oct 24, 2023

+hlt

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants