HLT crash in run-374803 (`HcalUnpacker::unpackUTCA`) #42960

mmusich · 2023-10-06T07:50:26Z

In run-374803 (HI collisions, release CMSSW_13_2_5_patch1), DAQ reported a CMSSW crash at HLT not seen previously, to my knowledge [link to HLT elog].
A piece of stack trace which is possibly relevant is in [1].
Once the corresponding error-stream files become available, we'll attempt to reproduce offline the crash.

FYI: @cms-sw/hlt-l2 @fwyzard @mzarucki @trtomei @trocino

[1]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Oct 5 19:38:28 CEST 2023
Thread 19 (Thread 0x7f8268ffe700 (LWP 1202271) "cmsRun"):
#0 0x00007f8345c61a71 in poll () from /lib64/libc.so.6
#1 0x00007f833baf8d2f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f833bac075c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f833bac11bb in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f82eefd290a in HcalUHTRData::const_iterator::operator++() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#6 0x00007f82eefd7eac in HcalUnpacker::unpackUTCA(FEDRawData const&, HcalElectronicsMap const&, HcalUnpacker::Collections&, HcalUnpackerReport&, bool) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#7 0x00007f82ca8f3709 in HcalRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiPlugins.so
#8 0x00007f834868b3ed in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9 0x00007f8348671b52 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f83485fc5aa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f83485fca58 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f834856ea8f in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f8346def2e4 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f825fb15000, waiter=..., this=0x7f8340b7e200) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f8340b7e200) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/arena.cpp:137
#16 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/market.cpp:599
#17 0x00007f8346df14a6 in tbb::detail::r1::rml::private_worker::run (this=0x7f8340b73a80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#18 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f8340b73a80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#19 0x00007f8345f3f17a in start_thread () from /lib64/libpthread.so.0
#20 0x00007f8345c6cdf3 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ]

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-10-06T07:50:50Z

A new Issue was created by @mmusich Marco Musich.

@antoniovilela, @smuzaffar, @makortel, @rappoccio, @sextonkennedy, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich · 2023-10-06T07:50:51Z

type hcal

mmusich · 2023-10-06T07:51:23Z

assign hlt

(I let others assign to other groups, if needed.)

@cms-sw/hcal-dpg-l2 FYI

cmsbuild · 2023-10-06T07:51:45Z

New categories assigned: hlt

@Martin-Grunewald,@mmusich,@missirol you have been requested to review this Pull request/Issue and eventually sign? Thanks

missirol · 2023-10-06T10:57:49Z

The crash appears to be reproducible. Below is a recipe to reproduce it on lxplus with CMSSW_13_2_5_patch1 (no GPUs required).

./run.sh -r 374803 -t 1 -i /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream -f run374803_ls0160_index000038_fu-c2b05-42-01_pid1201197 -o tmp -w

Content of run.sh:

#!/bin/bash

# defaults
showHelpMsg=false
runNumKeyword=-1
numThreadsDefault=32
numThreads="${numThreadsDefault}"
numStreamsDefault=0
numStreams="${numStreamsDefault}"
errDirPathDefault=/store/error_stream
errDirPath="${errDirPathDefault}"
outDirPathDefault=tmp
outDirPath="${outDirPathDefault}"
outDirOverWriteDefault=false
outDirOverWrite="${outDirOverWriteDefault}"
extraFilePatternDefault=""
extraFilePattern="${extraFilePatternDefault}"
noCmsRunDefault=false
noCmsRun="${noCmsRunDefault}"

# help message
usage() {
  cat <<@EOF
Description:
  This script can be used to run the HLT menu of a given run on error-stream files in FEDRawData (FRD) format.
  One cmsRun job per file is executed. The log files of all jobs are saved in an output directory.
  If a given job fails, the name of the corresponding log file is added to a file named "failed.txt" in the output directory.
  For all the files of a given run, the script uses the same HLT menu as used online during that run.

Example:
  The example below runs on all the files matching "/store/error_stream/run3676*/*fu-c2b04-32-01*.raw".
  Each cmsRun job uses 32 threads and 24 CMSSW streams. The results are saved in an directory named "tmp".
  If the output directory already exists, it will be overwritten, since "-w" is specified.

  > ./rerun_hlt_on_error_stream.sh -r 3676 -t 32 -s 24 -i /store/error_stream -f fu-c2b04-32-01 -o tmp -w

Options:
  -h, --help          Show this help message

  -r, --runNumber     Run number (a wildcard is appended: for example,
                      if "-r 123" is used, all runs matching "123*" will be considered)

  -t, --threads       Number of threads                     [Optional] [Default: ${numThreadsDefault}]

  -s, --streams       Number of CMSSW streams               [Optional] [Default: ${numStreamsDefault}]

  -i, --input-dir     Path to error-stream directory        [Optional] [Default: ${errDirPathDefault}]
                      containing one sub-folder per run

  -o, --output-dir    Path to output directory              [Optional] [Default: ${outDirPathDefault}]

  -w, --overwrite     Overwrite output directory
                      (if it already exists)                [Optional] [Default: ${outDirOverWriteDefault}]

  -f, --file-pattern  String to be used to restrict to
                      a subset of input files               [Optional] [Default: ${extraFilePatternDefault}]

  -n, --no-cmsRun     Do not run cmsRun job(s)              [Optional] [Default: ${noCmsRunDefault}]

  If optional arguments are not specified, the corresponding default values will be used.

@EOF
}

# command-line interface
while [[ $# -gt 0 ]]; do
  case "$1" in
    -h|--help) showHelpMsg=true; shift;;
    -r|--runNumber) runNumKeyword=$2; shift; shift;;
    -t|--threads) numThreads=$2; shift; shift;;
    -s|--streams) numStreams=$2; shift; shift;;
    -i|--input-dir) errDirPath=$2; shift; shift;;
    -o|--output-dir) outDirPath=$2; shift; shift;;
    -w|--overwrite) outDirOverWrite=true; shift;;
    -f|--file-pattern) extraFilePattern=$2; shift; shift;;
    -n|--no-cmsRun) noCmsRun=true; shift;;
    *) shift;;
  esac
done

# print help message
if [ "${showHelpMsg}" == true ]; then
  usage
  exit 0
fi

posNumRegex='^[0-9]+$'
if ! [[ "${runNumKeyword}" =~ ${posNumRegex} ]] ; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid run number (must be a positive integer without sign) [-r]: ${runNumKeyword}"
  exit 1
elif [ "${runNumKeyword}" -le 0 ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid run number (must be a number higher than zero) [-r]: ${runNumKeyword}"
  exit 1
fi

if ! [[ "${numThreads}" =~ ${posNumRegex} ]] ; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid number of threads per job (must be a positive integer without sign) [-t]: ${numThreads}"
  exit 1
elif [ "${numThreads}" -le 0 ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid number of threads per job (must be a number higher than zero) [-t]: ${numThreads}"
  exit 1
fi

if ! [[ "${numStreams}" =~ ${posNumRegex} ]] ; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- invalid number of CMSSW streams per job (must be a positive integer without sign) [-s]: ${numStreams}"
  exit 1
fi

if [ ! -d "${errDirPath}" ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- target input directory does not exist [-i]: ${errDirPath}"
  exit 1
fi

if [ -z "${CMSSW_BASE}" ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s" ">> ERROR" " -- environment variable CMSSW_BASE not found"
  printf "%s\n" ": it is necessary to first set up the CMSSW environment"
  printf "%s\n\n" "            (for example via \"source setup.sh -r CMSSW_X_Y_Z\")"
  exit 1
fi

errDirAbsPath=$(readlink -e "${errDirPath}")
runDirPrePath="${errDirAbsPath}"/run"${runNumKeyword}"

if [ $(ls -d "${runDirPrePath}"* 2> /dev/null | wc -l) -eq 0 ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- no input directories found: ${runDirPrePath}*"
  exit 1
fi

[ "${outDirOverWrite}" != true ] || (rm -rf "${outDirPath}")

if [ -d "${outDirPath}" ]; then
  printf "\n\033[31m\033[1m%s\033[0m%s\n\n" ">> ERROR" " -- target output directory already exists [-o]: ${outDirPath}"
  exit 1
fi

mkdir -p "${outDirPath}"
cd "${outDirPath}"

for dirPath in $(ls -d "${runDirPrePath}"*); do
  runNumber="${dirPath: -6}"
  echo "--------------------------------------------------"
  echo " run: ${runNumber}"
  echo "--------------------------------------------------"
  hltGetCmd="hltConfigFromDB --runNumber ${runNumber}"
  echo "${hltGetCmd} ..."
  hltCfg=run"${runNumber}"_cfg.py
  ${hltGetCmd} > "${hltCfg}"
  hltCfg=$(readlink -e "${hltCfg}")
  cat <<EOF >> "${hltCfg}"
import sys
if len(sys.argv) < 3:
    raise RuntimeError("one command-line argument required: path to file in FEDRawData (FRD) format")

process.source.fileListMode = True
process.source.fileNames = [sys.argv[2]]

process.options.numberOfThreads = ${numThreads}
process.options.numberOfStreams = ${numStreams}

del process.PrescaleService

del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')

process.EvFDaqDirector.buBaseDir = "${errDirAbsPath}"
process.EvFDaqDirector.runNumber = ${runNumber}

process.hltDQMFileSaverPB.runNumber = ${runNumber}

if hasattr(process, "hltOnlineBeamSpotESProducer"):
    process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)

# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  # array of non-empty FRD files
  frdFiles=($(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw))
  for frdFile in "${frdFiles[@]}"; do
    frdFileBasename=$(basename "${frdFile}")
    if [[ "${frdFileBasename}" != *"${extraFilePattern}"* ]]; then
      continue
    fi
    jobTag="${frdFileBasename::-4}"
    hltLog="${jobTag}".log
    frdFileAbsPath=$(readlink -e "${dirPath}"/"${frdFileBasename}")
    echo -e "\n${jobTag} ..."
    echo -e "# cmsRun ${hltCfg} ${frdFileAbsPath}\n" > "${hltLog}"
    if [ "${noCmsRun}" != true ]; then
      rm -rf run"${runNumber}" && mkdir -p run"${runNumber}"
      cmsRun "${hltCfg}" "${frdFileAbsPath}" &>> "${hltLog}"
      exitCode=$?
      [ ${exitCode} -eq 0 ] || echo "${hltLog}" >> failed.txt
      echo "${jobTag} ... done (exit code: ${exitCode})"
    fi
  done
  rm -rf run"${runNumber}"
  unset frdFile frdFiles
  unset runNumber hltCfg
done
unset dirPath

abdoulline · 2023-10-06T11:29:01Z

@missirol thanks a lot! We'll try to track it down.
At the moment the assumption is that the unpacker may not be protected enough against some weird RAW data corruption, which signs were spotted in this run...

mmusich · 2023-10-06T11:58:17Z

In case it helps, compiling with debug symbols, the segmentation fault originates here

cmssw/EventFilter/HcalRawToDigi/interface/HcalUHTRData.h

Line 47 in 6ba4223

bool isHeader() const { return ((*m_ptr) & 0x8000) != 0; }

when called from

cmssw/EventFilter/HcalRawToDigi/src/HcalUnpacker.cc

Line 758 in 6ba4223

for (++i; i != iend && !i.isHeader(); ++i) {

abdoulline · 2023-10-06T12:01:01Z

@mmusich very kind of you, thanks!

mmusich · 2023-10-11T06:47:03Z

This happened again in run-374970, see link to HLT e-log.
The relevant part of the stack trace appears to be similar [1].
I attach for completeness the full stack trace from F3Mon: f3mon_logtable_2023-10-11T06 40 08.534Z.txt

[1]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Wed Oct 11 06:42:19 CEST 2023
Thread 11 (Thread 0x7f8acb9ff700 (LWP 4193159) "cmsRun"):
#0 0x00007f8b6e9baa71 in poll () from /lib64/libc.so.6
#1 0x00007f8b67995d2f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f8b6795d75c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f8b6795e1bb in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f8b1852690a in HcalUHTRData::const_iterator::operator++() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#6 0x00007f8b1852bb7a in HcalUnpacker::unpackUTCA(FEDRawData const&, HcalElectronicsMap const&, HcalUnpacker::Collections&, HcalUnpackerReport&, bool) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#7 0x00007f8af3e47709 in HcalRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiPlugins.so
#8 0x00007f8b713e43ed in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9 0x00007f8b713cab52 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f8b713555aa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f8b71355a58 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f8b712c7a8f in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f8b6fb482e4 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f8a64b34300, waiter=..., this=0x7f8b698b1e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f8b698b1e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/arena.cpp:137
#16 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/market.cpp:599
#17 0x00007f8b6fb4a4a6 in tbb::detail::r1::rml::private_worker::run (this=0x7f8b698a7c80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#18 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f8b698a7c80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#19 0x00007f8b6ec9817a in start_thread () from /lib64/libpthread.so.0
#20 0x00007f8b6e9c5df3 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ]

@cms-sw/hcal-dpg-l2 do you have news about this?

unreglicious · 2023-10-11T09:55:43Z

Hi,

would it be possible to stick these error streams in the standard .root format and put somewhere on EOS?

This would allow us (HCAL operations) to digest the FED RAW data itself for any clues.

Thanks and cheers,
Pasha

mmusich · 2023-10-11T12:06:33Z

Hi @unreglicious ,
you can find the edm-converted .root file for the incriminated LS at:

/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/debug/374803/converted_run374803_ls0160_index000038.root

For the record I obtained this by using:

cmsRun convertFromRawToEDM.py /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run374803/run374803_ls0160_index000038_fu-c2b05-42-01_pid1201197.raw converted_run374803_ls0160_index000038.root

where convertFromRawToEDM.py is [*]

[*]

import FWCore.ParameterSet.Config as cms
import os
import sys
import glob

if sys.argv[0] == 'cmsRun':
  INPUTS = [sys.argv[2]]
  outputFilePath = sys.argv[3]
else:
  INPUTS = [sys.argv[1]]
  outputFilePath = sys.argv[2]

process = cms.Process('TEST')
process.options.wantSummary = False
process.maxEvents.input = -1

inputFilePaths = []
for inp_i in INPUTS:
  for inp_j in glob.glob(inp_i):
    inp_j2 = 'file:'+inp_j if os.path.isfile(inp_j) else inp_j
    inputFilePaths.append(inp_j2)
inputFilePaths = sorted(list(set(inputFilePaths)))
print(inputFilePaths)

process.EvFDaqDirector = cms.Service('EvFDaqDirector')

process.source = cms.Source('FedRawDataInputSource',
  fileListMode = cms.untracked.bool(True),
  fileNames = cms.untracked.vstring(inputFilePaths)
)

process.edmOutput = cms.OutputModule('PoolOutputModule',
  dataset = cms.untracked.PSet(
    dataTier = cms.untracked.string('RAW')
  ),
  fileName = cms.untracked.string('file:'+outputFilePath)
)

process.outputPath = cms.EndPath(process.edmOutput)

unreglicious · 2023-10-11T12:30:46Z

Thanks a lot, @mmusich! Both .raw and recipe are very helpful!

abdoulline · 2023-10-13T05:04:24Z

PRs addressing the issue (amendments suggested by Jeremy Mans) submitted
#43011 (master)
#43012 (132X)

mmusich · 2023-10-13T05:22:33Z

PRs addressing the issue (amendments suggested by Jeremy Mans) submitted

Thank you @abdoulline

mmusich · 2023-10-13T05:23:55Z

assign reconstruction

the proposed fixes lie in the reco area

cmsbuild · 2023-10-13T05:24:17Z

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2023-10-13T06:48:44Z

For the record there was yet another occurrence in run-375055, see link to HLT e-log.
The relevant part of the stack trace appears to be similar [1].
I attach for completeness the full stack trace from F3Mon: f3mon_logtable_2023-10-13T06 47 59.369Z.txt

[1]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Oct 12 18:57:35 CEST 2023
Thread 12 (Thread 0x7f7efbdff700 (LWP 3485309) "cmsRun"):
#0 0x00007f7f7ba0ea71 in poll () from /lib64/libc.so.6
#1 0x00007f7f7096cd2f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f7f7093475c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f7f709351bb in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f7f2553b90a in HcalUHTRData::const_iterator::operator++() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#6 0x00007f7f25540ec9 in HcalUnpacker::unpackUTCA(FEDRawData const&, HcalElectronicsMap const&, HcalUnpacker::Collections&, HcalUnpackerReport&, bool) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libEventFilterHcalRawToDigi.so
#7 0x00007f7f00e5c709 in HcalRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiPlugins.so
#8 0x00007f7f7e4383ed in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9 0x00007f7f7e41eb52 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f7f7e3a95aa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f7f7e3a9a58 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f7f7e31ba8f in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_5/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f7f7cb9c2e4 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f7e71576900, waiter=..., this=0x7f7f7692f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f7f7692f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/arena.cpp:137
#16 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/market.cpp:599
#17 0x00007f7f7cb9e4a6 in tbb::detail::r1::rml::private_worker::run (this=0x7f7f76923e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#18 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f7f76923e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre3-el8_amd64_gcc11/build/CMSSW_13_2_0_pre3-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#19 0x00007f7f7bcec17a in start_thread () from /lib64/libpthread.so.0
#20 0x00007f7f7ba19df3 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ]

jfernan2 · 2023-10-17T17:32:53Z

+1
Fixed by #43011

mmusich · 2023-10-24T07:24:35Z

+hlt

fixed in master at HCAL unpacker: adding more protection against corrupted data #43011
- fixed in 13.2.X at [13_2_X] HCAL unpacker: adding more protection against corrupted data #43012, subsequently introduced in CMSSW_13_2_6_patch2
patch-release deployed online (for HI collisions) from run 375186 (see e-log)
- no further crash of this type was observed since

cmsbuild · 2023-10-24T07:24:57Z

This issue is fully signed and ready to be closed.

makortel · 2023-10-24T13:41:34Z

@cmsbuild, please close

cmsbuild added the pending-assignment label Oct 6, 2023

cmsbuild added the hcal label Oct 6, 2023

cmsbuild added hlt-pending pending-signatures and removed pending-assignment labels Oct 6, 2023

This was referenced Oct 13, 2023

HCAL unpacker: adding more protection against corrupted data #43011

Merged

[13_2_X] HCAL unpacker: adding more protection against corrupted data #43012

Merged

cmsbuild added the reconstruction-pending label Oct 13, 2023

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Oct 17, 2023

cmsbuild removed the hlt-pending label Oct 24, 2023

cmsbuild added hlt-approved fully-signed and removed pending-signatures labels Oct 24, 2023

cmsbuild closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT crash in run-374803 (`HcalUnpacker::unpackUTCA`) #42960

HLT crash in run-374803 (`HcalUnpacker::unpackUTCA`) #42960

mmusich commented Oct 6, 2023

cmsbuild commented Oct 6, 2023

mmusich commented Oct 6, 2023

mmusich commented Oct 6, 2023

cmsbuild commented Oct 6, 2023

missirol commented Oct 6, 2023

abdoulline commented Oct 6, 2023

mmusich commented Oct 6, 2023 •

edited

Loading

abdoulline commented Oct 6, 2023

mmusich commented Oct 11, 2023 •

edited

Loading

unreglicious commented Oct 11, 2023

mmusich commented Oct 11, 2023

unreglicious commented Oct 11, 2023

abdoulline commented Oct 13, 2023

mmusich commented Oct 13, 2023

mmusich commented Oct 13, 2023

cmsbuild commented Oct 13, 2023

mmusich commented Oct 13, 2023

jfernan2 commented Oct 17, 2023

mmusich commented Oct 24, 2023 •

edited

Loading

cmsbuild commented Oct 24, 2023

makortel commented Oct 24, 2023

HLT crash in run-374803 (HcalUnpacker::unpackUTCA) #42960

HLT crash in run-374803 (HcalUnpacker::unpackUTCA) #42960

Comments

mmusich commented Oct 6, 2023

cmsbuild commented Oct 6, 2023

mmusich commented Oct 6, 2023

mmusich commented Oct 6, 2023

cmsbuild commented Oct 6, 2023

missirol commented Oct 6, 2023

abdoulline commented Oct 6, 2023

mmusich commented Oct 6, 2023 • edited Loading

abdoulline commented Oct 6, 2023

mmusich commented Oct 11, 2023 • edited Loading

unreglicious commented Oct 11, 2023

mmusich commented Oct 11, 2023

unreglicious commented Oct 11, 2023

abdoulline commented Oct 13, 2023

mmusich commented Oct 13, 2023

mmusich commented Oct 13, 2023

cmsbuild commented Oct 13, 2023

mmusich commented Oct 13, 2023

jfernan2 commented Oct 17, 2023

mmusich commented Oct 24, 2023 • edited Loading

cmsbuild commented Oct 24, 2023

makortel commented Oct 24, 2023

HLT crash in run-374803 (`HcalUnpacker::unpackUTCA`) #42960

HLT crash in run-374803 (`HcalUnpacker::unpackUTCA`) #42960

mmusich commented Oct 6, 2023 •

edited

Loading

mmusich commented Oct 11, 2023 •

edited

Loading

mmusich commented Oct 24, 2023 •

edited

Loading