Illegal instruction (core dumped) with some pretrained models (but not all) #1782

emericit · 2020-01-22T11:13:47Z

Hello,

I have a strange issue when loading some pretrained models. On loading, some models abort and give an "illegal instruction" message, like below:

import torchvision
torchvision.models.squeezenet1_0(pretrained=True)

Illegal instruction (core dumped)

But with other models, everything runs smoothly. The following code

import torchvision
torchvision.models.alexnet(pretrained=True)

outputs a nicely loaded model:

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

Running Python 3.7.5 on a CPU only Ubuntu machine with the following versions:
PyTorch Version: 1.4.0+cpu
Torchvision Version: 0.5.0+cpu

The CPUs have the following properties:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : QEMU Virtual CPU version (cpu64-rhel6)
stepping : 3
microcode : 0x1
cpu MHz : 2593.902
cache size : 4096 KB
physical id : 7
siblings : 1
core id : 0
cpu cores : 1
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 4
wp : yes
flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm nopl cpuid pni cx16 hypervisor lahf_lm abm pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 5187.80
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

I could not find help anywhere with this issue...

fmassa · 2020-01-22T20:46:25Z

Hi,

Can you run the code under gdb and paste the output?
So, it would consist of writing the following file

# tst.py
import torchvision
torchvision.models.squeezenet1_0(pretrained=True)

and then run from the command line

gdb python

once it starts, run

run tst.py

and when you get the segfault, do

bt

and paste the result here. It will help us identify the problem

bmanga · 2020-01-23T08:45:04Z

Hi, I had the same problem, and I solved it by compiling torch and torchvision myself.
This happened running on an older cpu without avx support. Was avx enabled by default recently?

emericit · 2020-01-27T12:58:14Z

Hello,

thank you for your response. I compiled torch and torchvision, disabling SSE4 support with -DENABLE_SSE4=0. But the problem remains.

The backtrace is below:

Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007ffff0772fcc in THFloatVector_normal_fill_AVX2 () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
(gdb) bt
#0 0x00007ffff0772fcc in THFloatVector_normal_fill_AVX2 () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#1 0x00007ffff0522269 in THFloatTensor_normal () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#2 0x00007ffff03fc29f in at::native::legacy::cpu::th_normal(at::Tensor&, double, double, at::Generator*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3 0x00007ffff0392e94 in at::CPUType::(anonymous namespace)::normal_(at::Tensor&, double, double, at::Generator*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#4 0x00007ffff23537ae in torch::autograd::VariableType::(anonymous namespace)::normal_(at::Tensor&, double, double, at::Generator*) ()
from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#5 0x00007ffff568ee6a in torch::autograd::THPVariable_normal_(_object*, _object*, _object*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6 0x00005555556bca04 in _PyMethodDef_RawFastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:694
#7 0x00005555556d432f in _PyMethodDescr_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/descrobject.c:288
#8 0x0000555555728b1c in call_function (kwnames=0x0, oparg=3, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4593
#9 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3110
#10 0x00005555556bbf7b in function_code_fastcall (globals=, nargs=3, args=, co=)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:283
#11 _PyFunction_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:408
#12 0x0000555555724156 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4616
#13 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3124
#14 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#15 0x00005555556bc207 in _PyFunction_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:433
#16 0x000055555572521f in call_function (kwnames=0x7fffd9564460, oparg=, pp_stack=)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4616
#17 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3139
#18 0x0000555555668a0a in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#19 0x0000555555669865 in _PyFunction_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:376
#20 0x0000555555689313 in _PyObject_Call_Prepend () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:908
#21 0x00005555556d372a in slot_tp_init () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/typeobject.c:6636
#22 0x00005555556d4287 in type_call () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/typeobject.c:971
#23 0x000055555567b06e in PyObject_Call () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:245
#24 0x0000555555725c8f in do_call_core (kwdict=0x7fffd6c64fa0, callargs=0x7ffff6c3fa10, func=0x555557315f70) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4645
#25 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3191
#26 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#27 0x0000555555669865 in _PyFunction_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:376
#28 0x0000555555725c8f in do_call_core (kwdict=0x7ffff6d8cdc0, callargs=0x7fffd6c5f370, func=0x7fffd955ddd0) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4645
#29 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3191
#30 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#31 0x00005555556bc207 in _PyFunction_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:433
#32 0x00005555557289b9 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4616
#33 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3093
#34 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#35 0x0000555555669654 in PyEval_EvalCodeEx () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3959
#36 0x000055555566967c in PyEval_EvalCode (co=, globals=, locals=)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:524
#37 0x000055555577fcb4 in run_mod () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/pythonrun.c:1035
#38 0x000055555578a191 in PyRun_FileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/pythonrun.c:988
#39 0x000055555578a383 in PyRun_SimpleFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/pythonrun.c:429
#40 0x000055555578b475 in pymain_run_file (p_cf=0x7fffffffdc10, filename=0x5555558c1700 L"test.py", fp=0x555555909950)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:428
#41 pymain_run_filename (cf=0x7fffffffdc10, pymain=0x7fffffffdd20) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:1607
#42 pymain_run_python (pymain=0x7fffffffdd20) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:2868
#43 pymain_main () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:3029
#44 0x000055555578b59c in _Py_UnixMain () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:3064
#45 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556492a0

, argc=2, argv=0x7fffffffde78, init=, fini=, rtld_fini=, stack_end=0x7fffffffde68)
at ../csu/libc-start.c:310
#46 0x0000555555733b50 in _start () at ../sysdeps/x86_64/elf/start.S:103
(gdb)

fmassa · 2020-01-27T13:23:22Z

Hi @emericit

Thanks for the backtrace, this seems to be a problem with PyTorch itself, and not with torchvision.
Seems like your machine doesn't have the necessary set of instructions to run the PyTorch build that you are using.

Indeed, it seems that if you do something like

import torch
torch.randn(10)

you might have the segfault as well.

This is actually very close to be a duplicate of pytorch/pytorch#22338, so let's redirect the discussion there.

emericit · 2020-01-27T13:48:15Z

Hello @fmassa !
Thank you for your quick answer. It seems to be an issue with torch, you are right.
torch.randn works fine, though... (see screenshot below)
I suspect my processor doesn't support AVX. Do you know how to disable AVX when compiling Torch?

fmassa · 2020-01-27T14:28:48Z

The randn codepath for AVX might only get activated for large-enough inputs, see https://github.com/pytorch/pytorch/blob/c2c835dd95f192d1397877b94e615d13258126d9/aten/src/TH/vector/AVX2.cpp#L79 and https://github.com/pytorch/pytorch/blob/64de93d8e7c6ba085997b18bcf85681b330d9afb/aten/src/TH/generic/THTensorRandom.cpp#L88-L90, so maybe try with something like

torch.randn(1024)

I think this should crash this time

fmassa · 2020-01-27T14:29:15Z

And about compiling without AVX2, I'm not sure, might be best to ask in the PyTorch issue I linked

emericit · 2020-01-27T15:57:30Z

You're right, it crashes with size 1024.
Thank you very much; I will ask on the PyTorch issue directly.

fmassa added awaiting response needs reproduction labels Jan 22, 2020

fmassa closed this as completed Jan 27, 2020

fmassa added duplicate topic: binaries and removed awaiting response labels Jan 27, 2020

This was referenced Jan 27, 2020

Illegal instruction (core dumped) when running in qemu pytorch/pytorch#22338

Open

Illegal instruction pytorch/pytorch#29371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal instruction (core dumped) with some pretrained models (but not all) #1782

Illegal instruction (core dumped) with some pretrained models (but not all) #1782

emericit commented Jan 22, 2020

fmassa commented Jan 22, 2020

bmanga commented Jan 23, 2020

emericit commented Jan 27, 2020

fmassa commented Jan 27, 2020

emericit commented Jan 27, 2020

fmassa commented Jan 27, 2020

fmassa commented Jan 27, 2020

emericit commented Jan 27, 2020

Illegal instruction (core dumped) with some pretrained models (but not all) #1782

Illegal instruction (core dumped) with some pretrained models (but not all) #1782

Comments

emericit commented Jan 22, 2020

fmassa commented Jan 22, 2020

bmanga commented Jan 23, 2020

emericit commented Jan 27, 2020

fmassa commented Jan 27, 2020

emericit commented Jan 27, 2020

fmassa commented Jan 27, 2020

fmassa commented Jan 27, 2020

emericit commented Jan 27, 2020