Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal instruction (core dumped) with some pretrained models (but not all) #1782

Closed
emericit opened this issue Jan 22, 2020 · 8 comments
Closed

Comments

@emericit
Copy link

Hello,

I have a strange issue when loading some pretrained models. On loading, some models abort and give an "illegal instruction" message, like below:

import torchvision
torchvision.models.squeezenet1_0(pretrained=True)

Illegal instruction (core dumped)

But with other models, everything runs smoothly. The following code

import torchvision
torchvision.models.alexnet(pretrained=True)

outputs a nicely loaded model:

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

Running Python 3.7.5 on a CPU only Ubuntu machine with the following versions:
PyTorch Version: 1.4.0+cpu
Torchvision Version: 0.5.0+cpu

The CPUs have the following properties:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : QEMU Virtual CPU version (cpu64-rhel6)
stepping : 3
microcode : 0x1
cpu MHz : 2593.902
cache size : 4096 KB
physical id : 7
siblings : 1
core id : 0
cpu cores : 1
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 4
wp : yes
flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm nopl cpuid pni cx16 hypervisor lahf_lm abm pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 5187.80
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

I could not find help anywhere with this issue...

@fmassa
Copy link
Member

fmassa commented Jan 22, 2020

Hi,

Can you run the code under gdb and paste the output?
So, it would consist of writing the following file

# tst.py
import torchvision
torchvision.models.squeezenet1_0(pretrained=True)

and then run from the command line

gdb python

once it starts, run

run tst.py

and when you get the segfault, do

bt

and paste the result here. It will help us identify the problem

@bmanga
Copy link
Contributor

bmanga commented Jan 23, 2020

Hi, I had the same problem, and I solved it by compiling torch and torchvision myself.
This happened running on an older cpu without avx support. Was avx enabled by default recently?

@emericit
Copy link
Author

Hello,

thank you for your response. I compiled torch and torchvision, disabling SSE4 support with -DENABLE_SSE4=0. But the problem remains.

The backtrace is below:

Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007ffff0772fcc in THFloatVector_normal_fill_AVX2 () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
(gdb) bt
#0 0x00007ffff0772fcc in THFloatVector_normal_fill_AVX2 () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#1 0x00007ffff0522269 in THFloatTensor_normal () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#2 0x00007ffff03fc29f in at::native::legacy::cpu::th_normal(at::Tensor&, double, double, at::Generator*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3 0x00007ffff0392e94 in at::CPUType::(anonymous namespace)::normal_(at::Tensor&, double, double, at::Generator*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#4 0x00007ffff23537ae in torch::autograd::VariableType::(anonymous namespace)::normal_(at::Tensor&, double, double, at::Generator*) ()
from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#5 0x00007ffff568ee6a in torch::autograd::THPVariable_normal_(_object*, _object*, _object*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6 0x00005555556bca04 in _PyMethodDef_RawFastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:694
#7 0x00005555556d432f in _PyMethodDescr_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/descrobject.c:288
#8 0x0000555555728b1c in call_function (kwnames=0x0, oparg=3, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4593
#9 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3110
#10 0x00005555556bbf7b in function_code_fastcall (globals=, nargs=3, args=, co=)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:283
#11 _PyFunction_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:408
#12 0x0000555555724156 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4616
#13 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3124
#14 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#15 0x00005555556bc207 in _PyFunction_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:433
#16 0x000055555572521f in call_function (kwnames=0x7fffd9564460, oparg=, pp_stack=)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4616
#17 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3139
#18 0x0000555555668a0a in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#19 0x0000555555669865 in _PyFunction_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:376
#20 0x0000555555689313 in _PyObject_Call_Prepend () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:908
#21 0x00005555556d372a in slot_tp_init () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/typeobject.c:6636
#22 0x00005555556d4287 in type_call () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/typeobject.c:971
#23 0x000055555567b06e in PyObject_Call () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:245
#24 0x0000555555725c8f in do_call_core (kwdict=0x7fffd6c64fa0, callargs=0x7ffff6c3fa10, func=0x555557315f70) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4645
#25 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3191
#26 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#27 0x0000555555669865 in _PyFunction_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:376
#28 0x0000555555725c8f in do_call_core (kwdict=0x7ffff6d8cdc0, callargs=0x7fffd6c5f370, func=0x7fffd955ddd0) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4645
#29 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3191
#30 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#31 0x00005555556bc207 in _PyFunction_FastCallKeywords () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Objects/call.c:433
#32 0x00005555557289b9 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:4616
#33 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3093
#34 0x0000555555668729 in _PyEval_EvalCodeWithName () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3930
#35 0x0000555555669654 in PyEval_EvalCodeEx () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:3959
#36 0x000055555566967c in PyEval_EvalCode (co=, globals=, locals=)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/ceval.c:524
#37 0x000055555577fcb4 in run_mod () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/pythonrun.c:1035
#38 0x000055555578a191 in PyRun_FileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/pythonrun.c:988
#39 0x000055555578a383 in PyRun_SimpleFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Python/pythonrun.c:429
#40 0x000055555578b475 in pymain_run_file (p_cf=0x7fffffffdc10, filename=0x5555558c1700 L"test.py", fp=0x555555909950)
at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:428
#41 pymain_run_filename (cf=0x7fffffffdc10, pymain=0x7fffffffdd20) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:1607
#42 pymain_run_python (pymain=0x7fffffffdd20) at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:2868
#43 pymain_main () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:3029
#44 0x000055555578b59c in _Py_UnixMain () at /home/conda/feedstock_root/build_artifacts/python_1578433408510/work/Modules/main.c:3064
#45 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556492a0

, argc=2, argv=0x7fffffffde78, init=, fini=, rtld_fini=, stack_end=0x7fffffffde68)
at ../csu/libc-start.c:310
#46 0x0000555555733b50 in _start () at ../sysdeps/x86_64/elf/start.S:103
(gdb)

@fmassa
Copy link
Member

fmassa commented Jan 27, 2020

Hi @emericit

Thanks for the backtrace, this seems to be a problem with PyTorch itself, and not with torchvision.
Seems like your machine doesn't have the necessary set of instructions to run the PyTorch build that you are using.

Indeed, it seems that if you do something like

import torch
torch.randn(10)

you might have the segfault as well.

This is actually very close to be a duplicate of pytorch/pytorch#22338, so let's redirect the discussion there.

@emericit
Copy link
Author

Hello @fmassa !
Thank you for your quick answer. It seems to be an issue with torch, you are right.
torch.randn works fine, though... (see screenshot below)
I suspect my processor doesn't support AVX. Do you know how to disable AVX when compiling Torch?

image

@fmassa
Copy link
Member

fmassa commented Jan 27, 2020

The randn codepath for AVX might only get activated for large-enough inputs, see https://github.com/pytorch/pytorch/blob/c2c835dd95f192d1397877b94e615d13258126d9/aten/src/TH/vector/AVX2.cpp#L79 and https://github.com/pytorch/pytorch/blob/64de93d8e7c6ba085997b18bcf85681b330d9afb/aten/src/TH/generic/THTensorRandom.cpp#L88-L90, so maybe try with something like

torch.randn(1024)

I think this should crash this time

@fmassa
Copy link
Member

fmassa commented Jan 27, 2020

And about compiling without AVX2, I'm not sure, might be best to ask in the PyTorch issue I linked

@emericit
Copy link
Author

You're right, it crashes with size 1024.
Thank you very much; I will ask on the PyTorch issue directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants