Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU example fails. #235

Open
sth1997 opened this issue Dec 6, 2018 · 4 comments
Open

GPU example fails. #235

sth1997 opened this issue Dec 6, 2018 · 4 comments
Labels

Comments

@sth1997
Copy link

sth1997 commented Dec 6, 2018

My cuda version is 9.0 and my cudnn version is 3.7.5.
I cau successfully the Walkthrough.ipynb code with cpu on a single node or multiple nodes. But if I set device=DeviceType.GPU for db.ops.Histogram and ran it on a single node or multiple nodes , it failed. This is its output:
5%|██████████▊ | 1/19 [00:02<00:36, 2.01s/it, jobs=1, tasks=18, workers=1]
Segmentation fault
I checked the log, this is no WARNING logs, just one INFO log:

Log file created at: 2018/12/06 16:30:00
Running on machine: gorgon4
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1206 16:30:00.866015 107280 ingest.cpp:936] Writing database metadata
I1206 16:30:00.869004 107280 ingest.cpp:940] Writing table megafile
I1206 16:30:00.889670 107194 worker.cpp:480] Creating worker
I1206 16:30:00.889878 107194 worker.cpp:497] Create master stub
I1206 16:30:00.889976 107194 worker.cpp:500] Finish master stub
I1206 16:30:00.890017 107194 worker.cpp:507] Worker created.
I1206 16:30:00.890188 107194 worker.cpp:666] Worker try to register with master
I1206 16:30:00.891312 107194 worker.cpp:693] Worker registered with master with id 0
I1206 16:30:00.902165 107327 worker.cpp:548] Worker 0 received NewJob
I1206 16:30:00.902737 107326 worker.cpp:722] Worker 0 loading Op library: /home/sth/.local/lib/python3.6/site-packages/scannerpy/lib/libscanner_stdlib.so
I1206 16:30:00.905745 107326 worker.cpp:1254] Initial pipeline instances per node: -1
I1206 16:30:00.905762 107326 worker.cpp:1280] Kernel Group 0 Pipeline instances per node: 1
I1206 16:30:00.905768 107326 worker.cpp:1294] Pipeline instances per node: 1

After that, I have also tried use GPU in examples/apps/quickstart/main.py. I set device=DeviceType.GPU for db.ops.Resize, it also failed. This is its output:
0%| | 0/7 [00:02<?, ?it/s, jobs=1, tasks=7, workers=1]
Segmentation fault
I also checked the log, the INFO log is a little different from the previous one:

Log file created at: 2018/12/06 16:35:14
Running on machine: gorgon4
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1206 16:35:14.334425 107522 ingest.cpp:936] Writing database metadata

Have you ever met this problem?
Let me know if you need more information.
@apoms @willcrichton

@willcrichton
Copy link
Member

Can you run it in gdb and give us the stack trace?

@sth1997
Copy link
Author

sth1997 commented Dec 11, 2018

Sorry, I only know how to run pure c/c++ project in gdb (and I know how to set CMAKE_BUILD_TYPE when compiling), but I never use gdb for debuging c/c++ functions called by python.
Could you please tell me how to run scanner (like main.py) in gdb?
Thanks!

@willcrichton
Copy link
Member

@sth1997 sorry for late reply. You can run gdb on python like this:

$ gdb python3
(gdb) r main.py
...

@sth1997
Copy link
Author

sth1997 commented Dec 25, 2018

Thread 149 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffecdb4f700 (LWP 20718)]
0x00007fff8d2dd666 in ?? () from /usr/lib/nvidia-387/libnvcuvid.so.1
(gdb) backtrace full
#0  0x00007fff8d2dd666 in ?? () from /usr/lib/nvidia-387/libnvcuvid.so.1
No symbol table info available.
#1  0x00007fff8d2deac4 in ?? () from /usr/lib/nvidia-387/libnvcuvid.so.1
No symbol table info available.
#2  0x00007fff8d2772f1 in ?? () from /usr/lib/nvidia-387/libnvcuvid.so.1
No symbol table info available.
#3  0x00007fff8ed09a97 in scanner::internal::NVIDIAVideoDecoder::feed (this=0x7ffe98001c30, 
    encoded_buffer=0x7ffed476443a <error: Cannot access memory at address 0x7ffed476443a>, encoded_size=3926, 
    discontinuity=<optimized out>) at /home/sth/video/scanner/scanner/scanner/video/nvidia/nvidia_video_decoder.cpp:225
        cupkt = {flags = 0, payload_size = 3926, payload = 0x7ffed476443a <error: Cannot access memory at address 0x7ffed476443a>, 
          timestamp = 0}
        dummy = 0x555555e3ef10
#4  0x00007fff8ed06988 in scanner::internal::DecoderAutomata::feeder (this=<optimized out>)
    at /home/sth/video/scanner/scanner/scanner/video/decoder_automata.cpp:310
        encoded_buffer_size = <optimized out>
        encoded_packet_size = 3926
        encoded_buffer = <optimized out>
        encoded_packet = <optimized out>
        seen_metadata = <optimized out>
        frames_fed = <optimized out>
#5  0x00007fffea104c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
        __t = <optimized out>
        __local = {<std::__shared_ptr<std::thread::_Impl_base, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<std::thread::_Impl_base, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7ffe98010bd0, _M_refcount = {
              _M_pi = 0x7ffe98010bc0}}, <No data fields>}
#6  0x00007ffff7bc16ba in start_thread (arg=0x7ffecdb4f700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7ffecdb4f700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140732349609728, 4193268998478770737, 0, 140732505506479, 140732349610432, 0, 
                -4193730285998951887, -4193250823535971791}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#7  0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants