Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in RepReq's rep0 recv - use after free #1241

Closed
alexkornitzer opened this issue May 10, 2020 · 39 comments
Closed

SIGSEGV in RepReq's rep0 recv - use after free #1241

alexkornitzer opened this issue May 10, 2020 · 39 comments
Assignees
Labels

Comments

@alexkornitzer
Copy link

As briefly discussed in #1240 there appears to be a bug in the current implementation that results in a SIGSEGV when a socket is closed but the resend timeout fired at least once. Or that is my current theory on it at least.

SIGSEGV from C implementation:

https://gist.github.com/AlexKornitzer/371eef55e558e89d10850d53986dfb35

./reqrep server ipc:///tmp/abcd 
./reqrep client ipc:///tmp/abcd
panic: pthread_mutex_lock: Invalid argument
This message is indicative of a BUG.
Report this at https://github.com/nanomsg/nng/issues
/home/developer/Developer/scratch/nng/reqrep(+0x15bb9) [0x555555569bb9]
/home/developer/Developer/scratch/nng/reqrep(+0x1f7af) [0x5555555737af]
/home/developer/Developer/scratch/nng/reqrep(+0x1f958) [0x555555573958]
/home/developer/Developer/scratch/nng/reqrep(+0x1c826) [0x555555570826]
/home/developer/Developer/scratch/nng/reqrep(+0x1c562) [0x555555570562]
/home/developer/Developer/scratch/nng/reqrep(+0xf21f) [0x55555556321f]
/home/developer/Developer/scratch/nng/reqrep(+0x24617) [0x555555578617]
/home/developer/Developer/scratch/nng/reqrep(+0x15f4e) [0x555555569f4e]
/home/developer/Developer/scratch/nng/reqrep(+0xe60d) [0x55555556260d]
/home/developer/Developer/scratch/nng/reqrep(+0xec35) [0x555555562c35]
/home/developer/Developer/scratch/nng/reqrep(+0x18a00) [0x55555556ca00]
/home/developer/Developer/scratch/nng/reqrep(+0x739f) [0x55555555b39f]
/home/developer/Developer/scratch/nng/reqrep(+0x710d) [0x55555555b10d]
/home/developer/Developer/scratch/nng/reqrep(+0x6f29) [0x55555555af29]
/home/developer/Developer/scratch/nng/reqrep(+0x690f) [0x55555555a90f]
/home/developer/Developer/scratch/nng/reqrep(+0x6dac) [0x55555555adac]
/usr/lib/libc.so.6(__libc_start_main+0xf3) [0x7ffff7ded023]
/home/developer/Developer/scratch/nng/reqrep(+0x672e) [0x55555555a72e]

(gdb) bt
#0  0x00007ffff7e01ce5 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7deb857 in abort () from /usr/lib/libc.so.6
#2  0x00005555555733bf in nni_plat_abort ()
#3  0x0000555555569bbe in nni_panic ()
#4  0x00005555555737af in nni_pthread_mutex_lock ()
#5  0x0000555555573958 in nni_plat_mtx_lock ()
#6  0x0000555555570826 in nni_mtx_lock ()
#7  0x0000555555570562 in nni_task_dispatch ()
#8  0x000055555556321f in nni_aio_begin ()
#9  0x0000555555578617 in ipctran_pipe_recv ()
#10 0x0000555555569f4e in nni_pipe_recv ()
#11 0x000055555556260d in rep0_ctx_recv ()
#12 0x0000555555562c35 in rep0_sock_recv ()
#13 0x000055555556ca00 in nni_sock_recv ()
#14 0x000055555555b39f in nng_recv_aio ()
#15 0x000055555555b10d in nng_recvmsg ()
#16 0x000055555555af29 in nng_recv ()
#17 0x000055555555a90f in server ()
#18 0x000055555555adac in main ()

SIGSEGV from Rust version:

Thread 1 "scratch" received signal SIGSEGV, Segmentation fault.
0x0000555555577ce4 in nni_pipe_recv (p=0x0, aio=0x7fffe8001430)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/core/pipe.c:141
141             p->p_tran_ops.p_recv(p->p_tran_data, aio);
(gdb) bt
#0  0x0000555555577ce4 in nni_pipe_recv (p=0x0, aio=0x7fffe8001430)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/core/pipe.c:141
#1  0x000055555556dddf in rep0_ctx_recv (arg=0x5555555ead20, aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/protocol/reqrep0/rep.c:504
#2  0x000055555556e361 in rep0_sock_recv (arg=0x5555555eac10, aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/protocol/reqrep0/rep.c:670
#3  0x000055555557ade3 in nni_sock_recv (sock=0x5555555ea220, aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/core/socket.c:847
#4  0x00005555555666e6 in nng_recv_aio (s=..., aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/nng.c:214
#5  0x000055555556651d in nng_recvmsg (s=..., msgp=0x7fffffffcf50, flags=0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/nng.c:136
#6  0x0000555555564a84 in nng::socket::Socket::recv (self=0x7fffffffd018)
    at /home/developer/.cargo/git/checkouts/nng-rs-585a5382ec72d3dc/fc301c2/src/socket.rs:245
#7  0x00005555555600f1 in scratch::server () at src/main.rs:76
#8  0x0000555555560339 in scratch::main () at src/main.rs:87
@gdamore
Copy link
Contributor

gdamore commented May 18, 2020

Thanks, will investigate.

@gdamore gdamore self-assigned this May 18, 2020
@gdamore gdamore added the bug label May 18, 2020
@gdamore
Copy link
Contributor

gdamore commented May 18, 2020

Can you see if this is reproducible with tonight's latest commits? I think there's a chance that this might be another manifestation of a problem I resolved in TCP.

@alexkornitzer
Copy link
Author

Unfortunately I can, just tried with the latest master, and I am getting the same backtraces as above.

@gdamore
Copy link
Contributor

gdamore commented May 19, 2020

Thanks. I will look further into it.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

There's at least one bug in your code -- a use-after-free -- in your program. It's not clear to me yet that there is a relationship with this particular crash though. I'll comment on your gist.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Actually I may have misread the code.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Your backtrace data above is not quite complete, since it doesn't include the arguments, which is unfortunate. It looks like this might be a race where the pipe is closed (and so members of the pipe itself are discarded) but the pipe still is on a per-socket recvpipes list. This might actually also be a bit of a race as we take it off that list. All of this is supposed to be done with the mutex for the socket held, so it should be safe. But something isn't right.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Yep. The rep0_pipe_close doesn't properly verify that the pipe isn't busy receiving.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Are you using NNG 1.2.4 or HEAD (which is 1.3.x?) The rust crash look like it is from 1.2.4. There are rather large changes in 1.3.x.

@alexkornitzer
Copy link
Author

Ah thought it might be something like that but the code was unfamiliar to me and I had a feeling you would identify the issue much faster.

On the rust front they currently point at 1.2.4 but I pointed nng-sys at HEAD to ensure that I was using the latest version before reporting that one. So unless I typoed both versions were tested against HEAD.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Is there any special trick to triggering the crash in your test program? I can't seem to reproduce it. Something here isn't adding up for me.

@alexkornitzer
Copy link
Author

alexkornitzer commented May 24, 2020

Being a race condition, I tend to find if it does not trigger first time i have to start the server again.

Its a bit of a gamble but i start the server, leave the client to run for about 3 seconds then kill it. After a few attempts this seems to trigger it. If you still cant trigger it i’ll try and script a loop to trigger it.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Are you killing the server, or the client?

@alexkornitzer
Copy link
Author

The client, which will then cause the server to crash. Hence why your reasoning above seems accurate.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Ok, I just got it.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Oh this is in a totally different place though.

@alexkornitzer
Copy link
Author

Hmm, different to both of the stack traces above, could that be due to recent changes in master? Would you like me to run again against the latest HEAD?

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Yes please. I'm seeing it on the send side not the receive side.

@alexkornitzer
Copy link
Author

Okay will try it now and get back to you asap.

@alexkornitzer
Copy link
Author

Screenshot 2020-05-24 at 21 07 01

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Interesting... different errors. Testing on Linux?

I'm actually getting SIGPIPE on send which I thought I should not, but I'm testing under WSL.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

I wonder if the debugger or WSL are changing the signal disposition somehow. I'm also trying to test under the sanitizer.

@alexkornitzer
Copy link
Author

Yeah Linux is where i dev these sort of things, as they end up in docker, etc. Hmm, odd, I wonder if I would get the same issue if i ran under WSL?! Which sanitizer are you running, I could try with that over gdb, but WSL will be a bit of a pain as ill need to setup a windows dev vm.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

I was using the address sanitizer. For whatever reason your code falls down hard with the memory sanitizer, and I wasn't able to convince it to generate output that would help me debug that.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Your crash indicates that the aio is garbage. I'm not seeing that at all on my side, but I think we're just seeing different variations of corruption.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

I am passing MSG_NOSIGNAL, but it didn't get honored. Don't know why not. Argh.

@alexkornitzer
Copy link
Author

I am not getting anything from the address sanitizer but i am getting the strcmp error with the memory sanitizer

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

I couldn't figure out where the strcmp error was -- the sanitizer output wasn't very informative, and I was unable to set any useful break points (notably __msan_warning and __msan_warning_noreturn) in the debugger -- or rather said entry points did not fire.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

I'm also stuck with WSL 1, because I'm not really prepared to run the insider's version on my primary desktop.

@alexkornitzer
Copy link
Author

Okay got some meaningful symbols for memorysanitizer:
Screenshot 2020-05-24 at 21 51 48

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Meanwhile I'm firing up a HyperV guest with Ubuntu 20. We'll see if that goes any better.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

And looking at the sanitizer output, I think it's wrong. I think it's confused because we don't do a string copy to initialize the url->u_scheme, but do a character by character copy. I'll probably change this just to silence it.

@alexkornitzer
Copy link
Author

Interesting, gonna make that change too and see if it gets further, just out of curiosity.

@alexkornitzer
Copy link
Author

Hmm, not getting anything with the memory sanitizer either, unless the sigabrt is doing something to intercept:
Screenshot 2020-05-24 at 22 32 18

@alexkornitzer
Copy link
Author

Trusty valgrind seems more useful :)
reqrep.log

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Yep. that was quite helpful.

@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

I think I have a fix. I also think respondent suffers the same problem.

@gdamore gdamore changed the title SIGSEGV in RepReq's rep0 recv, possibly due to resend SIGSEGV in RepReq's rep0 recv - use after free May 24, 2020
gdamore added a commit that referenced this issue May 24, 2020
This also affects the respondent protocol.  Examination of the other
protocols did not turn up any evidence of the same issue.
@gdamore
Copy link
Contributor

gdamore commented May 24, 2020

Please give this branch a try: https://github.com/nanomsg/nng/pull/1246/files

I think it probably solves the problem.

@alexkornitzer
Copy link
Author

That looks like it fixes it, just tried it ~10 times and could not get it to crash, tried it on the rust version too and could not get it to crash either. So Iooks like that has fixed it, thanks for sorting that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants