SIGSEGV in RepReq's rep0 recv - use after free #1241

alexkornitzer · 2020-05-10T22:52:20Z

As briefly discussed in #1240 there appears to be a bug in the current implementation that results in a SIGSEGV when a socket is closed but the resend timeout fired at least once. Or that is my current theory on it at least.

SIGSEGV from C implementation:

https://gist.github.com/AlexKornitzer/371eef55e558e89d10850d53986dfb35

./reqrep server ipc:///tmp/abcd 
./reqrep client ipc:///tmp/abcd

panic: pthread_mutex_lock: Invalid argument
This message is indicative of a BUG.
Report this at https://github.com/nanomsg/nng/issues
/home/developer/Developer/scratch/nng/reqrep(+0x15bb9) [0x555555569bb9]
/home/developer/Developer/scratch/nng/reqrep(+0x1f7af) [0x5555555737af]
/home/developer/Developer/scratch/nng/reqrep(+0x1f958) [0x555555573958]
/home/developer/Developer/scratch/nng/reqrep(+0x1c826) [0x555555570826]
/home/developer/Developer/scratch/nng/reqrep(+0x1c562) [0x555555570562]
/home/developer/Developer/scratch/nng/reqrep(+0xf21f) [0x55555556321f]
/home/developer/Developer/scratch/nng/reqrep(+0x24617) [0x555555578617]
/home/developer/Developer/scratch/nng/reqrep(+0x15f4e) [0x555555569f4e]
/home/developer/Developer/scratch/nng/reqrep(+0xe60d) [0x55555556260d]
/home/developer/Developer/scratch/nng/reqrep(+0xec35) [0x555555562c35]
/home/developer/Developer/scratch/nng/reqrep(+0x18a00) [0x55555556ca00]
/home/developer/Developer/scratch/nng/reqrep(+0x739f) [0x55555555b39f]
/home/developer/Developer/scratch/nng/reqrep(+0x710d) [0x55555555b10d]
/home/developer/Developer/scratch/nng/reqrep(+0x6f29) [0x55555555af29]
/home/developer/Developer/scratch/nng/reqrep(+0x690f) [0x55555555a90f]
/home/developer/Developer/scratch/nng/reqrep(+0x6dac) [0x55555555adac]
/usr/lib/libc.so.6(__libc_start_main+0xf3) [0x7ffff7ded023]
/home/developer/Developer/scratch/nng/reqrep(+0x672e) [0x55555555a72e]

(gdb) bt
#0  0x00007ffff7e01ce5 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7deb857 in abort () from /usr/lib/libc.so.6
#2  0x00005555555733bf in nni_plat_abort ()
#3  0x0000555555569bbe in nni_panic ()
#4  0x00005555555737af in nni_pthread_mutex_lock ()
#5  0x0000555555573958 in nni_plat_mtx_lock ()
#6  0x0000555555570826 in nni_mtx_lock ()
#7  0x0000555555570562 in nni_task_dispatch ()
#8  0x000055555556321f in nni_aio_begin ()
#9  0x0000555555578617 in ipctran_pipe_recv ()
#10 0x0000555555569f4e in nni_pipe_recv ()
#11 0x000055555556260d in rep0_ctx_recv ()
#12 0x0000555555562c35 in rep0_sock_recv ()
#13 0x000055555556ca00 in nni_sock_recv ()
#14 0x000055555555b39f in nng_recv_aio ()
#15 0x000055555555b10d in nng_recvmsg ()
#16 0x000055555555af29 in nng_recv ()
#17 0x000055555555a90f in server ()
#18 0x000055555555adac in main ()

SIGSEGV from Rust version:

Thread 1 "scratch" received signal SIGSEGV, Segmentation fault.
0x0000555555577ce4 in nni_pipe_recv (p=0x0, aio=0x7fffe8001430)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/core/pipe.c:141
141             p->p_tran_ops.p_recv(p->p_tran_data, aio);
(gdb) bt
#0  0x0000555555577ce4 in nni_pipe_recv (p=0x0, aio=0x7fffe8001430)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/core/pipe.c:141
#1  0x000055555556dddf in rep0_ctx_recv (arg=0x5555555ead20, aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/protocol/reqrep0/rep.c:504
#2  0x000055555556e361 in rep0_sock_recv (arg=0x5555555eac10, aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/protocol/reqrep0/rep.c:670
#3  0x000055555557ade3 in nni_sock_recv (sock=0x5555555ea220, aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/core/socket.c:847
#4  0x00005555555666e6 in nng_recv_aio (s=..., aio=0x5555555edcb0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/nng.c:214
#5  0x000055555556651d in nng_recvmsg (s=..., msgp=0x7fffffffcf50, flags=0)
    at /home/developer/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/nng-sys-1.2.4-rc.1/nng/src/nng.c:136
#6  0x0000555555564a84 in nng::socket::Socket::recv (self=0x7fffffffd018)
    at /home/developer/.cargo/git/checkouts/nng-rs-585a5382ec72d3dc/fc301c2/src/socket.rs:245
#7  0x00005555555600f1 in scratch::server () at src/main.rs:76
#8  0x0000555555560339 in scratch::main () at src/main.rs:87

The text was updated successfully, but these errors were encountered:

gdamore · 2020-05-18T00:41:29Z

Thanks, will investigate.

gdamore · 2020-05-18T06:04:39Z

Can you see if this is reproducible with tonight's latest commits? I think there's a chance that this might be another manifestation of a problem I resolved in TCP.

alexkornitzer · 2020-05-18T11:51:42Z

Unfortunately I can, just tried with the latest master, and I am getting the same backtraces as above.

gdamore · 2020-05-19T05:12:16Z

Thanks. I will look further into it.

gdamore · 2020-05-24T19:18:37Z

There's at least one bug in your code -- a use-after-free -- in your program. It's not clear to me yet that there is a relationship with this particular crash though. I'll comment on your gist.

gdamore · 2020-05-24T19:25:33Z

Actually I may have misread the code.

gdamore · 2020-05-24T19:32:37Z

Your backtrace data above is not quite complete, since it doesn't include the arguments, which is unfortunate. It looks like this might be a race where the pipe is closed (and so members of the pipe itself are discarded) but the pipe still is on a per-socket recvpipes list. This might actually also be a bit of a race as we take it off that list. All of this is supposed to be done with the mutex for the socket held, so it should be safe. But something isn't right.

gdamore · 2020-05-24T19:34:42Z

Yep. The rep0_pipe_close doesn't properly verify that the pipe isn't busy receiving.

gdamore · 2020-05-24T19:43:12Z

Are you using NNG 1.2.4 or HEAD (which is 1.3.x?) The rust crash look like it is from 1.2.4. There are rather large changes in 1.3.x.

alexkornitzer · 2020-05-24T19:46:19Z

Ah thought it might be something like that but the code was unfamiliar to me and I had a feeling you would identify the issue much faster.

On the rust front they currently point at 1.2.4 but I pointed nng-sys at HEAD to ensure that I was using the latest version before reporting that one. So unless I typoed both versions were tested against HEAD.

gdamore · 2020-05-24T19:50:00Z

Is there any special trick to triggering the crash in your test program? I can't seem to reproduce it. Something here isn't adding up for me.

alexkornitzer · 2020-05-24T19:53:12Z

Being a race condition, I tend to find if it does not trigger first time i have to start the server again.

Its a bit of a gamble but i start the server, leave the client to run for about 3 seconds then kill it. After a few attempts this seems to trigger it. If you still cant trigger it i’ll try and script a loop to trigger it.

gdamore · 2020-05-24T19:55:16Z

Are you killing the server, or the client?

alexkornitzer · 2020-05-24T19:56:26Z

The client, which will then cause the server to crash. Hence why your reasoning above seems accurate.

gdamore · 2020-05-24T19:56:56Z

Ok, I just got it.

gdamore · 2020-05-24T19:58:31Z

Oh this is in a totally different place though.

alexkornitzer · 2020-05-24T20:00:09Z

Hmm, different to both of the stack traces above, could that be due to recent changes in master? Would you like me to run again against the latest HEAD?

gdamore · 2020-05-24T20:00:39Z

Yes please. I'm seeing it on the send side not the receive side.

alexkornitzer · 2020-05-24T20:02:09Z

Okay will try it now and get back to you asap.

alexkornitzer · 2020-05-24T20:07:45Z

gdamore · 2020-05-24T20:25:16Z

Interesting... different errors. Testing on Linux?

I'm actually getting SIGPIPE on send which I thought I should not, but I'm testing under WSL.

gdamore · 2020-05-24T20:27:17Z

I wonder if the debugger or WSL are changing the signal disposition somehow. I'm also trying to test under the sanitizer.

alexkornitzer · 2020-05-24T20:29:27Z

Yeah Linux is where i dev these sort of things, as they end up in docker, etc. Hmm, odd, I wonder if I would get the same issue if i ran under WSL?! Which sanitizer are you running, I could try with that over gdb, but WSL will be a bit of a pain as ill need to setup a windows dev vm.

gdamore · 2020-05-24T20:30:50Z

I was using the address sanitizer. For whatever reason your code falls down hard with the memory sanitizer, and I wasn't able to convince it to generate output that would help me debug that.

gdamore · 2020-05-24T20:33:09Z

Your crash indicates that the aio is garbage. I'm not seeing that at all on my side, but I think we're just seeing different variations of corruption.

gdamore · 2020-05-24T20:41:36Z

I am passing MSG_NOSIGNAL, but it didn't get honored. Don't know why not. Argh.

alexkornitzer · 2020-05-24T20:44:08Z

I am not getting anything from the address sanitizer but i am getting the strcmp error with the memory sanitizer

gdamore · 2020-05-24T20:48:17Z

I couldn't figure out where the strcmp error was -- the sanitizer output wasn't very informative, and I was unable to set any useful break points (notably __msan_warning and __msan_warning_noreturn) in the debugger -- or rather said entry points did not fire.

gdamore · 2020-05-24T20:50:30Z

I'm also stuck with WSL 1, because I'm not really prepared to run the insider's version on my primary desktop.

alexkornitzer · 2020-05-24T20:52:34Z

Okay got some meaningful symbols for memorysanitizer:

gdamore · 2020-05-24T21:01:17Z

Meanwhile I'm firing up a HyperV guest with Ubuntu 20. We'll see if that goes any better.

gdamore · 2020-05-24T21:07:32Z

And looking at the sanitizer output, I think it's wrong. I think it's confused because we don't do a string copy to initialize the url->u_scheme, but do a character by character copy. I'll probably change this just to silence it.

alexkornitzer · 2020-05-24T21:13:33Z

Interesting, gonna make that change too and see if it gets further, just out of curiosity.

alexkornitzer · 2020-05-24T21:33:10Z

Hmm, not getting anything with the memory sanitizer either, unless the sigabrt is doing something to intercept:

alexkornitzer · 2020-05-24T21:40:32Z

Trusty valgrind seems more useful :)
reqrep.log

gdamore · 2020-05-24T22:52:27Z

Yep. that was quite helpful.

gdamore · 2020-05-24T23:10:01Z

I think I have a fix. I also think respondent suffers the same problem.

This also affects the respondent protocol. Examination of the other protocols did not turn up any evidence of the same issue.

gdamore · 2020-05-24T23:29:24Z

Please give this branch a try: https://github.com/nanomsg/nng/pull/1246/files

I think it probably solves the problem.

alexkornitzer · 2020-05-25T11:05:21Z

That looks like it fixes it, just tried it ~10 times and could not get it to crash, tried it on the rust version too and could not get it to crash either. So Iooks like that has fixed it, thanks for sorting that out.

gdamore self-assigned this May 18, 2020

gdamore added the bug label May 18, 2020

gdamore changed the title ~~SIGSEGV in RepReq's rep0 recv, possibly due to resend~~ SIGSEGV in RepReq's rep0 recv - use after free May 24, 2020

gdamore added a commit that referenced this issue May 24, 2020

fixes #1241 SIGSEGV in RepReq's rep0 recv - use after free

f5cfdc4

This also affects the respondent protocol. Examination of the other protocols did not turn up any evidence of the same issue.

gdamore closed this as completed in 79aed58 May 25, 2020

SIGSEGV in RepReq's rep0 recv - use after free #1241

SIGSEGV in RepReq's rep0 recv - use after free #1241

Comments

alexkornitzer commented May 10, 2020

gdamore commented May 18, 2020

gdamore commented May 18, 2020

alexkornitzer commented May 18, 2020

gdamore commented May 19, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020 • edited Loading

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 24, 2020

alexkornitzer commented May 24, 2020

alexkornitzer commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

gdamore commented May 24, 2020

alexkornitzer commented May 25, 2020

alexkornitzer commented May 24, 2020 •

edited

Loading