Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when using Firefox with WebSockets #2349

Closed
danflu opened this issue Sep 2, 2020 · 25 comments · Fixed by #2355
Closed

Crash when using Firefox with WebSockets #2349

danflu opened this issue Sep 2, 2020 · 25 comments · Fixed by #2355

Comments

@danflu
Copy link

danflu commented Sep 2, 2020

Hello Janus Team,

I'm getting random server crashes when using Firefox 80.0.1 (64-bit) + Secure WebSockets as transport.

Using janus v0.10.5 (b26cbb1)

Pastebin link with janus log (level 7) + libasan output
https://pastebin.com/0NdfQ1J8

Pastebin link with core stack trace:
https://pastebin.com/FWw8gJLE

It appears to be something related with buffer overflow when realocating ws_client buffer.

The crash always happens consistently in the same place on file janus_websockets.c at line 904:

g_free(ws_client->buffer);

Please, let me know if there is anything else I can do to help debugging this issue.

Thanks a lot for your attention,
Daniel

@lminiero
Copy link
Member

lminiero commented Sep 3, 2020

The libasan trace talks of a heap buffer overflow in lws_write (line 1234), not the free gdb mentions (which may only be an issue detected later). You may want to open an issue on the libwebsockets repo as well, to see if the issue is there.

@atoppi
Copy link
Member

atoppi commented Sep 3, 2020

[Wed Sep 2 16:33:47 2020] [WSS-0x61700012a300] Re-allocating to 127409 bytes (was 293, response is 127393 bytes)

@danflu is a size of 127393 bytes expected from your client?
If you still have the core dump you might discover the content of the message.

Then I noticed that you are using the http/2 transport for websocket

    #8 0x7f646dec2da7 in lws_h2_frame_write (/opt/janus/lib/libwebsockets.so.15+0x2eda7)
    #9 0x7f646dec88ec in rops_write_role_protocol_h2 (/opt/janus/lib/libwebsockets.so.15+0x348ec)
    #10 0x7f646dea4df5 in lws_write (/opt/janus/lib/libwebsockets.so.15+0x10df5)

switching to http might be worth a try.

@danflu
Copy link
Author

danflu commented Sep 3, 2020

I will check what is this huge payload...

@danflu
Copy link
Author

danflu commented Sep 3, 2020

Hi @atoppi, you are rigth!
My app was sending a huge json 127k.
After reducing to a more manageable amount (7k) the crash stopped happening in firefox.
There is still a bug somewhere, but the culprit was the buffer being too big to websocket somehow.
Maybe it is still worth investigating what exactly is the cause.

@lws-team
Copy link

lws-team commented Sep 4, 2020

Tbh this allocation / reallocation and overflow seems coming from outside lws. You can get better insight usually running under valgrind, it often can pinpoint where it first went off the rails, long before the crash.

Generally if user code does buffer allocation and that breaks, you may see bad dereferences inside lws if it hands the broken pointers or lengths to lws_write, but it doesn't indicate the problem coming from there.

Just a guess, is Janus realloc logic taking care about LWS_PRE over-alloc behind the pointer passed to lws_write()?

@tmatth
Copy link
Contributor

tmatth commented Sep 4, 2020

Tbh this allocation / reallocation and overflow seems coming from outside lws. You can get better insight usually running under valgrind, it often can pinpoint where it first went off the rails, long before the crash.

Generally if user code does buffer allocation and that breaks, you may see bad dereferences inside lws if it hands the broken pointers or lengths to lws_write, but it doesn't indicate the problem coming from there.

Just a guess, is Janus realloc logic taking care about LWS_PRE over-alloc behind the pointer passed to lws_write()?

The buffer is at least LWS_SEND_BUFFER_PRE_PADDING + LWS_SEND_BUFFER_POST_PADDING, should LWS_PRE be added in addition to that? It wasn't immediately clear from the docs.

@lws-team
Copy link

lws-team commented Sep 4, 2020

You must 'hide' LWS_PRE allocated bytes behind the pointer passed to lws_write()... they can be uninitialized but must be writeable. So for a usable buffer p of length x, p = malloc(LWS_PRE + x); and then, eg, p1 = p + LWS_PRE;. You can use p1 normally up to length x, pass p1 to lws_wite() and so on. And eventually free p.

The reason is lws can then prepend ws or h2 framing headers without any memcpy.

@lws-team
Copy link

lws-team commented Sep 4, 2020

... the docs on lws_write() should make this very clear

 * IMPORTANT NOTICE!
 *
 * When sending with websocket protocol
 *
 * LWS_WRITE_TEXT,
 * LWS_WRITE_BINARY,
 * LWS_WRITE_CONTINUATION,
 * LWS_WRITE_PING,
 * LWS_WRITE_PONG,
 *
 * or sending on http/2,
 *
 * the send buffer has to have LWS_PRE bytes valid BEFORE the buffer pointer you
 * pass to lws_write().  Since you'll probably want to use http/2 before too
 * long, it's wise to just always do this with lws_write buffers... LWS_PRE is
 * typically 16 bytes it's not going to hurt usually.
 *
 * start of alloc       ptr passed to lws_write      end of allocation
 *       |                         |                         |
 *       v  <-- LWS_PRE bytes -->  v                         v
 *       [----------------  allocated memory  ---------------]
 *              (for lws use)      [====== user buffer ======]
 *
 * This allows us to add protocol info before and after the data, and send as
 * one packet on the network without payload copying, for maximum efficiency.
 *
 * So for example you need this kind of code to use lws_write with a
 * 128-byte payload
 *
 *   char buf[LWS_PRE + 128];
 *
 *   // fill your part of the buffer... for example here it's all zeros
 *   memset(&buf[LWS_PRE], 0, 128);
 *
 *   lws_write(wsi, &buf[LWS_PRE], 128, LWS_WRITE_TEXT);
 *
 * LWS_PRE is at least the frame nonce + 2 header + 8 length
 * LWS_SEND_BUFFER_POST_PADDING is deprecated, it's now 0 and can be left off.
 * The example apps no longer use it.
 *
 * Pad LWS_PRE to the CPU word size, so that word references
 * to the address immediately after the padding won't cause an unaligned access
 * error. Sometimes for performance reasons the recommended padding is even
 * larger than sizeof(void *).
 *
 *	In the case of sending using websocket protocol, be sure to allocate
 *	valid storage before and after buf as explained above.  This scheme
 *	allows maximum efficiency of sending data and protocol in a single
 *	packet while not burdening the user code with any protocol knowledge.

https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-write.h#n125

@lminiero
Copy link
Member

lminiero commented Sep 4, 2020

@lws-team thanks for the clarification! I thought we were doing the right thing already, because as Tristan said we allocate (same thing when we realloc) these amount of bytes:

int buflen = LWS_SEND_BUFFER_PRE_PADDING + strlen(response) + LWS_SEND_BUFFER_POST_PADDING;

and then pass buffer + LWS_SEND_BUFFER_PRE_PADDING to lws_write. IIRC I got this from some examples at the time (a long time ago!), but I guess something changed since then and I failed to align the code. Does this mean LWS_SEND_BUFFER_PRE_PADDING and LWS_SEND_BUFFER_POST_PADDING are now not needed anymore, and I should allocate a buffer of this size instead?

int buflen = LWS_PRE + strlen(response);

Do I need fences in the code to behave differently, e.g., for lws 3 or 4, or can we safely assume all lws versions will use that properly?

@lminiero
Copy link
Member

lminiero commented Sep 4, 2020

Ok, looks like I really failed to keep up to date 🙂
Checking the repo code, LWS_SEND_BUFFER_POST_PADDING has been deprecated 5 years ago (and is now defined to 0), when LWS_SEND_BUFFER_PRE_PADDING simply became an alias for LWS_PRE.

But this means that, despite the legacy syntax (that I should fix), I'm apparently already doing the right thing: we do allocate the buffer as LWS_PRE + buffer size:

LWS_SEND_BUFFER_PRE_PADDING + size + LWS_SEND_BUFFER_POST_PADDING = LWS_PRE + size + 0

and I'm passing the right buffer to lws_write as well.

@lws-team
Copy link

lws-team commented Sep 4, 2020

What was said was just

The buffer is at least LWS_SEND_BUFFER_PRE_PADDING + LWS_SEND_BUFFER_POST_PADDING, should LWS_PRE be added in addition to that? It wasn't immediately clear from the docs.

Six years ago or so LWS_SEND_BUFFER_PRE_PADDING got shortened to LWS_PRE. The post padding only existed in earlier drafts of the ws rfc, that has been fixed at 0 for a similar number of years and can be left off. So you just need LWS_PRE plus the length, and pass the pointer offset by LWS_PRE.

The combination of a "buffer overflow" reported and a blowup in free if something was reallocated there makes it sound like all may not well in that path... it could be some other problem, if lws still suspected try it against master lws and see if problems still coming.

@lminiero
Copy link
Member

lminiero commented Sep 4, 2020

Got it, thanks! Yes, you were absolutely right in pointing out we were using very old semantics, I'll fix it right away. I just wanted to make sure I wasn't doing something wrong in the send code despite that: the realloc code follows the same logic as the alloc code, so I think it should be fine, but I'll double check.

I think the cause of the issue here was the incredibly long buffer that was being passed to the transport, but we're waiting for more details on that. If this was the issue, I should be able to replicate it locally to investigate.

@atoppi
Copy link
Member

atoppi commented Sep 4, 2020

@lws-team @danflu I have managed to reproduce the issue by enabling HTTP2 in lws configuration.

Steps:

  • compile lws with -DLWS_WITH_HTTP2=ON
  • enable WSS server in janus
  • quick hack to send huge JSON -> edit janus.js like this to send very large transactions id
// Helper method to create random identifiers (e.g., transaction)
Janus.randomString = function(len) {
	var charSet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
	var randomString = '';
	for (var i = 0; i < 130000; i++) {
		var randomPoz = Math.floor(Math.random() * charSet.length);
		randomString += charSet.substring(randomPoz,randomPoz+1);
	}
	return randomString;
};
  • try a janus demo page with Firefox

Janus will crash, here the stack trace with libasan https://pastebin.com/2yTK4Zc9

One of the following will be a workaround:

  • don't use WSS on janus (use WS and a WSS frontend proxy)
  • don't use Firefox
  • disable H2 in lws

@lws-team should you need any further debugging just ask

@atoppi
Copy link
Member

atoppi commented Sep 4, 2020

Forgot to mention I'm on origin/v4.0-stable branch, commit c4327e7f.
I'll do a test with the most recent versions asap.

@atoppi
Copy link
Member

atoppi commented Sep 4, 2020

Crash with origin/v4.0-stable 5b852b56 -> https://pastebin.com/vBX2SdVM
Crash with master 008b3551 -> https://pastebin.com/hBbWKF64

@lws-team
Copy link

lws-team commented Sep 4, 2020

0x7f1e27a9a7fd is located 3 bytes to the left of 130111-byte region [0x7f1e27a9a800,0x7f1e27aba43f)

Can you try a quick hack on your user code to allocate and point at (LWS_PRE + 3) instead of LWS_PRE? I am not sure if I understand what asan is telling me but it sounds like it's complaining we are using 3 bytes behind.

In the mode you describe actually lws is doing RFC8441 encapsulation of ws inside h2 frames... the layout is

[h2 header][ws header]|[your payload]

by the time it is sent, where | is the pointer given to lws_write().

Normally the ws header is not huge but actually, it uses variable length integer coding for the bulk payload length... if it faces a monster payload it may overflow LWS_PRE with additional bytes used to encode the length...

@atoppi
Copy link
Member

atoppi commented Sep 4, 2020

I've replaced LWS_PRE with (LWS_PRE + 3) in our user code.
Janus did not crash but the WS setup failed.

Here's the log from origin/v4.0-stable 5b852b56: https://pastebin.com/t0jZEem8

@lws-team
Copy link

lws-team commented Sep 4, 2020

During the h2 negotiation Firefox leaves it to use the default h2 max frame size of 16384... it's trying to shove 130K down there in one frame, and the client should hang up on it for that which seems to be what happens.

So I think we understand what goes on... 1) 16 bytes LWS_PRE isn't enough if you have h2 + RFC8441 plus a huge length, 2) but RFC8441 huge length won't fly anyway in an h2 encapsulation framing limited to 16KB itself

In other words we can't support these huge ws fragments directly. I'll try to add something to complain and reject it, but there isn't a "fix" at lws level.

What should happen is if you want to forward a huge ws message, break it up into ideally ~2mtu (eg, 3KB) ws fragments using ws fragmentation arrangements and send those with one writeable / lws_write() each

https://tools.ietf.org/html/rfc6455#section-5.4

in lws_write(), there are flags to express the different fragmentation states, in recent lws there is a helper

https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-write.h#n221-246

all ws clients understand the native ws fragmentation and you can produce endless messages that way, without needing to know the whole message length when you get started.

You can either use that to compute the flags or refer to what it does to pick your own flags directly. Then it'll all hang together keeping LWS_PRE at 16, ws-over-h2 and huge ws messages (made up of not-huge ws fragments).

@danflu
Copy link
Author

danflu commented Sep 4, 2020

Hi @lws-team, thanks a lot for your clarifications!
Thanks @atoppi, @lminiero, @tmatth !
The most important thing is that we were able to understand the issue.
I can live with 16kb message size limitation and reframe bigger messages by myself, but certainly would be a good thing to Janus team implement the lws fragmentation scheme so the end user can use the transport layer in a more transparent way.
Thanks you all for your valuable support in this matter!
Daniel

@atoppi
Copy link
Member

atoppi commented Sep 4, 2020

@lws-team thanks for the detailed reply.

However I don't understand why this happens only in in WSS mode.
IIRC the RFC8441 encapsulation is not aware of the underneath TLS connection.

@lws-team
Copy link

lws-team commented Sep 4, 2020

You can only negotiate h2 by alpn, over tls... otherwise you never can meet this encapsulation.

@atoppi
Copy link
Member

atoppi commented Sep 7, 2020

@lws-team ok thanks again for clarifying and suggesting a solution.

@lminiero
Copy link
Member

lminiero commented Sep 7, 2020

@lws-team I'm looking into how to start using the fragmentation stuff, and it does seem relatively easy indeed. There's one thing that I'm a bit confused about though. When sending a packet, I can set in the flags that I'm sending the last fragment, e.g., if I'm sending all the data that's left: there's no guarantee all the data will actually be sent, though, as for instance lws_write might return a smaller number of bytes actually sent, which means I'd have to send at least one more fragment after that. Will the lws stack automatically "disable" the "last fragment" bit if it couldn't send everything?

@lws-team
Copy link

lws-team commented Sep 7, 2020

lws_write might return a smaller number of bytes actually sent

For a few years lws will handle partial writes transparently, it will only return a number less than you asked if the connection is dead.

It will malloc up a buffer for any remainder, and suppress WRITEABLE on the connection while sending it in the background. When it has flushed the remainder it will allow WRITEABLE to be seen by the connection again. It's not efficient to copy stuff around into buffers like that, but it will handle it. That's why there is the recommendation for 2 x mtu, this is the value that send() or write() will almost always accept on linux.

You don't have to worry about the flags in this partial case, the flags are not per tcp fragment (which you can't control) but per ws fragment, each of which can extend over a large number of tcp fragments. So if it sends the first part of a ws fragment with FIN on it, the peer won't process it as the end of the ws fragment until it receives the expected whole length of the fragment. When it sees it has that, and if it originally came with a FIN, if it's a browser, typically it will then present the whole ws message - all of the fragments it had been saving up - at once to the client.

@lminiero
Copy link
Member

lminiero commented Sep 7, 2020

Got it, thanks! Then I'll reuse the existing buffering approach we have, to send fragments instead of what it does right now. I'll try to come up with a PR people can test soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants