-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tornado 6.0 and Jupyter notebook server issue. #2651
Comments
Nice work tracking this down! That patch looks right to me. Want to submit it as a PR? (Ideally with a small test case) I wonder why this is happening for you but slipped by past testing with jupyter and tornado 6.
|
Thanks @bdarnell
Sure! I'll open the PR now and write a test (though I may need help with this).
Yeah, this puzzled me too. I'm not sure why our tests in the notebook server pass. I'll try to investigate further and let you know if I learn anything. |
Can you point me to the jupyter test that's running in to this? I'm still having trouble seeing how the bug in maybe_add_error_listener can lead to the "already reading" exception and want to make sure we've fully understood the problem. |
Of course! Let me write up the steps to reproduce for you... give me a couple minutes. What's weird is that we don't have any tests failing in the notebook server. The failing happens in client side tests (in JLab) that make requests to the notebook server. So there is some edge case that even our notebook server tests are not covering... |
@bdarnell, here are steps to reproduce the issue.
You should see tornado throw the error above. The test that is failing can be found here. |
By the way, thank you tremendously for helping me here. |
OK, that's very helpful. If I'm reading this right, it looks like the test is passing, but it's logging this "already reading" assertion as an uncaught error. (Fortunately not a production problem, but we shouldn't be logging stack traces like this when there's not a real problem) With the help of tcpdump and some extra logging, I see that what's happening is that the client is closing the connection in the middle of the websocket handshake. Then it runs in to a weakness in our handling of closed connections: We mostly don't have explicit "if closed: return" code anywhere, we just keep going until an At this point in the websocket handshake, that actually happens twice: the HTTP layer starts reading for its next batch of headers, and the websocket layer starts reading for websocket messages (the closed connection breaks the HTTP layer out of its wait, but doesn't break it all the way out of the loop). Then, because some checks are out of order, if you try to read on a closed connection, you'll set the "reading" flag before raising StreamClosedError. If you try to read a second time, you'll get the "already reading" error instead of the expected StreamClosedError. I think fixing this should be as simple as checking the closed status of the stream earlier in IOStream read (and other) methods. |
I'm not sure what this means for the proposed fix in #2652. I can't decide whether it's a good change on its own or if it's a bad idea because it could lead to signal_closed being called multiple times in cases like this. Since we can't figure out how to trigger a problem that is specifically fixed by that change, I'm inclined to leave it out and only change the close checks in IOStream. |
Ah, that all makes sense. Thank you for working through those tests and gathering this information.
I totally agree. Would you like me to work on a new PR that tries to place some "checks for |
I think we may just need one additional close check, in @gen_test
async def test_incomplete_handshake(self):
with contextlib.closing(IOStream(socket.socket())) as stream:
await stream.connect(("127.0.0.1", self.get_http_port()))
await stream.write(
textwrap.dedent(
"""
GET /echo HTTP/1.1
Connection: Upgrade
Upgrade: WebSocket
Host: localhost
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
"""
)
.replace("\n", "\r\n")
.encode("utf-8")
)
# resp = await stream.read_until(b"\r\n\r\n")
# print(resp)
stream.close()
await asyncio.sleep(1.0)
# stream.close()
# await asyncio.sleep(1.0) |
@bdarnell sorry for the delayed reply here. I've been traveling all week and haven't had a chance to work on this. I'm hoping to return to this issue over the weekend. Thank you for pushing forward on it! |
If I may throw in another backtrace here which may be related:
This is from tornado-5.1.1 on python-3.6. We have a couple thousand long-lived websocket connections to each websocket-server process (and they also serve a few plain http requests per minute). This seems to be correlated with (comes a few milliseconds after) some cases where a websocket seems to already be closed when our (plain sychronous) I'm pretty sure this never happened with tornado-4.5.3. It's no big problem currently ... it's probably the fault of some distant client in some very strange network conditions. It's hard for me to tell whether something like #2652 would fix this occasional exception for me, but if it seems like it should, I would be interested in trying to backport it to 5.1. |
I meet the same problem. But I didn't use the websocket handler. My codes looks like:
Downgrade from v6.x to v5.1.1 fix it. |
@CzBiX are you getting |
@bdarnell I've been working on a test for this issue, but I can't recreate the behavior that we're seeing in JupyterLab (likely because I don't understand the HTTP and websocket layers well enough). I started with your design above and have the following steps:
I can verify the order of these steps with tcpdump. I get the expected |
It shouldn't matter what "layer" the close is happening at, but the precise timing matters. JupyterLab's test is running in a separate process, which can result in timing patterns that are different from those we see inside tornado. It's sometimes difficult to reproduce the right patterns in a Tornado test and often requires fragile non-idiomatic code (like a test that has exactly two empty Since neither of us have been able to reproduce the right patterns, I think it's time to give up on the idea of a test for this (especially since we have an indirect test via jupyterlab) and just add the check for closed status in |
@Zsailer @bdarnell
Should I create an issue against jupyterhub? Or against tornado? |
We're seeing tests fail in inside of Jupyter(Lab) when using Tornado 6.x, forcing us to pin to Tornado 5.x. (See the relevant issue).
Here's the traceback we see in the tests:
I'm hoping y'all might be able to provide some insight on the issue. I can't tell if we missed something in the jump to 6.x that you might be able to answer easily. 😄
We've tracked down the exact commit that broke our tests.
I dove a bit deeper into Tornado and found a single change that can toggle the break. Here's a summary of the change:
_maybe_add_error_listener
ensured that all futures+callbacks were done+closed if the stream was closed by calling the_maybe_run_close_callback()
method_maybe_run_close_callback()
was renamed to_signal_closed()
_maybe_add_error_listener
no longer callssignal_closed
/_maybe_run_close_callback()
Reverting to previous behavior fixes the issue. Here's a patch showing the diff. I add this
_signal_closed()
method back to_maybe_add_error_listener
and things seem to work again.I'm not sure if this is a bug in Tornado, or if Jupyter's notebook server is using this API wrong (possibly a missing
await
could be a likely culprit).Any thoughts? Let me know if I can provide you more detail.
The text was updated successfully, but these errors were encountered: