Tornado 6.0 and Jupyter notebook server issue. #2651

Zsailer · 2019-04-29T18:35:40Z

We're seeing tests fail in inside of Jupyter(Lab) when using Tornado 6.x, forcing us to pin to Tornado 5.x. (See the relevant issue).

Here's the traceback we see in the tests:

    Traceback (most recent call last):
      File "~/dev/tornado/tornado/web.py", line 1699, in _execute
        result = await result
      File "~/dev/tornado/tornado/gen.py", line 736, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "~/dev/notebook/notebook/services/kernels/handlers.py", line 239, in get
        yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
      File "~/dev/tornado/tornado/gen.py", line 729, in run
        value = future.result()
      File "~/dev/tornado/tornado/gen.py", line 736, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "~/dev/notebook/notebook/base/zmqhandlers.py", line 293, in get
        yield maybe_future(res)
      File "~/dev/tornado/tornado/gen.py", line 729, in run
        value = future.result()
      File "~/dev/tornado/tornado/websocket.py", line 278, in get
        await self.ws_connection.accept_connection(self)
      File "~/dev/tornado/tornado/websocket.py", line 881, in accept_connection
        await self._accept_connection(handler)
      File "~/dev/tornado/tornado/websocket.py", line 964, in _accept_connection
        await self._receive_frame_loop()
      File "~/dev/tornado/tornado/websocket.py", line 1118, in _receive_frame_loop
        await self._receive_frame()
      File "~/dev/tornado/tornado/websocket.py", line 1130, in _receive_frame
        data = await self._read_bytes(2)
      File "~/dev/tornado/tornado/websocket.py", line 1124, in _read_bytes
        data = await self.stream.read_bytes(n)
      File "~/dev/tornado/tornado/iostream.py", line 436, in read_bytes
        future = self._start_read()
      File "~/dev/tornado/tornado/iostream.py", line 797, in _start_read
        assert self._read_future is None, "Already reading"

I'm hoping y'all might be able to provide some insight on the issue. I can't tell if we missed something in the jump to 6.x that you might be able to answer easily. 😄

We've tracked down the exact commit that broke our tests.

I dove a bit deeper into Tornado and found a single change that can toggle the break. Here's a summary of the change:

Before this commit, _maybe_add_error_listener ensured that all futures+callbacks were done+closed if the stream was closed by calling the_maybe_run_close_callback() method
_maybe_run_close_callback() was renamed to _signal_closed()
In this commit, the _maybe_add_error_listener no longer calls signal_closed/_maybe_run_close_callback()

Reverting to previous behavior fixes the issue. Here's a patch showing the diff. I add this _signal_closed() method back to _maybe_add_error_listener and things seem to work again.

I'm not sure if this is a bug in Tornado, or if Jupyter's notebook server is using this API wrong (possibly a missing await could be a likely culprit).

Any thoughts? Let me know if I can provide you more detail.

The text was updated successfully, but these errors were encountered:

bdarnell · 2019-04-30T01:45:10Z

Nice work tracking this down! That patch looks right to me. Want to submit it as a PR? (Ideally with a small test case)

I wonder why this is happening for you but slipped by past testing with jupyter and tornado 6.

I'm not sure if this is a bug in Tornado, or if Jupyter's notebook server is using this API wrong (possibly a missing await could be a likely culprit).

WebSocketHandler.get() was never intended to be overridden by subclasses (you're expected to use prepare() instead), but you appear to be doing the right thing by awaiting its result.

Zsailer · 2019-04-30T18:25:28Z

Thanks @bdarnell

Want to submit it as a PR? (Ideally with a small test case)

Sure! I'll open the PR now and write a test (though I may need help with this).

I wonder why this is happening for you but slipped by past testing with jupyter and tornado 6.

Yeah, this puzzled me too. I'm not sure why our tests in the notebook server pass. I'll try to investigate further and let you know if I learn anything.

bdarnell · 2019-05-05T20:36:02Z

Can you point me to the jupyter test that's running in to this? I'm still having trouble seeing how the bug in maybe_add_error_listener can lead to the "already reading" exception and want to make sure we've fully understood the problem.

Zsailer · 2019-05-05T21:07:42Z

Of course! Let me write up the steps to reproduce for you... give me a couple minutes.

What's weird is that we don't have any tests failing in the notebook server. The failing happens in client side tests (in JLab) that make requests to the notebook server. So there is some edge case that even our notebook server tests are not covering...

Zsailer · 2019-05-05T21:40:58Z

@bdarnell, here are steps to reproduce the issue.

Grab and install tornado 6.x. I've been using master.
Grab jupyterlab and unpin tornado. We've currently pinned master to <6.x. You can use my branch here. It's latest jlab and tornado is unpinned.
Move into JupyterLab directory.
Install Lupyterlab using pip install -e .
Build Jupyterlab's client-side tests. This takes a few minutes, unfortunately.
```
jlpm install
jlpm run build:test
```
Run specific test that's failing using the following command. This avoids running our whole test suite, which can take a few minutes.
```
cd tests/test-apputils
jlpm test --testNamePattern="restartKernel" 
```

You should see tornado throw the error above.

The test that is failing can be found here.

Zsailer · 2019-05-05T21:46:32Z

By the way, thank you tremendously for helping me here.

bdarnell · 2019-05-08T01:43:34Z

OK, that's very helpful. If I'm reading this right, it looks like the test is passing, but it's logging this "already reading" assertion as an uncaught error. (Fortunately not a production problem, but we shouldn't be logging stack traces like this when there's not a real problem)

With the help of tcpdump and some extra logging, I see that what's happening is that the client is closing the connection in the middle of the websocket handshake. Then it runs in to a weakness in our handling of closed connections: We mostly don't have explicit "if closed: return" code anywhere, we just keep going until an IOStream.read call raises a StreamClosedError.

At this point in the websocket handshake, that actually happens twice: the HTTP layer starts reading for its next batch of headers, and the websocket layer starts reading for websocket messages (the closed connection breaks the HTTP layer out of its wait, but doesn't break it all the way out of the loop). Then, because some checks are out of order, if you try to read on a closed connection, you'll set the "reading" flag before raising StreamClosedError. If you try to read a second time, you'll get the "already reading" error instead of the expected StreamClosedError.

I think fixing this should be as simple as checking the closed status of the stream earlier in IOStream read (and other) methods.

bdarnell · 2019-05-08T01:46:08Z

I think fixing this should be as simple as checking the closed status of the stream earlier in IOStream read (and other) methods.

I'm not sure what this means for the proposed fix in #2652. I can't decide whether it's a good change on its own or if it's a bad idea because it could lead to signal_closed being called multiple times in cases like this. Since we can't figure out how to trigger a problem that is specifically fixed by that change, I'm inclined to leave it out and only change the close checks in IOStream.

Zsailer · 2019-05-09T18:48:59Z

Ah, that all makes sense. Thank you for working through those tests and gathering this information.

Since we can't figure out how to trigger a problem that is specifically fixed by that change, I'm inclined to leave it out and only change the close checks in IOStream.

I totally agree. Would you like me to work on a new PR that tries to place some "checks for closed" status in helpful places?

bdarnell · 2019-05-11T01:01:06Z

I think we may just need one additional close check, in IOStream._start_read (this fixes the jupyterlab test). But we also need to confirm this with a test in tornado. I'm playing with variants of this but haven't been able to trigger the problem yet.

    @gen_test
    async def test_incomplete_handshake(self):
        with contextlib.closing(IOStream(socket.socket())) as stream:
            await stream.connect(("127.0.0.1", self.get_http_port()))
            await stream.write(
                textwrap.dedent(
                    """
                    GET /echo HTTP/1.1
                    Connection: Upgrade
                    Upgrade: WebSocket
                    Host: localhost
                    Sec-WebSocket-Version: 13
                    Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

                    """
                )
                .replace("\n", "\r\n")
                .encode("utf-8")
            )
            # resp = await stream.read_until(b"\r\n\r\n")
            # print(resp)
            stream.close()
            await asyncio.sleep(1.0)
            # stream.close()
            # await asyncio.sleep(1.0)

Zsailer · 2019-05-16T06:00:07Z

@bdarnell sorry for the delayed reply here. I've been traveling all week and haven't had a chance to work on this. I'm hoping to return to this issue over the weekend. Thank you for pushing forward on it!

ploxiln · 2019-05-23T17:58:10Z

If I may throw in another backtrace here which may be related:

  File "/usr/local/lib/python3.6/dist-packages/tornado/http1connection.py", line 735, in _server_request_loop
    ret = yield conn.read_response(request_delegate)
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/usr/local/lib/python3.6/dist-packages/tornado/http1connection.py", line 160, in _read_message
    max_bytes=self.params.max_header_size)
  File "/usr/local/lib/python3.6/dist-packages/tornado/iostream.py", line 353, in read_until_regex
    future = self._set_read_callback(callback)
  File "/usr/local/lib/python3.6/dist-packages/tornado/iostream.py", line 860, in _set_read_callback
    assert self._read_future is None, "Already reading"
AssertionError: Already reading

This is from tornado-5.1.1 on python-3.6. We have a couple thousand long-lived websocket connections to each websocket-server process (and they also serve a few plain http requests per minute). This seems to be correlated with (comes a few milliseconds after) some cases where a websocket seems to already be closed when our (plain sychronous) open() method calls self.write_message("ACK"). With about 5k clients spread across 3 server processes, we see the close-during-open about 40 times a day, and this backtrace just a couple times a week.

I'm pretty sure this never happened with tornado-4.5.3. It's no big problem currently ... it's probably the fault of some distant client in some very strange network conditions. It's hard for me to tell whether something like #2652 would fix this occasional exception for me, but if it seems like it should, I would be interested in trying to backport it to 5.1.

CzBiX · 2019-05-30T09:06:21Z

I meet the same problem. AsyncHTTPClient will raise error when get access token from Github OAuth.
tornado.simple_httpclient.HTTPStreamClosedError: Stream closed

But I didn't use the websocket handler. My codes looks like:

class LoginHandler(RequestHandler):
  async def get(self):
    client = AsyncHTTPClient()
    resp = await client.fetch('github oauth access token url', method='POST', headers={'Accept': 'application/json'}, body=urlencode({'data': 'data'}))

Downgrade from v6.x to v5.1.1 fix it.

Zsailer · 2019-05-31T21:29:28Z

@CzBiX are you getting HTTPStreamClosedError as an error? If so, I think that's probably a separate issue (that you might want to open). Here, we're finding scenarios where StreamClosedErrors are not being raised when they should be...

Zsailer · 2019-05-31T22:00:37Z

@bdarnell I've been working on a test for this issue, but I can't recreate the behavior that we're seeing in JupyterLab (likely because I don't understand the HTTP and websocket layers well enough).

I started with your design above and have the following steps:

Connect to a socket
Send the handshake request
Close the connection before the handshake completes (by not awaiting the websocket request).
Read the websocket message after closing.

I can verify the order of these steps with tcpdump.

I get the expected StreamClosedError. I think the difference is that JupyterLab is closing at the HTTP layer while this test is closing the websocket layer directly? I'm I thinking about this correctly? If so, how do we mimic incorporate an HTTP layer that mimic's JupyterLab's behavior using tornado's test utils?

bdarnell · 2019-06-01T21:38:56Z

It shouldn't matter what "layer" the close is happening at, but the precise timing matters. JupyterLab's test is running in a separate process, which can result in timing patterns that are different from those we see inside tornado. It's sometimes difficult to reproduce the right patterns in a Tornado test and often requires fragile non-idiomatic code (like a test that has exactly two empty awaits in the right spot).

Since neither of us have been able to reproduce the right patterns, I think it's time to give up on the idea of a test for this (especially since we have an indirect test via jupyterlab) and just add the check for closed status in _start_read.

Zsailer · 2019-06-03T04:04:15Z

Okay, thanks @bdarnell for all your help on this issue. I've opened #2670 with the proposed fix.

datamove · 2019-07-27T14:10:40Z

@Zsailer @bdarnell
With
jupyterhub=1.0.0
tornado=6.0.3
I experience the same #2651 (comment)
and downgrade to tornado also 5.1.1 eliminates the error

[I 2019-07-27 13:15:36.767 JupyterHub log:174] 302 GET / -> /hub/ (@x.x.x.x) 0.92ms
[I 2019-07-27 13:15:36.930 JupyterHub log:174] 302 GET /hub/ -> /hub/login (@x.x.x.x) 0.75ms
[I 2019-07-27 13:15:37.537 JupyterHub log:174] 200 GET /hub/login (@x.x.x.x) 29.71ms
[I 2019-07-27 13:15:40.556 JupyterHub oauth2:100] OAuth redirect: 'https://hub.example.com:8000/hub/oauth_callback'
[I 2019-07-27 13:15:40.558 JupyterHub log:174] 302 GET /hub/oauth_login?next= -> https://github.com/login/oauth/authorize?respo...
[E 2019-07-27 13:15:41.276 JupyterHub web:1788] Uncaught exception GET /hub/oauth_callback?code=ea823585ae24250d5c58&state=eyJz...
     HTTPServerRequest(protocol='https', host='hub.example.com:8000', method='GET', uri='/hub/oauth_callback?code=ea823585...
     Traceback (most recent call last):
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
         result = await result
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/oauthenticator/oauth2.py", line 209, in get
         user = yield self.login_user()
         File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 655, in login_user
          authenticated = await self.authenticate(data)
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/jupyterhub/auth.py", line 383, in get_authenticated_user
         authenticated = await maybe_future(self.authenticate(handler, data))
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/oauthenticator/github.py", line 116, in authenticate
         resp = yield http_client.fetch(req)
     tornado.simple_httpclient.HTTPStreamClosedError: Stream closed

Should I create an issue against jupyterhub? Or against tornado?

Zsailer mentioned this issue Apr 30, 2019

[WIP] Ensure all futures+signals are closed when stream is closed. #2652

Closed

Zsailer mentioned this issue Jun 3, 2019

check that stream is open before trying to read #2670

Merged

bdarnell closed this as completed in #2670 Jun 9, 2019

GeorgianaElena mentioned this issue Jul 30, 2019

Tornado 6.0 and JupyterHub issue jupyterhub/jupyterhub#2448

Closed

This was referenced Aug 1, 2019

IOStream read can fail if closed with an open read buffer #2717

Closed

Resolve pending reads on stream close #2719

Closed

bdarnell mentioned this issue Feb 1, 2020

iostream: Resolve pending reads on stream close #2805

Merged

blink1073 mentioned this issue Apr 4, 2020

Still seeing "Stream is Closed" Errors in 6.0.4 #2834

Closed

ms-O mentioned this issue Jan 8, 2021

iqsharp jupyter kernel connection fails due to tornado timeout error microsoft/iqsharp#387

Closed

This was referenced Jan 15, 2021

Tornado error: uncaght exception, kernel does not exist, stream is closed jupyter/notebook#5947

Closed

tornado error: uncaght exception, kernel does not exist, stream is closed jupyterlab/jupyterlab#9625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tornado 6.0 and Jupyter notebook server issue. #2651

Tornado 6.0 and Jupyter notebook server issue. #2651

Zsailer commented Apr 29, 2019 •

edited

Loading

bdarnell commented Apr 30, 2019

Zsailer commented Apr 30, 2019

bdarnell commented May 5, 2019

Zsailer commented May 5, 2019

Zsailer commented May 5, 2019

Zsailer commented May 5, 2019

bdarnell commented May 8, 2019

bdarnell commented May 8, 2019

Zsailer commented May 9, 2019

bdarnell commented May 11, 2019

Zsailer commented May 16, 2019

ploxiln commented May 23, 2019 •

edited

Loading

CzBiX commented May 30, 2019

Zsailer commented May 31, 2019

Zsailer commented May 31, 2019

bdarnell commented Jun 1, 2019

Zsailer commented Jun 3, 2019

datamove commented Jul 27, 2019 •

edited

Loading

Tornado 6.0 and Jupyter notebook server issue. #2651

Tornado 6.0 and Jupyter notebook server issue. #2651

Comments

Zsailer commented Apr 29, 2019 • edited Loading

bdarnell commented Apr 30, 2019

Zsailer commented Apr 30, 2019

bdarnell commented May 5, 2019

Zsailer commented May 5, 2019

Zsailer commented May 5, 2019

Zsailer commented May 5, 2019

bdarnell commented May 8, 2019

bdarnell commented May 8, 2019

Zsailer commented May 9, 2019

bdarnell commented May 11, 2019

Zsailer commented May 16, 2019

ploxiln commented May 23, 2019 • edited Loading

CzBiX commented May 30, 2019

Zsailer commented May 31, 2019

Zsailer commented May 31, 2019

bdarnell commented Jun 1, 2019

Zsailer commented Jun 3, 2019

datamove commented Jul 27, 2019 • edited Loading

Zsailer commented Apr 29, 2019 •

edited

Loading

ploxiln commented May 23, 2019 •

edited

Loading

datamove commented Jul 27, 2019 •

edited

Loading