Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tornado 6.0 and Jupyter notebook server issue. #2651

Closed
Zsailer opened this issue Apr 29, 2019 · 18 comments · Fixed by #2670
Closed

Tornado 6.0 and Jupyter notebook server issue. #2651

Zsailer opened this issue Apr 29, 2019 · 18 comments · Fixed by #2670

Comments

@Zsailer
Copy link
Contributor

Zsailer commented Apr 29, 2019

We're seeing tests fail in inside of Jupyter(Lab) when using Tornado 6.x, forcing us to pin to Tornado 5.x. (See the relevant issue).

Here's the traceback we see in the tests:

    Traceback (most recent call last):
      File "~/dev/tornado/tornado/web.py", line 1699, in _execute
        result = await result
      File "~/dev/tornado/tornado/gen.py", line 736, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "~/dev/notebook/notebook/services/kernels/handlers.py", line 239, in get
        yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
      File "~/dev/tornado/tornado/gen.py", line 729, in run
        value = future.result()
      File "~/dev/tornado/tornado/gen.py", line 736, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "~/dev/notebook/notebook/base/zmqhandlers.py", line 293, in get
        yield maybe_future(res)
      File "~/dev/tornado/tornado/gen.py", line 729, in run
        value = future.result()
      File "~/dev/tornado/tornado/websocket.py", line 278, in get
        await self.ws_connection.accept_connection(self)
      File "~/dev/tornado/tornado/websocket.py", line 881, in accept_connection
        await self._accept_connection(handler)
      File "~/dev/tornado/tornado/websocket.py", line 964, in _accept_connection
        await self._receive_frame_loop()
      File "~/dev/tornado/tornado/websocket.py", line 1118, in _receive_frame_loop
        await self._receive_frame()
      File "~/dev/tornado/tornado/websocket.py", line 1130, in _receive_frame
        data = await self._read_bytes(2)
      File "~/dev/tornado/tornado/websocket.py", line 1124, in _read_bytes
        data = await self.stream.read_bytes(n)
      File "~/dev/tornado/tornado/iostream.py", line 436, in read_bytes
        future = self._start_read()
      File "~/dev/tornado/tornado/iostream.py", line 797, in _start_read
        assert self._read_future is None, "Already reading"

I'm hoping y'all might be able to provide some insight on the issue. I can't tell if we missed something in the jump to 6.x that you might be able to answer easily. 😄

We've tracked down the exact commit that broke our tests.

I dove a bit deeper into Tornado and found a single change that can toggle the break. Here's a summary of the change:

  • Before this commit, _maybe_add_error_listener ensured that all futures+callbacks were done+closed if the stream was closed by calling the_maybe_run_close_callback() method
  • _maybe_run_close_callback() was renamed to _signal_closed()
  • In this commit, the _maybe_add_error_listener no longer calls signal_closed/_maybe_run_close_callback()

Reverting to previous behavior fixes the issue. Here's a patch showing the diff. I add this _signal_closed() method back to _maybe_add_error_listener and things seem to work again.

I'm not sure if this is a bug in Tornado, or if Jupyter's notebook server is using this API wrong (possibly a missing await could be a likely culprit).

Any thoughts? Let me know if I can provide you more detail.

@bdarnell
Copy link
Member

Nice work tracking this down! That patch looks right to me. Want to submit it as a PR? (Ideally with a small test case)

I wonder why this is happening for you but slipped by past testing with jupyter and tornado 6.

I'm not sure if this is a bug in Tornado, or if Jupyter's notebook server is using this API wrong (possibly a missing await could be a likely culprit).

WebSocketHandler.get() was never intended to be overridden by subclasses (you're expected to use prepare() instead), but you appear to be doing the right thing by awaiting its result.

@Zsailer
Copy link
Contributor Author

Zsailer commented Apr 30, 2019

Thanks @bdarnell

Want to submit it as a PR? (Ideally with a small test case)

Sure! I'll open the PR now and write a test (though I may need help with this).

I wonder why this is happening for you but slipped by past testing with jupyter and tornado 6.

Yeah, this puzzled me too. I'm not sure why our tests in the notebook server pass. I'll try to investigate further and let you know if I learn anything.

@bdarnell
Copy link
Member

bdarnell commented May 5, 2019

Can you point me to the jupyter test that's running in to this? I'm still having trouble seeing how the bug in maybe_add_error_listener can lead to the "already reading" exception and want to make sure we've fully understood the problem.

@Zsailer
Copy link
Contributor Author

Zsailer commented May 5, 2019

Of course! Let me write up the steps to reproduce for you... give me a couple minutes.

What's weird is that we don't have any tests failing in the notebook server. The failing happens in client side tests (in JLab) that make requests to the notebook server. So there is some edge case that even our notebook server tests are not covering...

@Zsailer
Copy link
Contributor Author

Zsailer commented May 5, 2019

@bdarnell, here are steps to reproduce the issue.

  1. Grab and install tornado 6.x. I've been using master.
  2. Grab jupyterlab and unpin tornado. We've currently pinned master to <6.x. You can use my branch here. It's latest jlab and tornado is unpinned.
  3. Move into JupyterLab directory.
  4. Install Lupyterlab using pip install -e .
  5. Build Jupyterlab's client-side tests. This takes a few minutes, unfortunately.
    jlpm install
    jlpm run build:test
    
  6. Run specific test that's failing using the following command. This avoids running our whole test suite, which can take a few minutes.
    cd tests/test-apputils
    jlpm test --testNamePattern="restartKernel" 
    

You should see tornado throw the error above.

The test that is failing can be found here.

@Zsailer
Copy link
Contributor Author

Zsailer commented May 5, 2019

By the way, thank you tremendously for helping me here.

@bdarnell
Copy link
Member

bdarnell commented May 8, 2019

OK, that's very helpful. If I'm reading this right, it looks like the test is passing, but it's logging this "already reading" assertion as an uncaught error. (Fortunately not a production problem, but we shouldn't be logging stack traces like this when there's not a real problem)

With the help of tcpdump and some extra logging, I see that what's happening is that the client is closing the connection in the middle of the websocket handshake. Then it runs in to a weakness in our handling of closed connections: We mostly don't have explicit "if closed: return" code anywhere, we just keep going until an IOStream.read call raises a StreamClosedError.

At this point in the websocket handshake, that actually happens twice: the HTTP layer starts reading for its next batch of headers, and the websocket layer starts reading for websocket messages (the closed connection breaks the HTTP layer out of its wait, but doesn't break it all the way out of the loop). Then, because some checks are out of order, if you try to read on a closed connection, you'll set the "reading" flag before raising StreamClosedError. If you try to read a second time, you'll get the "already reading" error instead of the expected StreamClosedError.

I think fixing this should be as simple as checking the closed status of the stream earlier in IOStream read (and other) methods.

@bdarnell
Copy link
Member

bdarnell commented May 8, 2019

I think fixing this should be as simple as checking the closed status of the stream earlier in IOStream read (and other) methods.

I'm not sure what this means for the proposed fix in #2652. I can't decide whether it's a good change on its own or if it's a bad idea because it could lead to signal_closed being called multiple times in cases like this. Since we can't figure out how to trigger a problem that is specifically fixed by that change, I'm inclined to leave it out and only change the close checks in IOStream.

@Zsailer
Copy link
Contributor Author

Zsailer commented May 9, 2019

Ah, that all makes sense. Thank you for working through those tests and gathering this information.

Since we can't figure out how to trigger a problem that is specifically fixed by that change, I'm inclined to leave it out and only change the close checks in IOStream.

I totally agree. Would you like me to work on a new PR that tries to place some "checks for closed" status in helpful places?

@bdarnell
Copy link
Member

I think we may just need one additional close check, in IOStream._start_read (this fixes the jupyterlab test). But we also need to confirm this with a test in tornado. I'm playing with variants of this but haven't been able to trigger the problem yet.

    @gen_test
    async def test_incomplete_handshake(self):
        with contextlib.closing(IOStream(socket.socket())) as stream:
            await stream.connect(("127.0.0.1", self.get_http_port()))
            await stream.write(
                textwrap.dedent(
                    """
                    GET /echo HTTP/1.1
                    Connection: Upgrade
                    Upgrade: WebSocket
                    Host: localhost
                    Sec-WebSocket-Version: 13
                    Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

                    """
                )
                .replace("\n", "\r\n")
                .encode("utf-8")
            )
            # resp = await stream.read_until(b"\r\n\r\n")
            # print(resp)
            stream.close()
            await asyncio.sleep(1.0)
            # stream.close()
            # await asyncio.sleep(1.0)

@Zsailer
Copy link
Contributor Author

Zsailer commented May 16, 2019

@bdarnell sorry for the delayed reply here. I've been traveling all week and haven't had a chance to work on this. I'm hoping to return to this issue over the weekend. Thank you for pushing forward on it!

@ploxiln
Copy link
Contributor

ploxiln commented May 23, 2019

If I may throw in another backtrace here which may be related:

  File "/usr/local/lib/python3.6/dist-packages/tornado/http1connection.py", line 735, in _server_request_loop
    ret = yield conn.read_response(request_delegate)
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/usr/local/lib/python3.6/dist-packages/tornado/http1connection.py", line 160, in _read_message
    max_bytes=self.params.max_header_size)
  File "/usr/local/lib/python3.6/dist-packages/tornado/iostream.py", line 353, in read_until_regex
    future = self._set_read_callback(callback)
  File "/usr/local/lib/python3.6/dist-packages/tornado/iostream.py", line 860, in _set_read_callback
    assert self._read_future is None, "Already reading"
AssertionError: Already reading

This is from tornado-5.1.1 on python-3.6. We have a couple thousand long-lived websocket connections to each websocket-server process (and they also serve a few plain http requests per minute). This seems to be correlated with (comes a few milliseconds after) some cases where a websocket seems to already be closed when our (plain sychronous) open() method calls self.write_message("ACK"). With about 5k clients spread across 3 server processes, we see the close-during-open about 40 times a day, and this backtrace just a couple times a week.

I'm pretty sure this never happened with tornado-4.5.3. It's no big problem currently ... it's probably the fault of some distant client in some very strange network conditions. It's hard for me to tell whether something like #2652 would fix this occasional exception for me, but if it seems like it should, I would be interested in trying to backport it to 5.1.

@CzBiX
Copy link

CzBiX commented May 30, 2019

I meet the same problem. AsyncHTTPClient will raise error when get access token from Github OAuth.
tornado.simple_httpclient.HTTPStreamClosedError: Stream closed

But I didn't use the websocket handler. My codes looks like:

class LoginHandler(RequestHandler):
  async def get(self):
    client = AsyncHTTPClient()
    resp = await client.fetch('github oauth access token url', method='POST', headers={'Accept': 'application/json'}, body=urlencode({'data': 'data'}))

Downgrade from v6.x to v5.1.1 fix it.

@Zsailer
Copy link
Contributor Author

Zsailer commented May 31, 2019

@CzBiX are you getting HTTPStreamClosedError as an error? If so, I think that's probably a separate issue (that you might want to open). Here, we're finding scenarios where StreamClosedErrors are not being raised when they should be...

@Zsailer
Copy link
Contributor Author

Zsailer commented May 31, 2019

@bdarnell I've been working on a test for this issue, but I can't recreate the behavior that we're seeing in JupyterLab (likely because I don't understand the HTTP and websocket layers well enough).

I started with your design above and have the following steps:

  1. Connect to a socket
  2. Send the handshake request
  3. Close the connection before the handshake completes (by not awaiting the websocket request).
  4. Read the websocket message after closing.

I can verify the order of these steps with tcpdump.

I get the expected StreamClosedError. I think the difference is that JupyterLab is closing at the HTTP layer while this test is closing the websocket layer directly? I'm I thinking about this correctly? If so, how do we mimic incorporate an HTTP layer that mimic's JupyterLab's behavior using tornado's test utils?

@bdarnell
Copy link
Member

bdarnell commented Jun 1, 2019

It shouldn't matter what "layer" the close is happening at, but the precise timing matters. JupyterLab's test is running in a separate process, which can result in timing patterns that are different from those we see inside tornado. It's sometimes difficult to reproduce the right patterns in a Tornado test and often requires fragile non-idiomatic code (like a test that has exactly two empty awaits in the right spot).

Since neither of us have been able to reproduce the right patterns, I think it's time to give up on the idea of a test for this (especially since we have an indirect test via jupyterlab) and just add the check for closed status in _start_read.

@Zsailer
Copy link
Contributor Author

Zsailer commented Jun 3, 2019

Okay, thanks @bdarnell for all your help on this issue. I've opened #2670 with the proposed fix.

@datamove
Copy link

datamove commented Jul 27, 2019

@Zsailer @bdarnell
With
jupyterhub=1.0.0
tornado=6.0.3
I experience the same #2651 (comment)
and downgrade to tornado also 5.1.1 eliminates the error

[I 2019-07-27 13:15:36.767 JupyterHub log:174] 302 GET / -> /hub/ (@x.x.x.x) 0.92ms
[I 2019-07-27 13:15:36.930 JupyterHub log:174] 302 GET /hub/ -> /hub/login (@x.x.x.x) 0.75ms
[I 2019-07-27 13:15:37.537 JupyterHub log:174] 200 GET /hub/login (@x.x.x.x) 29.71ms
[I 2019-07-27 13:15:40.556 JupyterHub oauth2:100] OAuth redirect: 'https://hub.example.com:8000/hub/oauth_callback'
[I 2019-07-27 13:15:40.558 JupyterHub log:174] 302 GET /hub/oauth_login?next= -> https://github.com/login/oauth/authorize?respo...
[E 2019-07-27 13:15:41.276 JupyterHub web:1788] Uncaught exception GET /hub/oauth_callback?code=ea823585ae24250d5c58&state=eyJz...
     HTTPServerRequest(protocol='https', host='hub.example.com:8000', method='GET', uri='/hub/oauth_callback?code=ea823585...
     Traceback (most recent call last):
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
         result = await result
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/oauthenticator/oauth2.py", line 209, in get
         user = yield self.login_user()
         File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 655, in login_user
          authenticated = await self.authenticate(data)
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/jupyterhub/auth.py", line 383, in get_authenticated_user
         authenticated = await maybe_future(self.authenticate(handler, data))
       File "/home/ubuntu/jupyterhub-env/lib/python3.6/site-packages/oauthenticator/github.py", line 116, in authenticate
         resp = yield http_client.fetch(req)
     tornado.simple_httpclient.HTTPStreamClosedError: Stream closed

Should I create an issue against jupyterhub? Or against tornado?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants