Conn::new sometimes fails to make progress under load #94

jonhoo · 2020-02-13T00:35:39Z

Try running the following test:

#[tokio::test(threaded_scheduler)]
async fn can_handle_the_pressure() {
    use futures_util::stream::StreamExt;
    {
        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<()>();
        for _ in 0..500 {
            let tx = tx.clone();
            tokio::spawn(async move {
                let c = Conn::new(get_opts()).await.unwrap();
                let c = c.drop_query("SELECT 1").await.unwrap();
                c.disconnect().await.unwrap();
                drop(tx);
            });
        }
        drop(tx);
        // see that all the tx's eventually complete
        assert_eq!(rx.next().await, None);
    }
    eprintln!("MID");
    {
        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<()>();
        for _ in 0..500 {
            let tx = tx.clone();
            tokio::spawn(async move {
                eprintln!("GETCONN");
                let c = Conn::new(get_opts()).await.unwrap();
                eprintln!("GOTCONN");
                let c = c.drop_query("SELECT 1").await.unwrap();
                c.disconnect().await.unwrap();
                drop(tx);
            });
        }
        drop(tx);
        // see that all the tx's eventually complete
        assert_eq!(rx.next().await, None);
    }
    eprintln!("DONE");
}

Against a local MySQL with max_connections = 1000. It passes just fine. Now, run some CPU-heavy program that takes up most of the cores on the machine, and run the test again. It will frequently hang between printing MID and printing DONE. Digging a big deeper, you will see that GETCONN gets printed 500 times, but GOTCONN only N < 500 times (the exact value of N varies). Some further digging reveals that the issue lies in handle_handshake. Specifically, if you put a print statement above and below the call to read_packet().await, you'll see that the print before gets printed 2*500-N times, and the print after 2*500-2*N times. The explanation for this is that every connection goes through Conn::new twice: once as a "plain" connection, and once as an "upgraded" socket connection. But some (500-N) plain connections get "stuck" on that .await. They therefore do not get upgraded, so they fail to hit the second print twice, and the first print once (as a socket).

In any case, the issue for the stuck connections is that read_packet().await never returns for plain connections. Reading through the code, it appears that there are only three ways this can happen: tokio::net::TcpStream is broken (unlikely), tokio_util::codec::Framed is broken (possible), or mysql_common::proto::codec::PacketCodec either blocks (unlikely) or returns None when it should return Some (a likely culprit). How any of these are related to load, I have no idea. Why it is necessary to first spin up and drop 500 connections, I also have no idea. But that's what I'm seeing. Which sort of points at a tokio bug, but that seems weird too?

Well, I figured I should write up my findings no matter what, so that hopefully we can get to the bottom of it. I ran into this when running code on EC2 which is infamous for having a lot of scheduling noise, similar to that of a loaded machine, so it's probably worth trying to fix.

The text was updated successfully, but these errors were encountered:

jonhoo · 2020-02-13T03:23:19Z

@blackbeam I think this is unrelated to #65, but they may be the same underlying bug, I'm not sure. In either case, the new pool implementation in #92 seems to be working correctly.

jonhoo · 2020-02-13T13:50:12Z

Two additional observations:

PacketCodec does not seem to block -- every call to Stream::poll_next returns.
Sometimes, the number after the read_packet().await drops by more than N, which implies that this happens to both Plan and Socket. That is, some streams stop being woken up after being upgraded to a Socket. This makes it even more unlikely that tokio::net::TcpStream is to blame. Of course the underlying tokio::io::Register could be the culprit somehow.

jonhoo · 2020-02-13T14:51:07Z

Here's another: for runs of the test that succeed, after MID, I observe 18000 calls to PacketCodec::decode, and 5000 calls to PacketCodec::encode. That is 36 decodes per connection (roughly 18 each for plain and socket presumably), and 10 encodes per connection. On a run that hangs, I see 36 * H fewer calls to decode, and H/2 more calls to encode, where H is the number of hanging Conn::new. This suggests to me that maybe decode is never called for the hanging connections (how could that be?). I have no idea why encode would be called more.

@blackbeam I may need your knowledge of the protocol here to figure out what might be causing this hang. Can you replicate the hang on a computer you control?

jonhoo · 2020-02-13T15:55:57Z

Another interesting observation: I seem to only be able to replicate this when the total number of connections is close to the MySQL maximum number of connections. For example, I cannot replicate the hang with 200 in place of 500 in the test above, but if I replace the MySQL max_connections = 1000 with max_connections = 400, I can reproduce it.

jonhoo · 2020-02-13T15:56:37Z

As a side note, I do not think this is a MySQL bug, as I have also seen this when working against msql-srv.

jonhoo · 2020-02-13T16:05:15Z

I can also replicate with 200 and max_connections = 600, but not with max_connections = 800. I wonder whether the issue here somehow has to do with how disconnection works? That for some reason the connections never actually get terminated until the entire runtime is dropped. The reason I suspect that is that max_connections = 800 is sufficient for every connection (plain + socket) to be established at once, without having to wait for any connection to be dropped.

This points me at two possible issues, both of which may be the case:

conn.disconnect().await does not fully drop the connection to MySQL, and continues to "hold up" a connection for some reason
mysql_async does not report an error if MySQL cannot accept more connections

jonhoo · 2020-02-13T16:19:31Z

As an aside, replacing c.disconnect().await with c.close().await does not make a difference.

jonhoo · 2020-02-13T16:37:08Z

Am I right reading the code that Conn::disconnect().await will send COM_QUIT, but will not do cleanup, whereas Conn::close().await will do cleanup but not sent COM_QUIT?

jonhoo · 2020-02-14T15:24:11Z

Ah, here's what I think is happening: when a connection is upgraded from Plain to Socket, the Plain Conn is just dropped here:

mysql_async/src/conn/mod.rs

Line 441 in 8021666

Ok(conn) => return Ok(conn),

We probably need to await the disconnection of the old Conn, otherwise all that will happen is that the cleanup for the old conn is spawned here:

mysql_async/src/conn/mod.rs

Line 61 in 8021666

handle.spawn(async move {

And then the connection will stick around until that spawned task is eventually run (which may take a while when the machine is under heavy load).

jonhoo · 2020-02-14T16:14:29Z

Filed #97 to fix ^. Not sure that that fully solves the problem though, as I was also seeing this with a server that did not support socket connections.

jonhoo · 2020-02-14T19:48:49Z

Here's an even more self-contained test that demonstrates the problem:

#[tokio::test(basic_scheduler)]
async fn too_many() {
    use futures_util::stream::StreamExt;
    let n = 500;
    let futs = futures_util::stream::futures_unordered::FuturesUnordered::new();
    for _ in 0..n {
        futs.push(tokio::spawn(async move {
            let mut opts = get_opts();
            opts.prefer_socket(Some(false));
            eprintln!("GETCONN");
            let c = Conn::new(opts).await;
            eprintln!("GOTCONN");
            c.unwrap().close().await.unwrap();
        }));
    }
    // see that all the conns eventually complete
    assert_eq!(futs.fold(0, |a, _| async move { a + 1 }).await, n);
    eprintln!("DONE");
}

No additional load is needed. Just run it a few times (with max_connections = 400 or so), and it should get stuck. If you stick the output into a file somewhere, you'll see that GETCONN gets printed 500 times, but GOTCONN only gets printed < 500 times. And ultimately that's because read_packet sometimes just get completely stuck. There are only Plain connections here, and no heavy load.

@blackbeam can you reproduce the behavior with the test above? Any idea why read_packet would sometimes just hang forever?

EDIT: In fact, you don't even need the threaded_scheduler for this to happen -- removed it.
EDIT2: No need to issue a query, so removed that too.

jonhoo · 2020-02-14T21:09:00Z

If it's useful, I run the test repeatedly with

$ while cargo t --release --lib conn::test::too_many -- --nocapture > run.log 2>&1; do date; done;

jonhoo · 2020-02-14T21:23:35Z

I also just managed to reproduce by setting max_connections = 1 and setting n = 100 in the code above. So this is definitely something that sometimes happens when you try to connect to MySQL and there are no connections available.

I also noticed that there are established TCP connections to MySQL just sort of hanging around when the hang happens. Not sure what that's about.

blackbeam · 2020-02-17T20:09:50Z

Hi.

I was able to reproduce the issue using the test above. I've also tried synchronous mysql driver and it hangs after 5 minutes in my environment (arch+mariadb14 with max_connections=1). Direct UNIX socket connections (using OptsBuilder::socket) are not affected by this issue as far as i can see.

Looks like this issue goes away for both synchronous and asynchronous drivers if keepalive is enabled using OptsBuilder::tcp_keepalive. I've measured the delay between invocations of date for 1 and 10 seconds keepalive. That shows, that keepalive messages somehow wakes hanged connections up. Here are the histograms for delays:

histograms

delays for 1sec keepalive

delays for 10sec keepalive

Stack traces:

stack trace for `mysql_async`

#0  0x00007f6126bab70e in epoll_wait () from /usr/lib/libc.so.6
    (gdb) bt
#0  0x00007f6126bab70e in epoll_wait () from /usr/lib/libc.so.6
#1  0x00005639e390f614 in mio::sys::unix::epoll::Selector::select (self=<optimized out>, evts=0x7f61269621a8, awakener=..., timeout=<error reading variable: Cannot access memory at address 0x0>)
    at /home/set/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/mio-0.6.21/src/sys/unix/epoll.rs:72
#2  0x00005639e390e135 in mio::poll::Poll::poll2 (self=0x7f61200041c0, events=0x7f61269621a8, timeout=..., interruptible=<optimized out>)
    at /home/set/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/mio-0.6.21/src/poll.rs:1178
#3  mio::poll::Poll::poll1 (self=0x7f61200041c0, events=0x7f61269621a8, timeout=..., interruptible=<optimized out>)
    at /home/set/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/mio-0.6.21/src/poll.rs:1139
#4  0x00005639e390dce3 in mio::poll::Poll::poll (self=0x7f6120000e50, events=0x400, timeout=...) at /home/set/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/mio-0.6.21/src/poll.rs:1010
#5  0x00005639e38fcb43 in tokio::io::driver::Driver::turn (self=0x7f61269621a8, max_wait=...) at /home/set/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/tokio-0.2.11/src/io/driver/mod.rs:107
#6  0x00005639e38fd070 in <tokio::io::driver::Driver as tokio::park::Park>::park (self=0x7f6120000e50) at /home/set/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/tokio-0.2.11/src/io/driver/mod.rs:176
#7  0x00005639e38279da in <tokio::time::driver::Driver<T> as tokio::park::Park>::park ()
#8  0x00005639e386ac9c in tokio::runtime::basic_scheduler::SchedulerPriv::tick ()
#9  0x00005639e386bc0a in tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on ()
#10 0x00005639e383eeed in tokio::runtime::context::enter ()
#11 0x00005639e3844f79 in mysql_async::conn::test::too_many_error ()
#12 0x00005639e38a934f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/liballoc/boxed.rs:1022
#13 0x00005639e39ea72a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:78
#14 0x00005639e38c4607 in std::panicking::try () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/panicking.rs:270
#15 std::panic::catch_unwind () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/panic.rs:394
#16 test::run_test_in_process () at src/libtest/lib.rs:567
#17 test::run_test::run_test_inner::{{closure}} () at src/libtest/lib.rs:474
#18 0x00005639e389daf6 in std::sys_common::backtrace::__rust_begin_short_backtrace () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/sys_common/backtrace.rs:136
#19 0x00005639e38a1e76 in std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/thread/mod.rs:469
#20 <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/panic.rs:318
#21 std::panicking::try::do_call () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/panicking.rs:292
#22 0x00005639e39ea72a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:78
#23 0x00005639e38a28a6 in std::panicking::try () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/panicking.rs:270
#24 std::panic::catch_unwind () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/panic.rs:394
#25 std::thread::Builder::spawn_unchecked::{{closure}} () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libstd/thread/mod.rs:468
#26 core::ops::function::FnOnce::call_once{{vtable-shim}} () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/ops/function.rs:232
#27 0x00005639e39d34cf in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/liballoc/boxed.rs:1022
#28 0x00005639e39e98a0 in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/liballoc/boxed.rs:1022
#29 std::sys_common::thread::start_thread () at src/libstd/sys_common/thread.rs:13
#30 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:80
#31 0x00007f6126c9746f in start_thread () from /usr/lib/libpthread.so.0
#32 0x00007f6126bab3d3 in clone () from /usr/lib/libc.so.6

stack trace for `mysql`

#0  0x00007f2d460dc9bc in recv () from /usr/lib/libpthread.so.0
#1  0x00005565c458d8ba in std::sys::unix::net::Socket::recv_with_flags () at src/libstd/sys/unix/net.rs:224
#2  std::sys::unix::net::Socket::read () at src/libstd/sys/unix/net.rs:230
#3  std::sys_common::net::TcpStream::read () at src/libstd/sys_common/net.rs:293
#4  <std::net::tcp::TcpStream as std::io::Read>::read () at src/libstd/net/tcp.rs:573
#5  0x00005565c441474c in <std::io::buffered::BufReader<R> as std::io::Read>::read ()
#6  0x00005565c442dfa9 in <mysql_common::proto::sync_framed::MySyncFramed<T> as core::iter::traits::iterator::Iterator>::next ()
#7  0x00005565c43daa3d in mysql::conn::Conn::read_packet ()
#8  0x00005565c43e2424 in mysql::conn::Conn::connect ()
#9  0x00005565c43d8c9d in mysql::conn::Conn::new ()
#10 0x00005565c4421101 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005565c4421d99 in std::panicking::try::do_call ()
#12 0x00005565c459cf4a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:78
#13 0x00005565c4431220 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#14 0x00005565c4585a4f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/liballoc/boxed.rs:1022
#15 0x00005565c459c030 in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/liballoc/boxed.rs:1022
#16 std::sys_common::thread::start_thread () at src/libstd/sys_common/thread.rs:13
#17 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:80
#18 0x00007f2d460d246f in start_thread () from /usr/lib/libpthread.so.0
#19 0x00007f2d45fe63d3 in clone () from /usr/lib/libc.so.6

TBH i don't know why this is happening. I'll try to hang c++ connector just to make sure that it's not on mysql side.

As a side note, I do not think this is a MySQL bug, as I have also seen this when working against msql-srv.

This seems strange, because i wasn't able to reproduce the issue with max_connections > n.

Any thoughts?

jonhoo · 2020-02-17T20:14:02Z

That is certainly weird and interesting. It would be interesting indeed to see if it also occurs with a different connector (like the C++ one). It could be that the issue for msql-srv is unrelated somehow I suppose, I'm not sure.. Alternatively, the bug is somewhere in the Rust stack, and so affects anything where there are many open TCP connections, but that seems really unlikely.. The fact that keepalive fixes the issue makes no sense to me, especially given that the synchronous mysql bindings also get stuck!

blackbeam · 2020-02-18T11:04:40Z

I've managed to hang C connector on mac + mysql 8.0.16

backtrace

* thread #6
  * frame #0: 0x00007fff6e69a80e libsystem_kernel.dylib`__recvfrom + 10
    frame #1: 0x000000010e92157e libmysqlclient.21.dylib`vio_read(Vio*, unsigned char*, unsigned long) + 146
    frame #2: 0x000000010e921640 libmysqlclient.21.dylib`vio_read_buff(Vio*, unsigned char*, unsigned long) + 108
    frame #3: 0x000000010e8e5b36 libmysqlclient.21.dylib`net_read_raw_loop(NET*, unsigned long) + 44
    frame #4: 0x000000010e8e5869 libmysqlclient.21.dylib`net_read_packet(NET*, unsigned long*) + 47
    frame #5: 0x000000010e8e560d libmysqlclient.21.dylib`my_net_read(NET*) + 121
    frame #6: 0x000000010e8d628d libmysqlclient.21.dylib`cli_safe_read_with_ok + 176
    frame #7: 0x000000010e8e03b3 libmysqlclient.21.dylib`csm_read_greeting(mysql_async_connect*) + 71
    frame #8: 0x000000010e8dacd0 libmysqlclient.21.dylib`mysql_real_connect + 171
    frame #9: 0x000000010e8c7cd0 a.out`connect(vargp=<unavailable>) at main.c:15:9
    frame #10: 0x00007fff6e75ae65 libsystem_pthread.dylib`_pthread_start + 148
    frame #11: 0x00007fff6e75683b libsystem_pthread.dylib`thread_start + 15

source code

#import <mysql.h>
#import <stdio.h>
#include <pthread.h>

void *connect(void *vargp)
{
    MYSQL *conn;
    conn = mysql_init(NULL);

    if (conn == NULL) {
        printf("Error 1 %u %s\n", mysql_errno(conn), mysql_error(conn));
        return NULL;
    }

    if (mysql_real_connect(conn, "127.0.0.1", "root", "password", NULL, 3307, NULL, 0) == NULL) {
        printf("Error 2 %u: %s\n", mysql_errno(conn), mysql_error(conn));
        mysql_close(conn);
        return NULL;
    }

    if (mysql_query(conn, "DO 1")) {
        printf("Error 3 %u: %s\n",mysql_errno(conn), mysql_error(conn));
    }

    mysql_close(conn);

    return NULL;
}

int main() {
    const int n = 500;

    if (mysql_library_init(0, NULL, NULL)) {
        fprintf(stderr, "could not initialize MySQL client library\n");
        exit(1);
    }

    pthread_t handles[n];
    for (int i = 0; i < n; i++) {
        pthread_create(&handles[i], NULL, connect, NULL);
    }

    for (int i = 0; i < n; i++) {
        pthread_join(handles[i], NULL);
    }

    mysql_library_end();

    return 0;
}

jonhoo · 2020-02-18T12:35:52Z

Oh wow, so I suppose this is a MySQL bug then! Might be worth filing upstream.

blackbeam · 2020-02-18T13:09:20Z

Note, that system-wide keepalive also fixes the issue 🙃:

toothbrush7777777 · 2020-03-09T18:25:30Z

@blackbeam

Note, that system-wide keepalive also fixes the issue 🙃:

What do you mean by system-wide keep-alive?

blackbeam · 2020-03-10T04:12:24Z

@toothbrush7777777, I meant the default value of 7200 for net.ipv4.tcp_keepalive_time.

blackbeam · 2020-04-14T07:57:26Z

Closing this for now. Feel free to reopen.

jonhoo mentioned this issue Feb 14, 2020

Completely shut down connections on disconnect #97

Merged

jonhoo mentioned this issue Feb 14, 2020

PacketOutOfOrder error on connect with too many connections #96

Closed

blackbeam closed this as completed Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conn::new sometimes fails to make progress under load #94

Conn::new sometimes fails to make progress under load #94

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 14, 2020

jonhoo commented Feb 14, 2020

jonhoo commented Feb 14, 2020 •

edited

Loading

jonhoo commented Feb 14, 2020

jonhoo commented Feb 14, 2020 •

edited

Loading

blackbeam commented Feb 17, 2020

delays for 1sec keepalive

delays for 10sec keepalive

jonhoo commented Feb 17, 2020

blackbeam commented Feb 18, 2020

jonhoo commented Feb 18, 2020

blackbeam commented Feb 18, 2020

toothbrush7777777 commented Mar 9, 2020 •

edited

Loading

blackbeam commented Mar 10, 2020

blackbeam commented Apr 14, 2020

Conn::new sometimes fails to make progress under load #94

Conn::new sometimes fails to make progress under load #94

Comments

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 14, 2020

jonhoo commented Feb 14, 2020

jonhoo commented Feb 14, 2020 • edited Loading

jonhoo commented Feb 14, 2020

jonhoo commented Feb 14, 2020 • edited Loading

blackbeam commented Feb 17, 2020

delays for 1sec keepalive

delays for 10sec keepalive

jonhoo commented Feb 17, 2020

blackbeam commented Feb 18, 2020

jonhoo commented Feb 18, 2020

blackbeam commented Feb 18, 2020

toothbrush7777777 commented Mar 9, 2020 • edited Loading

blackbeam commented Mar 10, 2020

blackbeam commented Apr 14, 2020

jonhoo commented Feb 14, 2020 •

edited

Loading

jonhoo commented Feb 14, 2020 •

edited

Loading

toothbrush7777777 commented Mar 9, 2020 •

edited

Loading