-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conn::new sometimes fails to make progress under load #94
Comments
@blackbeam I think this is unrelated to #65, but they may be the same underlying bug, I'm not sure. In either case, the new pool implementation in #92 seems to be working correctly. |
Two additional observations:
|
Here's another: for runs of the test that succeed, after @blackbeam I may need your knowledge of the protocol here to figure out what might be causing this hang. Can you replicate the hang on a computer you control? |
Another interesting observation: I seem to only be able to replicate this when the total number of connections is close to the MySQL maximum number of connections. For example, I cannot replicate the hang with 200 in place of 500 in the test above, but if I replace the MySQL |
As a side note, I do not think this is a MySQL bug, as I have also seen this when working against |
I can also replicate with 200 and This points me at two possible issues, both of which may be the case:
|
As an aside, replacing |
Am I right reading the code that |
Ah, here's what I think is happening: when a connection is upgraded from Line 441 in 8021666
We probably need to await the disconnection of the old Line 61 in 8021666
And then the connection will stick around until that spawned task is eventually run (which may take a while when the machine is under heavy load). |
Filed #97 to fix ^. Not sure that that fully solves the problem though, as I was also seeing this with a server that did not support socket connections. |
Here's an even more self-contained test that demonstrates the problem: #[tokio::test(basic_scheduler)]
async fn too_many() {
use futures_util::stream::StreamExt;
let n = 500;
let futs = futures_util::stream::futures_unordered::FuturesUnordered::new();
for _ in 0..n {
futs.push(tokio::spawn(async move {
let mut opts = get_opts();
opts.prefer_socket(Some(false));
eprintln!("GETCONN");
let c = Conn::new(opts).await;
eprintln!("GOTCONN");
c.unwrap().close().await.unwrap();
}));
}
// see that all the conns eventually complete
assert_eq!(futs.fold(0, |a, _| async move { a + 1 }).await, n);
eprintln!("DONE");
} No additional load is needed. Just run it a few times (with @blackbeam can you reproduce the behavior with the test above? Any idea why EDIT: In fact, you don't even need the |
If it's useful, I run the test repeatedly with $ while cargo t --release --lib conn::test::too_many -- --nocapture > run.log 2>&1; do date; done; |
I also just managed to reproduce by setting I also noticed that there are established TCP connections to MySQL just sort of hanging around when the hang happens. Not sure what that's about. |
Hi. I was able to reproduce the issue using the test above. I've also tried synchronous mysql driver and it hangs after 5 minutes in my environment (arch+mariadb14 with max_connections=1). Direct UNIX socket connections (using Looks like this issue goes away for both synchronous and asynchronous drivers if keepalive is enabled using Stack traces: stack trace for `mysql_async`
stack trace for `mysql`
TBH i don't know why this is happening. I'll try to hang c++ connector just to make sure that it's not on mysql side.
This seems strange, because i wasn't able to reproduce the issue with Any thoughts? |
That is certainly weird and interesting. It would be interesting indeed to see if it also occurs with a different connector (like the C++ one). It could be that the issue for |
I've managed to hang C connector on mac + mysql 8.0.16 backtrace
source code
#import <mysql.h>
#import <stdio.h>
#include <pthread.h>
void *connect(void *vargp)
{
MYSQL *conn;
conn = mysql_init(NULL);
if (conn == NULL) {
printf("Error 1 %u %s\n", mysql_errno(conn), mysql_error(conn));
return NULL;
}
if (mysql_real_connect(conn, "127.0.0.1", "root", "password", NULL, 3307, NULL, 0) == NULL) {
printf("Error 2 %u: %s\n", mysql_errno(conn), mysql_error(conn));
mysql_close(conn);
return NULL;
}
if (mysql_query(conn, "DO 1")) {
printf("Error 3 %u: %s\n",mysql_errno(conn), mysql_error(conn));
}
mysql_close(conn);
return NULL;
}
int main() {
const int n = 500;
if (mysql_library_init(0, NULL, NULL)) {
fprintf(stderr, "could not initialize MySQL client library\n");
exit(1);
}
pthread_t handles[n];
for (int i = 0; i < n; i++) {
pthread_create(&handles[i], NULL, connect, NULL);
}
for (int i = 0; i < n; i++) {
pthread_join(handles[i], NULL);
}
mysql_library_end();
return 0;
} |
Oh wow, so I suppose this is a MySQL bug then! Might be worth filing upstream. |
What do you mean by system-wide keep-alive? |
@toothbrush7777777, I meant the default value of 7200 for |
Closing this for now. Feel free to reopen. |
Try running the following test:
Against a local MySQL with
max_connections = 1000
. It passes just fine. Now, run some CPU-heavy program that takes up most of the cores on the machine, and run the test again. It will frequently hang between printingMID
and printingDONE
. Digging a big deeper, you will see thatGETCONN
gets printed 500 times, butGOTCONN
onlyN < 500
times (the exact value ofN
varies). Some further digging reveals that the issue lies inhandle_handshake
. Specifically, if you put a print statement above and below the call toread_packet().await
, you'll see that the print before gets printed2*500-N
times, and the print after2*500-2*N
times. The explanation for this is that every connection goes throughConn::new
twice: once as a "plain" connection, and once as an "upgraded" socket connection. But some (500-N
) plain connections get "stuck" on that.await
. They therefore do not get upgraded, so they fail to hit the second print twice, and the first print once (as a socket).In any case, the issue for the stuck connections is that
read_packet().await
never returns for plain connections. Reading through the code, it appears that there are only three ways this can happen:tokio::net::TcpStream
is broken (unlikely),tokio_util::codec::Framed
is broken (possible), ormysql_common::proto::codec::PacketCodec
either blocks (unlikely) or returnsNone
when it should returnSome
(a likely culprit). How any of these are related to load, I have no idea. Why it is necessary to first spin up and drop 500 connections, I also have no idea. But that's what I'm seeing. Which sort of points at a tokio bug, but that seems weird too?Well, I figured I should write up my findings no matter what, so that hopefully we can get to the bottom of it. I ran into this when running code on EC2 which is infamous for having a lot of scheduling noise, similar to that of a loaded machine, so it's probably worth trying to fix.
The text was updated successfully, but these errors were encountered: