-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870
Comments
fho
added a commit
to fho/pq
that referenced
this issue
May 28, 2019
Since the commit "Don't return ErrBadConn on a network error" net.OpError do not return driver.ErrBadConn anymore. This caused that in some situations the sql package does not retry an operation when it should. E.g. when the postgresql server is restarted, a broken pipe error might happen for the query is done after the server finished the startup (lib#870). With this commit driver.ErrBadConn is returned for netErrors when it's ensured that the server did not already executed the operation. This is the case when e.g. a netError occur for the call that tries to send the message to initiate the query to the server.
fho
added a commit
to fho/pq
that referenced
this issue
May 28, 2019
Since the commit "Don't return ErrBadConn on a network error" net.OpError do not return driver.ErrBadConn anymore. This caused that in some situations the sql package does not retry an operation when it should. E.g. when the postgresql server is restarted, a broken pipe error might happen for the query is done after the server finished the startup (lib#870). With this commit driver.ErrBadConn is returned for netErrors when it's ensured that the server did not already executed the operation. This is the case when e.g. a netError occur for the call that tries to send the message to initiate the query to the server.
fho
added a commit
to fho/pq
that referenced
this issue
May 28, 2019
In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the already established connection, it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.
fho
added a commit
to fho/pq
that referenced
this issue
May 28, 2019
In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the already established connection, it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.
fho
added a commit
to fho/pq
that referenced
this issue
May 28, 2019
In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the already established connection, it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.
fho
added a commit
to fho/pq
that referenced
this issue
May 28, 2019
In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the same db handle it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.
Hi, I am also getting random broken pipe errors. Any update on this issue. |
I'm getting this error on my localhost after leaving my webserver running overnight and then attempting to log in, as it hits the database for the first time for the day it gets a |
6 tasks
roylee17
added a commit
to roylee17/sqlx
that referenced
this issue
Mar 21, 2021
I'm seeing "broken pipe" errors when working with CRDB using sqlx. The issue seemed to be the tcp connections were diconnected while the conns in db driver (pq) still has stale connection. It happens more often when the DB is behind a proxy. In our cases, the pods were proxied by the envoy sidecar. There were other instances on the community reporting similar issues, and took different workaround by sebding perodic dummy queries in app mimicing keepalive, enlenghthen proxy idle timeout, or shortening the lifetime of db conn. This has been reported and fixed by the lib/pq upstream in v1.9+ lib/pq#1013 lib/pq#723 lib/pq#897 lib/pq#870 grafana/grafana#29957
rtfb
added a commit
to rtfb/rtfblog
that referenced
this issue
Oct 5, 2022
Presumably, caused by this[1] problem, which has been fixed a couple of years ago. [1] lib/pq#870
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm using lib/pq v1.1.1, go 1.12.5, Linux 4.19.45-1-lts.
I have a db handler on that I run >=1 operations, restart the db-server and wait until the startup finished, the next query fails if all the previous TCP connections of the Operation System to the database were closed in the meantime (
ss -na | grep 5432'
shows nothing).The operation fails with:
write tcp [::1]:45676->[::1]:5432: write: broken pipe
.If another query is done after the failed one, it succeeds.
I expect that the
db.Exec()
query succeeds after the postgresql restart finished and the sql package or pq driver retries and reconnects transparently if needed in the background.If the postgresql-server restart happens quickly while they are still TCP connections in
FIN-WAIT-2
or another state, the db operations after the postgres restart succeeds.How to reproduce:
docker run -p 5432:5432 postgres:latest
pass as command-line argument
postgresql://postgres@localhost?sslmode=disable
q
to do a sql queryctrl + c
in the terminalwatch -n 0.5 "sh -c 'ss -na |grep 5432"
, wait until the TCP connections vanisheddocker run -p 5432:5432 postgres:latest
q
in the terminal that runs the go program to trigger another query => it failsMy first idea for a fix was to return
ErrBadConn
on broken pipe errors but like discussed in #422 this has the issue that operations might be redone.The mysql driver seems to solve it by having a custom error type to indicate retryable connection errors.
If a Write() on the tcp socket failed before a whole SQL statements was send, it's safe to retry the operation.
The caller of
conn.send()
could decide it and set error toErrBadConn
.The text was updated successfully, but these errors were encountered: