Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: sending the wrong drain signal #22630

Closed
vivekmenezes opened this issue Feb 13, 2018 · 5 comments
Closed

sql: sending the wrong drain signal #22630

vivekmenezes opened this issue Feb 13, 2018 · 5 comments
Assignees
Milestone

Comments

@vivekmenezes
Copy link
Contributor

@andreimatei writes:

I believe @asubiotto you're looking at some draining issues where the integration with some balancer isn't working right; I've stumbled upon the signals we're sending clients on draining, and I think they're wrong. Maybe this explains why the balancer isn't playing nice.
TL;DR I think we're sending the wrong message type to notify clients of the draining.

The protocol docs says this:

It is possible for NoticeResponse messages to be generated due to outside activity; for example, if the database administrator commands a “fast” database shutdown, the backend will send a NoticeResponse indicating this fact before closing the connection. Accordingly, frontends should always be prepared to accept and display NoticeResponse messages, even when the connection is nominally idle.

However, our code is this:

if pgErr, ok := pgerror.GetPGCause(err); ok && pgErr.Code == pgerror.CodeAdminShutdownError {

We're not sending a "NoticeResponse"; we're sending a plain error packet. I believe they're different.
Besides potentially not talking the language that balancers and drivers understand, I think this might also break clients - in my testing of other stuff I've noticed that lib/pq, for one, freaks out when it receives an unexpected error (i.e. when it gets an error but it's not in the context of running the specific types of commands that permit errors). On the other hand, clients apparently need to accept NoticeResponses at all times.

Does this sound right?

@asubiotto
Copy link
Contributor

I think that's fine. The postgres docs also say that:

In rare cases (such as an administrator-commanded database shutdown) the backend might 
disconnect without any frontend request to do so. In such cases the backend will attempt to send 
an error or notice message giving the reason for the disconnection before it closes the 
connection.

So it seems that both are accepted. The decision to send back a CodeAdminShutdown was based on pgpool's failover behavior.

I think it might be good to support sending back NoticeResponse messages, but the error behavior seems correct to me.

Re the balancer issues, the issue isn't our implementation of the postgres protocol, but a mix of performing a rolling upgrade too quickly and not transferring raft leadership. There definitely might be issues down the line, but I don't think this is one of them.

@andreimatei
Copy link
Contributor

None of these two sources seem to suggest to me that sending an error to an unsuspecting client is fine.
I'd read the Postgres one as saying that you can return an error in the context of a query and a notice otherwise.
I'd read the pgpool one as saying that just closing the connection is fine when fail_over_on_backend_error is set, and a different mechanism (not specified which, but I guess generally a notice) is required when fail_over_on_backend_error is not set.

@asubiotto
Copy link
Contributor

asubiotto commented Feb 13, 2018

In the Asynchronous Operations section:

There are several cases in which the backend will send messages that are not specifically 
prompted by the frontend's command stream. Frontends must be prepared to deal with these 
messages at any time, even when not engaged in a query.

This CodeAdminShutdownError is one of those messages.

I don't want to talk too much about pgpool, because I don't think that it is a load balancer that we should be focusing on (I'm skeptical about whether it is actually used in production instead of e.g. AWS' ELB or haproxy) but when fail_over_on_backend_error is not set the administrative shutdown is still handled:

Note: It is recommended to turn on the backend health checking (see Section 5.8) when
 fail_over_on_backend_error is set to off. Note, however, that Pgpool-II still triggers the failover
 when it detects the administrative shutdown of PostgreSQL backend server.

@andreimatei
Copy link
Contributor

What we're talking about is not whether the CodeAdminShutdownError is good or not - it is.
What we're talking about is whether we should be sending it in an ErrorResponse envelope or a NoticeResponse envelope. That Asynchronous Operations section refers only to NoticeResponse.

@andreimatei
Copy link
Contributor

I've traced a Postgres 10 server and it would appear that it always sends ErrorResponse messages when the operator does pg_ctl stop -m fast. The server sends an ErrorResponse both when it's in the middle of processing a query and on idle connections. So this was all noise, sorry.

What we do when we're in the middle of a query is a bit different - we send a "context canceled" error with the internal error code and then we send the AdminShutdown error.

It's interesting that Postgres doesn't send squat when the operator does pg_ctl stop -m smart (with smart being the default).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants