sql: sending the wrong drain signal #22630

vivekmenezes · 2018-02-13T11:10:32Z

I believe @asubiotto you're looking at some draining issues where the integration with some balancer isn't working right; I've stumbled upon the signals we're sending clients on draining, and I think they're wrong. Maybe this explains why the balancer isn't playing nice.
TL;DR I think we're sending the wrong message type to notify clients of the draining.

The protocol docs says this:

It is possible for NoticeResponse messages to be generated due to outside activity; for example, if the database administrator commands a “fast” database shutdown, the backend will send a NoticeResponse indicating this fact before closing the connection. Accordingly, frontends should always be prepared to accept and display NoticeResponse messages, even when the connection is nominally idle.

However, our code is this:

cockroach/pkg/sql/pgwire/server.go

Line 429 in 679c565

    
           if pgErr, ok := pgerror.GetPGCause(err); ok && pgErr.Code == pgerror.CodeAdminShutdownError {

We're not sending a "NoticeResponse"; we're sending a plain error packet. I believe they're different.
Besides potentially not talking the language that balancers and drivers understand, I think this might also break clients - in my testing of other stuff I've noticed that lib/pq, for one, freaks out when it receives an unexpected error (i.e. when it gets an error but it's not in the context of running the specific types of commands that permit errors). On the other hand, clients apparently need to accept NoticeResponses at all times.

Does this sound right?

asubiotto · 2018-02-13T14:50:21Z

I think that's fine. The postgres docs also say that:

In rare cases (such as an administrator-commanded database shutdown) the backend might 
disconnect without any frontend request to do so. In such cases the backend will attempt to send 
an error or notice message giving the reason for the disconnection before it closes the 
connection.

So it seems that both are accepted. The decision to send back a CodeAdminShutdown was based on pgpool's failover behavior.

I think it might be good to support sending back NoticeResponse messages, but the error behavior seems correct to me.

Re the balancer issues, the issue isn't our implementation of the postgres protocol, but a mix of performing a rolling upgrade too quickly and not transferring raft leadership. There definitely might be issues down the line, but I don't think this is one of them.

andreimatei · 2018-02-13T15:01:21Z

None of these two sources seem to suggest to me that sending an error to an unsuspecting client is fine.
I'd read the Postgres one as saying that you can return an error in the context of a query and a notice otherwise.
I'd read the pgpool one as saying that just closing the connection is fine when fail_over_on_backend_error is set, and a different mechanism (not specified which, but I guess generally a notice) is required when fail_over_on_backend_error is not set.

asubiotto · 2018-02-13T15:19:10Z

In the Asynchronous Operations section:

There are several cases in which the backend will send messages that are not specifically 
prompted by the frontend's command stream. Frontends must be prepared to deal with these 
messages at any time, even when not engaged in a query.

This CodeAdminShutdownError is one of those messages.

I don't want to talk too much about pgpool, because I don't think that it is a load balancer that we should be focusing on (I'm skeptical about whether it is actually used in production instead of e.g. AWS' ELB or haproxy) but when fail_over_on_backend_error is not set the administrative shutdown is still handled:

Note: It is recommended to turn on the backend health checking (see Section 5.8) when
 fail_over_on_backend_error is set to off. Note, however, that Pgpool-II still triggers the failover
 when it detects the administrative shutdown of PostgreSQL backend server.

andreimatei · 2018-02-17T15:01:23Z

What we're talking about is not whether the CodeAdminShutdownError is good or not - it is.
What we're talking about is whether we should be sending it in an ErrorResponse envelope or a NoticeResponse envelope. That Asynchronous Operations section refers only to NoticeResponse.

andreimatei · 2018-02-18T16:38:08Z

I've traced a Postgres 10 server and it would appear that it always sends ErrorResponse messages when the operator does pg_ctl stop -m fast. The server sends an ErrorResponse both when it's in the middle of processing a query and on idle connections. So this was all noise, sorry.

What we do when we're in the middle of a query is a bit different - we send a "context canceled" error with the internal error code and then we send the AdminShutdown error.

It's interesting that Postgres doesn't send squat when the operator does pg_ctl stop -m smart (with smart being the default).

vivekmenezes assigned asubiotto Feb 13, 2018

vivekmenezes added this to the 2.0 milestone Feb 13, 2018

vivekmenezes mentioned this issue Feb 13, 2018

QPS Drops to Zero during rolling upgrade #22424

Closed

andreimatei mentioned this issue Feb 17, 2018

sql: introduce connExecutor, the query execution orchestrator #22277

Merged

andreimatei closed this as completed Feb 18, 2018

andreimatei mentioned this issue Mar 14, 2019

pgwire: pq: unexpected message 'E'; expected ReadyForQuery errors when killing processes on kubernetes #24149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: sending the wrong drain signal #22630

sql: sending the wrong drain signal #22630

vivekmenezes commented Feb 13, 2018

asubiotto commented Feb 13, 2018

andreimatei commented Feb 13, 2018

asubiotto commented Feb 13, 2018 •

edited

Loading

andreimatei commented Feb 17, 2018

andreimatei commented Feb 18, 2018

sql: sending the wrong drain signal #22630

sql: sending the wrong drain signal #22630

Comments

vivekmenezes commented Feb 13, 2018

asubiotto commented Feb 13, 2018

andreimatei commented Feb 13, 2018

asubiotto commented Feb 13, 2018 • edited Loading

andreimatei commented Feb 17, 2018

andreimatei commented Feb 18, 2018

asubiotto commented Feb 13, 2018 •

edited

Loading