HA Postgres Support #2887

fantix · 2021-09-02T17:13:50Z

fantix
Sep 2, 2021
Maintainer

To support Postgres clusters that come with high availability, EdgeDB should be albe to automatically switch to the new Postgres master when a failover happens. For EdgeDB, HA Postgres clusters fall into 2 categories:

HA Postgres with EdgeDB-supported failover notification: For example, Stolon-managed Postgres in Support Stolon+Consul (acting as a Stolon proxy) #2580 falls into this category. EdgeDB will listen for failover notifications through a given URL.
Custom Managed HA Postgres: If the HA Postgres doesn't provide failover notification or the notification is not supported by EdgeDB - e.g. a custom HA solution that modifies DNS value on failover, EdgeDB would detect the failover by certain conditions and reconnect without DNS cache.

Design of Notification-based HA

EdgeDB server will have different implementations to support some selected Postgres HA backend like Stolon, through a new CLI parameter --ha-cluster. Based on its URL value, the corresponding implementation will subscribe to the given HA backend, and only use the advertised Postgres DSN from that backend.

When a failover notification is received, EdgeDB would bump the internal serial counter and cut off all current Postgres connections. Any new Postgres connection from now on will use the new master address from the notification, tagged with the new serial number. Any lingering connection with a lower serial number will be discarded.

The notification and the actual failover may have some difference in time - it is usual that when the old master fails, the HA backend takes 30~60 seconds to confirm the fact, and promoting a replica to be the new master also takes time, while the notification may arrive before the new master is ready. This means it is necessary for EdgeDB to keep trying to reconnect to Postgres, and only reopens when the backend is fully ready. This topic is expanded in a later section, but here I shall describe the current behavior in #2580, which is sufficient for Notification-based HA.

Currently, EdgeDB depends on the activeness of the Postgres connection to the system database (sys_pgcon in short). Once the sys_pgcon is broken, EdgeDB will set a server-wide "unavailable" tag and return retryable errors to all interactions. In the meantime, sys_pgcon will keep trying to reconnect as far as EdgeDB is still running. However if it is a regular pgcon in the pool that gets disconnected unexpectedly, the current operation will be aborted with a retryable error and the pgcon itself will be simply discarded without reconnection. If there is no ongoing operation on the pgcon, it will just silently ignore the disconnect error - it's only removed from the pool on the next time it's acquired. Though if the connection error happens while a pgcon tries to establish the connection, pgcon will retry 3 times before returning a retryable error.

A common timeline would look like this:

+- Clients A and B both issue a query
|
|   * Pooled pgcon A and B are acquired to run each query
V
+- Master Postgres Fails
|
|   * Pooled pgcon A and C are broken
|   * pgcon A is discarded from the pool
|   * Client A receives a retryable error and retries
|   * pgcon A' is created for Client A but fails to connect to Postgres in 3 attempts
|   * Client A receives another retryable error and retries again
|   * sys_pgcon is broken, starts trying to reconnect every 1 second
|   * Client A receives another retryable error and gives up propagating the error
V
+- HA Backend promotes a replica, and send failover notification
|
|   * pgcon B and C are terminated and discarded from the pool
|   * Client B receives a retryable error and retries
|   * Because sys_pgcon is not back yet, Client B receives another retryable error and retries again
|   * sys_pgcon connects to the new master
|   * pgcon B' is created for Client B to complete its query
|
+- Everything is normal again

Issues with AWS RDS & Custom HA Postgres

At first, RDS was tested as a "Custom HA Postgres" without failover notification, and it works fine with the AWS feature "Reboot with Failover", details in #2293. In short, the failover process is similar to the example in previous section, except for the failover notification and following termination of existing pgcons. AWS updates the DNS record, TCP keepalive kills all existing pgcons and new queries eventually connects to the new master.

In the real world, the HA backend may choose to do a failover while some pgcons are still active in EdgeDB, because the HA backend talks to Postgres from possibly a different network than the EdgeDB server, or any other unknown real-world reasons. In other words, when the EdgeDB server sees a still-working Postgres, the HA backend may have decided to do a failover. Without HA backend telling EdgeDB server about the failover, EdgeDB server could only come up with the best guess it can make, and tries its best to cut off lingering connections to the old master (do the switch) at the right moment, in order to talk to the real master consistently.

On the contrary, EdgeDB must not be too sensitive to switch, because it is also possible that the network between the EdgeDB server and the Postgres clusters is unstable. For example, we shouldn't actively drop all pgcons only because the sys_pgcon is broken. There should be some kind of a threshold beyond which EdgeDB should do the switch. At the same time, the switch should not happen very frequently: if we're killing all pgcons every second, there must be some kind of misunderstanding of the failover situation.

pgcon Pool Metrics

In order to evaluate the pgcon pool healthiness, I'll cover a little bit of how the pool works again here, and the metrics that may affect the switch-over evaluation.

In general, the EdgeDB pgcon pool is a lazy connection pool across multiple Postgres databases ("dbname"), specifically optimized for the quality of service. Some relevant characteristics:

It only establishes a new connection when there is a demand.
Blocks (connections to different databases) share the same total connection number limit.
EdgeDB will actively "transfer" connections between blocks - that is disconnecting from database A and reconnect to B.
The pool will run "garbage collect" regularly to discard idle connections.
If it fails to connect, it will retry 3 times before propagating the error.
The pool itself doesn't handle broken connections. If the server code acquired an unhealthy pgcon from the pool, or about to return an unhealthy pgcon back to the pool, the pgcon will be discarded.

Based on these, I think the healthiness evaluation could cover these metrics:

Unexpected disconnects: If some pgcons in the pool are disconnected from Postgres without e.g. calling transport.close(), then we know the master Postgres or the network is likely failing.
Failed connects: Because EdgeDB is constantly reconnecting (though the ratio is low), failures also mean the master Postgres is unhealthy.
Last active time: EdgeDB will keep track of the last timestamp that data is received from Postgres, and will use the sys_pgcon to poll when server is idle. So that, we'll know for how long the pgcons are freezing. This is I think less important for the evaluation than the other 2 because TCP keepalive will kill those frozen connections and lead to "unexpected disconnects".
Postgres shutdown notification: In some cases (like a normal Postgres shutdown), Postgres will send notification messages saying it's going down. This is a clear signal that EdgeDB should drop all pgcons and tries to reconnect, even though the HA backend may eventually choose not to failover but wait for the old master to restart, but an active cleanup just makes EdgeDB more responsive when the master is back.

Except for (4), we may likely experience any of (1)(2) or (3) in a poor network, and none of them is a strong proof that a failover is happening. Therefore, a practical design is to have a staging state.

Design of Supporting Custom HA Postgres

① `Unhealthy` -> `Healthy`

Successfully connected to a non-readonly Postgres
(or) Data received from any pgcon (including polling sys_pgcon) (?)

② `Unhealthy` -> `Failover`

More than 60% (?) of existing pgcons are "unexpectedly disconnected" (number of existing pgcons is captured at the moment we change to Unhealthy state)
(and) In Unhealthy state for more than 30 (?) seconds.
(and) sys_pgcon is down.
(or) Postgres shutdown notification received.

③ `Healthy` -> `Unhealthy`

Any unexpected disconnect.
(or) Failed to connect due to ConnectionError
(or) Last active time is greater than 10 seconds (depends on the sys_pgcon idle-poll interval).

④ `Healthy` -> `Failover`

Postgres shutdown notification received.

⑤ `Failover` -> `Healthy`

Successfully connected to a non-readonly Postgres.
(and) sys_pgcon is healthy.

Behaviors in Each State

Healthy
- Open to client queries
Unhealthy
- If the sys_pgcon is broken:
  - Client queries will raise retryable BackendUnavailableError
  - New client connects will raise retryable BackendUnavailableError
- Existing client connections will not be dropped
Failover
- All pgcons will be actively discarded when switching to Failover state
- Client queries will raise retryable BackendInFailoverError
- New client connects will directly raise retryable BackendInFailoverError
- Existing client connections will not be dropped
- sys_pgcon is the only retrying connection

Summary of Solutions

To support custom HA Postgres without notification, the design explained previously is a close-enough approximation that should work for most cases. More importantly, it allows us to add those metrics incrementally so that we could focus on the critical conditions first, and extend to other metrics later.

Implement state-based PG failover evaluation
Check if ~~a connection is read-only~~ Postgres is a hot standby
Identify and record unexpected disconnects
~~Identify failed connects due to ConnectionError~~ (This one doesn't make much sense, because failed connects cannot trigger a failover switch - we'll just keep reconnecting)
Record last active time and poll in sys_pgcon when idle
Reconnect immediately on unexpected disconnects in the pool
~~Check pg_replication_slots and raise error if connected to a replica~~ (not needed now)
Add HA backend support for RDS through AWS API

Also, it looks like we could easily blend the notification-based HA support into the same FSM - just switch to Failover whenever a failover notification is received.

elprans · 2021-09-02T18:32:35Z

elprans
Sep 2, 2021
Maintainer

This is a great summary, and all of it makes sense to me, thank you, @fantix.

Some notes:

through a new CLI parameter --ha-cluster

We'll likely name this --postgres-ha-cluster for consistency with --postgres-dsn. However, I also think that perhaps --enable-ha should be a boolean flag, and we should add support for custom DSNs in --postgres-dsn instead to avoid having several DSN-like parameters.

Unhealthy

* If the `sys_pgcon` is broken:
  
  * Client queries will raise retryable `BackendUnavailableError`
  * New client connects will raise retryable `BackendFailoverError`

Why is there a difference in existing and new connection? It seems to me that new connections should also raise BackendUnavailableError. BackendFailoverError should only be raised if EdgeDB made a transition to the Failover state.

Existing client connections will not be dropped

This behavior seems to be the same for Unhealthy and Failover. This makes sense to me, because we raise errors on queries, but I just wanted to make sure you intended this.

Check pg_replication_slots and raise error if connected to a replica

I think a more reliable way would be to check for a receipt of in_hot_standby ParameterStatus in PostgreSQL 14+, and poll pg_is_in_recovery() under earlier versions. pg_replication_slots should be mainly used to discover new replicas, and, as such, to discover new potential masters.

Also, it looks like we could easily blend the notification-based HA support into the same FSM - just switch to Failover whenever a failover notification is received.

Yes, a definitive notification from the HA system should override all other signals.

3 replies

fantix Sep 2, 2021
Maintainer Author

We'll likely name this --postgres-ha-cluster for consistency with --postgres-dsn. However, I also think that perhaps --enable-ha should be a boolean flag, and we should add support for custom DSNs in --postgres-dsn instead to avoid having several DSN-like parameters.

Right, sounds good, will fix!

Why is there a difference in existing and new connection? It seems to me that new connections should also raise BackendUnavailableError. BackendFailoverError should only be raised if EdgeDB made a transition to the Failover state.

Oh, good catch! That's a copy-paste typo, sorry about that.

Existing client connections will not be dropped

This behavior seems to be the same for Unhealthy and Failover. This makes sense to me, because we raise errors on queries, but I just wanted to make sure you intended this.

Right, yes - clients don't have to reconnect throughout the backend failover.

Check pg_replication_slots and raise error if connected to a replica

I think a more reliable way would be to check for a receipt of in_hot_standby ParameterStatus in PostgreSQL 14+, and poll pg_is_in_recovery() under earlier versions. pg_replication_slots should be mainly used to discover new replicas, and, as such, to discover new potential masters.

Oh that's a better way - thanks for sharing! RDS doesn't seems to be updating pg_replication_slots, let me try pg_is_in_recovery() on RDS.

1st1 Sep 3, 2021
Maintainer

We'll likely name this --postgres-ha-cluster for consistency with --postgres-dsn. However, I also think that perhaps --enable-ha should be a boolean flag, and we should add support for custom DSNs in --postgres-dsn instead to avoid having several DSN-like parameters.

We're a little inconsistent already. We have --postgres-dsn but we also have --max-backend-connections. And our error type names include Backend, not Postgres in their names (e.g. BackendUnavailableError.

Therefore, for consistency, I actually propose to rename --postgres-dsn to --backend-dsn, and call the new arg --ha-backend-cluster.

fantix Sep 3, 2021
Maintainer Author

I'll do the following:

Deprecate --postgres-dsn
Use --backend-dsn=postgresql://... instead
Support HA backend through the same parameter using custom DSN like --backend-dsn=stolon+consul+https://localhost:7500/my_cluster while passing other necessary Postgres connargs through env var like PGUSER=stolon_su

1st1 · 2021-09-03T03:19:53Z

1st1
Sep 3, 2021
Maintainer

Amazing and thorough overview, thank you @fantix.

BackendFailoverError

I'd rename to BackendInFailoverError, I think it's more descriptive that way.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA Postgres Support #2887

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

HA Postgres Support #2887

fantix Sep 2, 2021 Maintainer

Design of Notification-based HA

Issues with AWS RDS & Custom HA Postgres

pgcon Pool Metrics

Design of Supporting Custom HA Postgres

① Unhealthy -> Healthy

② Unhealthy -> Failover

③ Healthy -> Unhealthy

④ Healthy -> Failover

⑤ Failover -> Healthy

Behaviors in Each State

Summary of Solutions

Replies: 2 comments · 3 replies

elprans Sep 2, 2021 Maintainer

fantix Sep 2, 2021 Maintainer Author

1st1 Sep 3, 2021 Maintainer

fantix Sep 3, 2021 Maintainer Author

1st1 Sep 3, 2021 Maintainer

fantix
Sep 2, 2021
Maintainer

① `Unhealthy` -> `Healthy`

② `Unhealthy` -> `Failover`

③ `Healthy` -> `Unhealthy`

④ `Healthy` -> `Failover`

⑤ `Failover` -> `Healthy`

Replies: 2 comments 3 replies

elprans
Sep 2, 2021
Maintainer

fantix Sep 2, 2021
Maintainer Author

1st1 Sep 3, 2021
Maintainer

fantix Sep 3, 2021
Maintainer Author

1st1
Sep 3, 2021
Maintainer