Replies: 2 comments 3 replies
-
This is a great summary, and all of it makes sense to me, thank you, @fantix. Some notes:
We'll likely name this
Why is there a difference in existing and new connection? It seems to me that new connections should also raise
This behavior seems to be the same for
I think a more reliable way would be to check for a receipt of
Yes, a definitive notification from the HA system should override all other signals. |
Beta Was this translation helpful? Give feedback.
-
Amazing and thorough overview, thank you @fantix.
I'd rename to |
Beta Was this translation helpful? Give feedback.
-
Refs #2293, #2580
To support Postgres clusters that come with high availability, EdgeDB should be albe to automatically switch to the new Postgres master when a failover happens. For EdgeDB, HA Postgres clusters fall into 2 categories:
Design of Notification-based HA
EdgeDB server will have different implementations to support some selected Postgres HA backend like Stolon, through a new CLI parameter
--ha-cluster
. Based on its URL value, the corresponding implementation will subscribe to the given HA backend, and only use the advertised Postgres DSN from that backend.When a failover notification is received, EdgeDB would bump the internal serial counter and cut off all current Postgres connections. Any new Postgres connection from now on will use the new master address from the notification, tagged with the new serial number. Any lingering connection with a lower serial number will be discarded.
The notification and the actual failover may have some difference in time - it is usual that when the old master fails, the HA backend takes 30~60 seconds to confirm the fact, and promoting a replica to be the new master also takes time, while the notification may arrive before the new master is ready. This means it is necessary for EdgeDB to keep trying to reconnect to Postgres, and only reopens when the backend is fully ready. This topic is expanded in a later section, but here I shall describe the current behavior in #2580, which is sufficient for Notification-based HA.
Currently, EdgeDB depends on the activeness of the Postgres connection to the system database (
sys_pgcon
in short). Once thesys_pgcon
is broken, EdgeDB will set a server-wide "unavailable" tag and return retryable errors to all interactions. In the meantime,sys_pgcon
will keep trying to reconnect as far as EdgeDB is still running. However if it is a regularpgcon
in the pool that gets disconnected unexpectedly, the current operation will be aborted with a retryable error and thepgcon
itself will be simply discarded without reconnection. If there is no ongoing operation on thepgcon
, it will just silently ignore the disconnect error - it's only removed from the pool on the next time it's acquired. Though if the connection error happens while apgcon
tries to establish the connection,pgcon
will retry 3 times before returning a retryable error.A common timeline would look like this:
Issues with AWS RDS & Custom HA Postgres
At first, RDS was tested as a "Custom HA Postgres" without failover notification, and it works fine with the AWS feature "Reboot with Failover", details in #2293. In short, the failover process is similar to the example in previous section, except for the failover notification and following termination of existing pgcons. AWS updates the DNS record, TCP keepalive kills all existing pgcons and new queries eventually connects to the new master.
In the real world, the HA backend may choose to do a failover while some pgcons are still active in EdgeDB, because the HA backend talks to Postgres from possibly a different network than the EdgeDB server, or any other unknown real-world reasons. In other words, when the EdgeDB server sees a still-working Postgres, the HA backend may have decided to do a failover. Without HA backend telling EdgeDB server about the failover, EdgeDB server could only come up with the best guess it can make, and tries its best to cut off lingering connections to the old master (do the switch) at the right moment, in order to talk to the real master consistently.
On the contrary, EdgeDB must not be too sensitive to switch, because it is also possible that the network between the EdgeDB server and the Postgres clusters is unstable. For example, we shouldn't actively drop all pgcons only because the
sys_pgcon
is broken. There should be some kind of a threshold beyond which EdgeDB should do the switch. At the same time, the switch should not happen very frequently: if we're killing all pgcons every second, there must be some kind of misunderstanding of the failover situation.pgcon Pool Metrics
In order to evaluate the pgcon pool healthiness, I'll cover a little bit of how the pool works again here, and the metrics that may affect the switch-over evaluation.
In general, the EdgeDB pgcon pool is a lazy connection pool across multiple Postgres databases ("dbname"), specifically optimized for the quality of service. Some relevant characteristics:
Based on these, I think the healthiness evaluation could cover these metrics:
transport.close()
, then we know the master Postgres or the network is likely failing.sys_pgcon
to poll when server is idle. So that, we'll know for how long the pgcons are freezing. This is I think less important for the evaluation than the other 2 because TCP keepalive will kill those frozen connections and lead to "unexpected disconnects".Except for (4), we may likely experience any of (1)(2) or (3) in a poor network, and none of them is a strong proof that a failover is happening. Therefore, a practical design is to have a staging state.
Design of Supporting Custom HA Postgres
①
Unhealthy
->Healthy
pgcon
(including pollingsys_pgcon
) (?)②
Unhealthy
->Failover
Unhealthy
state)Unhealthy
state for more than 30 (?) seconds.sys_pgcon
is down.③
Healthy
->Unhealthy
ConnectionError
sys_pgcon
idle-poll interval).④
Healthy
->Failover
⑤
Failover
->Healthy
sys_pgcon
is healthy.Behaviors in Each State
Healthy
Unhealthy
sys_pgcon
is broken:BackendUnavailableError
BackendUnavailableError
Failover
Failover
stateBackendInFailoverError
BackendInFailoverError
sys_pgcon
is the only retrying connectionSummary of Solutions
To support custom HA Postgres without notification, the design explained previously is a close-enough approximation that should work for most cases. More importantly, it allows us to add those metrics incrementally so that we could focus on the critical conditions first, and extend to other metrics later.
a connection is read-onlyPostgres is a hot standbyIdentify failed connects due to(This one doesn't make much sense, because failed connects cannot trigger a failover switch - we'll just keep reconnecting)ConnectionError
sys_pgcon
when idleCheck(not needed now)pg_replication_slots
and raise error if connected to a replicaAlso, it looks like we could easily blend the notification-based HA support into the same FSM - just switch to
Failover
whenever a failover notification is received.Beta Was this translation helpful? Give feedback.
All reactions