Add support for PostgreSQL node failover #2293

elprans · 2021-03-01T17:29:34Z

See #1859 for related discussion.

fantix · 2021-08-26T17:01:36Z

Local Test

A local testing for the DNS-based failover is done as follows: (on top of #2580)

Add visibility to the EdgeDB server:

diff --git a/edb/server/pgcon/pgcon.pyx b/edb/server/pgcon/pgcon.pyx
index 202c3b0e1..f46fa305b 100644
--- a/edb/server/pgcon/pgcon.pyx
+++ b/edb/server/pgcon/pgcon.pyx
@@ -207,6 +207,7 @@ async def _connect(connargs, dbname, ssl):
                 lambda: PGConnection(dbname, loop, connargs),
                 host=host, port=port)
             _set_tcp_keepalive(trans)
+            print("connected to", trans.get_extra_info("peername"))

     try:
         await pgcon.connect()

Setup local Postgres instance on port 5432.
Setup toxiproxy, forward traffic on port 6432 of any host to 127.0.0.1:5432.
Edit /etc/hosts, add 127.0.0.1 dbhost.
Start EdgeDB server with --postgres-dsn=postgresql://dbhost:6432 - we should see connected to ('127.0.0.1', 6432) in the server log.
Create a client and run it:

import asyncio

import edgedb


async def job(pool):
    while True:
        async with pool.acquire() as conn:
            async for tx in conn.with_retry_options(
                edgedb.RetryOptions(60, lambda x: 1)
            ).retrying_transaction():
                async with tx:
                    print(await tx.query("SELECT datetime_current()"))
                    await tx.execute("SELECT sys::_sleep(1)")


async def main():
    async with edgedb.create_async_pool(
        "_localdev", min_size=0, max_size=5
    ) as pool:
        try:
            jobs = [asyncio.create_task(job(pool)) for _ in range(5)]
            await asyncio.wait(jobs)
        finally:
            print('end')


asyncio.run(main())

Edit /etc/hosts, update to 192.168.2.22 dbhost (LAN IP). By now, there shouldn't be any 192.168.2.22 in the server log unless there are new connections made by other clients.
Add a break rule to toxiproxy. In the EdgeDB server log, we should see that all connections are discarded. And the client pauses retrying.
Remove the break rule. The EdgeDB server should have several connected to ('192.168.2.22', 6432), and the client resumes printing the DB timestamp.

Conclusion

EdgeDB could mostly handle DNS-based Postgres failover, because:

In Support Stolon+Consul (acting as a Stolon proxy) #2580 we implemented systemdb connection retry.
libuv/uvloop is not caching DNS query results.

The missing part is:

If only some - not all - of the pgcons are broken, EdgeDB server should actively cut off the remaining connections as soon as a DNS failover is detected. The tricky part is - like mentioned in the discussion post - to figure out if it is a failover or just DNS-based load-balancing. This is usually depending on the actual Postgres provider/HA system.

AWS RDS

Following basically the same steps above, AWS multi-az RDS failover with the AWS "Reboot With Failover" feature works fine with EdgeDB with #2580, with a small fix for macOS (requires Python 3.10rc1 for a TCP keepalive flag, bpo-34932); Linux should be fine. In short, "Reboot With Failover" simulates the master Postgres going down without sending any TCP RST packet on any alive connection. In this case TCP keepalive is necessary to capture the "failover" - it's double-quoted because of the same reason: this only means some connection to the master node is not healthy, we're not sure that there is a true failover. Though, it is sufficient for the test where all connections are down and we could cleanly failover to the replica.

One possible solution for RDS to detect a failover is to use AWS API, but this requires a lot more configuration.

In case of RDS failover (based on the Reboot With Failover test), TCP keepalive is needed to detect unhealthy connections to the failing master, and new connect attempts may raise TimeoutError before the new master is in position. This fixes the TCP keepalive feature on macOS and requires Python 3.10 refs bpo-34932. Refs geldata#2293.

In case of RDS failover (based on the Reboot With Failover test), TCP keepalive is needed to detect unhealthy connections to the failing master, and new connect attempts may raise TimeoutError before the new master is in position. This fixes the TCP keepalive feature on macOS and requires Python 3.10 refs bpo-34932. Refs #2293.

fantix · 2021-09-16T14:35:27Z

Fixed in #2920

elprans self-assigned this Mar 1, 2021

elprans added the integration label Mar 1, 2021

elprans assigned fantix and unassigned elprans Aug 23, 2021

fantix mentioned this issue Aug 26, 2021

Fix for AWS RDS multi-az failover #2866

Merged

fantix closed this as completed Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for PostgreSQL node failover #2293

Add support for PostgreSQL node failover #2293

elprans commented Mar 1, 2021

fantix commented Aug 26, 2021 •

edited

Loading

fantix commented Sep 16, 2021

Add support for PostgreSQL node failover #2293

Add support for PostgreSQL node failover #2293

Comments

elprans commented Mar 1, 2021

fantix commented Aug 26, 2021 • edited Loading

Local Test

Conclusion

AWS RDS

fantix commented Sep 16, 2021

fantix commented Aug 26, 2021 •

edited

Loading