Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PostgreSQL node failover #2293

Closed
elprans opened this issue Mar 1, 2021 · 2 comments
Closed

Add support for PostgreSQL node failover #2293

elprans opened this issue Mar 1, 2021 · 2 comments
Assignees

Comments

@elprans
Copy link
Member

elprans commented Mar 1, 2021

See #1859 for related discussion.

@elprans elprans self-assigned this Mar 1, 2021
@elprans elprans assigned fantix and unassigned elprans Aug 23, 2021
@fantix
Copy link
Member

fantix commented Aug 26, 2021

Local Test

A local testing for the DNS-based failover is done as follows: (on top of #2580)

  1. Add visibility to the EdgeDB server:
diff --git a/edb/server/pgcon/pgcon.pyx b/edb/server/pgcon/pgcon.pyx
index 202c3b0e1..f46fa305b 100644
--- a/edb/server/pgcon/pgcon.pyx
+++ b/edb/server/pgcon/pgcon.pyx
@@ -207,6 +207,7 @@ async def _connect(connargs, dbname, ssl):
                 lambda: PGConnection(dbname, loop, connargs),
                 host=host, port=port)
             _set_tcp_keepalive(trans)
+            print("connected to", trans.get_extra_info("peername"))

     try:
         await pgcon.connect()
  1. Setup local Postgres instance on port 5432.
  2. Setup toxiproxy, forward traffic on port 6432 of any host to 127.0.0.1:5432.
  3. Edit /etc/hosts, add 127.0.0.1 dbhost.
  4. Start EdgeDB server with --postgres-dsn=postgresql://dbhost:6432 - we should see connected to ('127.0.0.1', 6432) in the server log.
  5. Create a client and run it:
import asyncio

import edgedb


async def job(pool):
    while True:
        async with pool.acquire() as conn:
            async for tx in conn.with_retry_options(
                edgedb.RetryOptions(60, lambda x: 1)
            ).retrying_transaction():
                async with tx:
                    print(await tx.query("SELECT datetime_current()"))
                    await tx.execute("SELECT sys::_sleep(1)")


async def main():
    async with edgedb.create_async_pool(
        "_localdev", min_size=0, max_size=5
    ) as pool:
        try:
            jobs = [asyncio.create_task(job(pool)) for _ in range(5)]
            await asyncio.wait(jobs)
        finally:
            print('end')


asyncio.run(main())
  1. Edit /etc/hosts, update to 192.168.2.22 dbhost (LAN IP). By now, there shouldn't be any 192.168.2.22 in the server log unless there are new connections made by other clients.
  2. Add a break rule to toxiproxy. In the EdgeDB server log, we should see that all connections are discarded. And the client pauses retrying.
  3. Remove the break rule. The EdgeDB server should have several connected to ('192.168.2.22', 6432), and the client resumes printing the DB timestamp.

Conclusion

EdgeDB could mostly handle DNS-based Postgres failover, because:

  1. In Support Stolon+Consul (acting as a Stolon proxy) #2580 we implemented systemdb connection retry.
  2. libuv/uvloop is not caching DNS query results.

The missing part is:

  • If only some - not all - of the pgcons are broken, EdgeDB server should actively cut off the remaining connections as soon as a DNS failover is detected. The tricky part is - like mentioned in the discussion post - to figure out if it is a failover or just DNS-based load-balancing. This is usually depending on the actual Postgres provider/HA system.

AWS RDS

Following basically the same steps above, AWS multi-az RDS failover with the AWS "Reboot With Failover" feature works fine with EdgeDB with #2580, with a small fix for macOS (requires Python 3.10rc1 for a TCP keepalive flag, bpo-34932); Linux should be fine. In short, "Reboot With Failover" simulates the master Postgres going down without sending any TCP RST packet on any alive connection. In this case TCP keepalive is necessary to capture the "failover" - it's double-quoted because of the same reason: this only means some connection to the master node is not healthy, we're not sure that there is a true failover. Though, it is sufficient for the test where all connections are down and we could cleanly failover to the replica.

One possible solution for RDS to detect a failover is to use AWS API, but this requires a lot more configuration.

fantix added a commit to fantix/edgedb that referenced this issue Aug 27, 2021
In case of RDS failover (based on the Reboot With Failover test), TCP
keepalive is needed to detect unhealthy connections to the failing
master, and new connect attempts may raise TimeoutError before the new
master is in position.

This fixes the TCP keepalive feature on macOS and requires Python 3.10
refs bpo-34932.

Refs geldata#2293.
fantix added a commit that referenced this issue Aug 27, 2021
In case of RDS failover (based on the Reboot With Failover test), TCP
keepalive is needed to detect unhealthy connections to the failing
master, and new connect attempts may raise TimeoutError before the new
master is in position.

This fixes the TCP keepalive feature on macOS and requires Python 3.10
refs bpo-34932.

Refs #2293.
@fantix
Copy link
Member

fantix commented Sep 16, 2021

Fixed in #2920

@fantix fantix closed this as completed Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants