Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming sync is slow and utilises a large amount of disk space #267

Open
voyc-jean opened this issue Mar 22, 2022 · 5 comments
Open

Resuming sync is slow and utilises a large amount of disk space #267

voyc-jean opened this issue Mar 22, 2022 · 5 comments

Comments

@voyc-jean
Copy link

voyc-jean commented Mar 22, 2022

PGSync version: 2.1.11
Postgres version: 12.7
Elasticsearch version: 7.10
Redis version: 5.0.6
Python version: 3.8.7
Problem Description:

When resuming PGsync after it was stopped for a while, a query runs that looks like the following:

SELECT *
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 162324251
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 163810458

When a replication slot isn't synced for a while on a busy database and the WAL grows, the above query can take a very long time to complete (in my case, around 20 minutes). On top of the query being slow, it also utilises a large amount of disk space (via temp storage) on the Postgres server:

image

These large dip in disk space begins when the above query executes. In the above case, it caused PGsync to crash with the following error:

sqlalchemy.exc.OperationalError: (psycopg2.errors.DiskFull) could not write to file "base/pgsql_tmp/pgsql_tmp19437.19": No space  left on device

If I change the query to only return for one xid, it still takes more than 20 minutes to return. My uneducated guess that that Postgres must read through the entirety of the WAL on disk in order to filter out the xids that the query requires.

In addition to the above, if the query does execute (after increasing the database disk space) PGSync might crash with the following exception depending on how much data is sitting in the WAL:

Traceback (most recent call last):
  File "/usr/local/bin/pgsync", line 7, in <module>
    sync.main()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 1240, in main
    sync.pull()
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 1084, in pull
    self.logical_slot_changes(txmin=txmin, txmax=txmax)
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 333, in logical_slot_changes
    rows: list = self.logical_slot_peek_changes(
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 411, in logical_slot_peek_changes
    return self._logical_slot_changes(
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 372, in _logical_slot_changes
    return self.fetchall(statement)
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 813, in fetchall
    rows = conn.execute(statement).fetchall()
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (psycopg2.DatabaseError) out of memory for query result

[SQL: SELECT xid, data 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES(%(PG_LOGICAL_SLOT_PEEK_CHANGES_1)s, NULL, NULL) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= %(param_1)s AND CAST(CAST(xid AS TEXT) AS BIGINT) < %(param_2)s]
[parameters: {'PG_LOGICAL_SLOT_PEEK_CHANGES_1': 'voyc_app_voyc_app_es', 'param_1': 172730985, 'param_2': 174377184}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

All 3.79GB of memory allocated to my PGSync container is exhausted as there are simply too many records returned at once.

toluaina added a commit that referenced this issue Mar 23, 2022
@toluaina
Copy link
Owner

I have fixed this in the master branch so that we only fetch records from the logical replication slot in batched defined by PG_LOGICAL_SLOT_UPTO_NCHANGES which defaults to 5000.

@voyc-jean
Copy link
Author

voyc-jean commented Mar 29, 2022

Though this fix does reduce the disk usage, though I'm not sure if the upto_nchanges parameter of PG_LOGICAL_SLOT_PEEK_CHANGES is working as you expect.

This query returns numerous results:

SELECT *
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 183066650
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 183116650

Whereas this query returns no results:

SELECT xid, lsn
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, 50000)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 183066650
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 183116650

I believe this causes the while loop here to evaluate to false immediately.

@toluaina
Copy link
Owner

i think you are right. I changed the logic on this one. Can you give it a try again please

@voyc-jean
Copy link
Author

Hi @toluaina,

Sorry for the delay, I've tested but I'm still having issues with this.

I believe the problem might be with versions of Postgres < 14.0 - I have been testing with 12.7. In the release notes for PostgreSQL 14:

  • Allow logical decoding to be filtered by xid (Markus Wanner)

The filtering done by pgsync on the logical slot seems to read through all WAL segments in order to filter by XID. I plan on upgrading my Postgres to 14 in the near future and will test and report back once done.

@toluaina
Copy link
Owner

Cool. Do let me know the outcome thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants