Resuming sync is slow and utilises a large amount of disk space #267

voyc-jean · 2022-03-22T15:21:17Z

PGSync version: 2.1.11
Postgres version: 12.7
Elasticsearch version: 7.10
Redis version: 5.0.6
Python version: 3.8.7
Problem Description:

When resuming PGsync after it was stopped for a while, a query runs that looks like the following:

SELECT *
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 162324251
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 163810458

When a replication slot isn't synced for a while on a busy database and the WAL grows, the above query can take a very long time to complete (in my case, around 20 minutes). On top of the query being slow, it also utilises a large amount of disk space (via temp storage) on the Postgres server:

These large dip in disk space begins when the above query executes. In the above case, it caused PGsync to crash with the following error:

sqlalchemy.exc.OperationalError: (psycopg2.errors.DiskFull) could not write to file "base/pgsql_tmp/pgsql_tmp19437.19": No space  left on device

If I change the query to only return for one xid, it still takes more than 20 minutes to return. My uneducated guess that that Postgres must read through the entirety of the WAL on disk in order to filter out the xids that the query requires.

In addition to the above, if the query does execute (after increasing the database disk space) PGSync might crash with the following exception depending on how much data is sitting in the WAL:

Traceback (most recent call last):
  File "/usr/local/bin/pgsync", line 7, in <module>
    sync.main()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 1240, in main
    sync.pull()
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 1084, in pull
    self.logical_slot_changes(txmin=txmin, txmax=txmax)
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 333, in logical_slot_changes
    rows: list = self.logical_slot_peek_changes(
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 411, in logical_slot_peek_changes
    return self._logical_slot_changes(
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 372, in _logical_slot_changes
    return self.fetchall(statement)
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 813, in fetchall
    rows = conn.execute(statement).fetchall()
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (psycopg2.DatabaseError) out of memory for query result

[SQL: SELECT xid, data 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES(%(PG_LOGICAL_SLOT_PEEK_CHANGES_1)s, NULL, NULL) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= %(param_1)s AND CAST(CAST(xid AS TEXT) AS BIGINT) < %(param_2)s]
[parameters: {'PG_LOGICAL_SLOT_PEEK_CHANGES_1': 'voyc_app_voyc_app_es', 'param_1': 172730985, 'param_2': 174377184}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

All 3.79GB of memory allocated to my PGSync container is exhausted as there are simply too many records returned at once.

The text was updated successfully, but these errors were encountered:

…anges and pg_logical_slot_get_changes #267

toluaina · 2022-03-24T09:48:41Z

I have fixed this in the master branch so that we only fetch records from the logical replication slot in batched defined by PG_LOGICAL_SLOT_UPTO_NCHANGES which defaults to 5000.

voyc-jean · 2022-03-29T11:18:10Z

Though this fix does reduce the disk usage, though I'm not sure if the upto_nchanges parameter of PG_LOGICAL_SLOT_PEEK_CHANGES is working as you expect.

This query returns numerous results:

SELECT *
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 183066650
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 183116650

Whereas this query returns no results:

SELECT xid, lsn
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, 50000)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 183066650
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 183116650

I believe this causes the while loop here to evaluate to false immediately.

toluaina · 2022-03-30T20:29:27Z

i think you are right. I changed the logic on this one. Can you give it a try again please

voyc-jean · 2022-05-20T12:23:46Z

Hi @toluaina,

Sorry for the delay, I've tested but I'm still having issues with this.

I believe the problem might be with versions of Postgres < 14.0 - I have been testing with 12.7. In the release notes for PostgreSQL 14:

Allow logical decoding to be filtered by xid (Markus Wanner)

The filtering done by pgsync on the logical slot seems to read through all WAL segments in order to filter by XID. I plan on upgrading my Postgres to 14 in the near future and will test and report back once done.

toluaina · 2022-05-20T16:53:38Z

Cool. Do let me know the outcome thanks.

This was referenced Mar 22, 2022

Checkpoint advances despite sync failure. #268

Closed

Postgres queries are inefficient #164

Open

toluaina added a commit that referenced this issue Mar 23, 2022

minimize the tmp file disk usage when calling pg_logical_slot_peek_ch…

93ed8cc

…anges and pg_logical_slot_get_changes #267

carlos-pereira-costoya mentioned this issue Jun 21, 2024

PG_LOGICAL_SLOT_PEEK_CHANGES query crashes #546

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming sync is slow and utilises a large amount of disk space #267

Resuming sync is slow and utilises a large amount of disk space #267

voyc-jean commented Mar 22, 2022 •

edited

Loading

toluaina commented Mar 24, 2022

voyc-jean commented Mar 29, 2022 •

edited

Loading

toluaina commented Mar 30, 2022

voyc-jean commented May 20, 2022

toluaina commented May 20, 2022

Resuming sync is slow and utilises a large amount of disk space #267

Resuming sync is slow and utilises a large amount of disk space #267

Comments

voyc-jean commented Mar 22, 2022 • edited Loading

toluaina commented Mar 24, 2022

voyc-jean commented Mar 29, 2022 • edited Loading

toluaina commented Mar 30, 2022

voyc-jean commented May 20, 2022

toluaina commented May 20, 2022

voyc-jean commented Mar 22, 2022 •

edited

Loading

voyc-jean commented Mar 29, 2022 •

edited

Loading