feat(data-warehouse): Build a new postgres source #28660

Gilbert09 · 2025-02-13T12:40:35Z

Problem

SQLAlchemy is causing a big fat memory leak it seems

Changes

Built a new SQL source for just postgres databases (this is the bulk of our SQL sources)
Refactors pyarrow logic to centralise this in one place
- Before we had some logic in with sql_database/arrow_helpers.py and other logic in pipeline/utils.py
Moved some logic out of sql_database/*.py to be more central for sources to use
New concept of SourceResponse which is the beginning of some work I'm doing to abstract out how sources are built to be more centralised in a single place - this also helps us move away from dlt helpers too

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Unit tests are all passing for this new postgres source

posthog-bot · 2025-02-13T12:40:50Z

Hey @Gilbert09! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

greptile-apps

PR Summary

This PR introduces significant changes to SQL data source handling and package dependencies, focusing on a new PostgreSQL implementation and memory management improvements.

Added new postgres_source function in sql_v2.py using psycopg library with 10k batch fetching and scrollable cursors
Implemented PyArrow memory pool management and debug logging through new PYARROW_DEBUG_LOGGING environment variable
Updated core database packages including psycopg (3.1.20 -> 3.2.4), SQLAlchemy (2.0.31 -> 2.0.38), and related dependencies
Modified TableLoader execution options to improve memory usage with max_row_buffer and stream_results parameters
Introduced SourceResponse dataclass to standardize data source operation responses

Note: Several issues need addressing before merge, including commented-out null column handling, debug print statements, and missing error handling in the new PostgreSQL implementation.

_{15 file(s) reviewed, 12 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-02-13T12:42:04Z

posthog/temporal/data_imports/pipelines/pipeline/pipeline.py

+        # TODO !!!!!
+        # pa_table = _handle_null_columns_with_definitions(pa_table, self._resource)


logic: This TODO needs to be resolved before merging. Removing null column handling without replacement could cause data integrity issues.

mypy-baseline.txt

posthog/temporal/common/worker.py

greptile-apps · 2025-02-13T12:42:42Z

posthog/temporal/data_imports/pipelines/pipeline/typings.py

+@dataclasses.dataclass
+class SourceResponse:
+    name: str
+    items: Iterable[Any]


style: Using Any for items type parameter reduces type safety. Consider using a more specific type or generic type parameter if possible.

greptile-apps · 2025-02-13T12:43:43Z

posthog/temporal/data_imports/pipelines/pipeline/utils.py

    try:
-        if len(table_data) == 0:
-            return pa.Table.from_pylist(table_data)
-
-        uuid_exists = any(isinstance(value, uuid.UUID) for value in table_data[0].values())
-        if uuid_exists:
-            return pa.Table.from_pylist(_convert_uuid_to_string(table_data))
-
        return pa.Table.from_pylist(table_data)
    except:


style: Removing empty list check means empty tables will hit the exception handler unnecessarily. Consider keeping the optimization.

greptile-apps · 2025-02-13T12:44:31Z

posthog/temporal/data_imports/pipelines/sql_v2/sql_v2.py

+    print("===================")  # noqa: T201
+    print("USING NEW SOURCE!!!")  # noqa: T201
+    print("===================")  # noqa: T201


style: Debug print statements should be removed before production

posthog/temporal/data_imports/pipelines/sql_v2/sql_v2.py

greptile-apps · 2025-02-13T12:44:33Z

posthog/temporal/data_imports/pipelines/sql_v2/sql_v2.py

+    def get_rows() -> Iterator[Any]:
+        with psycopg.connect(
+            f"postgresql://{user}:{password}@{host}:{port}/{database}?sslmode={sslmode}"
+        ) as connection:
+            with connection.cursor(name=f"posthog_{team_id}_{table_name}", scrollable=True) as cursor:
+                cursor.itersize = 10_000
+
+                query = sql.SQL("SELECT * FROM {}").format(sql.Identifier(table_name))
+                cursor.execute(query)
+
+                column_names = [column.name for column in cursor.description or []]
+
+                while True:
+                    rows = cursor.fetchmany(10_000)
+                    if not rows:
+                        break
+
+                    yield table_from_py_list([dict(zip(column_names, row)) for row in rows])


style: No error handling for database connection failures or query execution errors. Should wrap in try/catch

Suggested change

def get_rows() -> Iterator[Any]:

with psycopg.connect(

f"postgresql://{user}:{password}@{host}:{port}/{database}?sslmode={sslmode}"

) as connection:

with connection.cursor(name=f"posthog_{team_id}_{table_name}", scrollable=True) as cursor:

cursor.itersize = 10_000

query = sql.SQL("SELECT * FROM {}").format(sql.Identifier(table_name))

cursor.execute(query)

column_names = [column.name for column in cursor.description or []]

while True:

rows = cursor.fetchmany(10_000)

if not rows:

break

yield table_from_py_list([dict(zip(column_names, row)) for row in rows])

def get_rows() -> Iterator[Any]:

try:

with psycopg.connect(

f"postgresql://{user}:{password}@{host}:{port}/{database}?sslmode={sslmode}"

) as connection:

with connection.cursor(name=f"posthog_{team_id}_{table_name}", scrollable=True) as cursor:

cursor.itersize = 10_000

query = sql.SQL("SELECT * FROM {}").format(sql.Identifier(table_name))

cursor.execute(query)

column_names = [column.name for column in cursor.description or []]

while True:

rows = cursor.fetchmany(10_000)

if not rows:

break

yield table_from_py_list([dict(zip(column_names, row)) for row in rows])

except psycopg.Error as e:

raise RuntimeError(f"Database error: {str(e)}") from e

greptile-apps · 2025-02-13T12:44:33Z

posthog/temporal/data_imports/pipelines/sql_v2/sql_v2.py

+    db_incremental_field_last_value: Optional[Any],
+    using_ssl: Optional[bool] = True,
+    team_id: Optional[int] = None,
+    incremental_field: Optional[str] = None,
+    incremental_field_type: Optional[IncrementalFieldType] = None,


logic: Incremental field parameters are passed in but never used in the query logic

posthog/temporal/data_imports/pipelines/postgres/postgres.py

EDsCODE · 2025-02-17T16:34:31Z

posthog/temporal/data_imports/pipelines/pipeline/utils.py

@@ -1,12 +1,19 @@
 import asyncio


Non blocking but seems like a bunch of this should be tested eventually

100% - planning on adding separate tests for all of this soon

tomasfarias · 2025-02-17T16:53:22Z

posthog/temporal/data_imports/pipelines/pipeline/utils.py

+            seconds_column = pa.array(
+                [row.as_py().total_seconds() if row.as_py() is not None else None for row in column]
+            )
+            table = table.set_column(table.schema.get_field_index(column_name), column_name, seconds_column)
+            column = table.column(column_name)


nit: Every row.as_py() call allocates memory for the python type (a timedelta, I assume in this case). We could save some memory here by using pyarrow operations:

Suggested change

seconds_column = pa.array(

[row.as_py().total_seconds() if row.as_py() is not None else None for row in column]

)

table = table.set_column(table.schema.get_field_index(column_name), column_name, seconds_column)

column = table.column(column_name)

if column.unit == "s":

factor = 1

elif column.unit == "ms":

factor = 1_000

elif column.unit == "us":

factor = 1_000_000

elif column.unit == "ns":

factor = 1_000_000_000

else:

# Should never get here as we have covered all possible units.

# But there is no way to assert this at compile time.

# See for possible units: https://arrow.apache.org/docs/python/generated/pyarrow.duration.html

raise ValueError(f"Invalid unit: {column.unit}")

# This gets us an Int64 array, which is the same as what we would get

# by creating an array using Python's int.

second_column = pc.multiply(column.cast(pa.int64()), factor)

table = table.set_column(table.schema.get_field_index(column_name), column_name, seconds_column)

column = table.column(column_name)

Great knowledge - thank you

tomasfarias · 2025-02-17T17:04:43Z

posthog/temporal/data_imports/pipelines/pipeline/utils.py

+    return table_from_iterator(iter(table_data))
+
+
+def _process_batch(table_data: list[dict], schema: Optional[pa.Schema] = None) -> pa.Table:


nit: Would be nice if we could refactor this to work with an iterator thus saving the materialization in the caller. But It's probably a lengthy refactor that could be done later.

Yeah, it was initially meant to take an iterator - but some things started to seem very hard to do, one for later for sure

posthog/temporal/data_imports/pipelines/postgres/postgres.py

tomasfarias · 2025-02-17T17:22:25Z

posthog/temporal/data_imports/pipelines/postgres/postgres.py

+                while True:
+                    rows = cursor.fetchmany(10_000)
+                    if not rows:
+                        break


nit: Just an alternative that I think would be cleaner:

Suggested change

while True:

rows = cursor.fetchmany(10_000)

if not rows:

break

cursor.arraysize = 10_000

for rows in cursor:

Would have to test it, unsure if it works (but I think it should).

tomasfarias

A few nits, feel free to ignore. Work looks good!

Gilbert09 added 8 commits February 12, 2025 16:23

Updates to sql source and packcages

ee89c8a

Dev requirements

709d845

Test out standard json dumps

9481485

JSON upgrade

bf63f04

Pyarrows cleanup

3fe8a2c

Add pyarrow debug mode

0d2c945

mypy updates

2b597bf

WIP first take on a new postgres source

0c9e473

greptile-apps bot reviewed Feb 13, 2025

View reviewed changes

Gilbert09 added 4 commits February 13, 2025 16:00

Actually use server cursors and iterators for generating tables

1613cf4

Postgres source supports incremental syncs

6522aae

mypy updates

c9477fa

Pyarrow modifications for new postgres source

bd32d49

Gilbert09 changed the title ~~[WIP] Updates to sql source and packcages~~ [WIP] Updates to sql source and packages Feb 16, 2025

Gilbert09 added 3 commits February 16, 2025 14:19

mypy updates

79f2343

Merge branch 'master' into tom/new-postgres-source

0855d39

mypy updates

5d9864f

Gilbert09 changed the title ~~[WIP] Updates to sql source and packages~~ feat(data-warehouse): Build a new postgres source Feb 16, 2025

Gilbert09 requested a review from a team February 17, 2025 13:27

Fixed tests

0e6645b

EDsCODE approved these changes Feb 17, 2025

View reviewed changes

tomasfarias reviewed Feb 17, 2025

View reviewed changes

posthog/temporal/data_imports/pipelines/postgres/postgres.py Outdated Show resolved Hide resolved

tomasfarias reviewed Feb 17, 2025

View reviewed changes

tomasfarias approved these changes Feb 17, 2025

View reviewed changes

Gilbert09 added 3 commits February 17, 2025 18:46

PR nits

b5e5ee4

Merge branch 'master' into tom/new-postgres-source

72ac714

WIP fixes and tests

843a05b

Gilbert09 added 3 commits February 18, 2025 18:38

Finished off tests for now

fb90f80

e2e fix

3afeefc

another e2e fix

5600231

Gilbert09 merged commit c26dc0d into master Feb 19, 2025
93 checks passed

Gilbert09 deleted the tom/new-postgres-source branch February 19, 2025 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-warehouse): Build a new postgres source #28660

feat(data-warehouse): Build a new postgres source #28660

Gilbert09 commented Feb 13, 2025 •

edited

Loading

posthog-bot commented Feb 13, 2025

greptile-apps bot left a comment

greptile-apps bot Feb 13, 2025

greptile-apps bot Feb 13, 2025

greptile-apps bot Feb 13, 2025

greptile-apps bot Feb 13, 2025

greptile-apps bot Feb 13, 2025

greptile-apps bot Feb 13, 2025

EDsCODE Feb 17, 2025

Gilbert09 Feb 17, 2025

tomasfarias Feb 17, 2025

Gilbert09 Feb 17, 2025

tomasfarias Feb 17, 2025

Gilbert09 Feb 17, 2025

tomasfarias Feb 17, 2025

tomasfarias Feb 17, 2025

tomasfarias left a comment

		# TODO !!!!!
		# pa_table = _handle_null_columns_with_definitions(pa_table, self._resource)

-            seconds_column = pa.array(
-                [row.as_py().total_seconds() if row.as_py() is not None else None for row in column]
-            )
-            table = table.set_column(table.schema.get_field_index(column_name), column_name, seconds_column)
-            column = table.column(column_name)
+            if column.unit == "s":
+                factor = 1
+            elif column.unit == "ms":
+                factor = 1_000
+            elif column.unit == "us":
+                factor = 1_000_000
+            elif column.unit == "ns":
+                factor = 1_000_000_000
+            else:
+                # Should never get here as we have covered all possible units.
+                # But there is no way to assert this at compile time.
+                # See for possible units: https://arrow.apache.org/docs/python/generated/pyarrow.duration.html
+                raise ValueError(f"Invalid unit: {column.unit}")
+            # This gets us an Int64 array, which is the same as what we would get
+            # by creating an array using Python's int.
+            second_column = pc.multiply(column.cast(pa.int64()), factor)
+            table = table.set_column(table.schema.get_field_index(column_name), column_name, seconds_column)
+            column = table.column(column_name)

		return table_from_iterator(iter(table_data))


		def _process_batch(table_data: list[dict], schema: Optional[pa.Schema] = None) -> pa.Table:

feat(data-warehouse): Build a new postgres source #28660

feat(data-warehouse): Build a new postgres source #28660

Conversation

Gilbert09 commented Feb 13, 2025 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

posthog-bot commented Feb 13, 2025

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Feb 13, 2025

Choose a reason for hiding this comment

greptile-apps bot Feb 13, 2025

Choose a reason for hiding this comment

greptile-apps bot Feb 13, 2025

Choose a reason for hiding this comment

greptile-apps bot Feb 13, 2025

Choose a reason for hiding this comment

greptile-apps bot Feb 13, 2025

Choose a reason for hiding this comment

greptile-apps bot Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias left a comment

Choose a reason for hiding this comment

Gilbert09 commented Feb 13, 2025 •

edited

Loading