Enable pgvector support for Postgres provider #34891

sunank200 · 2023-10-12T13:33:38Z

This PR is part of our larger effort to add first-class integrations to support LLMOps that was presented at the Airflow Summit. This PR specifically adds the pgvector support for Postgres Provider. pgvector is a renowned Open-source vector similarity search for Postgres. In this iteration, we are integrating with their Embeddings Model.

The primary objective of this Provider is to present users with an alternative embedding model. This allows them to generate vectors for their proprietary data, a pivotal step towards establishing integrations with LLM models like ChatGPT.

Example DAG:
The PgVectorIngestOperator can accept either a list of strings or a callable returning a list of strings.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

airflow/providers/postgres/provider.yaml

airflow/providers/postgres/operators/postgres.py

airflow/providers/postgres/hooks/postgres.py

airflow/providers/postgres/operators/postgres.py

airflow/providers/postgres/hooks/postgres.py

Taragolis · 2023-10-17T23:08:00Z

airflow/providers/postgres/operators/postgres.py

+        input_callable: Callable[[Any], Any] | None = None,
+        input_callable_args: Collection[Any] | None = None,
+        input_callable_kwargs: Mapping[str, Any] | None = None,


I wan't to address the same question as I asked in #34921 (comment)

What benefits provide this arguments, because for me it could be replaced be replaced by different things

Make input data templated field and provide thought XCom from upstream task.

Combination of taskflow (or PythonOperator) + PostgresHook

@task def awesome_task(conn_id: str): hook = PostgresHook(postgres_conn_id=conn_id) ... hook.ingest_embedding( table="foo", input_data=... vector_size=42 )

One of the approaches we would like to support within the operator is to get data without using XCOMs, because if I am right there is no automatic cleanup of XCOMs, so users might not want to populate the database with large number of XCOM entries. Hence, we thought of supporting the callable approach within the operator

cc: @jlaneve

yes, purpose here is to give a workaround to storing things in XComs. depending on how much data you're working with, passing a large number of vectors through XComs can be unideal (especially if you don't have a custom XCom backend). instead, giving the user the ability to execute the same data fetching code within the task means we don't pollute XComs.

this has disadvantages though, particularly around retries. hence why there are two different input methods to let the user decide which is right for them: (1) from XComs and (2) with a callable

(2) with a callable

That is exactly what a PythonOperator (and taskflow decorators) does as well as provide greater flexibility, e.g. access to task context.

In the other hand current implementation it is a combination of (1) classic operator and (2) restricted PythonOperator. The second part could be replaced by add examples in docs how to use Hook within the taskflow.
It would reduce complexibility of the code and number of required tests

Why not just have an @task.pgvector_import() decorator which provides a cleaner UX for instantiating the hook and passing params? Conceptually like https://github.com/astronomer/ask-astro/blob/00cfbd7a48aafe5603b1ef49342f1dd68c148156/airflow/dags/ingestion/ask-astro-load-blogs.py#L167

airflow/providers/postgres/hooks/postgres.py

Co-authored-by: Tzu-ping Chung <[email protected]>

Taragolis · 2023-10-19T10:27:22Z

airflow/providers/postgres/hooks/postgres.py

+        from pgvector.psycopg import register_vector
+        from psycopg2 import sql
+
+        self.conn.execute("CREATE EXTENSION IF NOT EXISTS vector")


Suggested change

self.conn.execute("CREATE EXTENSION IF NOT EXISTS vector")

AFAIK, before Postgres 13 CREATE EXTENSION required superuser permissions, in PG 13+ trusted extension could be installed by someone with appropriate CREATE, however it is not a case of pgvector

So let keep create extensions for DBAs or for some one with appropriate permissions. We should not force users to use SU in their databases.

pankajkoti · 2023-11-03T10:15:06Z

Closing this PR in lieu of a separate provider PR: #35399

pankajastro added 3 commits October 6, 2023 02:16

Add method to insert vector in postgress

a0a0d31

Add method to insert vector in postgress

83561ac

Add operator

bfbbaf9

boring-cyborg bot added area:providers provider:postgres labels Oct 12, 2023

eladkal reviewed Oct 12, 2023

View reviewed changes

airflow/providers/postgres/provider.yaml Outdated Show resolved Hide resolved

kaxil reviewed Oct 12, 2023

View reviewed changes

airflow/providers/postgres/operators/postgres.py Show resolved Hide resolved

Taragolis reviewed Oct 12, 2023

View reviewed changes

airflow/providers/postgres/hooks/postgres.py Outdated Show resolved Hide resolved

sunank200 force-pushed the enable_pgvector branch from 4601c62 to 90dfdba Compare October 12, 2023 16:29

Add the doc-strings, extra dependencies and fix sql injections

0af9e17

sunank200 force-pushed the enable_pgvector branch from 90dfdba to 0af9e17 Compare October 13, 2023 05:25

uranusjr reviewed Oct 13, 2023

View reviewed changes

airflow/providers/postgres/hooks/postgres.py Outdated Show resolved Hide resolved

Taragolis reviewed Oct 13, 2023

View reviewed changes

airflow/providers/postgres/operators/postgres.py Outdated Show resolved Hide resolved

pankajkoti added 2 commits October 17, 2023 15:25

Address review comments

d9bf6cd

Merge branch 'main' into enable_pgvector

a5181ea

pankajkoti reviewed Oct 17, 2023

View reviewed changes

airflow/providers/postgres/hooks/postgres.py Outdated Show resolved Hide resolved

Update airflow/providers/postgres/hooks/postgres.py

9504734

Taragolis reviewed Oct 17, 2023

View reviewed changes

pankajkoti mentioned this pull request Oct 18, 2023

Add Cohere Provider #34921

Merged

uranusjr reviewed Oct 18, 2023

View reviewed changes

airflow/providers/postgres/hooks/postgres.py Outdated Show resolved Hide resolved

pankajkoti and others added 2 commits October 18, 2023 10:54

Update airflow/providers/postgres/hooks/postgres.py

5ca4553

Co-authored-by: Tzu-ping Chung <[email protected]>

Merge branch 'main' into enable_pgvector

224e198

mpgreg mentioned this pull request Oct 19, 2023

Add OpenAI Provider #35023

Merged

Taragolis reviewed Oct 19, 2023

View reviewed changes

pankajkoti closed this Nov 3, 2023

Taragolis mentioned this pull request Nov 6, 2023

Add pgvector provider implementation #35399

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable pgvector support for Postgres provider #34891

Enable pgvector support for Postgres provider #34891

sunank200 commented Oct 12, 2023 •

edited by phanikumv

Loading

Taragolis Oct 17, 2023

pankajkoti Oct 18, 2023

jlaneve Oct 18, 2023

Taragolis Oct 18, 2023

mpgreg Oct 19, 2023

Taragolis Oct 19, 2023

pankajkoti commented Nov 3, 2023

Enable pgvector support for Postgres provider #34891

Enable pgvector support for Postgres provider #34891

Conversation

sunank200 commented Oct 12, 2023 • edited by phanikumv Loading

Taragolis Oct 17, 2023

Choose a reason for hiding this comment

pankajkoti Oct 18, 2023

Choose a reason for hiding this comment

jlaneve Oct 18, 2023

Choose a reason for hiding this comment

Taragolis Oct 18, 2023

Choose a reason for hiding this comment

mpgreg Oct 19, 2023

Choose a reason for hiding this comment

Taragolis Oct 19, 2023

Choose a reason for hiding this comment

pankajkoti commented Nov 3, 2023

sunank200 commented Oct 12, 2023 •

edited by phanikumv

Loading