examples: Make RAG examples a bit more generic and demoable #3085

antoniivanov · 2024-02-06T11:23:21Z

For confluence-reader:

the recursive method for find pages was crashing so repalced with more
CQL and https://developer.atlassian.com/server/confluence/cql-field-reference/#ancestor
added passing parent id so we can take only few but consistent set of pages for demo
purpsoes
Noted bugs and issues in the code and added todos

For embed-ingest-job-example:

parameterize the table names used in the job
add clean-up deleted rows (though afterwards I realised it's redundant for now as
we need to drop the table first as the postgres ingestion does not
support upserts (updates))
as the embedding job is written in so generic way. Actually, there's no
need to tie it to confluence at all. It would work for any dataset.
Added multiple TODOs for missing features. The job could be even
further generalized if our ingestion framework improves
renamed embed-ingest-job-exmaple ot pgvector-embedder to better show its responsibilities

duyguHsnHsn

Looks good to me! Should we create ticket for some of the todos?

yonitoo

Awesome generalization. LGTM.
A bit of a side note but I recently read this SentenceTransformer issue suggesting that the stop words removal and lemmatization (part of our text cleaning) are not needed for transformer models - passing the original text is suggested. Maybe we can remove the whole cleaning part and directly embed (could be done here or as part of another story).

antoniivanov · 2024-02-06T13:01:01Z

Looks good to me! Should we create ticket for some of the todos?

Not yet. As we build a backlog for the next milestones, then we will create the needed tickets.

antoniivanov · 2024-02-06T13:02:12Z

Awesome generalization. LGTM. A bit of a side note but I recently read this SentenceTransformer issue suggesting that the stop words removal and lemmatization (part of our text cleaning) are not needed for transformer models - passing the original text is suggested. Maybe we can remove the whole cleaning part and directly embed (could be done here or as part of another story).

Ok. Maybe separately. But I also want to have some cleaning logic, because it's something you would expect ot have in a pipeline and we need to figure out how to handle it properly.

- the recursive method for find pages was crashing so repalced with more CQL - added passing parent id so we can take only few pages for demo purpsoes - Noted bugs and issues in the code and added todos

- parameterize the table names used in the job - add clean up deleted rows (though I realised it's redundant for now as we need to drop the table first as the postgres ingestion does not support upserts (updates)) - as the embedding job is written in so generic way. Actually there's no need to tie it to confluence at all. It would work for any dataset. - Added multiple TODOs for missing features. The job could be even further generalied if our ingestion frameowrk improves

vmwclabot added the cla-not-required label Feb 6, 2024

github-actions bot added the title needs formatting label Feb 6, 2024

duyguHsnHsn approved these changes Feb 6, 2024

View reviewed changes

yonitoo approved these changes Feb 6, 2024

View reviewed changes

antoniivanov added 4 commits February 7, 2024 02:35

examples: confluence-reader: fixes to be more demoable

d7436bd

- the recursive method for find pages was crashing so repalced with more CQL - added passing parent id so we can take only few pages for demo purpsoes - Noted bugs and issues in the code and added todos

renamed embed-ingest-job-example to pgvector-embedder

182771c

a few more fixes

cceb5da

yonitoo mentioned this pull request Feb 7, 2024

Refine the text cleaning before embedding the documents in the RAG pipeline #3089

Closed

antoniivanov force-pushed the person/aivanov/rag branch from 6fbae4c to cceb5da Compare February 8, 2024 10:02

antoniivanov changed the title ~~Make RAG examples a bit more generic and demoable~~ examples: Make RAG examples a bit more generic and demoable Feb 8, 2024

github-actions bot removed the title needs formatting label Feb 8, 2024

antoniivanov merged commit 62961c7 into main Feb 8, 2024
8 of 10 checks passed

antoniivanov deleted the person/aivanov/rag branch February 8, 2024 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples: Make RAG examples a bit more generic and demoable #3085

examples: Make RAG examples a bit more generic and demoable #3085

antoniivanov commented Feb 6, 2024 •

edited

Loading

duyguHsnHsn left a comment

yonitoo left a comment

antoniivanov commented Feb 6, 2024

antoniivanov commented Feb 6, 2024

examples: Make RAG examples a bit more generic and demoable #3085

examples: Make RAG examples a bit more generic and demoable #3085

Conversation

antoniivanov commented Feb 6, 2024 • edited Loading

duyguHsnHsn left a comment

Choose a reason for hiding this comment

yonitoo left a comment

Choose a reason for hiding this comment

antoniivanov commented Feb 6, 2024

antoniivanov commented Feb 6, 2024

antoniivanov commented Feb 6, 2024 •

edited

Loading