-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
examples: Make RAG examples a bit more generic and demoable #3085
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Should we create ticket for some of the todos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome generalization. LGTM.
A bit of a side note but I recently read this SentenceTransformer issue suggesting that the stop words removal and lemmatization (part of our text cleaning) are not needed for transformer models - passing the original text is suggested. Maybe we can remove the whole cleaning part and directly embed (could be done here or as part of another story).
Not yet. As we build a backlog for the next milestones, then we will create the needed tickets. |
Ok. Maybe separately. But I also want to have some cleaning logic, because it's something you would expect ot have in a pipeline and we need to figure out how to handle it properly. |
- the recursive method for find pages was crashing so repalced with more CQL - added passing parent id so we can take only few pages for demo purpsoes - Noted bugs and issues in the code and added todos
- parameterize the table names used in the job - add clean up deleted rows (though I realised it's redundant for now as we need to drop the table first as the postgres ingestion does not support upserts (updates)) - as the embedding job is written in so generic way. Actually there's no need to tie it to confluence at all. It would work for any dataset. - Added multiple TODOs for missing features. The job could be even further generalied if our ingestion frameowrk improves
6fbae4c
to
cceb5da
Compare
For confluence-reader:
CQL and https://developer.atlassian.com/server/confluence/cql-field-reference/#ancestor
purpsoes
For embed-ingest-job-example:
we need to drop the table first as the postgres ingestion does not
support upserts (updates))
need to tie it to confluence at all. It would work for any dataset.
further generalized if our ingestion framework improves