`Pipeline`: Ingestion pipeline #96

yassinsws · 2024-04-07T22:25:29Z

The primary goal of introducing a vector database is to enhance Iris's ability to respond accurately based on the content of lecture slides. This implementation sets the ingestion Pipeline.
How to Test:
Artemis instructor.
Artemis Lecture Ingestion Pipeline Branch.

Try to delete a course with lecture units in it. Check whether the Pyris side, get's the request from the Artemis side. And whether it sends proper status updates.
Try to update a lecture unit. Check whether the Pyris side, get's the request from the Artemis side. And whether it sends proper status updates to the Artemis side.
You can verify the steps either by running a debugger or checking the logs on the artemis side.

Summary by CodeRabbit

New Features
- Added support for ingesting lecture data, including PDF handling, image interpretation, and data chunking.
- Introduced citation handling in responses with a dedicated pipeline.
- Enhanced search functionality with additional filters and course language retrieval.
Improvements
- Refined thread handling and error logging in the Tutor Chat Pipeline.
- Introduced retry logic with exponential backoff for OpenAI chat completions.
- Updated prompt templates to use word count instead of character limit.
Configuration
- Added Weaviate settings and defaults for Docker environment configuration.
- Updated .gitignore for better Docker handling.

Postpone the ingestion methods of the lectures for now until we get the format of the letures, first basic implementation of ingest and retrieve methods for the code

…datastore

… a httpx version >= 0.26 and ollama needs a version >= 0.25.2 and 0.26<, Finished ingesting and retrieval classes for the lectures. Added hybrid search instead of normal semantic search.

fixed openai dalle class

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 9

Outside diff range and nitpick comments (8)

app/pipeline/prompts/choose_response_prompt.txt (2)
Line range hint 1-11: Review the instructions for selecting response paragraphs. Consider rephrasing "take into consideration" to "consider" for conciseness. Also, correct the grammatical mistake by adding "is" after "as it".
- To understand the full scope of the question, take into consideration the Chat History as it the necessary context for the question.
+ To understand the full scope of the question, consider the Chat History as it is the necessary context for the question.
Tools

LanguageTool

[grammar] ~9-~9: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ...aphs": [0]}} Do not by any means return a the number of the response that has written...

Line range hint 1-11: Correct the typographical error by removing the extra determiner.
- Do not by any means return a the number of the response that has written programming code in it.
+ Do not by any means return the number of the response that has written programming code in it.
Tools

LanguageTool

[grammar] ~9-~9: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ...aphs": [0]}} Do not by any means return a the number of the response that has written...
app/config.py (2)
Line range hint 27-27: Replace EnvironmentError with more specific OSError.
- raise EnvironmentError(
+ raise OSError(
Line range hint 33-33: Remove unnecessary open mode parameters.
- with open(file_path, "r") as file:
+ with open(file_path) as file:
app/pipeline/chat/lecture_chat_pipeline.py (1)
Line range hint 151-151: Rename unused loop control variable i to _.
- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _, chunk in enumerate(retrieved_lecture_chunks):
app/pipeline/chat/tutor_chat_pipeline.py (3)
Line range hint 124-124: Remove unnecessary open mode parameters for file operations.
-    with open(prompt_file_path, "r") as file:
+    with open(prompt_file_path) as file:
Line range hint 310-312: Use format specifiers instead of percent format for better readability and performance.
-    f"Here is the information if the build failed: {build_failed}\n"
-    "These are the build logs for the student's repository:\n%s"
-    ) % "\n".join(str(log) for log in build_logs)
+    f"Here is the information if the build failed: {build_failed}\n"
+    "These are the build logs for the student's repository:\n" +
+    "\n".join(str(log) for log in build_logs)
Line range hint 354-354: Avoid unused loop variables to enhance code clarity.
-    for i, chunk in enumerate(retrieved_lecture_chunks):
+    for _i, chunk in enumerate(retrieved_lecture_chunks):

app/pipeline/shared/citation_pipeline.py

app/pipeline/lecture_ingestion_pipeline.py

app/pipeline/shared/citation_pipeline.py

app/pipeline/lecture_ingestion_pipeline.py

app/domain/ingestion/ingestion_pipeline_execution_dto.py

app/pipeline/prompts/citation_prompt.txt

… it was already ingested

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (4)

app/pipeline/chat/tutor_chat_pipeline.py (4)
Line range hint 134-134: Remove unnecessary file open mode parameters.
- with open(prompt_file_path, "r") as file:
+ with open(prompt_file_path) as file:
This change addresses the static analysis hint about unnecessary open mode parameters. By default, files are opened in read mode, so specifying "r" is redundant.

Line range hint 324-326: Use format specifiers for string formatting.
- prompt = (
-    "These are the feedbacks for the student's repository:\n%s"
- ) % "\n---------\n".join(str(log) for log in feedbacks)
+ prompt = f"These are the feedbacks for the student's repository:\n{'\n---------\n'.join(str(log) for log in feedbacks)}"
This change uses Python's f-string format which is more readable and efficient than the percent format.

Line range hint 368-368: Rename the unused loop control variable.
- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _i, chunk in enumerate(retrieved_lecture_chunks):
Renaming the unused variable _i follows Python's convention to indicate that this variable is intentionally unused.

Line range hint 163-233: Review the method _run_tutor_chat_pipeline for clarity and efficiency.

The method _run_tutor_chat_pipeline is quite long and handles multiple responsibilities. Consider breaking it down into smaller, more focused methods to improve readability and maintainability.

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (1)

app/pipeline/chat/lecture_chat_pipeline.py (1)
Line range hint 151-151: Unused loop control variable i should be renamed to _i to indicate it's intentionally unused.
- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _i, chunk in enumerate(retrieved_lecture_chunks):

coderabbitai · 2024-06-18T14:27:10Z

app/pipeline/chat/lecture_chat_pipeline.py

            logger.info(f"Response from lecture chat pipeline: {response}")
-            return response
+            return response_with_citation


Consider specifying exception types to improve error diagnostics and handling.

- except Exception as e: + except (WeaviateException, IOError) as e:

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-06-18T14:27:10Z

app/pipeline/lecture_ingestion_pipeline.py

+        with batch_update_lock:
+            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:


Combine nested with statements into a single statement for better readability.

- with batch_update_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with batch_update_lock:

with self.collection.batch.rate_limit(requests_per_minute=600) as batch:

with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:

Tools

Ruff

156-157: Use a single with statement with multiple contexts instead of nested with statements (SIM117)

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (3)

app/pipeline/chat/tutor_chat_pipeline.py (3)
Line range hint 134-134: Remove unnecessary open mode parameters.

The file is opened in read mode by default, so specifying the mode explicitly is redundant. Here's a simplified version:
- with open(prompt_file_path, "r") as file:
+ with open(prompt_file_path) as file:
Line range hint 324-326: Use format specifiers instead of percent format.

Using format specifiers provides a more modern, readable, and error-proof way of formatting strings in Python. Here's how you can refactor the code:
- prompt = (
-     "These are the feedbacks for the student's repository:\n%s"
- ) % "\n---------\n".join(str(log) for log in feedbacks)
+ prompt = f"These are the feedbacks for the student's repository:\n{'\n---------\n'.join(str(log) for log in feedbacks)}"
Line range hint 368-368: Rename unused loop control variable.

The variable i is not used within the loop body. It is a good practice to replace it with _ to indicate that it is intentionally unused:
- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _, chunk in enumerate(retrieved_lecture_chunks):

coderabbitai · 2024-06-18T14:42:07Z

app/pipeline/chat/tutor_chat_pipeline.py

+    def should_execute_lecture_pipeline(self, course_id: int) -> bool:
+        """
+        Checks if the lecture pipeline should be executed
+        :param course_id: The course ID
+        :return: True if the lecture pipeline should be executed
+        """
+        if course_id:
+            # Fetch the first object that matches the course ID with the language property
+            result = self.db.lectures.query.fetch_objects(
+                filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(
+                    course_id
+                ),
+                limit=1,
+                return_properties=[LectureSchema.COURSE_NAME.value],
+            )
+            return len(result.objects) > 0
+        return False


Optimize the method should_execute_lecture_pipeline.

The method should_execute_lecture_pipeline can be optimized by directly returning the condition in the if statement. This change makes the function more concise and easier to read, as previously suggested:

- if course_id: - result = self.db.lectures.query.fetch_objects( - filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(course_id), - limit=1, - return_properties=[LectureSchema.COURSE_NAME.value], - ) - return len(result.objects) > 0 - return False + return course_id and len(self.db.lectures.query.fetch_objects( + filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(course_id), + limit=1, + return_properties=[LectureSchema.COURSE_NAME.value], + ).objects) > 0

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def should_execute_lecture_pipeline(self, course_id: int) -> bool:

"""

Checks if the lecture pipeline should be executed

:param course_id: The course ID

:return: True if the lecture pipeline should be executed

"""

if course_id:

# Fetch the first object that matches the course ID with the language property

result = self.db.lectures.query.fetch_objects(

filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(

course_id

),

limit=1,

return_properties=[LectureSchema.COURSE_NAME.value],

)

return len(result.objects) > 0

return False

def should_execute_lecture_pipeline(self, course_id: int) -> bool:

"""

Checks if the lecture pipeline should be executed

:param course_id: The course ID

:return: True if the lecture pipeline should be executed

"""

return course_id and len(self.db.lectures.query.fetch_objects(

filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(

course_id

),

limit=1,

return_properties=[LectureSchema.COURSE_NAME.value],

).objects) > 0

coderabbitai · 2024-06-18T14:42:07Z

app/pipeline/chat/tutor_chat_pipeline.py

+            should_execute_lecture_pipeline = self.should_execute_lecture_pipeline(
+                dto.course.id
+            )
+            self.lecture_chat_response = ""
+            if should_execute_lecture_pipeline:
+                execution_dto = LectureChatPipelineExecutionDTO(
+                    settings=dto.settings,
+                    course=dto.course,
+                    chatHistory=dto.chat_history,
+                )
+                lecture_chat_thread = threading.Thread(
+                    target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,)
+                )
+                lecture_chat_thread.start()
+
+            tutor_chat_thread = threading.Thread(
+                target=self._run_tutor_chat_pipeline(dto),
+                args=(dto, should_execute_lecture_pipeline),
+            )
+            tutor_chat_thread.start()


Ensure proper thread management and exception handling in threading logic.

The threading logic within the __call__ method does not appear to handle exceptions or race conditions effectively. Consider implementing thread joining or using a thread pool to manage threads more safely. Additionally, ensure that shared resources are accessed in a thread-safe manner. Here is a proposed change:

- lecture_chat_thread = threading.Thread( - target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,) - ) - lecture_chat_thread.start() + with ThreadPoolExecutor() as executor: + executor.submit(self._run_lecture_chat_pipeline, execution_dto, dto)

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

should_execute_lecture_pipeline = self.should_execute_lecture_pipeline(

dto.course.id

)

self.lecture_chat_response = ""

if should_execute_lecture_pipeline:

execution_dto = LectureChatPipelineExecutionDTO(

settings=dto.settings,

course=dto.course,

chatHistory=dto.chat_history,

)

lecture_chat_thread = threading.Thread(

target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,)

)

lecture_chat_thread.start()

tutor_chat_thread = threading.Thread(

target=self._run_tutor_chat_pipeline(dto),

args=(dto, should_execute_lecture_pipeline),

)

tutor_chat_thread.start()

should_execute_lecture_pipeline = self.should_execute_lecture_pipeline(

dto.course.id

)

self.lecture_chat_response = ""

if should_execute_lecture_pipeline:

execution_dto = LectureChatPipelineExecutionDTO(

settings=dto.settings,

course=dto.course,

chatHistory=dto.chat_history,

)

with ThreadPoolExecutor() as executor:

executor.submit(self._run_lecture_chat_pipeline, execution_dto, dto)

tutor_chat_thread = threading.Thread(

target=self._run_tutor_chat_pipeline(dto),

args=(dto, should_execute_lecture_pipeline),

)

tutor_chat_thread.start()

MichaelOwenDyer and others added 30 commits February 14, 2024 17:27

Add new pipeline DTOs

403f118

Apply autoformatter

e7c74f2

Have DTOs extend BaseModel

26e86ac

Add data package

6997315

update retrieval interface and requirements

128ea40

Merge branch 'main' into feature/datastore

0818109

Use cloud cluster for weaviate for now for the hackathon

70ed83f

Postpone the ingestion methods of the lectures for now until we get the format of the letures, first basic implementation of ingest and retrieve methods for the code

Merge remote-tracking branch 'origin/feature/datastore' into feature/…

7acf809

…datastore

fix splitting function.

2c0793a

Add content_service, data ingester and vector repository subsystems

b4cb05d

Merge branch 'main' into feature/datastore

2f3882f

fix lintin

05490f2

Merge remote-tracking branch 'origin/feature/datastore' into feature/…

e08aac6

…datastore

add a return statement to unzip

a29a44b

Add image recognition for Ollama, GPT4V and image generation for Dall-E

0c96395

Solved requirements problem ( removed olama for now as weaviate needs…

e9874b9

… a httpx version >= 0.26 and ollama needs a version >= 0.25.2 and 0.26<, Finished ingesting and retrieval classes for the lectures. Added hybrid search instead of normal semantic search.

fixed requirements file

3a186c9

fixed message interpretation function in the llm class

379550b

Added detail parameter to image_interpretation model

0f6e576

fixed openai dalle class

renamed pyris_image to iris_image

a4186c3

Update app/content_service/Ingestion/lectures_ingestion.py

9f2848e

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update app/content_service/Retrieval/abstract_retrieval.py

224a701

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update app/content_service/Ingestion/repository_ingestion.py

93a2f44

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update app/content_service/Ingestion/lectures_ingestion.py

bca6377

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update app/content_service/Retrieval/lecture_retrieval.py

6e9525d

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

erase old lecture download files

bc97236

Add a function to get lectures from Artemis

b0291b1

Update app/content_service/get_lecture_from_artemis.py

7211386

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

refractor tutor pipeline

738e7a0

Lecture content first draft ready for review

303f6d4

yassinsws added 7 commits June 12, 2024 01:59

flake8

d503ba2

Ingestion with images

e4c7575

Ingestion with images

a6b70ac

keep error handling

e5b34f0

keep error handling

c60921b

erase full ingestion checks for now

9e34fc2

erase full ingestion checks for now

28d99cf

github-actions bot added the component:Docker label Jun 13, 2024

coderabbitai bot requested changes Jun 13, 2024

View reviewed changes

yassinsws and others added 4 commits June 13, 2024 07:14

Add log when course finished

4843299

Add log when course finished

dc205ed

erase unsused import

847913d

Merge branch 'main' into feature/Ingestion_pipeline

692a10a

coderabbitai bot previously approved these changes Jun 15, 2024

View reviewed changes

Add check if course was ingested and only use the lecture pipeline if…

d5d926e

… it was already ingested

yassinsws dismissed coderabbitai[bot]’s stale review via d5d926e June 15, 2024 10:36

coderabbitai bot requested changes Jun 15, 2024

View reviewed changes

make chunks size 512

5732e7c

coderabbitai bot requested changes Jun 18, 2024

View reviewed changes

give only 5 best chunks to tutor pipeline

cba0561

coderabbitai bot requested changes Jun 18, 2024

View reviewed changes

bassner merged commit 83cd32a into main Jun 21, 2024
5 checks passed

bassner deleted the feature/Ingestion_pipeline branch June 21, 2024 10:02

coderabbitai bot mentioned this pull request Oct 2, 2024

Ingestion status callback update #142

Merged

coderabbitai bot mentioned this pull request Oct 11, 2024

Track token usage of iris requests #165

Merged

coderabbitai bot mentioned this pull request Nov 4, 2024

Exercise Chat: Implement native function calling agent #154

Merged

isabellagessl pushed a commit that referenced this pull request Nov 11, 2024

Pipeline: Ingestion pipeline (#96)

a45740f

This was referenced Jan 6, 2025

Enable IRIS to use FAQs #187

Merged

Add ChatGPT wrapper pipeline #185

Closed

coderabbitai bot mentioned this pull request Jan 18, 2025

FAQ rewriting pipeline #191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Pipeline`: Ingestion pipeline #96

`Pipeline`: Ingestion pipeline #96

yassinsws commented Apr 7, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Jun 18, 2024

coderabbitai bot Jun 18, 2024

coderabbitai bot left a comment

coderabbitai bot Jun 18, 2024

coderabbitai bot Jun 18, 2024

		with batch_update_lock:
		with self.collection.batch.rate_limit(requests_per_minute=600) as batch:

	with batch_update_lock:
	with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
	with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:

Pipeline: Ingestion pipeline #96

Pipeline: Ingestion pipeline #96

Conversation

yassinsws commented Apr 7, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

`Pipeline`: Ingestion pipeline #96

`Pipeline`: Ingestion pipeline #96

yassinsws commented Apr 7, 2024 •

edited by coderabbitai bot

Loading