Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline: Ingestion pipeline #96

Merged
merged 181 commits into from
Jun 21, 2024
Merged

Pipeline: Ingestion pipeline #96

merged 181 commits into from
Jun 21, 2024

Conversation

yassinsws
Copy link
Contributor

@yassinsws yassinsws commented Apr 7, 2024

The primary goal of introducing a vector database is to enhance Iris's ability to respond accurately based on the content of lecture slides. This implementation sets the ingestion Pipeline.
How to Test:
Artemis instructor.
Artemis Lecture Ingestion Pipeline Branch.

  • Try to delete a course with lecture units in it. Check whether the Pyris side, get's the request from the Artemis side. And whether it sends proper status updates.
  • Try to update a lecture unit. Check whether the Pyris side, get's the request from the Artemis side. And whether it sends proper status updates to the Artemis side.
    You can verify the steps either by running a debugger or checking the logs on the artemis side.

Summary by CodeRabbit

  • New Features

    • Added support for ingesting lecture data, including PDF handling, image interpretation, and data chunking.
    • Introduced citation handling in responses with a dedicated pipeline.
    • Enhanced search functionality with additional filters and course language retrieval.
  • Improvements

    • Refined thread handling and error logging in the Tutor Chat Pipeline.
    • Introduced retry logic with exponential backoff for OpenAI chat completions.
    • Updated prompt templates to use word count instead of character limit.
  • Configuration

    • Added Weaviate settings and defaults for Docker environment configuration.
    • Updated .gitignore for better Docker handling.

MichaelOwenDyer and others added 30 commits February 14, 2024 17:27
Postpone the ingestion methods of the lectures for now until we get the format of the letures,
first basic implementation of ingest and retrieve methods for the code
… a httpx version >= 0.26 and ollama needs a version >= 0.25.2 and 0.26<,

Finished ingesting and retrieval classes for the lectures.
Added hybrid search instead of normal semantic search.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Outside diff range and nitpick comments (8)
app/pipeline/prompts/choose_response_prompt.txt (2)

Line range hint 1-11: Review the instructions for selecting response paragraphs. Consider rephrasing "take into consideration" to "consider" for conciseness. Also, correct the grammatical mistake by adding "is" after "as it".

- To understand the full scope of the question, take into consideration the Chat History as it the necessary context for the question.
+ To understand the full scope of the question, consider the Chat History as it is the necessary context for the question.
Tools
LanguageTool

[grammar] ~9-~9: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ...aphs": [0]}} Do not by any means return a the number of the response that has written...


Line range hint 1-11: Correct the typographical error by removing the extra determiner.

- Do not by any means return a the number of the response that has written programming code in it.
+ Do not by any means return the number of the response that has written programming code in it.
Tools
LanguageTool

[grammar] ~9-~9: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ...aphs": [0]}} Do not by any means return a the number of the response that has written...

app/config.py (2)

Line range hint 27-27: Replace EnvironmentError with more specific OSError.

- raise EnvironmentError(
+ raise OSError(

Line range hint 33-33: Remove unnecessary open mode parameters.

- with open(file_path, "r") as file:
+ with open(file_path) as file:
app/pipeline/chat/lecture_chat_pipeline.py (1)

Line range hint 151-151: Rename unused loop control variable i to _.

- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _, chunk in enumerate(retrieved_lecture_chunks):
app/pipeline/chat/tutor_chat_pipeline.py (3)

Line range hint 124-124: Remove unnecessary open mode parameters for file operations.

-    with open(prompt_file_path, "r") as file:
+    with open(prompt_file_path) as file:

Line range hint 310-312: Use format specifiers instead of percent format for better readability and performance.

-    f"Here is the information if the build failed: {build_failed}\n"
-    "These are the build logs for the student's repository:\n%s"
-    ) % "\n".join(str(log) for log in build_logs)
+    f"Here is the information if the build failed: {build_failed}\n"
+    "These are the build logs for the student's repository:\n" +
+    "\n".join(str(log) for log in build_logs)

Line range hint 354-354: Avoid unused loop variables to enhance code clarity.

-    for i, chunk in enumerate(retrieved_lecture_chunks):
+    for _i, chunk in enumerate(retrieved_lecture_chunks):

coderabbitai[bot]
coderabbitai bot previously approved these changes Jun 15, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (4)
app/pipeline/chat/tutor_chat_pipeline.py (4)

Line range hint 134-134: Remove unnecessary file open mode parameters.

- with open(prompt_file_path, "r") as file:
+ with open(prompt_file_path) as file:

This change addresses the static analysis hint about unnecessary open mode parameters. By default, files are opened in read mode, so specifying "r" is redundant.


Line range hint 324-326: Use format specifiers for string formatting.

- prompt = (
-    "These are the feedbacks for the student's repository:\n%s"
- ) % "\n---------\n".join(str(log) for log in feedbacks)
+ prompt = f"These are the feedbacks for the student's repository:\n{'\n---------\n'.join(str(log) for log in feedbacks)}"

This change uses Python's f-string format which is more readable and efficient than the percent format.


Line range hint 368-368: Rename the unused loop control variable.

- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _i, chunk in enumerate(retrieved_lecture_chunks):

Renaming the unused variable _i follows Python's convention to indicate that this variable is intentionally unused.


Line range hint 163-233: Review the method _run_tutor_chat_pipeline for clarity and efficiency.

The method _run_tutor_chat_pipeline is quite long and handles multiple responsibilities. Consider breaking it down into smaller, more focused methods to improve readability and maintainability.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (1)
app/pipeline/chat/lecture_chat_pipeline.py (1)

Line range hint 151-151: Unused loop control variable i should be renamed to _i to indicate it's intentionally unused.

- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _i, chunk in enumerate(retrieved_lecture_chunks):

logger.info(f"Response from lecture chat pipeline: {response}")
return response
return response_with_citation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider specifying exception types to improve error diagnostics and handling.

- except Exception as e:
+ except (WeaviateException, IOError) as e:

Committable suggestion was skipped due to low confidence.

Comment on lines +156 to +157
with batch_update_lock:
with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine nested with statements into a single statement for better readability.

- with batch_update_lock:
-     with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+ with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with batch_update_lock:
with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
Tools
Ruff

156-157: Use a single with statement with multiple contexts instead of nested with statements (SIM117)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (3)
app/pipeline/chat/tutor_chat_pipeline.py (3)

Line range hint 134-134: Remove unnecessary open mode parameters.

The file is opened in read mode by default, so specifying the mode explicitly is redundant. Here's a simplified version:

- with open(prompt_file_path, "r") as file:
+ with open(prompt_file_path) as file:

Line range hint 324-326: Use format specifiers instead of percent format.

Using format specifiers provides a more modern, readable, and error-proof way of formatting strings in Python. Here's how you can refactor the code:

- prompt = (
-     "These are the feedbacks for the student's repository:\n%s"
- ) % "\n---------\n".join(str(log) for log in feedbacks)
+ prompt = f"These are the feedbacks for the student's repository:\n{'\n---------\n'.join(str(log) for log in feedbacks)}"

Line range hint 368-368: Rename unused loop control variable.

The variable i is not used within the loop body. It is a good practice to replace it with _ to indicate that it is intentionally unused:

- for i, chunk in enumerate(retrieved_lecture_chunks):
+ for _, chunk in enumerate(retrieved_lecture_chunks):

Comment on lines +378 to +394
def should_execute_lecture_pipeline(self, course_id: int) -> bool:
"""
Checks if the lecture pipeline should be executed
:param course_id: The course ID
:return: True if the lecture pipeline should be executed
"""
if course_id:
# Fetch the first object that matches the course ID with the language property
result = self.db.lectures.query.fetch_objects(
filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(
course_id
),
limit=1,
return_properties=[LectureSchema.COURSE_NAME.value],
)
return len(result.objects) > 0
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize the method should_execute_lecture_pipeline.

The method should_execute_lecture_pipeline can be optimized by directly returning the condition in the if statement. This change makes the function more concise and easier to read, as previously suggested:

- if course_id:
-     result = self.db.lectures.query.fetch_objects(
-         filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(course_id),
-         limit=1,
-         return_properties=[LectureSchema.COURSE_NAME.value],
-     )
-     return len(result.objects) > 0
- return False
+ return course_id and len(self.db.lectures.query.fetch_objects(
+     filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(course_id),
+     limit=1,
+     return_properties=[LectureSchema.COURSE_NAME.value],
+ ).objects) > 0
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def should_execute_lecture_pipeline(self, course_id: int) -> bool:
"""
Checks if the lecture pipeline should be executed
:param course_id: The course ID
:return: True if the lecture pipeline should be executed
"""
if course_id:
# Fetch the first object that matches the course ID with the language property
result = self.db.lectures.query.fetch_objects(
filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(
course_id
),
limit=1,
return_properties=[LectureSchema.COURSE_NAME.value],
)
return len(result.objects) > 0
return False
def should_execute_lecture_pipeline(self, course_id: int) -> bool:
"""
Checks if the lecture pipeline should be executed
:param course_id: The course ID
:return: True if the lecture pipeline should be executed
"""
return course_id and len(self.db.lectures.query.fetch_objects(
filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(
course_id
),
limit=1,
return_properties=[LectureSchema.COURSE_NAME.value],
).objects) > 0

Comment on lines +91 to +110
should_execute_lecture_pipeline = self.should_execute_lecture_pipeline(
dto.course.id
)
self.lecture_chat_response = ""
if should_execute_lecture_pipeline:
execution_dto = LectureChatPipelineExecutionDTO(
settings=dto.settings,
course=dto.course,
chatHistory=dto.chat_history,
)
lecture_chat_thread = threading.Thread(
target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,)
)
lecture_chat_thread.start()

tutor_chat_thread = threading.Thread(
target=self._run_tutor_chat_pipeline(dto),
args=(dto, should_execute_lecture_pipeline),
)
tutor_chat_thread.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper thread management and exception handling in threading logic.

The threading logic within the __call__ method does not appear to handle exceptions or race conditions effectively. Consider implementing thread joining or using a thread pool to manage threads more safely. Additionally, ensure that shared resources are accessed in a thread-safe manner. Here is a proposed change:

- lecture_chat_thread = threading.Thread(
-     target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,)
- )
- lecture_chat_thread.start()
+ with ThreadPoolExecutor() as executor:
+     executor.submit(self._run_lecture_chat_pipeline, execution_dto, dto)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
should_execute_lecture_pipeline = self.should_execute_lecture_pipeline(
dto.course.id
)
self.lecture_chat_response = ""
if should_execute_lecture_pipeline:
execution_dto = LectureChatPipelineExecutionDTO(
settings=dto.settings,
course=dto.course,
chatHistory=dto.chat_history,
)
lecture_chat_thread = threading.Thread(
target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,)
)
lecture_chat_thread.start()
tutor_chat_thread = threading.Thread(
target=self._run_tutor_chat_pipeline(dto),
args=(dto, should_execute_lecture_pipeline),
)
tutor_chat_thread.start()
should_execute_lecture_pipeline = self.should_execute_lecture_pipeline(
dto.course.id
)
self.lecture_chat_response = ""
if should_execute_lecture_pipeline:
execution_dto = LectureChatPipelineExecutionDTO(
settings=dto.settings,
course=dto.course,
chatHistory=dto.chat_history,
)
with ThreadPoolExecutor() as executor:
executor.submit(self._run_lecture_chat_pipeline, execution_dto, dto)
tutor_chat_thread = threading.Thread(
target=self._run_tutor_chat_pipeline(dto),
args=(dto, should_execute_lecture_pipeline),
)
tutor_chat_thread.start()

@bassner bassner merged commit 83cd32a into main Jun 21, 2024
5 checks passed
@bassner bassner deleted the feature/Ingestion_pipeline branch June 21, 2024 10:02
isabellagessl pushed a commit that referenced this pull request Nov 11, 2024
This was referenced Jan 6, 2025
@coderabbitai coderabbitai bot mentioned this pull request Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants