-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline
: Ingestion pipeline
#96
Conversation
Postpone the ingestion methods of the lectures for now until we get the format of the letures, first basic implementation of ingest and retrieve methods for the code
… a httpx version >= 0.26 and ollama needs a version >= 0.25.2 and 0.26<, Finished ingesting and retrieval classes for the lectures. Added hybrid search instead of normal semantic search.
fixed openai dalle class
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Outside diff range and nitpick comments (8)
app/pipeline/prompts/choose_response_prompt.txt (2)
Line range hint
1-11
: Review the instructions for selecting response paragraphs. Consider rephrasing "take into consideration" to "consider" for conciseness. Also, correct the grammatical mistake by adding "is" after "as it".- To understand the full scope of the question, take into consideration the Chat History as it the necessary context for the question. + To understand the full scope of the question, consider the Chat History as it is the necessary context for the question.Tools
LanguageTool
[grammar] ~9-~9: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ...aphs": [0]}} Do not by any means return a the number of the response that has written...
Line range hint
1-11
: Correct the typographical error by removing the extra determiner.- Do not by any means return a the number of the response that has written programming code in it. + Do not by any means return the number of the response that has written programming code in it.Tools
LanguageTool
[grammar] ~9-~9: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ...aphs": [0]}} Do not by any means return a the number of the response that has written...app/config.py (2)
Line range hint
27-27
: ReplaceEnvironmentError
with more specificOSError
.- raise EnvironmentError( + raise OSError(
Line range hint
33-33
: Remove unnecessary open mode parameters.- with open(file_path, "r") as file: + with open(file_path) as file:app/pipeline/chat/lecture_chat_pipeline.py (1)
Line range hint
151-151
: Rename unused loop control variablei
to_
.- for i, chunk in enumerate(retrieved_lecture_chunks): + for _, chunk in enumerate(retrieved_lecture_chunks):app/pipeline/chat/tutor_chat_pipeline.py (3)
Line range hint
124-124
: Remove unnecessary open mode parameters for file operations.- with open(prompt_file_path, "r") as file: + with open(prompt_file_path) as file:
Line range hint
310-312
: Use format specifiers instead of percent format for better readability and performance.- f"Here is the information if the build failed: {build_failed}\n" - "These are the build logs for the student's repository:\n%s" - ) % "\n".join(str(log) for log in build_logs) + f"Here is the information if the build failed: {build_failed}\n" + "These are the build logs for the student's repository:\n" + + "\n".join(str(log) for log in build_logs)
Line range hint
354-354
: Avoid unused loop variables to enhance code clarity.- for i, chunk in enumerate(retrieved_lecture_chunks): + for _i, chunk in enumerate(retrieved_lecture_chunks):
… it was already ingested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (4)
app/pipeline/chat/tutor_chat_pipeline.py (4)
Line range hint
134-134
: Remove unnecessary file open mode parameters.- with open(prompt_file_path, "r") as file: + with open(prompt_file_path) as file:This change addresses the static analysis hint about unnecessary open mode parameters. By default, files are opened in read mode, so specifying
"r"
is redundant.
Line range hint
324-326
: Use format specifiers for string formatting.- prompt = ( - "These are the feedbacks for the student's repository:\n%s" - ) % "\n---------\n".join(str(log) for log in feedbacks) + prompt = f"These are the feedbacks for the student's repository:\n{'\n---------\n'.join(str(log) for log in feedbacks)}"This change uses Python's f-string format which is more readable and efficient than the percent format.
Line range hint
368-368
: Rename the unused loop control variable.- for i, chunk in enumerate(retrieved_lecture_chunks): + for _i, chunk in enumerate(retrieved_lecture_chunks):Renaming the unused variable
_i
follows Python's convention to indicate that this variable is intentionally unused.
Line range hint
163-233
: Review the method_run_tutor_chat_pipeline
for clarity and efficiency.The method
_run_tutor_chat_pipeline
is quite long and handles multiple responsibilities. Consider breaking it down into smaller, more focused methods to improve readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (1)
app/pipeline/chat/lecture_chat_pipeline.py (1)
Line range hint
151-151
: Unused loop control variablei
should be renamed to_i
to indicate it's intentionally unused.- for i, chunk in enumerate(retrieved_lecture_chunks): + for _i, chunk in enumerate(retrieved_lecture_chunks):
logger.info(f"Response from lecture chat pipeline: {response}") | ||
return response | ||
return response_with_citation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider specifying exception types to improve error diagnostics and handling.
- except Exception as e:
+ except (WeaviateException, IOError) as e:
Committable suggestion was skipped due to low confidence.
with batch_update_lock: | ||
with self.collection.batch.rate_limit(requests_per_minute=600) as batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine nested with
statements into a single statement for better readability.
- with batch_update_lock:
- with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+ with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
with batch_update_lock: | |
with self.collection.batch.rate_limit(requests_per_minute=600) as batch: | |
with batch_update_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch: |
Tools
Ruff
156-157: Use a single
with
statement with multiple contexts instead of nestedwith
statements (SIM117)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (3)
app/pipeline/chat/tutor_chat_pipeline.py (3)
Line range hint
134-134
: Remove unnecessary open mode parameters.The file is opened in read mode by default, so specifying the mode explicitly is redundant. Here's a simplified version:
- with open(prompt_file_path, "r") as file: + with open(prompt_file_path) as file:
Line range hint
324-326
: Use format specifiers instead of percent format.Using format specifiers provides a more modern, readable, and error-proof way of formatting strings in Python. Here's how you can refactor the code:
- prompt = ( - "These are the feedbacks for the student's repository:\n%s" - ) % "\n---------\n".join(str(log) for log in feedbacks) + prompt = f"These are the feedbacks for the student's repository:\n{'\n---------\n'.join(str(log) for log in feedbacks)}"
Line range hint
368-368
: Rename unused loop control variable.The variable
i
is not used within the loop body. It is a good practice to replace it with_
to indicate that it is intentionally unused:- for i, chunk in enumerate(retrieved_lecture_chunks): + for _, chunk in enumerate(retrieved_lecture_chunks):
def should_execute_lecture_pipeline(self, course_id: int) -> bool: | ||
""" | ||
Checks if the lecture pipeline should be executed | ||
:param course_id: The course ID | ||
:return: True if the lecture pipeline should be executed | ||
""" | ||
if course_id: | ||
# Fetch the first object that matches the course ID with the language property | ||
result = self.db.lectures.query.fetch_objects( | ||
filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal( | ||
course_id | ||
), | ||
limit=1, | ||
return_properties=[LectureSchema.COURSE_NAME.value], | ||
) | ||
return len(result.objects) > 0 | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimize the method should_execute_lecture_pipeline
.
The method should_execute_lecture_pipeline
can be optimized by directly returning the condition in the if
statement. This change makes the function more concise and easier to read, as previously suggested:
- if course_id:
- result = self.db.lectures.query.fetch_objects(
- filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(course_id),
- limit=1,
- return_properties=[LectureSchema.COURSE_NAME.value],
- )
- return len(result.objects) > 0
- return False
+ return course_id and len(self.db.lectures.query.fetch_objects(
+ filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal(course_id),
+ limit=1,
+ return_properties=[LectureSchema.COURSE_NAME.value],
+ ).objects) > 0
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def should_execute_lecture_pipeline(self, course_id: int) -> bool: | |
""" | |
Checks if the lecture pipeline should be executed | |
:param course_id: The course ID | |
:return: True if the lecture pipeline should be executed | |
""" | |
if course_id: | |
# Fetch the first object that matches the course ID with the language property | |
result = self.db.lectures.query.fetch_objects( | |
filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal( | |
course_id | |
), | |
limit=1, | |
return_properties=[LectureSchema.COURSE_NAME.value], | |
) | |
return len(result.objects) > 0 | |
return False | |
def should_execute_lecture_pipeline(self, course_id: int) -> bool: | |
""" | |
Checks if the lecture pipeline should be executed | |
:param course_id: The course ID | |
:return: True if the lecture pipeline should be executed | |
""" | |
return course_id and len(self.db.lectures.query.fetch_objects( | |
filters=Filter.by_property(LectureSchema.COURSE_ID.value).equal( | |
course_id | |
), | |
limit=1, | |
return_properties=[LectureSchema.COURSE_NAME.value], | |
).objects) > 0 |
should_execute_lecture_pipeline = self.should_execute_lecture_pipeline( | ||
dto.course.id | ||
) | ||
self.lecture_chat_response = "" | ||
if should_execute_lecture_pipeline: | ||
execution_dto = LectureChatPipelineExecutionDTO( | ||
settings=dto.settings, | ||
course=dto.course, | ||
chatHistory=dto.chat_history, | ||
) | ||
lecture_chat_thread = threading.Thread( | ||
target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,) | ||
) | ||
lecture_chat_thread.start() | ||
|
||
tutor_chat_thread = threading.Thread( | ||
target=self._run_tutor_chat_pipeline(dto), | ||
args=(dto, should_execute_lecture_pipeline), | ||
) | ||
tutor_chat_thread.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper thread management and exception handling in threading logic.
The threading logic within the __call__
method does not appear to handle exceptions or race conditions effectively. Consider implementing thread joining or using a thread pool to manage threads more safely. Additionally, ensure that shared resources are accessed in a thread-safe manner. Here is a proposed change:
- lecture_chat_thread = threading.Thread(
- target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,)
- )
- lecture_chat_thread.start()
+ with ThreadPoolExecutor() as executor:
+ executor.submit(self._run_lecture_chat_pipeline, execution_dto, dto)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
should_execute_lecture_pipeline = self.should_execute_lecture_pipeline( | |
dto.course.id | |
) | |
self.lecture_chat_response = "" | |
if should_execute_lecture_pipeline: | |
execution_dto = LectureChatPipelineExecutionDTO( | |
settings=dto.settings, | |
course=dto.course, | |
chatHistory=dto.chat_history, | |
) | |
lecture_chat_thread = threading.Thread( | |
target=self._run_lecture_chat_pipeline(execution_dto), args=(dto,) | |
) | |
lecture_chat_thread.start() | |
tutor_chat_thread = threading.Thread( | |
target=self._run_tutor_chat_pipeline(dto), | |
args=(dto, should_execute_lecture_pipeline), | |
) | |
tutor_chat_thread.start() | |
should_execute_lecture_pipeline = self.should_execute_lecture_pipeline( | |
dto.course.id | |
) | |
self.lecture_chat_response = "" | |
if should_execute_lecture_pipeline: | |
execution_dto = LectureChatPipelineExecutionDTO( | |
settings=dto.settings, | |
course=dto.course, | |
chatHistory=dto.chat_history, | |
) | |
with ThreadPoolExecutor() as executor: | |
executor.submit(self._run_lecture_chat_pipeline, execution_dto, dto) | |
tutor_chat_thread = threading.Thread( | |
target=self._run_tutor_chat_pipeline(dto), | |
args=(dto, should_execute_lecture_pipeline), | |
) | |
tutor_chat_thread.start() |
The primary goal of introducing a vector database is to enhance Iris's ability to respond accurately based on the content of lecture slides. This implementation sets the ingestion Pipeline.
How to Test:
Artemis instructor.
Artemis Lecture Ingestion Pipeline Branch.
You can verify the steps either by running a debugger or checking the logs on the artemis side.
Summary by CodeRabbit
New Features
Improvements
Configuration
.gitignore
for better Docker handling.