Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chain of density #135

Merged
merged 16 commits into from
Nov 12, 2023
Merged

Chain of density #135

merged 16 commits into from
Nov 12, 2023

Conversation

ivanleomk
Copy link
Collaborator

@ivanleomk ivanleomk commented Nov 4, 2023

Added a sample implementation of the Chain of Density implementation. Modified pipeline slightly to use a mix of GPT4 for critiquing and GPT3.5 for generation of output

Summary by CodeRabbit

  • New Features

    • Introduced a new "Chain Of Density" summarization technique using GPT-3.5.
    • Added a new Python script for article summarization using Pydantic and OpenAI's GPT-4 model.
  • Documentation

    • Added a comprehensive guide on implementing the "Chain Of Density" summarization technique.
    • Provided instructions for installing dependencies, downloading datasets, generating examples for fine-tuning, and running a fine-tuning job.
  • Bug Fixes

    • Fixed a typo in the model parameter in the openai.ChatCompletion.create() function.
  • Chores

    • Updated Python version used in the workflow from 3.x to 3.10.
    • Added new dependencies: openai, pydantic, instructor, nltk, and rich.

Copy link
Contributor

coderabbitai bot commented Nov 4, 2023

Walkthrough

The changes introduce a new "Chain of Density" summarization technique using GPT-3.5 and the Instructor library. The technique involves generating a chain of summaries with increasing density. The implementation includes data models for initial and rewritten summaries, validators, and a function to generate the summaries. The changes also include instructions for fine-tuning the model, evaluating the quality of summaries, and a blog post explaining the technique.

Changes

File Summary
examples/chain-of-density/Readme.md Added a new document with instructions for performing Chain Of Density summarization using GPT-3.5.
examples/chain-of-density/chain_of_density.py Introduced new classes and functions for the Chain Of Density summarization technique.
examples/chain-of-density/download.py Added code to load and process datasets for the summarization task.
examples/chain-of-density/finetune.py Introduced a script for fine-tuning the model using the generated summaries.
examples/chain-of-density/run.py Introduced a script to generate a chain of summaries for an article.
docs/blog/posts/chain-of-density.md Added a blog post explaining the Chain Of Density summarization technique.
instructor/dsl/... Fixed a typo in the model parameter and added checks in the extract_json method.
examples/chain-of-density/requirements.txt Added new dependencies required for the summarization task.

Poem

🐇 Hopping through the code, making changes bright, 🌟

Summarizing articles, from morning till night. 🌞🌛

With GPT-3.5, we rewrite, ✍️

Making summaries dense and tight. 📚

As the leaves fall, we fine-tune, 🍂

Under the light of the autumn moon. 🌙

Celebrating changes, big and small, 🎉

CodeRabbit is here, hopping through fall. 🐇🍁


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • If you reply to a review comment from CodeRabbit, the bot will automatically respond.
  • To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
  • Note: Review comments are made on code diffs or files, not on the PR overview.
  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Comment on lines 23 to 27
class GeneratedSummary(BaseModel):
summary: str


@instructions.distil
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
print(summary_chain)
return GeneratedSummary(summary=summary_chain[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GeneratedSummary class and distil_summarization function are well-defined. However, there is no error handling for the case where summarize_article(text) returns an empty list. This would cause an IndexError at line 31. Consider adding a check for this.

+    if not summary_chain:
+        raise ValueError("No summaries generated for the provided text.")

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
class GeneratedSummary(BaseModel):
summary: str
@instructions.distil
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
print(summary_chain)
return GeneratedSummary(summary=summary_chain[-1])
class GeneratedSummary(BaseModel):
summary: str
@instructions.distil
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
if not summary_chain:
raise ValueError("No summaries generated for the provided text.")
print(summary_chain)
return GeneratedSummary(summary=summary_chain[-1])

Comment on lines 34 to 35
# Read in the csv file we have
with open("output.csv", "r") as file:
reader = csv.reader(file)

for article, summary in islice(reader, 1, 10):
distil_summarization(article)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV file is read and processed correctly. However, there is no error handling for the case where the file does not exist or cannot be opened. Consider adding a try-except block to handle potential FileNotFoundError or IOError.

+try:
    with open("output.csv", "r") as file:
        reader = csv.reader(file)
        for article, summary in islice(reader, 1, 10):
            distil_summarization(article)
+except (FileNotFoundError, IOError):
+    logging.error("Failed to open output.csv")

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
# Read in the csv file we have
with open("output.csv", "r") as file:
reader = csv.reader(file)
for article, summary in islice(reader, 1, 10):
distil_summarization(article)
try:
with open("output.csv", "r") as file:
reader = csv.reader(file)
for article, summary in islice(reader, 1, 10):
distil_summarization(article)
except (FileNotFoundError, IOError):
logging.error("Failed to open output.csv")

Comment on lines 12 to 10
class MissingEntity(BaseModel):
"""
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

A missing entity is:
- relevant to the main story,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the article),
- anywhere (can be located anywhere in the article).
"""

entity_name: str = Field(
...,
description="This is the associated name with the entity that exists in the text",
)
reason: str = Field(
...,
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MissingEntity class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

-    entity_name: str = Field(
+    entity_name: str = Field[str](
        ...,
        description="This is the associated name with the entity that exists in the text",
    )
-    reason: str = Field(
+    reason: str = Field[str](
        ...,
        description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",
    )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
class MissingEntity(BaseModel):
"""
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
A missing entity is:
- relevant to the main story,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the article),
- anywhere (can be located anywhere in the article).
"""
entity_name: str = Field(
...,
description="This is the associated name with the entity that exists in the text",
)
reason: str = Field(
...,
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",
)
class MissingEntity(BaseModel):
"""
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
A missing entity is:
- relevant to the main story,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the article),
- anywhere (can be located anywhere in the article).
"""
entity_name: str = Field[str](
...,
description="This is the associated name with the entity that exists in the text",
)
reason: str = Field[str](
...,
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",
)

Comment on lines 34 to 21
class OmittedEntity(BaseModel):
"""
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
"""

entity_name: str = Field(
...,
description="This is an entity which was present in the previous summary and not in the newly generated summary",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OmittedEntity class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

-    entity_name: str = Field(
+    entity_name: str = Field[str](
        ...,
        description="This is an entity which was present in the previous summary and not in the newly generated summary",
    )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
class OmittedEntity(BaseModel):
"""
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
"""
entity_name: str = Field(
...,
description="This is an entity which was present in the previous summary and not in the newly generated summary",
)
class OmittedEntity(BaseModel):
"""
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
"""
entity_name: str = Field[str](
...,
description="This is an entity which was present in the previous summary and not in the newly generated summary",
)

Comment on lines 45 to 52
class MaybeOmittedEntities(BaseModel):
"""
This represents whether the new summary has omitted any entities that were present in the previous summary provided.
"""

omitted_entities: Optional[List[OmittedEntity]] = Field(default=[])
message: Optional[str] = Field(default=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MaybeOmittedEntities class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

-    omitted_entities: Optional[List[OmittedEntity]] = Field(default=[])
+    omitted_entities: Optional[List[OmittedEntity]] = Field[Optional[List[OmittedEntity]]](default=[])
-    message: Optional[str] = Field(default=None)
+    message: Optional[str] = Field[Optional[str]](default=None)

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
class MaybeOmittedEntities(BaseModel):
"""
This represents whether the new summary has omitted any entities that were present in the previous summary provided.
"""
omitted_entities: Optional[List[OmittedEntity]] = Field(default=[])
message: Optional[str] = Field(default=None)
class MaybeOmittedEntities(BaseModel):
"""
This represents whether the new summary has omitted any entities that were present in the previous summary provided.
"""
omitted_entities: Optional[List[OmittedEntity]] = Field[Optional[List[OmittedEntity]]](default=[])
message: Optional[str] = Field[Optional[str]](default=None)

Comment on lines 63 to 20
class InitialSummary(BaseModel):
"""
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
"""

summary: str = Field(
...,
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InitialSummary class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

-    summary: str = Field(
+    summary: str = Field[str](
        ...,
        description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",
    )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
class InitialSummary(BaseModel):
"""
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
"""
summary: str = Field(
...,
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",
)
class InitialSummary(BaseModel):
"""
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
"""
summary: str = Field[str](
...,
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",
)

Comment on lines 122 to 174
def rewrite_summary(
article: str,
existing_summary: str,
entity_ctx: str,
error_msgs: List[str] = [],
remaining_retries=3,
):
# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities )
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo-16k",
functions=[RewrittenSummary.openai_schema],
function_call={"name": RewrittenSummary.openai_schema["name"]},
max_retries=2,
messages=[
{
"role": "system",
"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities",
},
{"role": "user", "content": f"Here is the article : {article}"},
{
"role": "user",
"content": f"Here is the most recent article : {existing_summary}",
},
{
"role": "user",
"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}",
},
*error_msgs,
],
)
try:
new_summary = RewrittenSummary.from_response(
completion, validation_context={"prev_summary": existing_summary}
)
return new_summary
except (ValidationError, JSONDecodeError) as e:
if remaining_retries == 0:
raise e
error_msgs = []
error_msgs.append(dict(**completion.choices[0].message))
error_msgs.append(
{
"role": "user",
"content": f"Recall the function correctly, exceptions found\n{e}",
}
)
return rewrite_summary(
article,
existing_summary,
entity_ctx,
error_msgs,
remaining_retries=remaining_retries - 1,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite_summary function is well defined with clear comments. It uses the OpenAI API to generate a new summary and validates that the entity density has increased. It also handles exceptions properly. However, consider adding type hints for the parameters and the return type to improve code readability and maintainability.

- def rewrite_summary(
+ def rewrite_summary(
    article: str,
    existing_summary: str,
    entity_ctx: str,
    error_msgs: List[str] = [],
    remaining_retries=3,
- ):
+ ) -> Union[RewrittenSummary, None]:

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
def rewrite_summary(
article: str,
existing_summary: str,
entity_ctx: str,
error_msgs: List[str] = [],
remaining_retries=3,
):
# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities )
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo-16k",
functions=[RewrittenSummary.openai_schema],
function_call={"name": RewrittenSummary.openai_schema["name"]},
max_retries=2,
messages=[
{
"role": "system",
"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities",
},
{"role": "user", "content": f"Here is the article : {article}"},
{
"role": "user",
"content": f"Here is the most recent article : {existing_summary}",
},
{
"role": "user",
"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}",
},
*error_msgs,
],
)
try:
new_summary = RewrittenSummary.from_response(
completion, validation_context={"prev_summary": existing_summary}
)
return new_summary
except (ValidationError, JSONDecodeError) as e:
if remaining_retries == 0:
raise e
error_msgs = []
error_msgs.append(dict(**completion.choices[0].message))
error_msgs.append(
{
"role": "user",
"content": f"Recall the function correctly, exceptions found\n{e}",
}
)
return rewrite_summary(
article,
existing_summary,
entity_ctx,
error_msgs,
remaining_retries=remaining_retries - 1,
)
def rewrite_summary(
article: str,
existing_summary: str,
entity_ctx: str,
error_msgs: List[str] = [],
remaining_retries=3,
) -> Union[RewrittenSummary, None]:
# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities )
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo-16k",
functions=[RewrittenSummary.openai_schema],
function_call={"name": RewrittenSummary.openai_schema["name"]},
max_retries=2,
messages=[
{
"role": "system",
"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities",
},
{"role": "user", "content": f"Here is the article : {article}"},
{
"role": "user",
"content": f"Here is the most recent article : {existing_summary}",
},
{
"role": "user",
"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}",
},
*error_msgs,
],
)
try:
new_summary = RewrittenSummary.from_response(
completion, validation_context={"prev_summary": existing_summary}
)
return new_summary
except (ValidationError, JSONDecodeError) as e:
if remaining_retries == 0:
raise e
error_msgs = []
error_msgs.append(dict(**completion.choices[0].message))
error_msgs.append(
{
"role": "user",
"content": f"Recall the function correctly, exceptions found\n{e}",
}
)
return rewrite_summary(
article,
existing_summary,
entity_ctx,
error_msgs,
remaining_retries=remaining_retries - 1,
)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 6e68f11 and 0e08af2.
Files selected for processing (1)
  • examples/chain-of-density/run.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • examples/chain-of-density/run.py

Copy link
Collaborator

@jxnl jxnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How close are the prompts here to the original prompts ?

examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
@ivanleomk
Copy link
Collaborator Author

How close are the prompts here to the original prompts ?

Prompts here are based off the original prompts from the paper and are modified slightly. There are a bunch of them which I came up on my own but here are the rough annotations from the original paper

CleanShot 2023-11-04 at 23 22 49

Figure 2 displays the prompt along with an example output. Rather than be prescriptive about the types of entities, we simply define a Missing Entity as:
• Relevant: to the main story.
• Specific: descriptive yet concise (5 words or fewer).
• Novel: not in the previous summary.
• Faithful: present in the Article.
• Anywhere: located anywhere in the Article.

Initial Summary comes from the line about

The first summary should be long ( 4 - 5 sentences, ~80 words ) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.

Rewritten Prompts come from the few paragraph points about

Guidelines:

  • Make every word count: re-write the previous summary to improve flow and make space for additional entities.
  • Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
  • The summaries should become highly dense and concise yet self -contained, e.g., easily understood without the Article.
  • Missing entities can appear anywhere in the new summary.
  • Never drop entities from the previous summary. If space cannot be made, add fewer new entities

I rephrased the third point about the highly dense and concise summaries and used it as the description for the Pydantic class

Entity Definition comes from spaCY

An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

Missing Entities Task comes from the step 1 of the prompt

Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 9

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0e08af2 and dfe380f.
Files ignored due to filter (1)
  • docs/blog/posts/img/chain-of-density.png
Files selected for processing (10)
  • .github/workflows/mkdocs.yml (1 hunks)
  • docs/blog/posts/chain-of-density.md (1 hunks)
  • examples/chain-of-density/Readme.md (1 hunks)
  • examples/chain-of-density/chain_of_density.py (1 hunks)
  • examples/chain-of-density/chain_of_density.txt (1 hunks)
  • examples/chain-of-density/finetune.py (1 hunks)
  • examples/chain-of-density/run.py (1 hunks)
  • instructor/dsl/citation.py (1 hunks)
  • instructor/dsl/multitask.py (1 hunks)
  • instructor/function_calls.py (1 hunks)
Files skipped from review due to trivial changes (4)
  • .github/workflows/mkdocs.yml
  • examples/chain-of-density/run.py
  • instructor/dsl/citation.py
  • instructor/function_calls.py
Additional comments: 14
examples/chain-of-density/chain_of_density.txt (1)
  • 1-9: Ensure that all these dependencies are compatible with each other and with the existing dependencies in your project. Also, make sure to update your project's documentation to reflect these new dependencies.
examples/chain-of-density/chain_of_density.py (2)
  • 1-7: The new imports look fine. Ensure that these packages are included in your project's dependencies.

  • 26-53: The InitialSummary and RewrittenSummary classes look well-structured. The docstrings and field descriptions provide clear explanations of their purpose and usage.

examples/chain-of-density/finetune.py (4)
  • 1-7: The imports look fine. Ensure that all the imported modules are used in the code and that they are installed in the environment where this script will run.

  • 9-20: The instructor patch and logging setup look fine. Ensure that the logging level is appropriate for your use case and that the log file "generated.jsonl" is being written to the correct location.

  • 23-31: The GeneratedSummary class and distil_summarization function are well defined. Ensure that the summarize_article function returns a list as expected and that the last element of this list is the final summary.

  • 35-52: The CSV file reading and summarization process look fine. Ensure that the "output.csv" file exists and is in the correct format. Also, ensure that the compute_metrics function returns the correct metrics. The division operation at line 53 should be safe as long as ttl_tokens is not zero. Consider adding a check to prevent division by zero.

- print(f"FINAL ET: {ttl_entities/ttl_tokens}")
+ if ttl_tokens > 0:
+     print(f"FINAL ET: {ttl_entities/ttl_tokens}")
+ else:
+     print("No tokens found.")
examples/chain-of-density/Readme.md (1)
  • 35-35: The TODO comment should be addressed before merging the pull request. If it's not feasible to implement this feature at the moment, consider creating an issue in the repository to track this task.
docs/blog/posts/chain-of-density.md (6)
  • 1-11: The metadata of the blog post looks fine. The authors and tags are correctly set.

  • 109-136: The Pydantic models InitialSummary and RewrittenSummary are well defined with appropriate descriptions and fields.

  • 147-174: The validators for the fields summary, missing, and absent are correctly implemented. They ensure that the summary has the right length, missing entities are identified, and no entities are absent from the new summary.

  • 184-251: The summarize_article function is well implemented. It generates an initial summary and then iteratively rewrites the summary to include missing entities. The use of OpenAI's ChatCompletion API is correct and the response models are correctly set. The function also handles retries and token limits.

  • 274-308: The script for fine-tuning the model is well implemented. It uses the Instructor library to generate a .jsonl file for fine-tuning. The use of the distil decorator and the Instruction object is correct. The script also correctly reads the articles from a CSV file and generates summaries for them.

  • 343-347: The update to the distil_summarization function to use the fine-tuned model is correctly done. The model id is correctly set in the distil decorator.

Comment on lines 1 to 35
# Introduction

This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.

## Instructions

1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.

> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.

```
pip3 install -r chain_of_density.txt
```



2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`

```
python3 download.py
```

3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`

> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too

4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

```
instructor jobs create-from-file summarization.jsonl
```

Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.

TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructions are clear and concise. However, it would be helpful to include a brief explanation of what the Chain Of Density summarization technique is and why it's beneficial. This would provide context for users who are unfamiliar with the technique.

+ ## What is Chain Of Density Summarization?
+ 
+ Chain Of Density Summarization is a technique that...

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
# Introduction
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.
## Instructions
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.
```
pip3 install -r chain_of_density.txt
```
2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`
```
python3 download.py
```
3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`
> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job
```
instructor jobs create-from-file summarization.jsonl
```
Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.
TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )
# Introduction
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.
## What is Chain Of Density Summarization?
Chain Of Density Summarization is a technique that...
## Instructions
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.

pip3 install -r chain_of_density.txt




2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`

python3 download.py


3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`

>  Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too

4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

instructor jobs create-from-file summarization.jsonl


Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.

TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )

examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

```
instructor jobs create-from-file summarization.jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command instructor jobs create-from-file summarization.jsonl seems to be incorrect. It should be instruct instead of instructor.

- instructor jobs create-from-file summarization.jsonl 
+ instruct jobs create-from-file summarization.jsonl 

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
instructor jobs create-from-file summarization.jsonl
instruct jobs create-from-file summarization.jsonl

Comment on lines 31 to 39
@staticmethod
def extract_json(completion):
for chunk in completion:
delta = chunk["choices"][0]["delta"]
if "function_call" in delta:
yield delta["function_call"]["arguments"]
if chunk["choices"]:
delta = chunk["choices"][0]["delta"]
if "function_call" in delta:
if "arguments" in delta["function_call"]:
yield delta["function_call"]["arguments"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code assumes that chunk["choices"] exists and is not None. If it doesn't exist or is None, this will raise a KeyError or TypeError respectively. To avoid this, use the get method which returns None if the key doesn't exist.

- if chunk["choices"]:
+ if chunk.get("choices"):

Also, the code assumes that chunk["choices"] is a list and has at least one element. If it's not a list or is an empty list, this will raise an IndexError. To avoid this, check if chunk["choices"] is a list and has at least one element.

- if chunk.get("choices"):
+ if isinstance(chunk.get("choices"), list) and len(chunk.get("choices")) > 0:

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
@staticmethod
def extract_json(completion):
for chunk in completion:
delta = chunk["choices"][0]["delta"]
if "function_call" in delta:
yield delta["function_call"]["arguments"]
if chunk["choices"]:
delta = chunk["choices"][0]["delta"]
if "function_call" in delta:
if "arguments" in delta["function_call"]:
yield delta["function_call"]["arguments"]
@staticmethod
def extract_json(completion):
for chunk in completion:
if isinstance(chunk.get("choices"), list) and len(chunk.get("choices")) > 0:
delta = chunk["choices"][0]["delta"]
if "function_call" in delta:
if "arguments" in delta["function_call"]:
yield delta["function_call"]["arguments"]

instructor/dsl/multitask.py Outdated Show resolved Hide resolved
docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved
@ivanleomk
Copy link
Collaborator Author

Migrated and updated with benchmarks that I calculated, also created a Hugging Face data card here - https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density.

I will tidy up a colab notebook so that people can reproduce my benchmarks using the data ( just the visualisations ) so you can double check my figures too just in case if you want.

Am still pending some new lib with the upgrade to Open AI library and changes to the ReadMe to add in the hugging face dataset links

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between dfe380f and 717b40f.
Files selected for processing (4)
  • docs/blog/posts/chain-of-density.md (1 hunks)
  • examples/chain-of-density/chain_of_density.py (1 hunks)
  • examples/chain-of-density/finetune.py (1 hunks)
  • examples/chain-of-density/requirements.txt (1 hunks)
Files skipped from review due to trivial changes (1)
  • examples/chain-of-density/requirements.txt
Additional comments: 11
examples/chain-of-density/chain_of_density.py (3)
  • 1-7: The import statements are well-organized and only necessary modules are imported. Good practice.

  • 8-8: The OpenAI client is patched using the instructor module. Ensure that the instructor module is correctly implemented and does not introduce any security vulnerabilities.

  • 11-32: The InitialSummary and RewrittenSummary classes are well-defined with clear docstrings and field descriptions. Good use of Pydantic for data validation.

docs/blog/posts/chain-of-density.md (3)
  • 12-13: The order of authors is being discussed. If the order matters, please verify and adjust accordingly.

  • 394-394: The note about the benefits of fine-tuning is insightful and well written.

  • 402-407: The suggestions for further improvements are well thought out and clearly explained.

examples/chain-of-density/finetune.py (5)
  • 1-8: Imports and setup look good. Ensure that all the imported modules are used in the code.

  • 9-9: The instructor.patch() function is used to patch the OpenAI client. Ensure that the instructor library is compatible with the OpenAI library and that the patching process doesn't introduce any unexpected behavior.

  • 13-17: The Instructions object is created with a name, format, and log handlers. Ensure that the log file path is correct and that the file has write permissions.

  • 20-22: The GeneratedSummary class is defined with a single attribute summary. This class is used to return the summary from the distil_summarization function. The use of Pydantic's BaseModel ensures that the data is validated and serialized/deserialized correctly.

  • 24-27: The distil_summarization function is decorated with @instructions.distil and takes a string input. It calls the summarize_article function and returns a GeneratedSummary object. Ensure that the summarize_article function is correctly implemented and that it returns a list of strings.

Comment on lines 96 to 105
class InitialSummary(BaseModel):
"""
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
"""

summary: str = Field(
...,
description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length",
)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InitialSummary class is well defined with clear docstrings and field descriptions. However, consider adding a validation to ensure the summary is approximately 80 words long.

Comment on lines 113 to 140
class RewrittenSummary(BaseModel):
"""
This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

Guidelines
- Make every word count : Rewrite the previous summary to improve flow and make space for additional entities
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
- Missing entities can appear anywhere in the new summary

An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
"""

summary: str = Field(
...,
description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article",
)
absent: List[str] = Field(
...,
default_factory=list,
description="this is a list of Entities found absent from the new summary that were present in the previous summary",
)
missing: List[str] = Field(
default_factory=list,
description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.",
)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RewrittenSummary class is well defined with clear docstrings and field descriptions. However, consider adding validations to ensure the summary is approximately 80 words long, the absent list is empty, and the missing list contains 1-3 entities.

docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved
docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved
Comment on lines 279 to 314
from typing import List
from chain_of_density import summarize_article #(1)!
import csv
import logging
import instructor
from itertools import islice
from pydantic import BaseModel

instructor.patch() #(2)!

logging.basicConfig(level=logging.INFO)

instructions = instructor.Instructions( #(3)!
name="Chain Of Density",
finetune_format="messages",
# log handler is used to save the data to a file
# you can imagine saving it to a database or other storage
# based on your needs!
log_handlers=[logging.FileHandler("generated.jsonl")],
)

class GeneratedSummary(BaseModel):
summary: str

@instructions.distil #(4)!
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
return GeneratedSummary(summary=summary_chain[-1]) #(5)!

with open("train.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for index, (article, summary) in enumerate(reader):
# Run Distillisation to generate the values
distil_summarization(article)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script for generating the training set is well implemented. However, consider adding error handling for potential issues with the OpenAI API calls and the CSV file reading.

Comment on lines 349 to 353
@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)!
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
return GeneratedSummary(summary=summary_chain[-1])
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distil_summarization function is well implemented. However, consider adding error handling for potential issues with the OpenAI API calls.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 717b40f and e75c3a7.
Files selected for processing (1)
  • examples/chain-of-density/finetune.py (1 hunks)
Additional comments: 3
examples/chain-of-density/finetune.py (3)
  • 1-8: Imports and client setup look good.

  • 20-22: The GeneratedSummary class is well defined.

  • 24-27: The distil_summarization function is correctly defined and uses the summarize_article function to generate a summary. It returns the last summary in the chain.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 11

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 877f57a and f4068a9.
Files ignored due to filter (2)
  • docs/blog/posts/img/chain-of-density.png
  • examples/chain-of-density/output.csv
Files selected for processing (5)
  • docs/blog/posts/chain-of-density.md (1 hunks)
  • examples/chain-of-density/Readme.md (1 hunks)
  • examples/chain-of-density/chain_of_density.py (1 hunks)
  • examples/chain-of-density/finetune.py (1 hunks)
  • examples/chain-of-density/requirements.txt (1 hunks)
Files skipped from review due to trivial changes (2)
  • examples/chain-of-density/Readme.md
  • examples/chain-of-density/requirements.txt
Additional comments: 5
examples/chain-of-density/finetune.py (5)
  • 1-8: The imports are well organized and only necessary modules are imported. Good job!

  • 9-9: The OpenAI client is patched with the instructor library. Ensure that the instructor library is compatible with the OpenAI client.

  • 13-17: The Instructions object is well defined with appropriate parameters. Ensure that the log file summarization.jsonl has write permissions.

  • 20-22: The GeneratedSummary class is well defined using Pydantic for data validation. Good job!

  • 24-27: The distil_summarization function is well defined and uses the @instructions.distil decorator. It returns a GeneratedSummary object with the last summary in the chain. Good job!

Comment on lines 30 to 35
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for article, summary in reader:
distil_summarization(article)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script reads from a CSV file and calls the distil_summarization function for each article. Ensure that the CSV file exists, is in the correct format, and that the file has read permissions. Also, the result of the distil_summarization function is not stored or used. If the result is needed, consider storing it in a variable or data structure.

-    for article, summary in reader:
-        distil_summarization(article)
+    summaries = []
+    for article, _ in reader:
+        summaries.append(distil_summarization(article))

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for article, summary in reader:
distil_summarization(article)
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
summaries = []
for article, _ in reader:
summaries.append(distil_summarization(article))

Comment on lines 79 to 145
def summarize_article(article: str, summary_steps: int = 3):
summary_chain = []
# We first generate an initial summary
summary: InitialSummary = openai.chat.completions.create(
model="gpt-4-0613",
response_model=InitialSummary,
messages=[
{
"role": "system",
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": "The generated summary should be about 80 words.",
},
],
max_retries=2,
)
prev_summary = None
summary_chain.append(summary.summary)
for i in range(summary_steps):
missing_entity_message = (
[]
if prev_summary is None
else [
{
"role": "user",
"content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}",
},
]
)
new_summary: RewrittenSummary = openai.chat.completions.create(
model="gpt-4-0613",
messages=[
{
"role": "system",
"content": """
You are going to generate an increasingly concise,entity-dense summary of the following article.

Perform the following two tasks
- Identify 1-3 informative entities from the following article which is missing from the previous summary
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities

Guidelines
- Make every word count: re-write the previous summary to improve flow and make space for additional entities
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
- Missing entities can appear anywhere in the new summary
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
""",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": f"Here is the previous summary: {summary_chain[-1]}",
},
*missing_entity_message,
],
max_retries=3,
max_tokens=1000,
response_model=RewrittenSummary,
)
summary_chain.append(new_summary.summary)
prev_summary = new_summary

return summary_chain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summarize_article function is still quite long and complex. Consider breaking it down into smaller helper functions to improve readability and maintainability. For example, you could create separate functions for generating the initial summary and the rewritten summaries.

- chain of density
- finetuneing
- gpt-3.5-turbo
- distilation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 whoops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jxnl Thank you for acknowledging the correction. Please proceed to commit the suggested change to fix the typographical error.


# Implementing Chain Of Density

By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typographical error.

- As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder.
+ As usual, all of the code is available for reference under our `examples/chain-of-density` folder.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is available for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.


## Implementation

Note that our implementation uses a validator to ensure that the rewritten summary has a minimum length rather than a prompt. As a result, we match the original paper on entity count but not entity density.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence is not clear. It would be better to clarify what is meant by "entity count" and "entity density".

Let's first segregate our train and test set so that we don't have any sort of contamination - this corresponds to our `train.csv` and `test.csv` in our [Hugging Face Dataset](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density). Now, we just need to import the `Instructions` module from the `Instructor` package which allows you to generate a nicely formatted `.jsonl` file to be used for fine-tuning


```py hl_lines="2 9 13-20 25 28"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typographical error.

- from chain_of_density import summarize_article #(1)!
+ from chain_of_density import summarize_article  # (1)!

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
```py hl_lines="2 9 13-20 25 28"
from chain_of_density import summarize_article # (1)!


!!! note

Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typographical error.

- Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.
+ Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.
Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jxnl Thank you for your approval. I will proceed to make the suggested change.


## Conclusion

There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typographical error.

- There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it
+ There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convenience but there's really a whole variety of other ways to improve it.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it
There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convenience but there's really a whole variety of other ways to improve it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between f4068a9 and 1d688d6.
Files selected for processing (1)
  • examples/chain-of-density/finetune.py (1 hunks)
Additional comments: 2
examples/chain-of-density/finetune.py (2)
  • 27-32: The previous comment about storing the result of the distil_summarization function is still valid. Also, ensure that the CSV file "test.csv" exists and is in the correct format.
-    for article, summary in reader:
-        distil_summarization(article)
+    summaries = []
+    for article, _ in reader:
+        summaries.append(distil_summarization(article))
  • 21-24: The distil_summarization function is correctly defined and returns a GeneratedSummary object. However, ensure that the summarize_article function from the chain_of_density module is correctly implemented and tested.

Comment on lines 1 to 14
from typing import List
from chain_of_density import summarize_article
import csv
import logging
import instructor
from pydantic import BaseModel

logging.basicConfig(level=logging.INFO)

instructions = instructor.Instructions(
name="Chain Of Density",
finetune_format="messages",
log_handlers=[logging.FileHandler("summarization.jsonl")],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Instructions object is created but not used anywhere in the code. If it's not used, consider removing it to avoid confusion.

- instructions = instructor.Instructions(
-     name="Chain Of Density",
-     finetune_format="messages",
-     log_handlers=[logging.FileHandler("summarization.jsonl")],
- )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
from typing import List
from chain_of_density import summarize_article
import csv
import logging
import instructor
from pydantic import BaseModel
logging.basicConfig(level=logging.INFO)
instructions = instructor.Instructions(
name="Chain Of Density",
finetune_format="messages",
log_handlers=[logging.FileHandler("summarization.jsonl")],
)
from typing import List
from chain_of_density import summarize_article
import csv
import logging
from pydantic import BaseModel
logging.basicConfig(level=logging.INFO)

- chain of density
- finetuneing
- gpt-3.5-turbo
- distilation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 whoops

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 1d688d6 and 332a0b4.
Files selected for processing (3)
  • docs/blog/posts/chain-of-density.md (1 hunks)
  • examples/chain-of-density/Readme.md (1 hunks)
  • examples/chain-of-density/chain_of_density.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • examples/chain-of-density/Readme.md
Additional comments: 11
examples/chain-of-density/chain_of_density.py (3)
  • 1-6: Ensure that all the imported modules are used in the code. Unused imports can lead to confusion and unnecessary overhead.

  • 7-7: The instructor.patch(OpenAI()) seems to be a custom function. Ensure that it is defined and imported correctly.

  • 78-144: The summarize_article function seems to be well-structured and logically sound. However, ensure that the OpenAI API calls are correctly implemented and that the API keys are securely stored and accessed.

docs/blog/posts/chain-of-density.md (8)
  • 98-107: The Pydantic model InitialSummary is well defined with a clear docstring and field description. The use of Pydantic's Field function to provide a description for the summary field is a good practice for code readability and maintainability.

  • 115-142: The Pydantic model RewrittenSummary is well defined with a clear docstring and field descriptions. The use of Pydantic's Field function to provide descriptions for the summary, absent, and missing fields is a good practice for code readability and maintainability.

  • 153-180: The use of Pydantic validators to enforce business rules on the summary, missing, and absent fields is a good practice. It ensures that the data adheres to the expected format and constraints, improving the robustness of the code.

  • 190-262: The summarize_article function is well implemented. It uses the OpenAI API to generate summaries and uses the Pydantic models and validators to ensure the summaries adhere to the expected format and constraints. The use of a loop to generate a chain of summaries is a good practice. The code is readable and maintainable.

  • 292-346: The use of the instructor library to generate a .jsonl file for fine-tuning is a good practice. It simplifies the process of preparing data for fine-tuning and ensures the data is in the correct format. The use of the distil decorator to automatically capture the input and output of the function is a good practice. It simplifies the process of generating the .jsonl file.

  • 359-363: The use of the distil decorator with the model and mode parameters to use the fine-tuned model is a good practice. It simplifies the process of using the fine-tuned model in the function.

  • 384-400: The benchmarking results are well presented in a table format, making it easy to compare the performance of the different models. The use of metrics such as mean latency, mean entity count, mean entity density, and total cost provides a comprehensive view of the performance of the models.

  • 413-424: The conclusion provides a good summary of the article and offers suggestions for further improvements. This is a good practice as it encourages readers to explore the topic further.

examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved
Comment on lines +2 to +14
draft: False
date: 2023-11-05
tags:
- pydantic
- validation
- chain of density
- finetuneing
- gpt-3.5-turbo
- distilation
authors:
- ivanleomk
- jxnl
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata of the blog post seems fine. However, there are a couple of typos in the tags. "finetuneing" should be "finetuning" and "distilation" should be "distillation".

-  - finetuneing
+  - finetuning
-  - distilation
+  - distillation

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
draft: False
date: 2023-11-05
tags:
- pydantic
- validation
- chain of density
- finetuneing
- gpt-3.5-turbo
- distilation
authors:
- ivanleomk
- jxnl
---
draft: False
date: 2023-11-05
tags:
- pydantic
- validation
- chain of density
- finetuning
- gpt-3.5-turbo
- distillation
authors:
- ivanleomk
- jxnl
---


# Implementing Chain Of Density

By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the word "available". It should be corrected.

- all of the code is avaliable for reference under our `examples/chain-of-density` folder.
+ all of the code is available for reference under our `examples/chain-of-density` folder.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder.
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is available for reference under our `examples/chain-of-density` folder.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 13

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 332a0b4 and ae14920.
Files ignored due to filter (1)
  • mkdocs.yml
Files selected for processing (1)
  • docs/blog/posts/chain-of-density.md (1 hunks)
Additional comments: 1
docs/blog/posts/chain-of-density.md (1)
  • 370-370: Typo in "@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)!". It should be "@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") # (

- pydantic
- validation
- chain of density
- finetuneing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the tag "finetuneing". It should be "finetuning".

-  - finetuneing
+  - finetuning

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
- finetuneing
- finetuning

- chain of density
- finetuneing
- gpt-3.5-turbo
- distilation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the tag "distilation". It should be "distillation".

-  - distilation
+  - distillation

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
- distilation
- distillation


# Better Summaries by Finetuning Chain of Density

> Discover how to distil an interative method like chain of density into a single finetune.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "interative". It should be "iterative".

- > Discover how to distil an interative method like chain of density into a single finetune.
+ > Discover how to distil an iterative method like chain of density into a single finetune.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
> Discover how to distil an interative method like chain of density into a single finetune.
> Discover how to distil an iterative method like chain of density into a single finetune.


> Discover how to distil an interative method like chain of density into a single finetune.

In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "were able to increase latency". It should be "were able to decrease latency".

- Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density.
+ Using these methods were able to decrease latency by 40x, reduce costs by 10x and maintain entity density.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.
In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to decrease latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.


### Original Prompt

We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "pip install instructor". It should be "pip install instructor".

- We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls.
+ We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller API calls.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.
We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller API calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.


logging.basicConfig(level=logging.INFO)

instructions = instructor.Instructions( #(2)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "instructions = instructor.Instructions( #(2)!". It should be "instructions = instructor.Instructions() # (2)!".

- instructions = instructor.Instructions( #(2)!
+ instructions = instructor.Instructions() # (2)!

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
instructions = instructor.Instructions( #(2)!
instructions = instructor.Instructions() # (2)!

class GeneratedSummary(BaseModel):
summary: str

@instructions.distil #(3)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "@instructions.distil #(3)!". It should be "@instructions.distil # (3)!".

- @instructions.distil #(3)!
+ @instructions.distil # (3)!

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
@instructions.distil #(3)!
@instructions.distil # (3)!


Now that we have our models and the rough flow figured out, let's implement a function to summarize a piece of text using `Chain Of Density` summarization.

```py hl_lines="4 9-24 38-68"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "py hl_lines="4 9-24 38-68"". It should be "python hl_lines="4 9-24 38-68"".

- ```py hl_lines="4 9-24 38-68"
+ ```python hl_lines="4 9-24 38-68"


!!! warning "Rate Limiting"

We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.". It should be "We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.".

- We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.
+ We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.
We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.


Once we run this script, we'll have a new file called `generated.jsonl` in our local repository. Now all that's left is to run the command below to start fine-tuning your first model!

```sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "sh". It should be "bash".

- ```sh
+ ```bash

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between ae14920 and 77104bd.
Files ignored due to filter (1)
  • docs/blog/posts/img/chain-of-density.png
Files selected for processing (3)
  • docs/blog/posts/chain-of-density.md (1 hunks)
  • examples/chain-of-density/chain_of_density.py (1 hunks)
  • examples/chain-of-density/finetune.py (1 hunks)
Files not reviewed due to errors (1)
  • docs/blog/posts/chain-of-density.md (Error: diff too large)
Additional comments: 13
examples/chain-of-density/finetune.py (6)
  • 1-6: Imports are correctly placed and organized according to PEP8 guidelines.

  • 8-8: Logging level is set to INFO. Ensure that this level of logging is appropriate for your use case.

  • 10-17: The Instructions object is correctly initialized with appropriate parameters.

  • 20-35: The GeneratedSummary class is well-documented and correctly uses Pydantic for data validation.

  • 38-41: The distil_summarization function is correctly decorated and returns a GeneratedSummary object. Ensure that the summarize_article function returns a list of strings.

  • 44-48: The script reads from a CSV file and processes each article. Ensure that the CSV file is correctly formatted and contains the necessary data.

examples/chain-of-density/chain_of_density.py (7)
  • 1-7: Ensure that all the imported modules are used in the code. Unused imports can lead to confusion and unnecessary dependencies.

  • 12-23: The InitialSummary class is well defined with clear documentation and field descriptions.

  • 24-49: The RewrittenSummary class is well defined with clear documentation and field descriptions. It also includes field validators to ensure the quality of the summary.

  • 51-67: The min_entity_density validator ensures that the summary has a minimum entity density. This is a good practice to maintain the quality of the summary.

  • 69-77: The min_length validator ensures that the summary has a minimum length. This is a good practice to maintain the quality of the summary.

  • 79-85: The has_missing_entities validator ensures that there are missing entities identified for the next summary. This is a good practice to maintain the quality of the summary.

  • 87-95: The has_no_absent_entities validator ensures that no entities from the previous summary are absent in the new summary. This is a good practice to maintain the quality of the summary.

examples/chain-of-density/chain_of_density.py Show resolved Hide resolved
Comment on lines +98 to +151
def summarize_article(article: str, summary_steps: int = 3):
summary_chain = []
# We first generate an initial summary
summary: InitialSummary = client.chat.completions.create(
model="gpt-4-0613",
response_model=InitialSummary,
messages=[
{
"role": "system",
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words. ",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": "The generated summary should be about 80 words.",
},
],
max_retries=2,
)
summary_chain.append(summary.summary)
for i in range(summary_steps):
new_summary: RewrittenSummary = client.chat.completions.create(
model="gpt-4-0613",
messages=[
{
"role": "system",
"content": f"""
Article: {article}
You are going to generate an increasingly concise,entity-dense summary of the following article.

Perform the following two tasks
- Identify 1-3 informative entities from the following article which is missing from the previous summary
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities

Guidelines
- Make every word count: re-write the previous summary to improve flow and make space for additional entities
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
- Missing entities can appear anywhere in the new summary
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
""",
},
{
"role": "user",
"content": f"Here is the previous summary: {summary_chain[-1]}",
},
],
max_retries=5,
max_tokens=1000,
response_model=RewrittenSummary,
)
summary_chain.append(new_summary.summary)

return summary_chain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summarize_article function is well defined and uses the OpenAI API to generate initial and rewritten summaries. It also includes error handling with retries. However, the function could be broken down into smaller functions for better readability and maintainability.

@jxnl jxnl merged commit 9139857 into main Nov 12, 2023
@jxnl jxnl deleted the chain-of-density branch November 12, 2023 15:12

## Part 1) Chain of Density

Summarizing extensive texts with AI can be challenging, often relying on inconsistent techniques. Salesforce AI Research's novel method, chain of density, enhances AI-based text summarization, outperforming human-generated summaries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] "chain of density" -> "Chain of Density" for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed up


Let's start by walking through some of the data models that we'll be using as the `response_model` for our open ai function calls

Firstly, we'll need a data model for the initial summary that we will be generating. We'll take the description of this class straight from the original prompt. Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] "Its" -> "It's"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs.

Make it clearer why the docstrings are important. I think it's used to validate that initial and rewritten summaries contain the expected Pydantic fields (summary, absent, missing). If so, then do we need such extensive docstrings? Also, the doctrings seem duplicative of the system content in the examples below.

Please correct my understanding if it's wrong. Either way, I think it's an opportunity to educate users on how docstrings come into play in Instructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, the docstrings are used in the function call parameters (Eg below) so a more descriptive docstring helps guide the eventual output by cleanly specifying what you want

{
  "functions": [
    {
      "name": "GeneratedSummary",
      "description": "This represents a highly concise summary that includes as many entities as possible from the original source article.\n\nAn Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.\n\nGuidelines\n- Make every word count\n- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.\n- Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\"",
      "parameters": {
        "properties": {
          "summary": {
            "description": "This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ",
            "title": "Summary",
            "type": "string"
          }
        },
        "required": [
          "summary"
        ],
        "type": "object"
      }
    }
  ]
}
}

import instructor
from pydantic import BaseModel

logging.basicConfig(level=logging.INFO) #(2)!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an unnecessary detail that the user has to be careful not to trip over—perhaps instructor.Instructions can have it's own logging handler so the user doesn't have to even be aware of it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jxnl thoughts?

| Model | Mean Latency (s) | Mean Entity Count | Mean Entity Density | Tokens |
| ------------------- | ---------------- | ----------------- | ------------------- | ------ |
| GPT-4 (COD) | 49.5 | 11.3 | 0.138 | 81.65 |
| GPT-3 (COD) | 145.94 | 11.05 | 0.105 | 105.7 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this gpt-3 or gpt-3.5? If the former, I wonder if we can add a benchmark for gpt-3.5? If the latter, make it clearer.

Also, we can make the benefits of COD + distillation more convincing by adding benchmarks for gpt-4 and gpt-3.5 without COD—how much lift does this additional effort buy us? This will help users prioritize between different approaches to improve their summaries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is GPT-3.5, my bad on this. I took this out and replaced it with the benchmarks I calculated for a Vanilla summary using GPT 3.5 that just asked for a concise summary.

| ------------------- | ---------------- | ----------------- | ------------------- | ------ |
| GPT-4 (COD) | 49.5 | 11.3 | 0.138 | 81.65 |
| GPT-3 (COD) | 145.94 | 11.05 | 0.105 | 105.7 |
| 3.5 Finetuned (20) | 2.25 | 14.7 | 0.154 | 95.45 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing finetuning on 20 summaries to the rows below, it seems that finetuning only on 20 summaries had the highest absolute entity count and density? Hmm, why is that? Might want to add a few hypotheses even if you don't have the answers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it in a new branch! But my hypotheses are that

  1. Model might not be benefitting from the higher number of examples due to the epochs ( 20, 50 and 76 are all trained with 4 epochs which is the default number provided by OpenAI)
  2. Larger variety of examples might cause the model to optimize for different objectives - not just for token density. The COD summarization method tends to produce more abstract summaries with each rewrite, so it might be learning other metrics under the hood to optimize for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants