-
-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chain of density #135
Chain of density #135
Conversation
WalkthroughThe changes introduce a new "Chain of Density" summarization technique using GPT-3.5 and the Instructor library. The technique involves generating a chain of summaries with increasing density. The implementation includes data models for initial and rewritten summaries, validators, and a function to generate the summaries. The changes also include instructions for fine-tuning the model, evaluating the quality of summaries, and a blog post explaining the technique. Changes
Poem
TipsChat with CodeRabbit Bot (
|
class GeneratedSummary(BaseModel): | ||
summary: str | ||
|
||
|
||
@instructions.distil | ||
def distil_summarization(text: str) -> GeneratedSummary: | ||
summary_chain: List[str] = summarize_article(text) | ||
print(summary_chain) | ||
return GeneratedSummary(summary=summary_chain[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GeneratedSummary
class and distil_summarization
function are well-defined. However, there is no error handling for the case where summarize_article(text)
returns an empty list. This would cause an IndexError
at line 31. Consider adding a check for this.
+ if not summary_chain:
+ raise ValueError("No summaries generated for the provided text.")
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
class GeneratedSummary(BaseModel): | |
summary: str | |
@instructions.distil | |
def distil_summarization(text: str) -> GeneratedSummary: | |
summary_chain: List[str] = summarize_article(text) | |
print(summary_chain) | |
return GeneratedSummary(summary=summary_chain[-1]) | |
class GeneratedSummary(BaseModel): | |
summary: str | |
@instructions.distil | |
def distil_summarization(text: str) -> GeneratedSummary: | |
summary_chain: List[str] = summarize_article(text) | |
if not summary_chain: | |
raise ValueError("No summaries generated for the provided text.") | |
print(summary_chain) | |
return GeneratedSummary(summary=summary_chain[-1]) |
# Read in the csv file we have | ||
with open("output.csv", "r") as file: | ||
reader = csv.reader(file) | ||
|
||
for article, summary in islice(reader, 1, 10): | ||
distil_summarization(article) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV file is read and processed correctly. However, there is no error handling for the case where the file does not exist or cannot be opened. Consider adding a try-except block to handle potential FileNotFoundError
or IOError
.
+try:
with open("output.csv", "r") as file:
reader = csv.reader(file)
for article, summary in islice(reader, 1, 10):
distil_summarization(article)
+except (FileNotFoundError, IOError):
+ logging.error("Failed to open output.csv")
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
# Read in the csv file we have | |
with open("output.csv", "r") as file: | |
reader = csv.reader(file) | |
for article, summary in islice(reader, 1, 10): | |
distil_summarization(article) | |
try: | |
with open("output.csv", "r") as file: | |
reader = csv.reader(file) | |
for article, summary in islice(reader, 1, 10): | |
distil_summarization(article) | |
except (FileNotFoundError, IOError): | |
logging.error("Failed to open output.csv") |
class MissingEntity(BaseModel): | ||
""" | ||
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | ||
|
||
A missing entity is: | ||
- relevant to the main story, | ||
- specific yet concise (5 words or fewer), | ||
- novel (not in the previous summary), | ||
- faithful (present in the article), | ||
- anywhere (can be located anywhere in the article). | ||
""" | ||
|
||
entity_name: str = Field( | ||
..., | ||
description="This is the associated name with the entity that exists in the text", | ||
) | ||
reason: str = Field( | ||
..., | ||
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MissingEntity
class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.
- entity_name: str = Field(
+ entity_name: str = Field[str](
...,
description="This is the associated name with the entity that exists in the text",
)
- reason: str = Field(
+ reason: str = Field[str](
...,
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",
)
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
class MissingEntity(BaseModel): | |
""" | |
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | |
A missing entity is: | |
- relevant to the main story, | |
- specific yet concise (5 words or fewer), | |
- novel (not in the previous summary), | |
- faithful (present in the article), | |
- anywhere (can be located anywhere in the article). | |
""" | |
entity_name: str = Field( | |
..., | |
description="This is the associated name with the entity that exists in the text", | |
) | |
reason: str = Field( | |
..., | |
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract", | |
) | |
class MissingEntity(BaseModel): | |
""" | |
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | |
A missing entity is: | |
- relevant to the main story, | |
- specific yet concise (5 words or fewer), | |
- novel (not in the previous summary), | |
- faithful (present in the article), | |
- anywhere (can be located anywhere in the article). | |
""" | |
entity_name: str = Field[str]( | |
..., | |
description="This is the associated name with the entity that exists in the text", | |
) | |
reason: str = Field[str]( | |
..., | |
description="This is a short sentence which describes why we should include this new entity in the rewritten abstract", | |
) |
class OmittedEntity(BaseModel): | ||
""" | ||
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | ||
""" | ||
|
||
entity_name: str = Field( | ||
..., | ||
description="This is an entity which was present in the previous summary and not in the newly generated summary", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OmittedEntity
class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.
- entity_name: str = Field(
+ entity_name: str = Field[str](
...,
description="This is an entity which was present in the previous summary and not in the newly generated summary",
)
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
class OmittedEntity(BaseModel): | |
""" | |
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | |
""" | |
entity_name: str = Field( | |
..., | |
description="This is an entity which was present in the previous summary and not in the newly generated summary", | |
) | |
class OmittedEntity(BaseModel): | |
""" | |
An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | |
""" | |
entity_name: str = Field[str]( | |
..., | |
description="This is an entity which was present in the previous summary and not in the newly generated summary", | |
) |
class MaybeOmittedEntities(BaseModel): | ||
""" | ||
This represents whether the new summary has omitted any entities that were present in the previous summary provided. | ||
""" | ||
|
||
omitted_entities: Optional[List[OmittedEntity]] = Field(default=[]) | ||
message: Optional[str] = Field(default=None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MaybeOmittedEntities
class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.
- omitted_entities: Optional[List[OmittedEntity]] = Field(default=[])
+ omitted_entities: Optional[List[OmittedEntity]] = Field[Optional[List[OmittedEntity]]](default=[])
- message: Optional[str] = Field(default=None)
+ message: Optional[str] = Field[Optional[str]](default=None)
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
class MaybeOmittedEntities(BaseModel): | |
""" | |
This represents whether the new summary has omitted any entities that were present in the previous summary provided. | |
""" | |
omitted_entities: Optional[List[OmittedEntity]] = Field(default=[]) | |
message: Optional[str] = Field(default=None) | |
class MaybeOmittedEntities(BaseModel): | |
""" | |
This represents whether the new summary has omitted any entities that were present in the previous summary provided. | |
""" | |
omitted_entities: Optional[List[OmittedEntity]] = Field[Optional[List[OmittedEntity]]](default=[]) | |
message: Optional[str] = Field[Optional[str]](default=None) |
class InitialSummary(BaseModel): | ||
""" | ||
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. | ||
""" | ||
|
||
summary: str = Field( | ||
..., | ||
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InitialSummary
class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.
- summary: str = Field(
+ summary: str = Field[str](
...,
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",
)
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
class InitialSummary(BaseModel): | |
""" | |
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. | |
""" | |
summary: str = Field( | |
..., | |
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words", | |
) | |
class InitialSummary(BaseModel): | |
""" | |
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. | |
""" | |
summary: str = Field[str]( | |
..., | |
description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words", | |
) |
def rewrite_summary( | ||
article: str, | ||
existing_summary: str, | ||
entity_ctx: str, | ||
error_msgs: List[str] = [], | ||
remaining_retries=3, | ||
): | ||
# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities ) | ||
completion = openai.ChatCompletion.create( | ||
model="gpt-3.5-turbo-16k", | ||
functions=[RewrittenSummary.openai_schema], | ||
function_call={"name": RewrittenSummary.openai_schema["name"]}, | ||
max_retries=2, | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities", | ||
}, | ||
{"role": "user", "content": f"Here is the article : {article}"}, | ||
{ | ||
"role": "user", | ||
"content": f"Here is the most recent article : {existing_summary}", | ||
}, | ||
{ | ||
"role": "user", | ||
"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}", | ||
}, | ||
*error_msgs, | ||
], | ||
) | ||
try: | ||
new_summary = RewrittenSummary.from_response( | ||
completion, validation_context={"prev_summary": existing_summary} | ||
) | ||
return new_summary | ||
except (ValidationError, JSONDecodeError) as e: | ||
if remaining_retries == 0: | ||
raise e | ||
error_msgs = [] | ||
error_msgs.append(dict(**completion.choices[0].message)) | ||
error_msgs.append( | ||
{ | ||
"role": "user", | ||
"content": f"Recall the function correctly, exceptions found\n{e}", | ||
} | ||
) | ||
return rewrite_summary( | ||
article, | ||
existing_summary, | ||
entity_ctx, | ||
error_msgs, | ||
remaining_retries=remaining_retries - 1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rewrite_summary
function is well defined with clear comments. It uses the OpenAI API to generate a new summary and validates that the entity density has increased. It also handles exceptions properly. However, consider adding type hints for the parameters and the return type to improve code readability and maintainability.
- def rewrite_summary(
+ def rewrite_summary(
article: str,
existing_summary: str,
entity_ctx: str,
error_msgs: List[str] = [],
remaining_retries=3,
- ):
+ ) -> Union[RewrittenSummary, None]:
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
def rewrite_summary( | |
article: str, | |
existing_summary: str, | |
entity_ctx: str, | |
error_msgs: List[str] = [], | |
remaining_retries=3, | |
): | |
# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities ) | |
completion = openai.ChatCompletion.create( | |
model="gpt-3.5-turbo-16k", | |
functions=[RewrittenSummary.openai_schema], | |
function_call={"name": RewrittenSummary.openai_schema["name"]}, | |
max_retries=2, | |
messages=[ | |
{ | |
"role": "system", | |
"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities", | |
}, | |
{"role": "user", "content": f"Here is the article : {article}"}, | |
{ | |
"role": "user", | |
"content": f"Here is the most recent article : {existing_summary}", | |
}, | |
{ | |
"role": "user", | |
"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}", | |
}, | |
*error_msgs, | |
], | |
) | |
try: | |
new_summary = RewrittenSummary.from_response( | |
completion, validation_context={"prev_summary": existing_summary} | |
) | |
return new_summary | |
except (ValidationError, JSONDecodeError) as e: | |
if remaining_retries == 0: | |
raise e | |
error_msgs = [] | |
error_msgs.append(dict(**completion.choices[0].message)) | |
error_msgs.append( | |
{ | |
"role": "user", | |
"content": f"Recall the function correctly, exceptions found\n{e}", | |
} | |
) | |
return rewrite_summary( | |
article, | |
existing_summary, | |
entity_ctx, | |
error_msgs, | |
remaining_retries=remaining_retries - 1, | |
) | |
def rewrite_summary( | |
article: str, | |
existing_summary: str, | |
entity_ctx: str, | |
error_msgs: List[str] = [], | |
remaining_retries=3, | |
) -> Union[RewrittenSummary, None]: | |
# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities ) | |
completion = openai.ChatCompletion.create( | |
model="gpt-3.5-turbo-16k", | |
functions=[RewrittenSummary.openai_schema], | |
function_call={"name": RewrittenSummary.openai_schema["name"]}, | |
max_retries=2, | |
messages=[ | |
{ | |
"role": "system", | |
"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities", | |
}, | |
{"role": "user", "content": f"Here is the article : {article}"}, | |
{ | |
"role": "user", | |
"content": f"Here is the most recent article : {existing_summary}", | |
}, | |
{ | |
"role": "user", | |
"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}", | |
}, | |
*error_msgs, | |
], | |
) | |
try: | |
new_summary = RewrittenSummary.from_response( | |
completion, validation_context={"prev_summary": existing_summary} | |
) | |
return new_summary | |
except (ValidationError, JSONDecodeError) as e: | |
if remaining_retries == 0: | |
raise e | |
error_msgs = [] | |
error_msgs.append(dict(**completion.choices[0].message)) | |
error_msgs.append( | |
{ | |
"role": "user", | |
"content": f"Recall the function correctly, exceptions found\n{e}", | |
} | |
) | |
return rewrite_summary( | |
article, | |
existing_summary, | |
entity_ctx, | |
error_msgs, | |
remaining_retries=remaining_retries - 1, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (1)
- examples/chain-of-density/run.py (1 hunks)
Files skipped from review due to trivial changes (1)
- examples/chain-of-density/run.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How close are the prompts here to the original prompts ?
Prompts here are based off the original prompts from the paper and are modified slightly. There are a bunch of them which I came up on my own but here are the rough annotations from the original paper
Initial Summary comes from the line about
Rewritten Prompts come from the few paragraph points about
I rephrased the third point about the highly dense and concise summaries and used it as the description for the Pydantic class Entity Definition comes from spaCY
Missing Entities Task comes from the step 1 of the prompt
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 9
Configuration used: CodeRabbit UI
Files ignored due to filter (1)
- docs/blog/posts/img/chain-of-density.png
Files selected for processing (10)
- .github/workflows/mkdocs.yml (1 hunks)
- docs/blog/posts/chain-of-density.md (1 hunks)
- examples/chain-of-density/Readme.md (1 hunks)
- examples/chain-of-density/chain_of_density.py (1 hunks)
- examples/chain-of-density/chain_of_density.txt (1 hunks)
- examples/chain-of-density/finetune.py (1 hunks)
- examples/chain-of-density/run.py (1 hunks)
- instructor/dsl/citation.py (1 hunks)
- instructor/dsl/multitask.py (1 hunks)
- instructor/function_calls.py (1 hunks)
Files skipped from review due to trivial changes (4)
- .github/workflows/mkdocs.yml
- examples/chain-of-density/run.py
- instructor/dsl/citation.py
- instructor/function_calls.py
Additional comments: 14
examples/chain-of-density/chain_of_density.txt (1)
- 1-9: Ensure that all these dependencies are compatible with each other and with the existing dependencies in your project. Also, make sure to update your project's documentation to reflect these new dependencies.
examples/chain-of-density/chain_of_density.py (2)
1-7: The new imports look fine. Ensure that these packages are included in your project's dependencies.
26-53: The
InitialSummary
andRewrittenSummary
classes look well-structured. The docstrings and field descriptions provide clear explanations of their purpose and usage.examples/chain-of-density/finetune.py (4)
1-7: The imports look fine. Ensure that all the imported modules are used in the code and that they are installed in the environment where this script will run.
9-20: The instructor patch and logging setup look fine. Ensure that the logging level is appropriate for your use case and that the log file "generated.jsonl" is being written to the correct location.
23-31: The
GeneratedSummary
class anddistil_summarization
function are well defined. Ensure that thesummarize_article
function returns a list as expected and that the last element of this list is the final summary.35-52: The CSV file reading and summarization process look fine. Ensure that the "output.csv" file exists and is in the correct format. Also, ensure that the
compute_metrics
function returns the correct metrics. The division operation at line 53 should be safe as long asttl_tokens
is not zero. Consider adding a check to prevent division by zero.- print(f"FINAL ET: {ttl_entities/ttl_tokens}") + if ttl_tokens > 0: + print(f"FINAL ET: {ttl_entities/ttl_tokens}") + else: + print("No tokens found.")examples/chain-of-density/Readme.md (1)
- 35-35: The TODO comment should be addressed before merging the pull request. If it's not feasible to implement this feature at the moment, consider creating an issue in the repository to track this task.
docs/blog/posts/chain-of-density.md (6)
1-11: The metadata of the blog post looks fine. The authors and tags are correctly set.
109-136: The Pydantic models
InitialSummary
andRewrittenSummary
are well defined with appropriate descriptions and fields.147-174: The validators for the fields
summary
,missing
, andabsent
are correctly implemented. They ensure that the summary has the right length, missing entities are identified, and no entities are absent from the new summary.184-251: The
summarize_article
function is well implemented. It generates an initial summary and then iteratively rewrites the summary to include missing entities. The use of OpenAI's ChatCompletion API is correct and the response models are correctly set. The function also handles retries and token limits.274-308: The script for fine-tuning the model is well implemented. It uses the
Instructor
library to generate a.jsonl
file for fine-tuning. The use of thedistil
decorator and theInstruction
object is correct. The script also correctly reads the articles from a CSV file and generates summaries for them.343-347: The update to the
distil_summarization
function to use the fine-tuned model is correctly done. The model id is correctly set in thedistil
decorator.
examples/chain-of-density/Readme.md
Outdated
# Introduction | ||
|
||
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. | ||
|
||
## Instructions | ||
|
||
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation. | ||
|
||
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics. | ||
|
||
``` | ||
pip3 install -r chain_of_density.txt | ||
``` | ||
|
||
|
||
|
||
2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv` | ||
|
||
``` | ||
python3 download.py | ||
``` | ||
|
||
3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py` | ||
|
||
> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too | ||
|
||
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job | ||
|
||
``` | ||
instructor jobs create-from-file summarization.jsonl | ||
``` | ||
|
||
Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density. | ||
|
||
TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instructions are clear and concise. However, it would be helpful to include a brief explanation of what the Chain Of Density summarization technique is and why it's beneficial. This would provide context for users who are unfamiliar with the technique.
+ ## What is Chain Of Density Summarization?
+
+ Chain Of Density Summarization is a technique that...
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
# Introduction | |
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. | |
## Instructions | |
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation. | |
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics. | |
``` | |
pip3 install -r chain_of_density.txt | |
``` | |
2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv` | |
``` | |
python3 download.py | |
``` | |
3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py` | |
> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too | |
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job | |
``` | |
instructor jobs create-from-file summarization.jsonl | |
``` | |
Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density. | |
TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens ) | |
# Introduction | |
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. | |
## What is Chain Of Density Summarization? | |
Chain Of Density Summarization is a technique that... | |
## Instructions | |
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation. | |
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics. | |
pip3 install -r chain_of_density.txt
2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`
python3 download.py
3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`
> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job
instructor jobs create-from-file summarization.jsonl
Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.
TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job | ||
|
||
``` | ||
instructor jobs create-from-file summarization.jsonl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command instructor jobs create-from-file summarization.jsonl
seems to be incorrect. It should be instruct
instead of instructor
.
- instructor jobs create-from-file summarization.jsonl
+ instruct jobs create-from-file summarization.jsonl
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
instructor jobs create-from-file summarization.jsonl | |
instruct jobs create-from-file summarization.jsonl |
instructor/dsl/multitask.py
Outdated
@staticmethod | ||
def extract_json(completion): | ||
for chunk in completion: | ||
delta = chunk["choices"][0]["delta"] | ||
if "function_call" in delta: | ||
yield delta["function_call"]["arguments"] | ||
if chunk["choices"]: | ||
delta = chunk["choices"][0]["delta"] | ||
if "function_call" in delta: | ||
if "arguments" in delta["function_call"]: | ||
yield delta["function_call"]["arguments"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes that chunk["choices"]
exists and is not None. If it doesn't exist or is None, this will raise a KeyError or TypeError respectively. To avoid this, use the get
method which returns None if the key doesn't exist.
- if chunk["choices"]:
+ if chunk.get("choices"):
Also, the code assumes that chunk["choices"]
is a list and has at least one element. If it's not a list or is an empty list, this will raise an IndexError. To avoid this, check if chunk["choices"]
is a list and has at least one element.
- if chunk.get("choices"):
+ if isinstance(chunk.get("choices"), list) and len(chunk.get("choices")) > 0:
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
@staticmethod | |
def extract_json(completion): | |
for chunk in completion: | |
delta = chunk["choices"][0]["delta"] | |
if "function_call" in delta: | |
yield delta["function_call"]["arguments"] | |
if chunk["choices"]: | |
delta = chunk["choices"][0]["delta"] | |
if "function_call" in delta: | |
if "arguments" in delta["function_call"]: | |
yield delta["function_call"]["arguments"] | |
@staticmethod | |
def extract_json(completion): | |
for chunk in completion: | |
if isinstance(chunk.get("choices"), list) and len(chunk.get("choices")) > 0: | |
delta = chunk["choices"][0]["delta"] | |
if "function_call" in delta: | |
if "arguments" in delta["function_call"]: | |
yield delta["function_call"]["arguments"] |
Migrated and updated with benchmarks that I calculated, also created a Hugging Face data card here - https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density. I will tidy up a colab notebook so that people can reproduce my benchmarks using the data ( just the visualisations ) so you can double check my figures too just in case if you want. Am still pending some new lib with the upgrade to Open AI library and changes to the ReadMe to add in the hugging face dataset links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 10
Configuration used: CodeRabbit UI
Files selected for processing (4)
- docs/blog/posts/chain-of-density.md (1 hunks)
- examples/chain-of-density/chain_of_density.py (1 hunks)
- examples/chain-of-density/finetune.py (1 hunks)
- examples/chain-of-density/requirements.txt (1 hunks)
Files skipped from review due to trivial changes (1)
- examples/chain-of-density/requirements.txt
Additional comments: 11
examples/chain-of-density/chain_of_density.py (3)
1-7: The import statements are well-organized and only necessary modules are imported. Good practice.
8-8: The OpenAI client is patched using the
instructor
module. Ensure that theinstructor
module is correctly implemented and does not introduce any security vulnerabilities.11-32: The
InitialSummary
andRewrittenSummary
classes are well-defined with clear docstrings and field descriptions. Good use of Pydantic for data validation.docs/blog/posts/chain-of-density.md (3)
12-13: The order of authors is being discussed. If the order matters, please verify and adjust accordingly.
394-394: The note about the benefits of fine-tuning is insightful and well written.
402-407: The suggestions for further improvements are well thought out and clearly explained.
examples/chain-of-density/finetune.py (5)
1-8: Imports and setup look good. Ensure that all the imported modules are used in the code.
9-9: The
instructor.patch()
function is used to patch the OpenAI client. Ensure that theinstructor
library is compatible with the OpenAI library and that the patching process doesn't introduce any unexpected behavior.13-17: The
Instructions
object is created with a name, format, and log handlers. Ensure that the log file path is correct and that the file has write permissions.20-22: The
GeneratedSummary
class is defined with a single attributesummary
. This class is used to return the summary from thedistil_summarization
function. The use of Pydantic'sBaseModel
ensures that the data is validated and serialized/deserialized correctly.24-27: The
distil_summarization
function is decorated with@instructions.distil
and takes a string input. It calls thesummarize_article
function and returns aGeneratedSummary
object. Ensure that thesummarize_article
function is correctly implemented and that it returns a list of strings.
docs/blog/posts/chain-of-density.md
Outdated
class InitialSummary(BaseModel): | ||
""" | ||
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. | ||
""" | ||
|
||
summary: str = Field( | ||
..., | ||
description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length", | ||
) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InitialSummary
class is well defined with clear docstrings and field descriptions. However, consider adding a validation to ensure the summary is approximately 80 words long.
docs/blog/posts/chain-of-density.md
Outdated
class RewrittenSummary(BaseModel): | ||
""" | ||
This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. | ||
|
||
Guidelines | ||
- Make every word count : Rewrite the previous summary to improve flow and make space for additional entities | ||
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities. | ||
- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article. | ||
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses" | ||
- Missing entities can appear anywhere in the new summary | ||
|
||
An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | ||
""" | ||
|
||
summary: str = Field( | ||
..., | ||
description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article", | ||
) | ||
absent: List[str] = Field( | ||
..., | ||
default_factory=list, | ||
description="this is a list of Entities found absent from the new summary that were present in the previous summary", | ||
) | ||
missing: List[str] = Field( | ||
default_factory=list, | ||
description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.", | ||
) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RewrittenSummary
class is well defined with clear docstrings and field descriptions. However, consider adding validations to ensure the summary is approximately 80 words long, the absent
list is empty, and the missing
list contains 1-3 entities.
docs/blog/posts/chain-of-density.md
Outdated
from typing import List | ||
from chain_of_density import summarize_article #(1)! | ||
import csv | ||
import logging | ||
import instructor | ||
from itertools import islice | ||
from pydantic import BaseModel | ||
|
||
instructor.patch() #(2)! | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
instructions = instructor.Instructions( #(3)! | ||
name="Chain Of Density", | ||
finetune_format="messages", | ||
# log handler is used to save the data to a file | ||
# you can imagine saving it to a database or other storage | ||
# based on your needs! | ||
log_handlers=[logging.FileHandler("generated.jsonl")], | ||
) | ||
|
||
class GeneratedSummary(BaseModel): | ||
summary: str | ||
|
||
@instructions.distil #(4)! | ||
def distil_summarization(text: str) -> GeneratedSummary: | ||
summary_chain: List[str] = summarize_article(text) | ||
return GeneratedSummary(summary=summary_chain[-1]) #(5)! | ||
|
||
with open("train.csv", "r") as file: | ||
reader = csv.reader(file) | ||
next(reader) # Skip the header | ||
for index, (article, summary) in enumerate(reader): | ||
# Run Distillisation to generate the values | ||
distil_summarization(article) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script for generating the training set is well implemented. However, consider adding error handling for potential issues with the OpenAI API calls and the CSV file reading.
docs/blog/posts/chain-of-density.md
Outdated
@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)! | ||
def distil_summarization(text: str) -> GeneratedSummary: | ||
summary_chain: List[str] = summarize_article(text) | ||
return GeneratedSummary(summary=summary_chain[-1]) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distil_summarization
function is well implemented. However, consider adding error handling for potential issues with the OpenAI API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (1)
- examples/chain-of-density/finetune.py (1 hunks)
Additional comments: 3
examples/chain-of-density/finetune.py (3)
1-8: Imports and client setup look good.
20-22: The
GeneratedSummary
class is well defined.24-27: The
distil_summarization
function is correctly defined and uses thesummarize_article
function to generate a summary. It returns the last summary in the chain.
e75c3a7
to
f4068a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 11
Configuration used: CodeRabbit UI
Files ignored due to filter (2)
- docs/blog/posts/img/chain-of-density.png
- examples/chain-of-density/output.csv
Files selected for processing (5)
- docs/blog/posts/chain-of-density.md (1 hunks)
- examples/chain-of-density/Readme.md (1 hunks)
- examples/chain-of-density/chain_of_density.py (1 hunks)
- examples/chain-of-density/finetune.py (1 hunks)
- examples/chain-of-density/requirements.txt (1 hunks)
Files skipped from review due to trivial changes (2)
- examples/chain-of-density/Readme.md
- examples/chain-of-density/requirements.txt
Additional comments: 5
examples/chain-of-density/finetune.py (5)
1-8: The imports are well organized and only necessary modules are imported. Good job!
9-9: The OpenAI client is patched with the
instructor
library. Ensure that theinstructor
library is compatible with the OpenAI client.13-17: The
Instructions
object is well defined with appropriate parameters. Ensure that the log filesummarization.jsonl
has write permissions.20-22: The
GeneratedSummary
class is well defined using Pydantic for data validation. Good job!24-27: The
distil_summarization
function is well defined and uses the@instructions.distil
decorator. It returns aGeneratedSummary
object with the last summary in the chain. Good job!
# Read in the csv file we have | ||
with open("test.csv", "r") as file: | ||
reader = csv.reader(file) | ||
next(reader) # Skip the header | ||
for article, summary in reader: | ||
distil_summarization(article) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script reads from a CSV file and calls the distil_summarization
function for each article. Ensure that the CSV file exists, is in the correct format, and that the file has read permissions. Also, the result of the distil_summarization
function is not stored or used. If the result is needed, consider storing it in a variable or data structure.
- for article, summary in reader:
- distil_summarization(article)
+ summaries = []
+ for article, _ in reader:
+ summaries.append(distil_summarization(article))
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
# Read in the csv file we have | |
with open("test.csv", "r") as file: | |
reader = csv.reader(file) | |
next(reader) # Skip the header | |
for article, summary in reader: | |
distil_summarization(article) | |
# Read in the csv file we have | |
with open("test.csv", "r") as file: | |
reader = csv.reader(file) | |
next(reader) # Skip the header | |
summaries = [] | |
for article, _ in reader: | |
summaries.append(distil_summarization(article)) |
def summarize_article(article: str, summary_steps: int = 3): | ||
summary_chain = [] | ||
# We first generate an initial summary | ||
summary: InitialSummary = openai.chat.completions.create( | ||
model="gpt-4-0613", | ||
response_model=InitialSummary, | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words", | ||
}, | ||
{"role": "user", "content": f"Here is the Article: {article}"}, | ||
{ | ||
"role": "user", | ||
"content": "The generated summary should be about 80 words.", | ||
}, | ||
], | ||
max_retries=2, | ||
) | ||
prev_summary = None | ||
summary_chain.append(summary.summary) | ||
for i in range(summary_steps): | ||
missing_entity_message = ( | ||
[] | ||
if prev_summary is None | ||
else [ | ||
{ | ||
"role": "user", | ||
"content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}", | ||
}, | ||
] | ||
) | ||
new_summary: RewrittenSummary = openai.chat.completions.create( | ||
model="gpt-4-0613", | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": """ | ||
You are going to generate an increasingly concise,entity-dense summary of the following article. | ||
|
||
Perform the following two tasks | ||
- Identify 1-3 informative entities from the following article which is missing from the previous summary | ||
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities | ||
|
||
Guidelines | ||
- Make every word count: re-write the previous summary to improve flow and make space for additional entities | ||
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses". | ||
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article. | ||
- Missing entities can appear anywhere in the new summary | ||
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities. | ||
""", | ||
}, | ||
{"role": "user", "content": f"Here is the Article: {article}"}, | ||
{ | ||
"role": "user", | ||
"content": f"Here is the previous summary: {summary_chain[-1]}", | ||
}, | ||
*missing_entity_message, | ||
], | ||
max_retries=3, | ||
max_tokens=1000, | ||
response_model=RewrittenSummary, | ||
) | ||
summary_chain.append(new_summary.summary) | ||
prev_summary = new_summary | ||
|
||
return summary_chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The summarize_article
function is still quite long and complex. Consider breaking it down into smaller helper functions to improve readability and maintainability. For example, you could create separate functions for generating the initial summary and the rewritten summaries.
- chain of density | ||
- finetuneing | ||
- gpt-3.5-turbo | ||
- distilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 whoops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jxnl Thank you for acknowledging the correction. Please proceed to commit the suggested change to fix the typographical error.
docs/blog/posts/chain-of-density.md
Outdated
|
||
# Implementing Chain Of Density | ||
|
||
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typographical error.
- As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder.
+ As usual, all of the code is available for reference under our `examples/chain-of-density` folder.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments. | |
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is available for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments. |
docs/blog/posts/chain-of-density.md
Outdated
|
||
## Implementation | ||
|
||
Note that our implementation uses a validator to ensure that the rewritten summary has a minimum length rather than a prompt. As a result, we match the original paper on entity count but not entity density. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence is not clear. It would be better to clarify what is meant by "entity count" and "entity density".
docs/blog/posts/chain-of-density.md
Outdated
Let's first segregate our train and test set so that we don't have any sort of contamination - this corresponds to our `train.csv` and `test.csv` in our [Hugging Face Dataset](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density). Now, we just need to import the `Instructions` module from the `Instructor` package which allows you to generate a nicely formatted `.jsonl` file to be used for fine-tuning | ||
|
||
|
||
```py hl_lines="2 9 13-20 25 28" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typographical error.
- from chain_of_density import summarize_article #(1)!
+ from chain_of_density import summarize_article # (1)!
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
```py hl_lines="2 9 13-20 25 28" | |
from chain_of_density import summarize_article # (1)! |
docs/blog/posts/chain-of-density.md
Outdated
|
||
!!! note | ||
|
||
Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typographical error.
- Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.
+ Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks. | |
Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jxnl Thank you for your approval. I will proceed to make the suggested change.
docs/blog/posts/chain-of-density.md
Outdated
|
||
## Conclusion | ||
|
||
There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typographical error.
- There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it
+ There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convenience but there's really a whole variety of other ways to improve it.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it | |
There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convenience but there's really a whole variety of other ways to improve it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (1)
- examples/chain-of-density/finetune.py (1 hunks)
Additional comments: 2
examples/chain-of-density/finetune.py (2)
- 27-32: The previous comment about storing the result of the
distil_summarization
function is still valid. Also, ensure that the CSV file "test.csv" exists and is in the correct format.- for article, summary in reader: - distil_summarization(article) + summaries = [] + for article, _ in reader: + summaries.append(distil_summarization(article))
- 21-24: The
distil_summarization
function is correctly defined and returns aGeneratedSummary
object. However, ensure that thesummarize_article
function from thechain_of_density
module is correctly implemented and tested.
from typing import List | ||
from chain_of_density import summarize_article | ||
import csv | ||
import logging | ||
import instructor | ||
from pydantic import BaseModel | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
instructions = instructor.Instructions( | ||
name="Chain Of Density", | ||
finetune_format="messages", | ||
log_handlers=[logging.FileHandler("summarization.jsonl")], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Instructions
object is created but not used anywhere in the code. If it's not used, consider removing it to avoid confusion.
- instructions = instructor.Instructions(
- name="Chain Of Density",
- finetune_format="messages",
- log_handlers=[logging.FileHandler("summarization.jsonl")],
- )
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
from typing import List | |
from chain_of_density import summarize_article | |
import csv | |
import logging | |
import instructor | |
from pydantic import BaseModel | |
logging.basicConfig(level=logging.INFO) | |
instructions = instructor.Instructions( | |
name="Chain Of Density", | |
finetune_format="messages", | |
log_handlers=[logging.FileHandler("summarization.jsonl")], | |
) | |
from typing import List | |
from chain_of_density import summarize_article | |
import csv | |
import logging | |
from pydantic import BaseModel | |
logging.basicConfig(level=logging.INFO) |
- chain of density | ||
- finetuneing | ||
- gpt-3.5-turbo | ||
- distilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 whoops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Configuration used: CodeRabbit UI
Files selected for processing (3)
- docs/blog/posts/chain-of-density.md (1 hunks)
- examples/chain-of-density/Readme.md (1 hunks)
- examples/chain-of-density/chain_of_density.py (1 hunks)
Files skipped from review due to trivial changes (1)
- examples/chain-of-density/Readme.md
Additional comments: 11
examples/chain-of-density/chain_of_density.py (3)
1-6: Ensure that all the imported modules are used in the code. Unused imports can lead to confusion and unnecessary overhead.
7-7: The
instructor.patch(OpenAI())
seems to be a custom function. Ensure that it is defined and imported correctly.78-144: The
summarize_article
function seems to be well-structured and logically sound. However, ensure that the OpenAI API calls are correctly implemented and that the API keys are securely stored and accessed.docs/blog/posts/chain-of-density.md (8)
98-107: The Pydantic model
InitialSummary
is well defined with a clear docstring and field description. The use of Pydantic'sField
function to provide a description for thesummary
field is a good practice for code readability and maintainability.115-142: The Pydantic model
RewrittenSummary
is well defined with a clear docstring and field descriptions. The use of Pydantic'sField
function to provide descriptions for thesummary
,absent
, andmissing
fields is a good practice for code readability and maintainability.153-180: The use of Pydantic validators to enforce business rules on the
summary
,missing
, andabsent
fields is a good practice. It ensures that the data adheres to the expected format and constraints, improving the robustness of the code.190-262: The
summarize_article
function is well implemented. It uses the OpenAI API to generate summaries and uses the Pydantic models and validators to ensure the summaries adhere to the expected format and constraints. The use of a loop to generate a chain of summaries is a good practice. The code is readable and maintainable.292-346: The use of the
instructor
library to generate a.jsonl
file for fine-tuning is a good practice. It simplifies the process of preparing data for fine-tuning and ensures the data is in the correct format. The use of thedistil
decorator to automatically capture the input and output of the function is a good practice. It simplifies the process of generating the.jsonl
file.359-363: The use of the
distil
decorator with themodel
andmode
parameters to use the fine-tuned model is a good practice. It simplifies the process of using the fine-tuned model in the function.384-400: The benchmarking results are well presented in a table format, making it easy to compare the performance of the different models. The use of metrics such as mean latency, mean entity count, mean entity density, and total cost provides a comprehensive view of the performance of the models.
413-424: The conclusion provides a good summary of the article and offers suggestions for further improvements. This is a good practice as it encourages readers to explore the topic further.
draft: False | ||
date: 2023-11-05 | ||
tags: | ||
- pydantic | ||
- validation | ||
- chain of density | ||
- finetuneing | ||
- gpt-3.5-turbo | ||
- distilation | ||
authors: | ||
- ivanleomk | ||
- jxnl | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata of the blog post seems fine. However, there are a couple of typos in the tags. "finetuneing" should be "finetuning" and "distilation" should be "distillation".
- - finetuneing
+ - finetuning
- - distilation
+ - distillation
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
draft: False | |
date: 2023-11-05 | |
tags: | |
- pydantic | |
- validation | |
- chain of density | |
- finetuneing | |
- gpt-3.5-turbo | |
- distilation | |
authors: | |
- ivanleomk | |
- jxnl | |
--- | |
draft: False | |
date: 2023-11-05 | |
tags: | |
- pydantic | |
- validation | |
- chain of density | |
- finetuning | |
- gpt-3.5-turbo | |
- distillation | |
authors: | |
- ivanleomk | |
- jxnl | |
--- |
docs/blog/posts/chain-of-density.md
Outdated
|
||
# Implementing Chain Of Density | ||
|
||
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the word "available". It should be corrected.
- all of the code is avaliable for reference under our `examples/chain-of-density` folder.
+ all of the code is available for reference under our `examples/chain-of-density` folder.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. | |
By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is available for reference under our `examples/chain-of-density` folder. |
Edits to Chain of Density work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 13
Configuration used: CodeRabbit UI
Files ignored due to filter (1)
- mkdocs.yml
Files selected for processing (1)
- docs/blog/posts/chain-of-density.md (1 hunks)
Additional comments: 1
docs/blog/posts/chain-of-density.md (1)
- 370-370: Typo in "@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)!". It should be "@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") # (
- pydantic | ||
- validation | ||
- chain of density | ||
- finetuneing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the tag "finetuneing". It should be "finetuning".
- - finetuneing
+ - finetuning
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
- finetuneing | |
- finetuning |
- chain of density | ||
- finetuneing | ||
- gpt-3.5-turbo | ||
- distilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the tag "distilation". It should be "distillation".
- - distilation
+ - distillation
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
- distilation | |
- distillation |
|
||
# Better Summaries by Finetuning Chain of Density | ||
|
||
> Discover how to distil an interative method like chain of density into a single finetune. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "interative". It should be "iterative".
- > Discover how to distil an interative method like chain of density into a single finetune.
+ > Discover how to distil an iterative method like chain of density into a single finetune.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
> Discover how to distil an interative method like chain of density into a single finetune. | |
> Discover how to distil an iterative method like chain of density into a single finetune. |
docs/blog/posts/chain-of-density.md
Outdated
|
||
> Discover how to distil an interative method like chain of density into a single finetune. | ||
|
||
In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "were able to increase latency". It should be "were able to decrease latency".
- Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density.
+ Using these methods were able to decrease latency by 40x, reduce costs by 10x and maintain entity density.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models. | |
In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to decrease latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models. |
|
||
### Original Prompt | ||
|
||
We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "pip install instructor". It should be "pip install instructor".
- We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls.
+ We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller API calls.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want. | |
We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller API calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want. |
docs/blog/posts/chain-of-density.md
Outdated
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
instructions = instructor.Instructions( #(2)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "instructions = instructor.Instructions( #(2)!". It should be "instructions = instructor.Instructions() # (2)!".
- instructions = instructor.Instructions( #(2)!
+ instructions = instructor.Instructions() # (2)!
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
instructions = instructor.Instructions( #(2)! | |
instructions = instructor.Instructions() # (2)! |
docs/blog/posts/chain-of-density.md
Outdated
class GeneratedSummary(BaseModel): | ||
summary: str | ||
|
||
@instructions.distil #(3)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "@instructions.distil #(3)!". It should be "@instructions.distil # (3)!".
- @instructions.distil #(3)!
+ @instructions.distil # (3)!
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
@instructions.distil #(3)! | |
@instructions.distil # (3)! |
|
||
Now that we have our models and the rough flow figured out, let's implement a function to summarize a piece of text using `Chain Of Density` summarization. | ||
|
||
```py hl_lines="4 9-24 38-68" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "py hl_lines="4 9-24 38-68"". It should be "
python hl_lines="4 9-24 38-68"".
- ```py hl_lines="4 9-24 38-68"
+ ```python hl_lines="4 9-24 38-68"
|
||
!!! warning "Rate Limiting" | ||
|
||
We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.". It should be "We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.".
- We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.
+ We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely. | |
We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely. |
|
||
Once we run this script, we'll have a new file called `generated.jsonl` in our local repository. Now all that's left is to run the command below to start fine-tuning your first model! | ||
|
||
```sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "sh". It should be "
bash".
- ```sh
+ ```bash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files ignored due to filter (1)
- docs/blog/posts/img/chain-of-density.png
Files selected for processing (3)
- docs/blog/posts/chain-of-density.md (1 hunks)
- examples/chain-of-density/chain_of_density.py (1 hunks)
- examples/chain-of-density/finetune.py (1 hunks)
Files not reviewed due to errors (1)
- docs/blog/posts/chain-of-density.md (Error: diff too large)
Additional comments: 13
examples/chain-of-density/finetune.py (6)
1-6: Imports are correctly placed and organized according to PEP8 guidelines.
8-8: Logging level is set to INFO. Ensure that this level of logging is appropriate for your use case.
10-17: The
Instructions
object is correctly initialized with appropriate parameters.20-35: The
GeneratedSummary
class is well-documented and correctly uses Pydantic for data validation.38-41: The
distil_summarization
function is correctly decorated and returns aGeneratedSummary
object. Ensure that thesummarize_article
function returns a list of strings.44-48: The script reads from a CSV file and processes each article. Ensure that the CSV file is correctly formatted and contains the necessary data.
examples/chain-of-density/chain_of_density.py (7)
1-7: Ensure that all the imported modules are used in the code. Unused imports can lead to confusion and unnecessary dependencies.
12-23: The
InitialSummary
class is well defined with clear documentation and field descriptions.24-49: The
RewrittenSummary
class is well defined with clear documentation and field descriptions. It also includes field validators to ensure the quality of the summary.51-67: The
min_entity_density
validator ensures that the summary has a minimum entity density. This is a good practice to maintain the quality of the summary.69-77: The
min_length
validator ensures that the summary has a minimum length. This is a good practice to maintain the quality of the summary.79-85: The
has_missing_entities
validator ensures that there are missing entities identified for the next summary. This is a good practice to maintain the quality of the summary.87-95: The
has_no_absent_entities
validator ensures that no entities from the previous summary are absent in the new summary. This is a good practice to maintain the quality of the summary.
def summarize_article(article: str, summary_steps: int = 3): | ||
summary_chain = [] | ||
# We first generate an initial summary | ||
summary: InitialSummary = client.chat.completions.create( | ||
model="gpt-4-0613", | ||
response_model=InitialSummary, | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words. ", | ||
}, | ||
{"role": "user", "content": f"Here is the Article: {article}"}, | ||
{ | ||
"role": "user", | ||
"content": "The generated summary should be about 80 words.", | ||
}, | ||
], | ||
max_retries=2, | ||
) | ||
summary_chain.append(summary.summary) | ||
for i in range(summary_steps): | ||
new_summary: RewrittenSummary = client.chat.completions.create( | ||
model="gpt-4-0613", | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": f""" | ||
Article: {article} | ||
You are going to generate an increasingly concise,entity-dense summary of the following article. | ||
|
||
Perform the following two tasks | ||
- Identify 1-3 informative entities from the following article which is missing from the previous summary | ||
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities | ||
|
||
Guidelines | ||
- Make every word count: re-write the previous summary to improve flow and make space for additional entities | ||
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses". | ||
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article. | ||
- Missing entities can appear anywhere in the new summary | ||
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities. | ||
""", | ||
}, | ||
{ | ||
"role": "user", | ||
"content": f"Here is the previous summary: {summary_chain[-1]}", | ||
}, | ||
], | ||
max_retries=5, | ||
max_tokens=1000, | ||
response_model=RewrittenSummary, | ||
) | ||
summary_chain.append(new_summary.summary) | ||
|
||
return summary_chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The summarize_article
function is well defined and uses the OpenAI API to generate initial and rewritten summaries. It also includes error handling with retries. However, the function could be broken down into smaller functions for better readability and maintainability.
|
||
## Part 1) Chain of Density | ||
|
||
Summarizing extensive texts with AI can be challenging, often relying on inconsistent techniques. Salesforce AI Research's novel method, chain of density, enhances AI-based text summarization, outperforming human-generated summaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] "chain of density" -> "Chain of Density" for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed up
|
||
Let's start by walking through some of the data models that we'll be using as the `response_model` for our open ai function calls | ||
|
||
Firstly, we'll need a data model for the initial summary that we will be generating. We'll take the description of this class straight from the original prompt. Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] "Its" -> "It's"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs.
Make it clearer why the docstrings are important. I think it's used to validate that initial and rewritten summaries contain the expected Pydantic fields (summary, absent, missing). If so, then do we need such extensive docstrings? Also, the doctrings seem duplicative of the system content in the examples below.
Please correct my understanding if it's wrong. Either way, I think it's an opportunity to educate users on how docstrings come into play in Instructor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, the docstrings are used in the function call parameters (Eg below) so a more descriptive docstring helps guide the eventual output by cleanly specifying what you want
{
"functions": [
{
"name": "GeneratedSummary",
"description": "This represents a highly concise summary that includes as many entities as possible from the original source article.\n\nAn Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.\n\nGuidelines\n- Make every word count\n- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.\n- Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\"",
"parameters": {
"properties": {
"summary": {
"description": "This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ",
"title": "Summary",
"type": "string"
}
},
"required": [
"summary"
],
"type": "object"
}
}
]
}
}
import instructor | ||
from pydantic import BaseModel | ||
|
||
logging.basicConfig(level=logging.INFO) #(2)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like an unnecessary detail that the user has to be careful not to trip over—perhaps instructor.Instructions
can have it's own logging handler so the user doesn't have to even be aware of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jxnl thoughts?
| Model | Mean Latency (s) | Mean Entity Count | Mean Entity Density | Tokens | | ||
| ------------------- | ---------------- | ----------------- | ------------------- | ------ | | ||
| GPT-4 (COD) | 49.5 | 11.3 | 0.138 | 81.65 | | ||
| GPT-3 (COD) | 145.94 | 11.05 | 0.105 | 105.7 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this gpt-3 or gpt-3.5? If the former, I wonder if we can add a benchmark for gpt-3.5? If the latter, make it clearer.
Also, we can make the benefits of COD + distillation more convincing by adding benchmarks for gpt-4 and gpt-3.5 without COD—how much lift does this additional effort buy us? This will help users prioritize between different approaches to improve their summaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is GPT-3.5, my bad on this. I took this out and replaced it with the benchmarks I calculated for a Vanilla summary using GPT 3.5 that just asked for a concise summary.
| ------------------- | ---------------- | ----------------- | ------------------- | ------ | | ||
| GPT-4 (COD) | 49.5 | 11.3 | 0.138 | 81.65 | | ||
| GPT-3 (COD) | 145.94 | 11.05 | 0.105 | 105.7 | | ||
| 3.5 Finetuned (20) | 2.25 | 14.7 | 0.154 | 95.45 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing finetuning on 20 summaries to the rows below, it seems that finetuning only on 20 summaries had the highest absolute entity count and density? Hmm, why is that? Might want to add a few hypotheses even if you don't have the answers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it in a new branch! But my hypotheses are that
- Model might not be benefitting from the higher number of examples due to the epochs ( 20, 50 and 76 are all trained with 4 epochs which is the default number provided by OpenAI)
- Larger variety of examples might cause the model to optimize for different objectives - not just for token density. The COD summarization method tends to produce more abstract summaries with each rewrite, so it might be learning other metrics under the hood to optimize for
Added a sample implementation of the Chain of Density implementation. Modified pipeline slightly to use a mix of GPT4 for critiquing and GPT3.5 for generation of output
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
openai.ChatCompletion.create()
function.Chores