Chain of density #135

ivanleomk · 2023-11-04T09:28:33Z

Added a sample implementation of the Chain of Density implementation. Modified pipeline slightly to use a mix of GPT4 for critiquing and GPT3.5 for generation of output

Summary by CodeRabbit

New Features
- Introduced a new "Chain Of Density" summarization technique using GPT-3.5.
- Added a new Python script for article summarization using Pydantic and OpenAI's GPT-4 model.
Documentation
- Added a comprehensive guide on implementing the "Chain Of Density" summarization technique.
- Provided instructions for installing dependencies, downloading datasets, generating examples for fine-tuning, and running a fine-tuning job.
Bug Fixes
- Fixed a typo in the model parameter in the openai.ChatCompletion.create() function.
Chores
- Updated Python version used in the workflow from 3.x to 3.10.
- Added new dependencies: openai, pydantic, instructor, nltk, and rich.

coderabbitai · 2023-11-04T09:28:39Z

Walkthrough

The changes introduce a new "Chain of Density" summarization technique using GPT-3.5 and the Instructor library. The technique involves generating a chain of summaries with increasing density. The implementation includes data models for initial and rewritten summaries, validators, and a function to generate the summaries. The changes also include instructions for fine-tuning the model, evaluating the quality of summaries, and a blog post explaining the technique.

Changes

File	Summary
`examples/chain-of-density/Readme.md`	Added a new document with instructions for performing Chain Of Density summarization using GPT-3.5.
`examples/chain-of-density/chain_of_density.py`	Introduced new classes and functions for the Chain Of Density summarization technique.
`examples/chain-of-density/download.py`	Added code to load and process datasets for the summarization task.
`examples/chain-of-density/finetune.py`	Introduced a script for fine-tuning the model using the generated summaries.
`examples/chain-of-density/run.py`	Introduced a script to generate a chain of summaries for an article.
`docs/blog/posts/chain-of-density.md`	Added a blog post explaining the Chain Of Density summarization technique.
`instructor/dsl/...`	Fixed a typo in the model parameter and added checks in the `extract_json` method.
`examples/chain-of-density/requirements.txt`	Added new dependencies required for the summarization task.

Poem

🐇 Hopping through the code, making changes bright, 🌟

Summarizing articles, from morning till night. 🌞🌛

With GPT-3.5, we rewrite, ✍️

Making summaries dense and tight. 📚

As the leaves fall, we fine-tune, 🍂

Under the light of the autumn moon. 🌙

Celebrating changes, big and small, 🎉

CodeRabbit is here, hopping through fall. 🐇🍁

Tips

Chat with CodeRabbit Bot (`@coderabbitai`)

If you reply to a review comment from CodeRabbit, the bot will automatically respond.
To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
Note: Review comments are made on code diffs or files, not on the PR overview.
Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

coderabbitai · 2023-11-04T09:32:39Z

examples/chain-of-density/finetune.py

+class GeneratedSummary(BaseModel):
+    summary: str
+
+
+@instructions.distil
+def distil_summarization(text: str) -> GeneratedSummary:
+    summary_chain: List[str] = summarize_article(text)
+    print(summary_chain)
+    return GeneratedSummary(summary=summary_chain[-1])


The GeneratedSummary class and distil_summarization function are well-defined. However, there is no error handling for the case where summarize_article(text) returns an empty list. This would cause an IndexError at line 31. Consider adding a check for this.

+ if not summary_chain: + raise ValueError("No summaries generated for the provided text.")

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

class GeneratedSummary(BaseModel):

summary: str

@instructions.distil

def distil_summarization(text: str) -> GeneratedSummary:

summary_chain: List[str] = summarize_article(text)

print(summary_chain)

return GeneratedSummary(summary=summary_chain[-1])

class GeneratedSummary(BaseModel):

summary: str

@instructions.distil

def distil_summarization(text: str) -> GeneratedSummary:

summary_chain: List[str] = summarize_article(text)

if not summary_chain:

raise ValueError("No summaries generated for the provided text.")

print(summary_chain)

return GeneratedSummary(summary=summary_chain[-1])

coderabbitai · 2023-11-04T09:32:42Z

examples/chain-of-density/finetune.py

+# Read in the csv file we have
+with open("output.csv", "r") as file:
+    reader = csv.reader(file)
+
+    for article, summary in islice(reader, 1, 10):
+        distil_summarization(article)


The CSV file is read and processed correctly. However, there is no error handling for the case where the file does not exist or cannot be opened. Consider adding a try-except block to handle potential FileNotFoundError or IOError.

+try: with open("output.csv", "r") as file: reader = csv.reader(file) for article, summary in islice(reader, 1, 10): distil_summarization(article) +except (FileNotFoundError, IOError): + logging.error("Failed to open output.csv")

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

# Read in the csv file we have

with open("output.csv", "r") as file:

reader = csv.reader(file)

for article, summary in islice(reader, 1, 10):

distil_summarization(article)

try:

with open("output.csv", "r") as file:

reader = csv.reader(file)

for article, summary in islice(reader, 1, 10):

distil_summarization(article)

except (FileNotFoundError, IOError):

logging.error("Failed to open output.csv")

coderabbitai · 2023-11-04T09:32:45Z

examples/chain-of-density/chain_of_density.py

+class MissingEntity(BaseModel):
+    """
+    An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
+
+    A missing entity is:
+    - relevant to the main story,
+    - specific yet concise (5 words or fewer),
+    - novel (not in the previous summary),
+    - faithful (present in the article),
+    - anywhere (can be located anywhere in the article).
+    """
+
+    entity_name: str = Field(
+        ...,
+        description="This is the associated name with the entity that exists in the text",
+    )
+    reason: str = Field(
+        ...,
+        description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",
+    )
+


The MissingEntity class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

- entity_name: str = Field( + entity_name: str = Field[str]( ..., description="This is the associated name with the entity that exists in the text", ) - reason: str = Field( + reason: str = Field[str]( ..., description="This is a short sentence which describes why we should include this new entity in the rewritten abstract", )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

class MissingEntity(BaseModel):

"""

An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

A missing entity is:

- relevant to the main story,

- specific yet concise (5 words or fewer),

- novel (not in the previous summary),

- faithful (present in the article),

- anywhere (can be located anywhere in the article).

"""

entity_name: str = Field(

...,

description="This is the associated name with the entity that exists in the text",

)

reason: str = Field(

...,

description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",

)

class MissingEntity(BaseModel):

"""

An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

A missing entity is:

- relevant to the main story,

- specific yet concise (5 words or fewer),

- novel (not in the previous summary),

- faithful (present in the article),

- anywhere (can be located anywhere in the article).

"""

entity_name: str = Field[str](

...,

description="This is the associated name with the entity that exists in the text",

)

reason: str = Field[str](

...,

description="This is a short sentence which describes why we should include this new entity in the rewritten abstract",

)

coderabbitai · 2023-11-04T09:32:48Z

examples/chain-of-density/chain_of_density.py

+class OmittedEntity(BaseModel):
+    """
+    An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
+    """
+
+    entity_name: str = Field(
+        ...,
+        description="This is an entity which was present in the previous summary and not in the newly generated summary",
+    )
+


The OmittedEntity class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

- entity_name: str = Field( + entity_name: str = Field[str]( ..., description="This is an entity which was present in the previous summary and not in the newly generated summary", )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

class OmittedEntity(BaseModel):

"""

An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

"""

entity_name: str = Field(

...,

description="This is an entity which was present in the previous summary and not in the newly generated summary",

)

class OmittedEntity(BaseModel):

"""

An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

"""

entity_name: str = Field[str](

...,

description="This is an entity which was present in the previous summary and not in the newly generated summary",

)

coderabbitai · 2023-11-04T09:32:51Z

examples/chain-of-density/chain_of_density.py

+class MaybeOmittedEntities(BaseModel):
+    """
+    This represents whether the new summary has omitted any entities that were present in the previous summary provided.
+    """
+
+    omitted_entities: Optional[List[OmittedEntity]] = Field(default=[])
+    message: Optional[str] = Field(default=None)
+


The MaybeOmittedEntities class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

- omitted_entities: Optional[List[OmittedEntity]] = Field(default=[]) + omitted_entities: Optional[List[OmittedEntity]] = Field[Optional[List[OmittedEntity]]](default=[]) - message: Optional[str] = Field(default=None) + message: Optional[str] = Field[Optional[str]](default=None)

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

class MaybeOmittedEntities(BaseModel):

"""

This represents whether the new summary has omitted any entities that were present in the previous summary provided.

"""

omitted_entities: Optional[List[OmittedEntity]] = Field(default=[])

message: Optional[str] = Field(default=None)

class MaybeOmittedEntities(BaseModel):

"""

This represents whether the new summary has omitted any entities that were present in the previous summary provided.

"""

omitted_entities: Optional[List[OmittedEntity]] = Field[Optional[List[OmittedEntity]]](default=[])

message: Optional[str] = Field[Optional[str]](default=None)

coderabbitai · 2023-11-04T09:32:54Z

examples/chain-of-density/chain_of_density.py

+class InitialSummary(BaseModel):
+    """
+    This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
+    """
+
+    summary: str = Field(
+        ...,
+        description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",
+    )
+


The InitialSummary class is well defined with clear comments and field descriptions. However, consider adding type hints for the fields to improve code readability and maintainability.

- summary: str = Field( + summary: str = Field[str]( ..., description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words", )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

class InitialSummary(BaseModel):

"""

This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.

"""

summary: str = Field(

...,

description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",

)

class InitialSummary(BaseModel):

"""

This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.

"""

summary: str = Field[str](

...,

description="This is a summary of the article provided which is overly verbose and has fillers to reach ~80 words",

)

examples/chain-of-density/chain_of_density.py

coderabbitai · 2023-11-04T09:33:00Z

examples/chain-of-density/chain_of_density.py

+def rewrite_summary(
+    article: str,
+    existing_summary: str,
+    entity_ctx: str,
+    error_msgs: List[str] = [],
+    remaining_retries=3,
+):
+    # # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities )
+    completion = openai.ChatCompletion.create(
+        model="gpt-3.5-turbo-16k",
+        functions=[RewrittenSummary.openai_schema],
+        function_call={"name": RewrittenSummary.openai_schema["name"]},
+        max_retries=2,
+        messages=[
+            {
+                "role": "system",
+                "content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities",
+            },
+            {"role": "user", "content": f"Here is the article : {article}"},
+            {
+                "role": "user",
+                "content": f"Here is the most recent article : {existing_summary}",
+            },
+            {
+                "role": "user",
+                "content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}",
+            },
+            *error_msgs,
+        ],
+    )
+    try:
+        new_summary = RewrittenSummary.from_response(
+            completion, validation_context={"prev_summary": existing_summary}
+        )
+        return new_summary
+    except (ValidationError, JSONDecodeError) as e:
+        if remaining_retries == 0:
+            raise e
+        error_msgs = []
+        error_msgs.append(dict(**completion.choices[0].message))
+        error_msgs.append(
+            {
+                "role": "user",
+                "content": f"Recall the function correctly, exceptions found\n{e}",
+            }
+        )
+        return rewrite_summary(
+            article,
+            existing_summary,
+            entity_ctx,
+            error_msgs,
+            remaining_retries=remaining_retries - 1,
+        )


The rewrite_summary function is well defined with clear comments. It uses the OpenAI API to generate a new summary and validates that the entity density has increased. It also handles exceptions properly. However, consider adding type hints for the parameters and the return type to improve code readability and maintainability.

- def rewrite_summary( + def rewrite_summary( article: str, existing_summary: str, entity_ctx: str, error_msgs: List[str] = [], remaining_retries=3, - ): + ) -> Union[RewrittenSummary, None]:

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

def rewrite_summary(

article: str,

existing_summary: str,

entity_ctx: str,

error_msgs: List[str] = [],

remaining_retries=3,

):

# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities )

completion = openai.ChatCompletion.create(

model="gpt-3.5-turbo-16k",

functions=[RewrittenSummary.openai_schema],

function_call={"name": RewrittenSummary.openai_schema["name"]},

max_retries=2,

messages=[

{

"role": "system",

"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities",

},

{"role": "user", "content": f"Here is the article : {article}"},

{

"role": "user",

"content": f"Here is the most recent article : {existing_summary}",

},

{

"role": "user",

"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}",

},

*error_msgs,

],

)

try:

new_summary = RewrittenSummary.from_response(

completion, validation_context={"prev_summary": existing_summary}

)

return new_summary

except (ValidationError, JSONDecodeError) as e:

if remaining_retries == 0:

raise e

error_msgs = []

error_msgs.append(dict(**completion.choices[0].message))

error_msgs.append(

{

"role": "user",

"content": f"Recall the function correctly, exceptions found\n{e}",

}

)

return rewrite_summary(

article,

existing_summary,

entity_ctx,

error_msgs,

remaining_retries=remaining_retries - 1,

)

def rewrite_summary(

article: str,

existing_summary: str,

entity_ctx: str,

error_msgs: List[str] = [],

remaining_retries=3,

) -> Union[RewrittenSummary, None]:

# # We then perform a new summary and validate that the entity density has increased ( We have not lost any entities )

completion = openai.ChatCompletion.create(

model="gpt-3.5-turbo-16k",

functions=[RewrittenSummary.openai_schema],

function_call={"name": RewrittenSummary.openai_schema["name"]},

max_retries=2,

messages=[

{

"role": "system",

"content": "You are about to be given an article, an existing summary of the article and some new entities. Please use the information to rewrite the summary to make it denser and more concise so that it covers every entity and detail from the previous summary plus the missing entities",

},

{"role": "user", "content": f"Here is the article : {article}"},

{

"role": "user",

"content": f"Here is the most recent article : {existing_summary}",

},

{

"role": "user",

"content": f"Here is some information on entities you should include in the rewritten summary: {entity_ctx}",

},

*error_msgs,

],

)

try:

new_summary = RewrittenSummary.from_response(

completion, validation_context={"prev_summary": existing_summary}

)

return new_summary

except (ValidationError, JSONDecodeError) as e:

if remaining_retries == 0:

raise e

error_msgs = []

error_msgs.append(dict(**completion.choices[0].message))

error_msgs.append(

{

"role": "user",

"content": f"Recall the function correctly, exceptions found\n{e}",

}

)

return rewrite_summary(

article,

existing_summary,

entity_ctx,

error_msgs,

remaining_retries=remaining_retries - 1,

)

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 6e68f11 and 0e08af2.

Files selected for processing (1)

examples/chain-of-density/run.py (1 hunks)

Files skipped from review due to trivial changes (1)

examples/chain-of-density/run.py

jxnl

How close are the prompts here to the original prompts ?

examples/chain-of-density/chain_of_density.py

ivanleomk · 2023-11-04T15:27:38Z

How close are the prompts here to the original prompts ?

Prompts here are based off the original prompts from the paper and are modified slightly. There are a bunch of them which I came up on my own but here are the rough annotations from the original paper

Figure 2 displays the prompt along with an example output. Rather than be prescriptive about the types of entities, we simply define a Missing Entity as:
• Relevant: to the main story.
• Specific: descriptive yet concise (5 words or fewer).
• Novel: not in the previous summary.
• Faithful: present in the Article.
• Anywhere: located anywhere in the Article.

Initial Summary comes from the line about

The first summary should be long ( 4 - 5 sentences, ~80 words ) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.

Rewritten Prompts come from the few paragraph points about

Guidelines:

Make every word count: re-write the previous summary to improve flow and make space for additional entities.

Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"

The summaries should become highly dense and concise yet self -contained, e.g., easily understood without the Article.

Missing entities can appear anywhere in the new summary.

Never drop entities from the previous summary. If space cannot be made, add fewer new entities

I rephrased the third point about the highly dense and concise summaries and used it as the description for the Pydantic class

Entity Definition comes from spaCY

An entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

Missing Entities Task comes from the step 1 of the prompt

Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary

coderabbitai

Review Status

Actionable comments generated: 9

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 0e08af2 and dfe380f.

Files ignored due to filter (1)

docs/blog/posts/img/chain-of-density.png

Files selected for processing (10)

.github/workflows/mkdocs.yml (1 hunks)
docs/blog/posts/chain-of-density.md (1 hunks)
examples/chain-of-density/Readme.md (1 hunks)
examples/chain-of-density/chain_of_density.py (1 hunks)
examples/chain-of-density/chain_of_density.txt (1 hunks)
examples/chain-of-density/finetune.py (1 hunks)
examples/chain-of-density/run.py (1 hunks)
instructor/dsl/citation.py (1 hunks)
instructor/dsl/multitask.py (1 hunks)
instructor/function_calls.py (1 hunks)

Files skipped from review due to trivial changes (4)

.github/workflows/mkdocs.yml
examples/chain-of-density/run.py
instructor/dsl/citation.py
instructor/function_calls.py

Additional comments: 14

examples/chain-of-density/chain_of_density.txt (1)

1-9: Ensure that all these dependencies are compatible with each other and with the existing dependencies in your project. Also, make sure to update your project's documentation to reflect these new dependencies.

examples/chain-of-density/chain_of_density.py (2)

1-7: The new imports look fine. Ensure that these packages are included in your project's dependencies.

26-53: The InitialSummary and RewrittenSummary classes look well-structured. The docstrings and field descriptions provide clear explanations of their purpose and usage.
examples/chain-of-density/finetune.py (4)
1-7: The imports look fine. Ensure that all the imported modules are used in the code and that they are installed in the environment where this script will run.

9-20: The instructor patch and logging setup look fine. Ensure that the logging level is appropriate for your use case and that the log file "generated.jsonl" is being written to the correct location.

23-31: The GeneratedSummary class and distil_summarization function are well defined. Ensure that the summarize_article function returns a list as expected and that the last element of this list is the final summary.

35-52: The CSV file reading and summarization process look fine. Ensure that the "output.csv" file exists and is in the correct format. Also, ensure that the compute_metrics function returns the correct metrics. The division operation at line 53 should be safe as long as ttl_tokens is not zero. Consider adding a check to prevent division by zero.
- print(f"FINAL ET: {ttl_entities/ttl_tokens}")
+ if ttl_tokens > 0:
+     print(f"FINAL ET: {ttl_entities/ttl_tokens}")
+ else:
+     print("No tokens found.")
examples/chain-of-density/Readme.md (1)

35-35: The TODO comment should be addressed before merging the pull request. If it's not feasible to implement this feature at the moment, consider creating an issue in the repository to track this task.

docs/blog/posts/chain-of-density.md (6)

1-11: The metadata of the blog post looks fine. The authors and tags are correctly set.

109-136: The Pydantic models InitialSummary and RewrittenSummary are well defined with appropriate descriptions and fields.

147-174: The validators for the fields summary, missing, and absent are correctly implemented. They ensure that the summary has the right length, missing entities are identified, and no entities are absent from the new summary.

184-251: The summarize_article function is well implemented. It generates an initial summary and then iteratively rewrites the summary to include missing entities. The use of OpenAI's ChatCompletion API is correct and the response models are correctly set. The function also handles retries and token limits.

274-308: The script for fine-tuning the model is well implemented. It uses the Instructor library to generate a .jsonl file for fine-tuning. The use of the distil decorator and the Instruction object is correct. The script also correctly reads the articles from a CSV file and generates summaries for them.

343-347: The update to the distil_summarization function to use the fine-tuned model is correctly done. The model id is correctly set in the distil decorator.

coderabbitai · 2023-11-05T15:03:38Z

examples/chain-of-density/Readme.md

+# Introduction
+
+This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.
+
+## Instructions
+
+1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.
+
+> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.
+
+```
+pip3 install -r chain_of_density.txt
+```
+
+
+
+2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`
+
+```
+python3 download.py
+```
+
+3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`
+
+>  Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too
+
+4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job
+
+```
+instructor jobs create-from-file summarization.jsonl 
+```
+
+Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.
+
+TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )


The instructions are clear and concise. However, it would be helpful to include a brief explanation of what the Chain Of Density summarization technique is and why it's beneficial. This would provide context for users who are unfamiliar with the technique.

+ ## What is Chain Of Density Summarization? + + Chain Of Density Summarization is a technique that...

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

# Introduction

This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.

## Instructions

1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.

> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.

```

pip3 install -r chain_of_density.txt

```

2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`

```

python3 download.py

```

3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`

> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too

4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

```

instructor jobs create-from-file summarization.jsonl

```

Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.

TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )

# Introduction

This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.

## What is Chain Of Density Summarization?

Chain Of Density Summarization is a technique that...

## Instructions

1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.

> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.

pip3 install -r chain_of_density.txt

2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`

python3 download.py

3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py` > Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too 4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

instructor jobs create-from-file summarization.jsonl

Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density. TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )

examples/chain-of-density/chain_of_density.py

coderabbitai · 2023-11-05T15:03:39Z

examples/chain-of-density/Readme.md

+4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job
+
+```
+instructor jobs create-from-file summarization.jsonl 


The command instructor jobs create-from-file summarization.jsonl seems to be incorrect. It should be instruct instead of instructor.

- instructor jobs create-from-file summarization.jsonl + instruct jobs create-from-file summarization.jsonl

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

instructor jobs create-from-file summarization.jsonl

instruct jobs create-from-file summarization.jsonl

coderabbitai · 2023-11-05T15:03:39Z

instructor/dsl/multitask.py

    @staticmethod
    def extract_json(completion):
        for chunk in completion:
-            delta = chunk["choices"][0]["delta"]
-            if "function_call" in delta:
-                yield delta["function_call"]["arguments"]
+            if chunk["choices"]:
+                delta = chunk["choices"][0]["delta"]
+                if "function_call" in delta:
+                    if "arguments" in delta["function_call"]:
+                        yield delta["function_call"]["arguments"]



The code assumes that chunk["choices"] exists and is not None. If it doesn't exist or is None, this will raise a KeyError or TypeError respectively. To avoid this, use the get method which returns None if the key doesn't exist.

- if chunk["choices"]: + if chunk.get("choices"):

Also, the code assumes that chunk["choices"] is a list and has at least one element. If it's not a list or is an empty list, this will raise an IndexError. To avoid this, check if chunk["choices"] is a list and has at least one element.

- if chunk.get("choices"): + if isinstance(chunk.get("choices"), list) and len(chunk.get("choices")) > 0:

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

@staticmethod

def extract_json(completion):

for chunk in completion:

delta = chunk["choices"][0]["delta"]

if "function_call" in delta:

yield delta["function_call"]["arguments"]

if chunk["choices"]:

delta = chunk["choices"][0]["delta"]

if "function_call" in delta:

if "arguments" in delta["function_call"]:

yield delta["function_call"]["arguments"]

@staticmethod

def extract_json(completion):

for chunk in completion:

if isinstance(chunk.get("choices"), list) and len(chunk.get("choices")) > 0:

delta = chunk["choices"][0]["delta"]

if "function_call" in delta:

if "arguments" in delta["function_call"]:

yield delta["function_call"]["arguments"]

instructor/dsl/multitask.py

docs/blog/posts/chain-of-density.md

examples/chain-of-density/chain_of_density.txt

ivanleomk · 2023-11-09T10:50:08Z

Migrated and updated with benchmarks that I calculated, also created a Hugging Face data card here - https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density.

I will tidy up a colab notebook so that people can reproduce my benchmarks using the data ( just the visualisations ) so you can double check my figures too just in case if you want.

Am still pending some new lib with the upgrade to Open AI library and changes to the ReadMe to add in the hugging face dataset links

coderabbitai

Review Status

Actionable comments generated: 10

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between dfe380f and 717b40f.

Files selected for processing (4)

docs/blog/posts/chain-of-density.md (1 hunks)
examples/chain-of-density/chain_of_density.py (1 hunks)
examples/chain-of-density/finetune.py (1 hunks)
examples/chain-of-density/requirements.txt (1 hunks)

Files skipped from review due to trivial changes (1)

examples/chain-of-density/requirements.txt

Additional comments: 11

examples/chain-of-density/chain_of_density.py (3)

1-7: The import statements are well-organized and only necessary modules are imported. Good practice.

8-8: The OpenAI client is patched using the instructor module. Ensure that the instructor module is correctly implemented and does not introduce any security vulnerabilities.

11-32: The InitialSummary and RewrittenSummary classes are well-defined with clear docstrings and field descriptions. Good use of Pydantic for data validation.

docs/blog/posts/chain-of-density.md (3)

12-13: The order of authors is being discussed. If the order matters, please verify and adjust accordingly.

394-394: The note about the benefits of fine-tuning is insightful and well written.

402-407: The suggestions for further improvements are well thought out and clearly explained.

examples/chain-of-density/finetune.py (5)

1-8: Imports and setup look good. Ensure that all the imported modules are used in the code.

9-9: The instructor.patch() function is used to patch the OpenAI client. Ensure that the instructor library is compatible with the OpenAI library and that the patching process doesn't introduce any unexpected behavior.

13-17: The Instructions object is created with a name, format, and log handlers. Ensure that the log file path is correct and that the file has write permissions.

20-22: The GeneratedSummary class is defined with a single attribute summary. This class is used to return the summary from the distil_summarization function. The use of Pydantic's BaseModel ensures that the data is validated and serialized/deserialized correctly.

24-27: The distil_summarization function is decorated with @instructions.distil and takes a string input. It calls the summarize_article function and returns a GeneratedSummary object. Ensure that the summarize_article function is correctly implemented and that it returns a list of strings.

coderabbitai · 2023-11-09T10:52:53Z

docs/blog/posts/chain-of-density.md

+class InitialSummary(BaseModel):
+    """
+    This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
+    """
+
+    summary: str = Field(
+        ...,
+        description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length",
+    )
+```


The InitialSummary class is well defined with clear docstrings and field descriptions. However, consider adding a validation to ensure the summary is approximately 80 words long.

coderabbitai · 2023-11-09T10:52:53Z

docs/blog/posts/chain-of-density.md

+class RewrittenSummary(BaseModel):
+    """
+    This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.
+
+    Guidelines
+    - Make every word count : Rewrite the previous summary to improve flow and make space for additional entities
+    - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
+    - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
+    - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
+    - Missing entities can appear anywhere in the new summary
+
+    An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
+    """
+
+    summary: str = Field(
+        ...,
+        description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article",
+    )
+    absent: List[str] = Field(
+        ...,
+        default_factory=list,
+        description="this is a list of Entities found absent from the new summary that were present in the previous summary",
+    )
+    missing: List[str] = Field(
+        default_factory=list,
+        description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.",
+    )
+```


The RewrittenSummary class is well defined with clear docstrings and field descriptions. However, consider adding validations to ensure the summary is approximately 80 words long, the absent list is empty, and the missing list contains 1-3 entities.

docs/blog/posts/chain-of-density.md

coderabbitai · 2023-11-09T10:52:53Z

docs/blog/posts/chain-of-density.md

+from typing import List
+from chain_of_density import summarize_article #(1)!
+import csv
+import logging
+import instructor
+from itertools import islice
+from pydantic import BaseModel
+
+instructor.patch() #(2)!
+
+logging.basicConfig(level=logging.INFO) 
+
+instructions = instructor.Instructions( #(3)!
+    name="Chain Of Density",
+    finetune_format="messages",
+    # log handler is used to save the data to a file
+    # you can imagine saving it to a database or other storage
+    # based on your needs!
+    log_handlers=[logging.FileHandler("generated.jsonl")],
+)
+
+class GeneratedSummary(BaseModel):
+    summary: str
+
+@instructions.distil #(4)!
+def distil_summarization(text: str) -> GeneratedSummary:
+    summary_chain: List[str] = summarize_article(text)
+    return GeneratedSummary(summary=summary_chain[-1]) #(5)!
+
+with open("train.csv", "r") as file:
+    reader = csv.reader(file)
+    next(reader)  # Skip the header
+    for index, (article, summary) in enumerate(reader):
+        # Run Distillisation to generate the values
+        distil_summarization(article)
+```


The script for generating the training set is well implemented. However, consider adding error handling for potential issues with the OpenAI API calls and the CSV file reading.

coderabbitai · 2023-11-09T10:52:53Z

docs/blog/posts/chain-of-density.md

+@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)!
+def distil_summarization(text: str) -> GeneratedSummary:
+    summary_chain: List[str] = summarize_article(text)
+    return GeneratedSummary(summary=summary_chain[-1]) 
+```


The distil_summarization function is well implemented. However, consider adding error handling for potential issues with the OpenAI API calls.

coderabbitai

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 717b40f and e75c3a7.

Files selected for processing (1)

examples/chain-of-density/finetune.py (1 hunks)

Additional comments: 3

examples/chain-of-density/finetune.py (3)

1-8: Imports and client setup look good.

20-22: The GeneratedSummary class is well defined.

24-27: The distil_summarization function is correctly defined and uses the summarize_article function to generate a summary. It returns the last summary in the chain.

coderabbitai

Review Status

Actionable comments generated: 11

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 877f57a and f4068a9.

Files ignored due to filter (2)

docs/blog/posts/img/chain-of-density.png
examples/chain-of-density/output.csv

Files selected for processing (5)

docs/blog/posts/chain-of-density.md (1 hunks)
examples/chain-of-density/Readme.md (1 hunks)
examples/chain-of-density/chain_of_density.py (1 hunks)
examples/chain-of-density/finetune.py (1 hunks)
examples/chain-of-density/requirements.txt (1 hunks)

Files skipped from review due to trivial changes (2)

examples/chain-of-density/Readme.md
examples/chain-of-density/requirements.txt

Additional comments: 5

examples/chain-of-density/finetune.py (5)

1-8: The imports are well organized and only necessary modules are imported. Good job!

9-9: The OpenAI client is patched with the instructor library. Ensure that the instructor library is compatible with the OpenAI client.

13-17: The Instructions object is well defined with appropriate parameters. Ensure that the log file summarization.jsonl has write permissions.

20-22: The GeneratedSummary class is well defined using Pydantic for data validation. Good job!

24-27: The distil_summarization function is well defined and uses the @instructions.distil decorator. It returns a GeneratedSummary object with the last summary in the chain. Good job!

coderabbitai · 2023-11-09T14:41:12Z

examples/chain-of-density/finetune.py

+# Read in the csv file we have
+with open("test.csv", "r") as file:
+    reader = csv.reader(file)
+    next(reader)  # Skip the header
+    for article, summary in reader:
+        distil_summarization(article)


The script reads from a CSV file and calls the distil_summarization function for each article. Ensure that the CSV file exists, is in the correct format, and that the file has read permissions. Also, the result of the distil_summarization function is not stored or used. If the result is needed, consider storing it in a variable or data structure.

- for article, summary in reader: - distil_summarization(article) + summaries = [] + for article, _ in reader: + summaries.append(distil_summarization(article))

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

# Read in the csv file we have

with open("test.csv", "r") as file:

reader = csv.reader(file)

next(reader) # Skip the header

for article, summary in reader:

distil_summarization(article)

# Read in the csv file we have

with open("test.csv", "r") as file:

reader = csv.reader(file)

next(reader) # Skip the header

summaries = []

for article, _ in reader:

summaries.append(distil_summarization(article))

coderabbitai · 2023-11-09T14:41:12Z

examples/chain-of-density/chain_of_density.py

+def summarize_article(article: str, summary_steps: int = 3):
+    summary_chain = []
+    # We first generate an initial summary
+    summary: InitialSummary = openai.chat.completions.create(
+        model="gpt-4-0613",
+        response_model=InitialSummary,
+        messages=[
+            {
+                "role": "system",
+                "content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words",
+            },
+            {"role": "user", "content": f"Here is the Article: {article}"},
+            {
+                "role": "user",
+                "content": "The generated summary should be about 80 words.",
+            },
+        ],
+        max_retries=2,
+    )
+    prev_summary = None
+    summary_chain.append(summary.summary)
+    for i in range(summary_steps):
+        missing_entity_message = (
+            []
+            if prev_summary is None
+            else [
+                {
+                    "role": "user",
+                    "content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}",
+                },
+            ]
+        )
+        new_summary: RewrittenSummary = openai.chat.completions.create(
+            model="gpt-4-0613",
+            messages=[
+                {
+                    "role": "system",
+                    "content": """
+                You are going to generate an increasingly concise,entity-dense summary of the following article.
+
+                Perform the following two tasks
+                - Identify 1-3 informative entities from the following article which is missing from the previous summary
+                - Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities 
+
+                Guidelines
+                - Make every word count: re-write the previous summary to improve flow and make space for additional entities
+                - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
+                - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
+                - Missing entities can appear anywhere in the new summary
+                - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
+                """,
+                },
+                {"role": "user", "content": f"Here is the Article: {article}"},
+                {
+                    "role": "user",
+                    "content": f"Here is the previous summary: {summary_chain[-1]}",
+                },
+                *missing_entity_message,
+            ],
+            max_retries=3,
+            max_tokens=1000,
+            response_model=RewrittenSummary,
+        )
+        summary_chain.append(new_summary.summary)
+        prev_summary = new_summary
+
+    return summary_chain


The summarize_article function is still quite long and complex. Consider breaking it down into smaller helper functions to improve readability and maintainability. For example, you could create separate functions for generating the initial summary and the rewritten summaries.

jxnl · 2023-11-09T14:53:55Z

docs/blog/posts/chain-of-density.md

+  - chain of density
+  - finetuneing
+  - gpt-3.5-turbo
+  - distilation


@jxnl Thank you for acknowledging the correction. Please proceed to commit the suggested change to fix the typographical error.

coderabbitai · 2023-11-09T14:41:13Z

docs/blog/posts/chain-of-density.md

+
+# Implementing Chain Of Density
+
+By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.


Typographical error.

- As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. + As usual, all of the code is available for reference under our `examples/chain-of-density` folder.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.

By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is available for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.

coderabbitai · 2023-11-09T14:41:13Z

docs/blog/posts/chain-of-density.md

+
+## Implementation
+
+Note that our implementation uses a validator to ensure that the rewritten summary has a minimum length rather than a prompt. As a result, we match the original paper on entity count but not entity density.


The sentence is not clear. It would be better to clarify what is meant by "entity count" and "entity density".

coderabbitai · 2023-11-09T14:41:13Z

docs/blog/posts/chain-of-density.md

+Let's first segregate our train and test set so that we don't have any sort of contamination - this corresponds to our `train.csv` and `test.csv` in our [Hugging Face Dataset](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density). Now, we just need to import the `Instructions` module from the `Instructor` package which allows you to generate a nicely formatted `.jsonl` file to be used for fine-tuning
+
+
+```py hl_lines="2 9 13-20 25 28"


Typographical error.

- from chain_of_density import summarize_article #(1)! + from chain_of_density import summarize_article # (1)!

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

```py hl_lines="2 9 13-20 25 28"

from chain_of_density import summarize_article # (1)!

coderabbitai · 2023-11-09T14:41:13Z

docs/blog/posts/chain-of-density.md

+
+!!! note
+
+    Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.


Typographical error.

- Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks. + Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.

Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.

@jxnl Thank you for your approval. I will proceed to make the suggested change.

coderabbitai · 2023-11-09T14:41:14Z

docs/blog/posts/chain-of-density.md

+
+## Conclusion
+
+There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it


Typographical error.

- There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it + There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convenience but there's really a whole variety of other ways to improve it.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it

There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convenience but there's really a whole variety of other ways to improve it.

coderabbitai

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between f4068a9 and 1d688d6.

Files selected for processing (1)

examples/chain-of-density/finetune.py (1 hunks)

Additional comments: 2

examples/chain-of-density/finetune.py (2)
27-32: The previous comment about storing the result of the distil_summarization function is still valid. Also, ensure that the CSV file "test.csv" exists and is in the correct format.
-    for article, summary in reader:
-        distil_summarization(article)
+    summaries = []
+    for article, _ in reader:
+        summaries.append(distil_summarization(article))
21-24: The distil_summarization function is correctly defined and returns a GeneratedSummary object. However, ensure that the summarize_article function from the chain_of_density module is correctly implemented and tested.

coderabbitai · 2023-11-09T14:43:05Z

examples/chain-of-density/finetune.py

+from typing import List
+from chain_of_density import summarize_article
+import csv
+import logging
+import instructor
+from pydantic import BaseModel
+
+logging.basicConfig(level=logging.INFO)
+
+instructions = instructor.Instructions(
+    name="Chain Of Density",
+    finetune_format="messages",
+    log_handlers=[logging.FileHandler("summarization.jsonl")],
+)


The Instructions object is created but not used anywhere in the code. If it's not used, consider removing it to avoid confusion.

- instructions = instructor.Instructions( - name="Chain Of Density", - finetune_format="messages", - log_handlers=[logging.FileHandler("summarization.jsonl")], - )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

from typing import List

from chain_of_density import summarize_article

import csv

import logging

import instructor

from pydantic import BaseModel

logging.basicConfig(level=logging.INFO)

instructions = instructor.Instructions(

name="Chain Of Density",

finetune_format="messages",

log_handlers=[logging.FileHandler("summarization.jsonl")],

)

from typing import List

from chain_of_density import summarize_article

import csv

import logging

from pydantic import BaseModel

logging.basicConfig(level=logging.INFO)

jxnl · 2023-11-09T14:53:55Z

docs/blog/posts/chain-of-density.md

+  - chain of density
+  - finetuneing
+  - gpt-3.5-turbo
+  - distilation


…AI Api

coderabbitai

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 1d688d6 and 332a0b4.

Files selected for processing (3)

docs/blog/posts/chain-of-density.md (1 hunks)
examples/chain-of-density/Readme.md (1 hunks)
examples/chain-of-density/chain_of_density.py (1 hunks)

Files skipped from review due to trivial changes (1)

examples/chain-of-density/Readme.md

Additional comments: 11

examples/chain-of-density/chain_of_density.py (3)

1-6: Ensure that all the imported modules are used in the code. Unused imports can lead to confusion and unnecessary overhead.

7-7: The instructor.patch(OpenAI()) seems to be a custom function. Ensure that it is defined and imported correctly.

78-144: The summarize_article function seems to be well-structured and logically sound. However, ensure that the OpenAI API calls are correctly implemented and that the API keys are securely stored and accessed.

docs/blog/posts/chain-of-density.md (8)

98-107: The Pydantic model InitialSummary is well defined with a clear docstring and field description. The use of Pydantic's Field function to provide a description for the summary field is a good practice for code readability and maintainability.

115-142: The Pydantic model RewrittenSummary is well defined with a clear docstring and field descriptions. The use of Pydantic's Field function to provide descriptions for the summary, absent, and missing fields is a good practice for code readability and maintainability.

153-180: The use of Pydantic validators to enforce business rules on the summary, missing, and absent fields is a good practice. It ensures that the data adheres to the expected format and constraints, improving the robustness of the code.

190-262: The summarize_article function is well implemented. It uses the OpenAI API to generate summaries and uses the Pydantic models and validators to ensure the summaries adhere to the expected format and constraints. The use of a loop to generate a chain of summaries is a good practice. The code is readable and maintainable.

292-346: The use of the instructor library to generate a .jsonl file for fine-tuning is a good practice. It simplifies the process of preparing data for fine-tuning and ensures the data is in the correct format. The use of the distil decorator to automatically capture the input and output of the function is a good practice. It simplifies the process of generating the .jsonl file.

359-363: The use of the distil decorator with the model and mode parameters to use the fine-tuned model is a good practice. It simplifies the process of using the fine-tuned model in the function.

384-400: The benchmarking results are well presented in a table format, making it easy to compare the performance of the different models. The use of metrics such as mean latency, mean entity count, mean entity density, and total cost provides a comprehensive view of the performance of the models.

413-424: The conclusion provides a good summary of the article and offers suggestions for further improvements. This is a good practice as it encourages readers to explore the topic further.

examples/chain-of-density/chain_of_density.py

coderabbitai · 2023-11-09T16:54:08Z

docs/blog/posts/chain-of-density.md

+draft: False
+date: 2023-11-05
+tags:
+  - pydantic
+  - validation
+  - chain of density
+  - finetuneing
+  - gpt-3.5-turbo
+  - distilation
+authors:
+  - ivanleomk
+  - jxnl
+---


The metadata of the blog post seems fine. However, there are a couple of typos in the tags. "finetuneing" should be "finetuning" and "distilation" should be "distillation".

- - finetuneing + - finetuning - - distilation + - distillation

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

draft: False

date: 2023-11-05

tags:

- pydantic

- validation

- chain of density

- finetuneing

- gpt-3.5-turbo

- distilation

authors:

- ivanleomk

- jxnl

---

draft: False

date: 2023-11-05

tags:

- pydantic

- validation

- chain of density

- finetuning

- gpt-3.5-turbo

- distillation

authors:

- ivanleomk

- jxnl

---

coderabbitai · 2023-11-09T16:54:08Z

docs/blog/posts/chain-of-density.md

+
+# Implementing Chain Of Density
+
+By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. 


There's a typo in the word "available". It should be corrected.

- all of the code is avaliable for reference under our `examples/chain-of-density` folder. + all of the code is available for reference under our `examples/chain-of-density` folder.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder.

By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is available for reference under our `examples/chain-of-density` folder.

Edits to Chain of Density work

coderabbitai

Review Status

Actionable comments generated: 13

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 332a0b4 and ae14920.

Files ignored due to filter (1)

mkdocs.yml

Files selected for processing (1)

docs/blog/posts/chain-of-density.md (1 hunks)

Additional comments: 1

docs/blog/posts/chain-of-density.md (1)

370-370: Typo in "@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)!". It should be "@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") # (

coderabbitai · 2023-11-11T09:10:37Z

docs/blog/posts/chain-of-density.md

+  - pydantic
+  - validation
+  - chain of density
+  - finetuneing


Typo in the tag "finetuneing". It should be "finetuning".

- - finetuneing + - finetuning

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

- finetuneing

- finetuning

coderabbitai · 2023-11-11T09:10:37Z

docs/blog/posts/chain-of-density.md

+  - chain of density
+  - finetuneing
+  - gpt-3.5-turbo
+  - distilation


Typo in the tag "distilation". It should be "distillation".

- - distilation + - distillation

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

- distilation

- distillation

coderabbitai · 2023-11-11T09:10:37Z

docs/blog/posts/chain-of-density.md

+
+# Better Summaries by Finetuning Chain of Density
+
+> Discover how to distil an interative method like chain of density into a single finetune.


Typo in "interative". It should be "iterative".

- > Discover how to distil an interative method like chain of density into a single finetune. + > Discover how to distil an iterative method like chain of density into a single finetune.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

> Discover how to distil an interative method like chain of density into a single finetune.

> Discover how to distil an iterative method like chain of density into a single finetune.

coderabbitai · 2023-11-11T09:10:37Z

docs/blog/posts/chain-of-density.md

+
+> Discover how to distil an interative method like chain of density into a single finetune.
+
+In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.


Typo in "were able to increase latency". It should be "were able to decrease latency".

- Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. + Using these methods were able to decrease latency by 40x, reduce costs by 10x and maintain entity density.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.

In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to decrease latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.

coderabbitai · 2023-11-11T09:10:37Z

docs/blog/posts/chain-of-density.md

+
+### Original Prompt
+
+We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.


Typo in "pip install instructor". It should be "pip install instructor".

- We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. + We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller API calls.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.

We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller API calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.

coderabbitai · 2023-11-11T09:10:38Z

docs/blog/posts/chain-of-density.md

+
+logging.basicConfig(level=logging.INFO)
+
+instructions = instructor.Instructions( #(2)!


Typo in "instructions = instructor.Instructions( #(2)!". It should be "instructions = instructor.Instructions() # (2)!".

- instructions = instructor.Instructions( #(2)! + instructions = instructor.Instructions() # (2)!

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

instructions = instructor.Instructions( #(2)!

instructions = instructor.Instructions() # (2)!

coderabbitai · 2023-11-11T09:10:38Z

docs/blog/posts/chain-of-density.md

+class GeneratedSummary(BaseModel):
+    summary: str
+
+@instructions.distil #(3)!


Typo in "@instructions.distil #(3)!". It should be "@instructions.distil # (3)!".

- @instructions.distil #(3)! + @instructions.distil # (3)!

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

@instructions.distil #(3)!

@instructions.distil # (3)!

coderabbitai · 2023-11-11T09:10:38Z

docs/blog/posts/chain-of-density.md

+
+Now that we have our models and the rough flow figured out, let's implement a function to summarize a piece of text using `Chain Of Density` summarization.
+
+```py hl_lines="4 9-24 38-68"


Typo in "py hl_lines="4 9-24 38-68"". It should be "python hl_lines="4 9-24 38-68"".

- ```py hl_lines="4 9-24 38-68" + ```python hl_lines="4 9-24 38-68"

coderabbitai · 2023-11-11T09:10:38Z

docs/blog/posts/chain-of-density.md

+
+!!! warning "Rate Limiting"
+
+    We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.


Typo in "We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.". It should be "We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.".

- We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely. + We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.

We recommend running this script on a small subset of the dataset first to test if you've got everything configured nicely.

coderabbitai · 2023-11-11T09:10:38Z

docs/blog/posts/chain-of-density.md

+
+Once we run this script, we'll have a new file called `generated.jsonl` in our local repository. Now all that's left is to run the command below to start fine-tuning your first model!
+
+```sh


Typo in "sh". It should be "bash".

- ```sh + ```bash

coderabbitai

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between ae14920 and 77104bd.

Files ignored due to filter (1)

docs/blog/posts/img/chain-of-density.png

Files selected for processing (3)

docs/blog/posts/chain-of-density.md (1 hunks)
examples/chain-of-density/chain_of_density.py (1 hunks)
examples/chain-of-density/finetune.py (1 hunks)

Files not reviewed due to errors (1)

docs/blog/posts/chain-of-density.md (Error: diff too large)

Additional comments: 13

examples/chain-of-density/finetune.py (6)

1-6: Imports are correctly placed and organized according to PEP8 guidelines.

8-8: Logging level is set to INFO. Ensure that this level of logging is appropriate for your use case.

10-17: The Instructions object is correctly initialized with appropriate parameters.

20-35: The GeneratedSummary class is well-documented and correctly uses Pydantic for data validation.

38-41: The distil_summarization function is correctly decorated and returns a GeneratedSummary object. Ensure that the summarize_article function returns a list of strings.

44-48: The script reads from a CSV file and processes each article. Ensure that the CSV file is correctly formatted and contains the necessary data.

examples/chain-of-density/chain_of_density.py (7)

1-7: Ensure that all the imported modules are used in the code. Unused imports can lead to confusion and unnecessary dependencies.

12-23: The InitialSummary class is well defined with clear documentation and field descriptions.

24-49: The RewrittenSummary class is well defined with clear documentation and field descriptions. It also includes field validators to ensure the quality of the summary.

51-67: The min_entity_density validator ensures that the summary has a minimum entity density. This is a good practice to maintain the quality of the summary.

69-77: The min_length validator ensures that the summary has a minimum length. This is a good practice to maintain the quality of the summary.

79-85: The has_missing_entities validator ensures that there are missing entities identified for the next summary. This is a good practice to maintain the quality of the summary.

87-95: The has_no_absent_entities validator ensures that no entities from the previous summary are absent in the new summary. This is a good practice to maintain the quality of the summary.

examples/chain-of-density/chain_of_density.py

coderabbitai · 2023-11-12T13:49:20Z

examples/chain-of-density/chain_of_density.py

+def summarize_article(article: str, summary_steps: int = 3):
+    summary_chain = []
+    # We first generate an initial summary
+    summary: InitialSummary = client.chat.completions.create(
+        model="gpt-4-0613",
+        response_model=InitialSummary,
+        messages=[
+            {
+                "role": "system",
+                "content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words. ",
+            },
+            {"role": "user", "content": f"Here is the Article: {article}"},
+            {
+                "role": "user",
+                "content": "The generated summary should be about 80 words.",
+            },
+        ],
+        max_retries=2,
+    )
+    summary_chain.append(summary.summary)
+    for i in range(summary_steps):
+        new_summary: RewrittenSummary = client.chat.completions.create(
+            model="gpt-4-0613",
+            messages=[
+                {
+                    "role": "system",
+                    "content": f"""
+                Article: {article}
+                You are going to generate an increasingly concise,entity-dense summary of the following article.
+
+                Perform the following two tasks
+                - Identify 1-3 informative entities from the following article which is missing from the previous summary
+                - Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities 
+
+                Guidelines
+                - Make every word count: re-write the previous summary to improve flow and make space for additional entities
+                - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
+                - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
+                - Missing entities can appear anywhere in the new summary
+                - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
+                """,
+                },
+                {
+                    "role": "user",
+                    "content": f"Here is the previous summary: {summary_chain[-1]}",
+                },
+            ],
+            max_retries=5,
+            max_tokens=1000,
+            response_model=RewrittenSummary,
+        )
+        summary_chain.append(new_summary.summary)
+
+    return summary_chain


The summarize_article function is well defined and uses the OpenAI API to generate initial and rewritten summaries. It also includes error handling with retries. However, the function could be broken down into smaller functions for better readability and maintainability.

eugeneyan · 2023-11-12T17:34:14Z

docs/blog/posts/chain-of-density.md

+
+## Part 1) Chain of Density
+
+Summarizing extensive texts with AI can be challenging, often relying on inconsistent techniques. Salesforce AI Research's novel method, chain of density, enhances AI-based text summarization, outperforming human-generated summaries.


[NIT] "chain of density" -> "Chain of Density" for consistency.

eugeneyan · 2023-11-12T17:36:06Z

docs/blog/posts/chain-of-density.md

+
+Let's start by walking through some of the data models that we'll be using as the `response_model` for our open ai function calls
+
+Firstly, we'll need a data model for the initial summary that we will be generating. We'll take the description of this class straight from the original prompt. Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs.


[NIT] "Its" -> "It's"

Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs.

Make it clearer why the docstrings are important. I think it's used to validate that initial and rewritten summaries contain the expected Pydantic fields (summary, absent, missing). If so, then do we need such extensive docstrings? Also, the doctrings seem duplicative of the system content in the examples below.

Please correct my understanding if it's wrong. Either way, I think it's an opportunity to educate users on how docstrings come into play in Instructor.

To my understanding, the docstrings are used in the function call parameters (Eg below) so a more descriptive docstring helps guide the eventual output by cleanly specifying what you want

{ "functions": [ { "name": "GeneratedSummary", "description": "This represents a highly concise summary that includes as many entities as possible from the original source article.\n\nAn Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.\n\nGuidelines\n- Make every word count\n- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.\n- Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\"", "parameters": { "properties": { "summary": { "description": "This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ", "title": "Summary", "type": "string" } }, "required": [ "summary" ], "type": "object" } } ] } }

eugeneyan · 2023-11-12T18:18:18Z

docs/blog/posts/chain-of-density.md

+import instructor
+from pydantic import BaseModel
+
+logging.basicConfig(level=logging.INFO) #(2)!


This feels like an unnecessary detail that the user has to be careful not to trip over—perhaps instructor.Instructions can have it's own logging handler so the user doesn't have to even be aware of it?

@jxnl thoughts?

eugeneyan · 2023-11-12T18:22:54Z

docs/blog/posts/chain-of-density.md

+| Model               | Mean Latency (s) | Mean Entity Count | Mean Entity Density | Tokens |
+| ------------------- | ---------------- | ----------------- | ------------------- | ------ |
+| GPT-4 (COD)         | 49.5             | 11.3              | 0.138               | 81.65  |
+| GPT-3 (COD)         | 145.94           | 11.05             | 0.105               | 105.7  |


Is this gpt-3 or gpt-3.5? If the former, I wonder if we can add a benchmark for gpt-3.5? If the latter, make it clearer.

Also, we can make the benefits of COD + distillation more convincing by adding benchmarks for gpt-4 and gpt-3.5 without COD—how much lift does this additional effort buy us? This will help users prioritize between different approaches to improve their summaries.

Ah, this is GPT-3.5, my bad on this. I took this out and replaced it with the benchmarks I calculated for a Vanilla summary using GPT 3.5 that just asked for a concise summary.

eugeneyan · 2023-11-12T18:25:27Z

docs/blog/posts/chain-of-density.md

+| ------------------- | ---------------- | ----------------- | ------------------- | ------ |
+| GPT-4 (COD)         | 49.5             | 11.3              | 0.138               | 81.65  |
+| GPT-3 (COD)         | 145.94           | 11.05             | 0.105               | 105.7  |
+| 3.5 Finetuned (20)  | 2.25             | 14.7              | 0.154               | 95.45  |


Comparing finetuning on 20 summaries to the rows below, it seems that finetuning only on 20 summaries had the highest absolute entity count and density? Hmm, why is that? Might want to add a few hypotheses even if you don't have the answers.

Added it in a new branch! But my hypotheses are that

Model might not be benefitting from the higher number of examples due to the epochs ( 20, 50 and 76 are all trained with 4 epochs which is the default number provided by OpenAI)

Larger variety of examples might cause the model to optimize for different objectives - not just for token density. The COD summarization method tends to produce more abstract summaries with each rewrite, so it might be learning other metrics under the hood to optimize for

coderabbitai bot reviewed Nov 4, 2023

View reviewed changes

examples/chain-of-density/chain_of_density.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 4, 2023

View reviewed changes

jxnl requested changes Nov 4, 2023

View reviewed changes

coderabbitai bot reviewed Nov 5, 2023

View reviewed changes

jxnl reviewed Nov 5, 2023

View reviewed changes

docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved

jxnl reviewed Nov 7, 2023

View reviewed changes

docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved

jxnl reviewed Nov 7, 2023

View reviewed changes

docs/blog/posts/chain-of-density.md Show resolved Hide resolved

jxnl reviewed Nov 7, 2023

View reviewed changes

docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved

jxnl reviewed Nov 7, 2023

View reviewed changes

docs/blog/posts/chain-of-density.md Outdated Show resolved Hide resolved

jxnl reviewed Nov 7, 2023

View reviewed changes

examples/chain-of-density/chain_of_density.txt Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 9, 2023

View reviewed changes

jxnl and others added 8 commits November 9, 2023 22:32

chain of density

6644539

Adding in a first draft of the Pydantic model with spaCy validation

db15318

Fixed up a initial finetuning script

26499de

Removed the use of islice

6f10296

Refactored out the pipeline to use individual bits

27e36c1

Fixed merge conflict with run.py

fb86366

Migrated existing pipeline to use a more simpler approach

623dc13

Removed original code and added a new article

008c554

ivanleomk added 2 commits November 9, 2023 22:33

Fixed up some new changes and article

e31b21b

Removed data files

f4068a9

ivanleomk force-pushed the chain-of-density branch from e75c3a7 to f4068a9 Compare November 9, 2023 14:35

Removing useless files

1d688d6

coderabbitai bot reviewed Nov 9, 2023

View reviewed changes

jxnl reviewed Nov 9, 2023

View reviewed changes

docs/blog/posts/chain-of-density.md

- chain of density

- finetuneing

- gpt-3.5-turbo

- distilation

Copy link

Collaborator

jxnl Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 whoops

Fixed up changes and rewrote code to make it compatible with new Open…

332a0b4

…AI Api

coderabbitai bot reviewed Nov 9, 2023

View reviewed changes

jxnl and others added 2 commits November 10, 2023 20:53

edits

8100198

Merge pull request #163 from jxnl/chain-of-density-jxnl/v1

ae14920

Edits to Chain of Density work

coderabbitai bot reviewed Nov 11, 2023

View reviewed changes

Fixed up new validator and updated article with benchmarks

77104bd

coderabbitai bot reviewed Nov 12, 2023

View reviewed changes

Merge branch 'main' into chain-of-density

6d18bba

jxnl merged commit 9139857 into main Nov 12, 2023

jxnl deleted the chain-of-density branch November 12, 2023 15:12

eugeneyan reviewed Nov 12, 2023

View reviewed changes

	instructor jobs create-from-file summarization.jsonl
	instruct jobs create-from-file summarization.jsonl


		# Implementing Chain Of Density

		By the end of this article, you'll have a fine-tuned GPT 3.5 model which can take in source text and output a summary which mimics a chain-of-density summarization using `Instructor`'s new jobs cli feature. As usual, all of the code is avaliable for reference under our `examples/chain-of-density` folder. We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments.


		## Implementation

		Note that our implementation uses a validator to ensure that the rewritten summary has a minimum length rather than a prompt. As a result, we match the original paper on entity count but not entity density.

		Let's first segregate our train and test set so that we don't have any sort of contamination - this corresponds to our `train.csv` and `test.csv` in our [Hugging Face Dataset](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density). Now, we just need to import the `Instructions` module from the `Instructor` package which allows you to generate a nicely formatted `.jsonl` file to be used for fine-tuning


		```py hl_lines="2 9 13-20 25 28"

	```py hl_lines="2 9 13-20 25 28"
	from chain_of_density import summarize_article # (1)!


		!!! note

		Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.

	Using a fine-tuned model was able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.
	Using a fine-tuned model, we were able to reduce the inference time by almost 20-40x while keeping entity density relatively constant. At the same time, our costs dropped by almost 10x when compared against `GPT-4 Turbo` and by almost 150x when compared against GPT-4. This is a strong nod to the power of fine-tuning and generating small models to perform niche tasks.


		## Conclusion

		There's a large area of improvement that you could take on to improve the performance of this fine-tuned model and the quality of the dataset. We chose the first 100 rows as training data for convinience but there's really a whole variety of other ways to improve it


		# Better Summaries by Finetuning Chain of Density

		> Discover how to distil an interative method like chain of density into a single finetune.

	> Discover how to distil an interative method like chain of density into a single finetune.
	> Discover how to distil an iterative method like chain of density into a single finetune.


		> Discover how to distil an interative method like chain of density into a single finetune.

		In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 40x, reduce costs by 10x and maintain entity density. Showing massive efficiency gains by finetuning and distiling capabilities into specialized models.


		### Original Prompt

		We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want.


		logging.basicConfig(level=logging.INFO)

		instructions = instructor.Instructions( #(2)!

	instructions = instructor.Instructions( #(2)!
	instructions = instructor.Instructions() # (2)!


		Now that we have our models and the rough flow figured out, let's implement a function to summarize a piece of text using `Chain Of Density` summarization.

		```py hl_lines="4 9-24 38-68"


		!!! warning "Rate Limiting"

		We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely.


		Once we run this script, we'll have a new file called `generated.jsonl` in our local repository. Now all that's left is to run the command below to start fine-tuning your first model!

		```sh


		## Part 1) Chain of Density

		Summarizing extensive texts with AI can be challenging, often relying on inconsistent techniques. Salesforce AI Research's novel method, chain of density, enhances AI-based text summarization, outperforming human-generated summaries.


		Let's start by walking through some of the data models that we'll be using as the `response_model` for our open ai function calls

		Firstly, we'll need a data model for the initial summary that we will be generating. We'll take the description of this class straight from the original prompt. Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs.

Chain of density #135

Chain of density #135

Conversation

ivanleomk commented Nov 4, 2023 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Nov 4, 2023 • edited Loading

Walkthrough

Changes

Poem

Chat with CodeRabbit Bot (@coderabbitai)

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 4, 2023

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jxnl left a comment

Choose a reason for hiding this comment

ivanleomk commented Nov 4, 2023

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 5, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 5, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 5, 2023

Choose a reason for hiding this comment

ivanleomk commented Nov 9, 2023

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

ivanleomk commented Nov 4, 2023 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2023 •

edited

Loading

Chat with CodeRabbit Bot (`@coderabbitai`)

CodeRabbit Configration File (`.coderabbit.yaml`)

coderabbitai bot Nov 9, 2023 •

edited

Loading