Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chain of density #135

Merged
merged 16 commits into from
Nov 12, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
408 changes: 408 additions & 0 deletions docs/blog/posts/chain-of-density.md

Large diffs are not rendered by default.

Binary file added docs/blog/posts/img/chain-of-density.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions examples/chain-of-density/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Introduction

This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.

## Instructions

1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.

> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.

```
pip3 install -r chain_of_density.txt
```



2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`

```
python3 download.py
```

3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`

> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too

4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

```
instructor jobs create-from-file summarization.jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command instructor jobs create-from-file summarization.jsonl seems to be incorrect. It should be instruct instead of instructor.

- instructor jobs create-from-file summarization.jsonl 
+ instruct jobs create-from-file summarization.jsonl 

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
instructor jobs create-from-file summarization.jsonl
instruct jobs create-from-file summarization.jsonl

```

Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.

TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructions are clear and concise. However, it would be helpful to include a brief explanation of what the Chain Of Density summarization technique is and why it's beneficial. This would provide context for users who are unfamiliar with the technique.

+ ## What is Chain Of Density Summarization?
+ 
+ Chain Of Density Summarization is a technique that...

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
# Introduction
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.
## Instructions
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.
```
pip3 install -r chain_of_density.txt
```
2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`
```
python3 download.py
```
3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`
> Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too
4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job
```
instructor jobs create-from-file summarization.jsonl
```
Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.
TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )
# Introduction
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage.
## What is Chain Of Density Summarization?
Chain Of Density Summarization is a technique that...
## Instructions
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.
> To evaluate the quality of our Summaries, we use spaCy and NLTk. You'll need to download the spaCy en_core_web_trf package and the nltk punkt package to compute the token metrics.

pip3 install -r chain_of_density.txt




2. Download the dataset using `download.py`. We're using the `griffin/chain_of_density` dataset for this example so no worries if you don't have a dataset of your own. This should generate a new `.csv` file in the folder called `output.csv`

python3 download.py


3. We now need some examples to fine-tune our `3.5` model on. We provide a existing `.jsonl` file to use or you can generate new ones from the dataset using `finetune.py`

>  Don't forget to set an environment variable `OPENAI_API_KEY` in your shell if you wish to regenerate the examples. You can do so using the command `export OPENAI_API_KEY=<api key> ` We'll use it subsequently down the line for our finetuning step too

4. Now that we have a `.jsonl` file with a bunch of examples, let's now run a simple finetuning job

instructor jobs create-from-file summarization.jsonl


Voila! Now you've got a new GPT3.5 model that's capable of summarizing text fine-tuned with Chain Of Density.

TODO: Evaluate the quality of the improved summaries using Spacy's Entity counter ( So we can calculate entity / tokens )

145 changes: 145 additions & 0 deletions examples/chain-of-density/chain_of_density.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
from pydantic import BaseModel, Field, field_validator
from typing import List
import instructor
import openai
import nltk
from openai import OpenAI

client = instructor.patch(OpenAI())


class InitialSummary(BaseModel):
"""
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
"""

summary: str = Field(
...,
description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length",
)


class RewrittenSummary(BaseModel):
"""
This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

Guidelines
- Make every word count : Rewrite the previous summary to improve flow and make space for additional entities
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
- Missing entities can appear anywhere in the new summary

An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
"""

summary: str = Field(
...,
description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article",
)
absent: List[str] = Field(
...,
default_factory=list,
description="this is a list of Entities found absent from the new summary that were present in the previous summary",
)
missing: List[str] = Field(
default_factory=list,
description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.",
)

@field_validator("summary")
def min_length(cls, v: str):
tokens = nltk.word_tokenize(v)
num_tokens = len(tokens)
if num_tokens < 75:
raise ValueError(
"The current summary is too short. Please make sure that you generate a new summary that is around 80 words long."
)
return v

@field_validator("missing")
def has_missing_entities(cls, missing_entities: List[str]):
if len(missing_entities) == 0:
raise ValueError(
"You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary"
)
return missing_entities

@field_validator("absent")
def has_no_absent_entities(cls, absent_entities: List[str]):
absent_entity_string = ",".join(absent_entities)
if len(absent_entities) > 0:
print(f"Detected absent entities of {absent_entity_string}")
raise ValueError(
f"Do not omit the following Entities {absent_entity_string} from the new summary"
)
return absent_entities
jxnl marked this conversation as resolved.
Show resolved Hide resolved


def summarize_article(article: str, summary_steps: int = 3):
ivanleomk marked this conversation as resolved.
Show resolved Hide resolved
summary_chain = []
# We first generate an initial summary
summary: InitialSummary = openai.chat.completions.create(
model="gpt-4-0613",
response_model=InitialSummary,
messages=[
{
"role": "system",
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": "The generated summary should be about 80 words.",
},
],
max_retries=2,
)
prev_summary = None
summary_chain.append(summary.summary)
for i in range(summary_steps):
missing_entity_message = (
[]
if prev_summary is None
else [
{
"role": "user",
"content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}",
},
]
)
new_summary: RewrittenSummary = openai.chat.completions.create(
model="gpt-4-0613",
messages=[
{
"role": "system",
"content": """
You are going to generate an increasingly concise,entity-dense summary of the following article.

Perform the following two tasks
- Identify 1-3 informative entities from the following article which is missing from the previous summary
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities

Guidelines
- Make every word count: re-write the previous summary to improve flow and make space for additional entities
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
- Missing entities can appear anywhere in the new summary
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
""",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": f"Here is the previous summary: {summary_chain[-1]}",
},
*missing_entity_message,
],
max_retries=3,
max_tokens=1000,
response_model=RewrittenSummary,
)
summary_chain.append(new_summary.summary)
prev_summary = new_summary

return summary_chain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summarize_article function is still quite long and complex. Consider breaking it down into smaller helper functions to improve readability and maintainability. For example, you could create separate functions for generating the initial summary and the rewritten summaries.

35 changes: 35 additions & 0 deletions examples/chain-of-density/finetune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from typing import List
from chain_of_density import summarize_article
import csv
import logging
import instructor
from pydantic import BaseModel
from openai import OpenAI

client = instructor.patch(OpenAI())

logging.basicConfig(level=logging.INFO)

instructions = instructor.Instructions(
name="Chain Of Density",
finetune_format="messages",
log_handlers=[logging.FileHandler("summarization.jsonl")],
)


class GeneratedSummary(BaseModel):
summary: str


@instructions.distil
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
return GeneratedSummary(summary=summary_chain[-1])


# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for article, summary in reader:
distil_summarization(article)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script reads from a CSV file and calls the distil_summarization function for each article. Ensure that the CSV file exists, is in the correct format, and that the file has read permissions. Also, the result of the distil_summarization function is not stored or used. If the result is needed, consider storing it in a variable or data structure.

-    for article, summary in reader:
-        distil_summarization(article)
+    summaries = []
+    for article, _ in reader:
+        summaries.append(distil_summarization(article))

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for article, summary in reader:
distil_summarization(article)
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
summaries = []
for article, _ in reader:
summaries.append(distil_summarization(article))

Loading