Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chain of density #135

Merged
merged 16 commits into from
Nov 12, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
417 changes: 417 additions & 0 deletions docs/blog/posts/chain-of-density.md

Large diffs are not rendered by default.

Binary file added docs/blog/posts/img/chain-of-density.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 31 additions & 0 deletions examples/chain-of-density/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Introduction

This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. All of our data referenced in this file is located [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) on hugging face

Check out our blog post [here](https://jxnl.github.io/instructor/blog/2023/11/05/implementing-chain-of-density/) where we have a detailed explanation of the code and a [colab notebook](https://colab.research.google.com/drive/1iBkrEh2G5U8yh8RmI8EkWxjLq6zIIuVm?usp=sharing) walking you through how we perform our calculations.

## Instructions

1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.

> We use NLTK to ensure that our summaries are of a certain token length. In order to do so, you'll need to download the `punkt` package to compute the token metrics. You can do so by running the command `nltk.download('punkt')`

```
pip3 install -r requirements.txt
```

2. Download the `test.csv` file and the `summarization.jsonl` file that you want to use for finetuning. We provide one with `20` examples, `50` examples and `100` examples to be used for testing. Let's now run a simple finetuning job with the following command.

> Don't forget to set your `OPENAI_API_KEY` as an environment variable in your shell before running these commands

```
instructor jobs create-from-file summarization.jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command instructor jobs create-from-file summarization.jsonl seems to be incorrect. It should be instruct instead of instructor.

- instructor jobs create-from-file summarization.jsonl 
+ instruct jobs create-from-file summarization.jsonl 

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
instructor jobs create-from-file summarization.jsonl
instruct jobs create-from-file summarization.jsonl

```

3. Once the job is complete, you'll end up with a new GPT 3.5 model that's capable of producing high quality summaries with a high entity density. You can run it by simply changing our `finetune.py` file's `instructions.distil` annotator as

```
@instructions.distil(model=<your finetuned model >,mode="dispatch")
def distil_summarization(text: str) -> GeneratedSummary:
// rest of code goes here
```
144 changes: 144 additions & 0 deletions examples/chain-of-density/chain_of_density.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
from pydantic import BaseModel, Field, field_validator
from typing import List
import instructor
import nltk
from openai import OpenAI

client = instructor.patch(OpenAI())


class InitialSummary(BaseModel):
"""
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
"""

summary: str = Field(
...,
description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length",
)


class RewrittenSummary(BaseModel):
"""
This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

Guidelines
- Make every word count : Rewrite the previous summary to improve flow and make space for additional entities
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
- Missing entities can appear anywhere in the new summary

An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
"""

summary: str = Field(
...,
description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article",
)
absent: List[str] = Field(
...,
default_factory=list,
description="this is a list of Entities found absent from the new summary that were present in the previous summary",
)
missing: List[str] = Field(
default_factory=list,
description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.",
)

@field_validator("summary")
def min_length(cls, v: str):
tokens = nltk.word_tokenize(v)
num_tokens = len(tokens)
if num_tokens < 75:
raise ValueError(
"The current summary is too short. Please make sure that you generate a new summary that is around 80 words long."
)
return v

@field_validator("missing")
def has_missing_entities(cls, missing_entities: List[str]):
if len(missing_entities) == 0:
raise ValueError(
"You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary"
)
return missing_entities

@field_validator("absent")
def has_no_absent_entities(cls, absent_entities: List[str]):
absent_entity_string = ",".join(absent_entities)
if len(absent_entities) > 0:
print(f"Detected absent entities of {absent_entity_string}")
raise ValueError(
f"Do not omit the following Entities {absent_entity_string} from the new summary"
)
return absent_entities
jxnl marked this conversation as resolved.
Show resolved Hide resolved


def summarize_article(article: str, summary_steps: int = 3):
ivanleomk marked this conversation as resolved.
Show resolved Hide resolved
summary_chain = []
# We first generate an initial summary
summary: InitialSummary = client.chat.completions.create(
model="gpt-4-0613",
response_model=InitialSummary,
messages=[
{
"role": "system",
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": "The generated summary should be about 80 words.",
},
],
max_retries=2,
)
prev_summary = None
summary_chain.append(summary.summary)
for i in range(summary_steps):
missing_entity_message = (
[]
if prev_summary is None
else [
{
"role": "user",
"content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}",
},
]
)
new_summary: RewrittenSummary = client.chat.completions.create(
model="gpt-4-0613",
messages=[
{
"role": "system",
"content": """
You are going to generate an increasingly concise,entity-dense summary of the following article.

Perform the following two tasks
- Identify 1-3 informative entities from the following article which is missing from the previous summary
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities

Guidelines
- Make every word count: re-write the previous summary to improve flow and make space for additional entities
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
- Missing entities can appear anywhere in the new summary
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
""",
},
{"role": "user", "content": f"Here is the Article: {article}"},
{
"role": "user",
"content": f"Here is the previous summary: {summary_chain[-1]}",
},
*missing_entity_message,
],
max_retries=3,
max_tokens=1000,
response_model=RewrittenSummary,
)
summary_chain.append(new_summary.summary)
prev_summary = new_summary

return summary_chain
32 changes: 32 additions & 0 deletions examples/chain-of-density/finetune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from typing import List
from chain_of_density import summarize_article
import csv
import logging
import instructor
from pydantic import BaseModel

logging.basicConfig(level=logging.INFO)

instructions = instructor.Instructions(
name="Chain Of Density",
finetune_format="messages",
log_handlers=[logging.FileHandler("summarization.jsonl")],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Instructions object is created but not used anywhere in the code. If it's not used, consider removing it to avoid confusion.

- instructions = instructor.Instructions(
-     name="Chain Of Density",
-     finetune_format="messages",
-     log_handlers=[logging.FileHandler("summarization.jsonl")],
- )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
from typing import List
from chain_of_density import summarize_article
import csv
import logging
import instructor
from pydantic import BaseModel
logging.basicConfig(level=logging.INFO)
instructions = instructor.Instructions(
name="Chain Of Density",
finetune_format="messages",
log_handlers=[logging.FileHandler("summarization.jsonl")],
)
from typing import List
from chain_of_density import summarize_article
import csv
import logging
from pydantic import BaseModel
logging.basicConfig(level=logging.INFO)



class GeneratedSummary(BaseModel):
summary: str


@instructions.distil
def distil_summarization(text: str) -> GeneratedSummary:
summary_chain: List[str] = summarize_article(text)
return GeneratedSummary(summary=summary_chain[-1])


# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for article, summary in reader:
distil_summarization(article)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script reads from a CSV file and calls the distil_summarization function for each article. Ensure that the CSV file exists, is in the correct format, and that the file has read permissions. Also, the result of the distil_summarization function is not stored or used. If the result is needed, consider storing it in a variable or data structure.

-    for article, summary in reader:
-        distil_summarization(article)
+    summaries = []
+    for article, _ in reader:
+        summaries.append(distil_summarization(article))

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
for article, summary in reader:
distil_summarization(article)
# Read in the csv file we have
with open("test.csv", "r") as file:
reader = csv.reader(file)
next(reader) # Skip the header
summaries = []
for article, _ in reader:
summaries.append(distil_summarization(article))

5 changes: 5 additions & 0 deletions examples/chain-of-density/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
openai
pydantic
instructor
nltk
rich
Loading