-
-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chain of density #135
Chain of density #135
Changes from 14 commits
6644539
db15318
26499de
6f10296
27e36c1
fb86366
623dc13
008c554
e31b21b
f4068a9
1d688d6
332a0b4
8100198
ae14920
77104bd
6d18bba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Introduction | ||
|
||
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. All of our data referenced in this file is located [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) on hugging face | ||
|
||
Check out our blog post [here](https://jxnl.github.io/instructor/blog/2023/11/05/implementing-chain-of-density/) where we have a detailed explanation of the code and a [colab notebook](https://colab.research.google.com/drive/1iBkrEh2G5U8yh8RmI8EkWxjLq6zIIuVm?usp=sharing) walking you through how we perform our calculations. | ||
|
||
## Instructions | ||
|
||
1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation. | ||
|
||
> We use NLTK to ensure that our summaries are of a certain token length. In order to do so, you'll need to download the `punkt` package to compute the token metrics. You can do so by running the command `nltk.download('punkt')` | ||
|
||
``` | ||
pip3 install -r requirements.txt | ||
``` | ||
|
||
2. Download the `test.csv` file and the `summarization.jsonl` file that you want to use for finetuning. We provide one with `20` examples, `50` examples and `100` examples to be used for testing. Let's now run a simple finetuning job with the following command. | ||
|
||
> Don't forget to set your `OPENAI_API_KEY` as an environment variable in your shell before running these commands | ||
|
||
``` | ||
instructor jobs create-from-file summarization.jsonl | ||
``` | ||
|
||
3. Once the job is complete, you'll end up with a new GPT 3.5 model that's capable of producing high quality summaries with a high entity density. You can run it by simply changing our `finetune.py` file's `instructions.distil` annotator as | ||
|
||
``` | ||
@instructions.distil(model=<your finetuned model >,mode="dispatch") | ||
def distil_summarization(text: str) -> GeneratedSummary: | ||
// rest of code goes here | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
from pydantic import BaseModel, Field, field_validator | ||
from typing import List | ||
import instructor | ||
import nltk | ||
from openai import OpenAI | ||
|
||
client = instructor.patch(OpenAI()) | ||
|
||
|
||
class InitialSummary(BaseModel): | ||
""" | ||
This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. | ||
""" | ||
|
||
summary: str = Field( | ||
..., | ||
description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length", | ||
) | ||
|
||
|
||
class RewrittenSummary(BaseModel): | ||
""" | ||
This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. | ||
|
||
Guidelines | ||
- Make every word count : Rewrite the previous summary to improve flow and make space for additional entities | ||
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities. | ||
- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article. | ||
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses" | ||
- Missing entities can appear anywhere in the new summary | ||
|
||
An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. | ||
""" | ||
|
||
summary: str = Field( | ||
..., | ||
description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article", | ||
) | ||
absent: List[str] = Field( | ||
..., | ||
default_factory=list, | ||
description="this is a list of Entities found absent from the new summary that were present in the previous summary", | ||
) | ||
missing: List[str] = Field( | ||
default_factory=list, | ||
description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.", | ||
) | ||
|
||
@field_validator("summary") | ||
def min_length(cls, v: str): | ||
tokens = nltk.word_tokenize(v) | ||
num_tokens = len(tokens) | ||
if num_tokens < 75: | ||
raise ValueError( | ||
"The current summary is too short. Please make sure that you generate a new summary that is around 80 words long." | ||
) | ||
return v | ||
|
||
@field_validator("missing") | ||
def has_missing_entities(cls, missing_entities: List[str]): | ||
if len(missing_entities) == 0: | ||
raise ValueError( | ||
"You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary" | ||
) | ||
return missing_entities | ||
|
||
@field_validator("absent") | ||
def has_no_absent_entities(cls, absent_entities: List[str]): | ||
absent_entity_string = ",".join(absent_entities) | ||
if len(absent_entities) > 0: | ||
print(f"Detected absent entities of {absent_entity_string}") | ||
raise ValueError( | ||
f"Do not omit the following Entities {absent_entity_string} from the new summary" | ||
) | ||
return absent_entities | ||
jxnl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
def summarize_article(article: str, summary_steps: int = 3): | ||
ivanleomk marked this conversation as resolved.
Show resolved
Hide resolved
|
||
summary_chain = [] | ||
# We first generate an initial summary | ||
summary: InitialSummary = client.chat.completions.create( | ||
model="gpt-4-0613", | ||
response_model=InitialSummary, | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words", | ||
}, | ||
{"role": "user", "content": f"Here is the Article: {article}"}, | ||
{ | ||
"role": "user", | ||
"content": "The generated summary should be about 80 words.", | ||
}, | ||
], | ||
max_retries=2, | ||
) | ||
prev_summary = None | ||
summary_chain.append(summary.summary) | ||
for i in range(summary_steps): | ||
missing_entity_message = ( | ||
[] | ||
if prev_summary is None | ||
else [ | ||
{ | ||
"role": "user", | ||
"content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}", | ||
}, | ||
] | ||
) | ||
new_summary: RewrittenSummary = client.chat.completions.create( | ||
model="gpt-4-0613", | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": """ | ||
You are going to generate an increasingly concise,entity-dense summary of the following article. | ||
|
||
Perform the following two tasks | ||
- Identify 1-3 informative entities from the following article which is missing from the previous summary | ||
- Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities | ||
|
||
Guidelines | ||
- Make every word count: re-write the previous summary to improve flow and make space for additional entities | ||
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses". | ||
- The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article. | ||
- Missing entities can appear anywhere in the new summary | ||
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities. | ||
""", | ||
}, | ||
{"role": "user", "content": f"Here is the Article: {article}"}, | ||
{ | ||
"role": "user", | ||
"content": f"Here is the previous summary: {summary_chain[-1]}", | ||
}, | ||
*missing_entity_message, | ||
], | ||
max_retries=3, | ||
max_tokens=1000, | ||
response_model=RewrittenSummary, | ||
) | ||
summary_chain.append(new_summary.summary) | ||
prev_summary = new_summary | ||
|
||
return summary_chain |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,32 @@ | ||||||||||||||||||||||||||||||||||||||||||||
from typing import List | ||||||||||||||||||||||||||||||||||||||||||||
from chain_of_density import summarize_article | ||||||||||||||||||||||||||||||||||||||||||||
import csv | ||||||||||||||||||||||||||||||||||||||||||||
import logging | ||||||||||||||||||||||||||||||||||||||||||||
import instructor | ||||||||||||||||||||||||||||||||||||||||||||
from pydantic import BaseModel | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
logging.basicConfig(level=logging.INFO) | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
instructions = instructor.Instructions( | ||||||||||||||||||||||||||||||||||||||||||||
name="Chain Of Density", | ||||||||||||||||||||||||||||||||||||||||||||
finetune_format="messages", | ||||||||||||||||||||||||||||||||||||||||||||
log_handlers=[logging.FileHandler("summarization.jsonl")], | ||||||||||||||||||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The - instructions = instructor.Instructions(
- name="Chain Of Density",
- finetune_format="messages",
- log_handlers=[logging.FileHandler("summarization.jsonl")],
- ) Commitable suggestion
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
class GeneratedSummary(BaseModel): | ||||||||||||||||||||||||||||||||||||||||||||
summary: str | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
@instructions.distil | ||||||||||||||||||||||||||||||||||||||||||||
def distil_summarization(text: str) -> GeneratedSummary: | ||||||||||||||||||||||||||||||||||||||||||||
summary_chain: List[str] = summarize_article(text) | ||||||||||||||||||||||||||||||||||||||||||||
return GeneratedSummary(summary=summary_chain[-1]) | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
# Read in the csv file we have | ||||||||||||||||||||||||||||||||||||||||||||
with open("test.csv", "r") as file: | ||||||||||||||||||||||||||||||||||||||||||||
reader = csv.reader(file) | ||||||||||||||||||||||||||||||||||||||||||||
next(reader) # Skip the header | ||||||||||||||||||||||||||||||||||||||||||||
for article, summary in reader: | ||||||||||||||||||||||||||||||||||||||||||||
distil_summarization(article) | ||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script reads from a CSV file and calls the - for article, summary in reader:
- distil_summarization(article)
+ summaries = []
+ for article, _ in reader:
+ summaries.append(distil_summarization(article)) Commitable suggestion
Suggested change
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
openai | ||
pydantic | ||
instructor | ||
nltk | ||
rich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command
instructor jobs create-from-file summarization.jsonl
seems to be incorrect. It should beinstruct
instead ofinstructor
.Commitable suggestion