diff --git a/docs/blog/posts/chain-of-density.md b/docs/blog/posts/chain-of-density.md new file mode 100644 index 000000000..5e07816c1 --- /dev/null +++ b/docs/blog/posts/chain-of-density.md @@ -0,0 +1,467 @@ +--- +draft: False +date: 2023-11-05 +tags: + - pydantic + - validation + - chain of density + - finetuneing + - gpt-3.5-turbo + - distilation +authors: + - ivanleomk + - jxnl +--- + +# Better Summaries by Finetuning Chain of Density + +> Discover how to distil an interative method like chain of density into a single finetune. + +In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to increase latency by 20x, reduce costs by 50x and maintain entity density. + +By the end you'll end up with a GPT 3.5 model, (fine-tuned using Instructor's great tooling), capable of producing summaries that rival the effectiveness of Chain of Density. As always, all code is readily available in our `examples/chain-of-density` folder in our repo for your reference. + +??? abstract "Datasets and Colab Notebook" + + We've also uploaded all our generated data to Hugging Face [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) for you to use if you'd like to try reproducing these experiments. We've also added a [Colab Instance](https://colab.research.google.com/drive/1iBkrEh2G5U8yh8RmI8EkWxjLq6zIIuVm?usp=sharing) for you to check our generated values. + +## Part 1) Chain of Density + +Summarizing extensive texts with AI can be challenging, often relying on inconsistent techniques. Salesforce AI Research's novel method, chain of density, enhances AI-based text summarization, outperforming human-generated summaries. + +Initially, an AI produces a summary, then refines it through multiple iterations, adding missing article entities. Each iteration adds new article entities to the summary, keeping length consistent, leading to an entity-dense, informative summary called Chain Of Density. + +First introduced by Salesforce's AI Research wing in their paper - [From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting](https://arxiv.org/abs/2309.04269). The team has found that this method is able to consistently beats similar summaries written by human annotators. + +??? info "Implementation Details" + + Note that our implementation uses a validator to ensure that the rewritten summary has a minimum length rather than a prompt. We also perform just 3 and not 5 rounds of rewrites, resulting in a lower final entity density. + +### Original Prompt + +We can implement the original prompt using `pip install instructor` by breaking down the entire process into smaller api calls. This allows us to introduce validation at each step to ensure that we're getting the results that we want. + +??? note "Original Chain of Density Prompt" + + ``` + Article: {{ARTICLE}} + + You will generate increasingly concise, entity-dense summaries of the + above Article. + + Repeat the following 2 steps 5 times. + + Step 1. Identify 1-3 informative Entities (";" delimited) from the + Article which are missing from the previously generated summary. + Step 2. Write a new, denser summary of identical length which covers + every entity and detail from the previous summary plus the Missing + Entities. + + A Missing Entity is: + - Relevant: to the main story. + - Specific: descriptive yet concise (5 words or fewer). + - Novel; not in the previous summary. + - Faithful: present in the Article. + - Anywhere: located anywhere in the Article. + + Guidelines: + - The first summary should be long (4-5 sentences, -80 words) yet + highly non-specific, containing little information beyond the + entities marked as missing. Use overly verbose language and fillers + (e.g., "this article discusses") to reach -80 words. + - Make every word count: re-write the previous summary to improve + flow and make space for additional entities. + - Make space with fusion, compression, and removal of uninformative + phrases like "the article discusses" + - The summaries should become highly dense and concise yet + self-contained, e.g., easily understood without the Article. + - Missing entities can appear anywhere in the new summary. + - Never drop entities from the previous summary. If space cannot be + made, add fewer new entities. + + Remember, use the exact same number of words for each summary. + + Answer in JSON. The JSON should be a list (length 5) of dictionaries + whose keys are "Missing_Entities" and "Denser_Summary" + ``` + +
+ ![RAG](img/chain-of-density.png) +
Improved process with Instructor
+
+ +### Data Modelling + +#### Initial Summary + +Let's start by walking through some of the data models that we'll be using as the `response_model` for our open ai function calls + +Firstly, we'll need a data model for the initial summary that we will be generating. We'll take the description of this class straight from the original prompt. Its important to note that these docstrings serve a purpose, they are directly used by the LLM when generating the outputs. + +```py +class InitialSummary(BaseModel): + """ + This is an initial summary which should be long ( 4-5 sentences, ~80 words) + yet highly non-specific, containing little information beyond the entities marked as missing. + Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. + """ + + summary: str = Field( + ..., + description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length", + ) +``` + +#### Rewritten Summary + +We'll also need one additional class to help model the rewritten schema + +```py +class RewrittenSummary(BaseModel): + """ + This is a new, denser summary of identical length which covers every entity + and detail from the previous summary plus the Missing Entities. + + Guidelines + - Make every word count : Rewrite the previous summary to improve flow and make space for additional entities + - Never drop entities from the previous summary. If space cannot be made, add fewer new entities. + - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article. + - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses" + - Missing entities can appear anywhere in the new summary + + An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. + """ + + summary: str = Field( + ..., + description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article", + ) + absent: List[str] = Field( + ..., + default_factory=list, + description="this is a list of Entities found absent from the new summary that were present in the previous summary", + ) + missing: List[str] = Field( + default_factory=list, + description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.", + ) +``` + +!!! tip "Using Pydantic Validators with Instructor" + + For a more in-depth walkthrough on how to use `Pydantic` validators with the `Instructor` + library, we recommend checking out our previous article on LLM + validation - [Good LLM Validation is just Good Validation](/instructor/blog/2023/10/23/good-llm-validation-is-just-good-validation/) + +Ideally, we'd like for `Missing` to have a length between 1 and 3, `Absent` to be an empty list and for our rewritten summaries to keep a minimum entity density. With `Instructor`, we can implement this logic using native `Pydantic` validators that are simply declared as part of the class itself. + +```py hl_lines="8 40 44" +import nltk +import spacy + +nlp = spacy.load("en_core_web_sm") + +@field_validator("summary") +def min_length(cls, v: str): + tokens = nltk.word_tokenize(v) #(1)! + num_tokens = len(tokens) + if num_tokens < 60: + raise ValueError( + "The current summary is too short. Please make sure that you generate a new summary that is around 80 words long." + ) + return v + +@field_validator("missing") +def has_missing_entities(cls, missing_entities: List[str]): + if len(missing_entities) == 0: + raise ValueError( + "You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary" + ) + return missing_entities + +@field_validator("absent") +def has_no_absent_entities(cls, absent_entities: List[str]): + absent_entity_string = ",".join(absent_entities) + if len(absent_entities) > 0: + print(f"Detected absent entities of {absent_entity_string}") + raise ValueError( + f"Do not omit the following Entities {absent_entity_string} from the new summary" + ) + return absent_entities + +@field_validator("summary") + def min_entity_density(cls, v: str): + tokens = nltk.word_tokenize(v) + num_tokens = len(tokens) + + # Extract Entities + doc = nlp(v) #(2)! + num_entities = len(doc.ents) + + density = num_entities / num_tokens + if density < 0.08: #(3)! + raise ValueError( + f"The summary of {v} has too few entities. Please regenerate a new summary with more new entities added to it. Remember that new entities can be added at any point of the summary." + ) + + return v +``` + +1. Similar to the original paper, we utilize the `NLTK` word tokenizer to count the number of tokens within our generated sentences. + We aim for at least 60 tokens in our generated summary so that we don't lose information. + +2. We also use the spaCy library to calculate the entity density of the generated summary. + +3. We also implement a minimum entity density so that we stay within a given range. 0.08 is arbitrarily chosen in this case + +### Putting it all Together + +Now that we have our models and the rough flow figured out, let's implement a function to summarize a piece of text using `Chain Of Density` summarization. + +```py hl_lines="4 9-24 38-68" +from openai import OpenAI +import instructor + +client = instructor.patch(OpenAI()) #(1)! + +def summarize_article(article: str, summary_steps: int = 3): + summary_chain = [] + # We first generate an initial summary + summary: InitialSummary = client.chat.completions.create( # (2)! + model="gpt-4-0613", + response_model=InitialSummary, + messages=[ + { + "role": "system", + "content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words", + }, + {"role": "user", "content": f"Here is the Article: {article}"}, + { + "role": "user", + "content": "The generated summary should be about 80 words.", + }, + ], + max_retries=2, + ) + prev_summary = None + summary_chain.append(summary.summary) + for i in range(summary_steps): + missing_entity_message = ( + [] + if prev_summary is None + else [ + { + "role": "user", + "content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}", + }, + ] + ) + new_summary: RewrittenSummary = client.chat.completions.create( # (3)! + model="gpt-4-0613", + messages=[ + { + "role": "system", + "content": """ + You are going to generate an increasingly concise,entity-dense summary of the following article. + + Perform the following two tasks + - Identify 1-3 informative entities from the following article which is missing from the previous summary + - Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities + + Guidelines + - Make every word count: re-write the previous summary to improve flow and make space for additional entities + - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses". + - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article. + - Missing entities can appear anywhere in the new summary + - Never drop entities from the previous summary. If space cannot be made, add fewer new entities. + """, + }, + {"role": "user", "content": f"Here is the Article: {article}"}, + { + "role": "user", + "content": f"Here is the previous summary: {summary_chain[-1]}", + }, + *missing_entity_message, + ], + max_retries=3, #(4)! + max_tokens=1000, + response_model=RewrittenSummary, + ) + summary_chain.append(new_summary.summary) + prev_summary = new_summary + + return summary_chain +``` + +1. We need to apply a `patch` function on the `OpenAI` client for us to get all + of the benefits that `Instructor` provides. With a simple `patch`, we can get + **automatic type coercion of our outputs and automatic retries for invalid outputs** + out of the box! + +2. We first generate an initial summary. Note here that we explictly ask for a summary that has + 80 words and is lengthy with overly verbose fillers in the system prompt + +3. We slightly modify the original system prompt used in the original paper to perform a rewrite of the summary. + Using `Instructor`, we also get validation of the generated output with our `field_validator`s that we defined above + +4. If you've chosen a value that is larger than 0.08, make sure to increase this value in case you need to do multiple rewrites + +This summarization function yields a result which triples the number of entities while mantaining the same number of tokens. We can also see that stylistically, the summary is a lot more natural. + +**First Iteration** + +> This article discusses the highly-anticipated boxing match between Manny Pacquiao and Floyd Mayweather. The article revolves around Manny Pacquiao's statements about his upcoming fight and his preparations for the same. A portion of the article provides details about the financial stipulations of the match and its significance in the sporting arena. Quotes from Pacquiao illustrating his determination and his battle strategy are highlighted. The tone of the article is largely centered around creating a build-up to the upcoming mega event. + +**Final Iteration** + +> Manny Pacquiao, the Filipino boxer, anticipates the forthcoming May 2 showdown at the MGM Grand as the fight of his life, against the undefeated American Floyd Mayweather, in a $300m bout. Despite being seen as the underdog in this high-stakes Las Vegas match, Pacquiao is confident, promising a warrior's spirit and assuring the fans who have been awaiting this encounter for a decade, that it will indeed be the biggest sporting spectacle in history worthy of their anticipation + +## Part 2) Fine-Tuning + +In this section, we'll look into how to fine-tune a GPT 3.5 model so that it is able to perform at an equivalent level as a GPT-4 model. We'll then compare the performance of our model against that of `GPT-4` and `GPT-4-Turbo` to see how it stacks up. + +### Creating a Training Set + +Let's first segregate our train and test set so that we don't have any sort of contamination - this corresponds to our `train.csv` and `test.csv` in our [Hugging Face Dataset](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density). Now, we just need to import the `Instructions` module from the `Instructor` package which allows you to generate a nicely formatted `.jsonl` file to be used for fine-tuning + +```py hl_lines="2 9 11-18 37 40" +from typing import List +from chain_of_density import summarize_article #(1)! +import csv +import logging +import instructor +from pydantic import BaseModel + +logging.basicConfig(level=logging.INFO) #(2)! + +instructions = instructor.Instructions( #(3)! + name="Chain Of Density", + finetune_format="messages", + # log handler is used to save the data to a file + # you can imagine saving it to a database or other storage + # based on your needs! + log_handlers=[logging.FileHandler("generated.jsonl")], +) + +class GeneratedSummary(BaseModel): + """ + This represents a highly concise summary that includes as many entities as possible from the original source article. + + An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. + + Guidelines + - Make every word count + - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article. + - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses" + """ + + summary: str = Field( + ..., + description="This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ", + ) + +@instructions.distil #(4)! +def distil_summarization(text: str) -> GeneratedSummary: + summary_chain: List[str] = summarize_article(text) + return GeneratedSummary(summary=summary_chain[-1]) #(5)! + +with open("train.csv", "r") as file: + reader = csv.reader(file) + next(reader) # Skip the header + for article, summary in reader: + # Run Distillisation to generate the values + distil_summarization(article) +``` + +1. In this example, we're using the summarize_article that we defined up above. We saved it in a local file called `chain_of_density.py`, + hence the import + +2. We also need to configure logging at the `INFO` level. This is very important, if this is not configured, your output will not be generated. + +3. We instantiate a `Instruction` object which will help us handle the conversion of our function calls into a valid `.jsonl` file. We also define + the name of the `.jsonl` file in the `log_handlers` parameter + +4. We add in an `instructions.distil` annotation so that we automatically capture the input and output of the function we'd like to + fine-tune our model to output + +5. We return a `Pydantic` object which matches the annotation that we use on our function. Note that we must specify a `Pydantic` object to + be returned when using the `instructions.distil` annotation + +!!! warning "Rate Limiting" + + We recommend running this script on a small subset of the dataset first to test you've got everything configured nicely. + Don't forget to add in rate limiting error handling with `tenacity` and set the `OPENAI_API_KEY` shell environment variable + before running any subsequent commands + +### Creating Fine-Tuning Jobs + +Once we run this script, we'll have a new file called `generated.jsonl` in our local repository. Now all that's left is to run the command below to start fine-tuning your first model! + +```sh +instructor jobs create-from-file generated.jsonl +``` + +??? notes "Finetuning Reference" + + Checking out our [Finetuning CLI](/instructor/cli/finetune/) to learn about other hyperparameters that you can tune to improve your model's performance. + +Once the job is complete, all we need to do is to then change the annotation in the function call to `distil_summarization` in our original file above to start using our new model. + +```py +@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") #(1)! +def distil_summarization(text: str) -> GeneratedSummary: + summary_chain: List[str] = summarize_article(text) + return GeneratedSummary(summary=summary_chain[-1]) +``` + +1. Don't forget to replace this with your new model id. OpenAI identifies fine tuned models with an id of + ft:gpt-3.5-turbo-0613:personal:: under their Fine-tuning tab on their dashboard + +With that, you've now got your own fine-tuned model ready to go and serve data in production. We've seen how Instructor can make your life easier, from fine-tuning to distillation. + +## Results and Benchmarks + +We fine-tuned a total of 3 different models, giving each 20, 50 and 76 samples respectively to see if more data improved the models. We then compared the output of these fine tuned models to GPT-4 and GPT-3 summaries that were generated using chain-of-density methods. + +We'll be comparing these models in three main ways + +- Entity Density : This is entities per token, the higher the better for density. +- Latency : Time to last token generated in seconds +- Costs : How much does the entire experiment cost + +We used a total of 20 articles as a validation set which our fine tuned models had not seen before. This was the overall performance that we observed. + +| Model | Mean Latency (s) | Mean Entity Count | Mean Entity Density | Tokens | +| ------------------- | ---------------- | ----------------- | ------------------- | ------ | +| GPT-4 (COD) | 49.5 | 11.3 | 0.138 | 81.65 | +| GPT-3 (COD) | 145.94 | 11.05 | 0.105 | 105.7 | +| 3.5 Finetuned (20) | 2.25 | 14.7 | 0.154 | 95.45 | +| 3.5 Finetuned (50) | 2.09 | 12.4 | 0.140 | 88.35 | +| 3.5 Finetuned (76) | 2.17 | 11.65 | 0.142 | 82.05 | + +??? notes "Finetuning Datasets" + + For our finetuned models, we did a few optimisations to raise the performance. + + We only included summaries that had a minimum density of 0.15 in the dataset, took the summary in the entire chain with the highest density as the final one, forced every regenerated summary to have a minimum density of 0.12 and regenerated summaries up to three times if they didn't meet the summaries. **This is a much more expensive strategy and can cost up to 2.5x or more what we do in this tutorial** + + This resulted in the total cost of $63.46 to generate just 75 examples due to the stringent requirements, translating to about $0.85 per generated summary example. + +Using the OpenAI Usage Dashboard, we can calculate the cost of generating 20 summaries as seen below. + +| Model | Training Cost ($) | Inference Cost ($) | Tokens Used | Total Cost ($) | +| ------------------- | ----------------- | ------------------ | ----------- | -------------- | +| 3.5 Finetuned (20) | 0.664 | 0.207 | 56,573 | 0.817 | +| 3.5 Finetuned (50) | 1.368 | 0.165 | 49,057 | 1.266 | +| 3.5 Finetuned (76) | 1.824 | 0.174 | 51,583 | 2.481 | +| GPT-4 (COD) | - | 12.9 | 409,062 | 12.9 | +| GPT-3 (COD) | - | 0.45 | 290,164 | 0.45 | + + +Here, we can see that `GPT-4` has an approximate inference cost of `0.65` per summary while our finetuned models have an inference cost of `0.0091` per summary which is ~ `72x` cheaper. + +## Conclusions + +Finetuning this iterative method was 20-40x faster while improving overall performance, resulting in massive efficiency gains by finetuning and distilling capabilities into specialized models. + +We've seen how `Instructor` can make your life easier, from data modeling to distilation and finetuning. If you enjoy the content or want to try out `instructor` check out the [github](https://github.com/jxnl/instructor) and don't forget to give us a star! diff --git a/docs/blog/posts/img/chain-of-density.png b/docs/blog/posts/img/chain-of-density.png new file mode 100644 index 000000000..75e361a00 Binary files /dev/null and b/docs/blog/posts/img/chain-of-density.png differ diff --git a/examples/chain-of-density/Readme.md b/examples/chain-of-density/Readme.md new file mode 100644 index 000000000..6aac9998c --- /dev/null +++ b/examples/chain-of-density/Readme.md @@ -0,0 +1,31 @@ +# Introduction + +This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. All of our data referenced in this file is located [here](https://huggingface.co/datasets/ivanleomk/gpt4-chain-of-density) on hugging face + +Check out our blog post [here](https://jxnl.github.io/instructor/blog/2023/11/05/implementing-chain-of-density/) where we have a detailed explanation of the code and a [colab notebook](https://colab.research.google.com/drive/1iBkrEh2G5U8yh8RmI8EkWxjLq6zIIuVm?usp=sharing) walking you through how we perform our calculations. + +## Instructions + +1. First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation. + +> We use NLTK to ensure that our summaries are of a certain token length. In order to do so, you'll need to download the `punkt` package to compute the token metrics. You can do so by running the command `nltk.download('punkt')` + +``` +pip3 install -r requirements.txt +``` + +2. Download the `test.csv` file and the `summarization.jsonl` file that you want to use for finetuning. We provide one with `20` examples, `50` examples and `100` examples to be used for testing. Let's now run a simple finetuning job with the following command. + +> Don't forget to set your `OPENAI_API_KEY` as an environment variable in your shell before running these commands + +``` +instructor jobs create-from-file summarization.jsonl +``` + +3. Once the job is complete, you'll end up with a new GPT 3.5 model that's capable of producing high quality summaries with a high entity density. You can run it by simply changing our `finetune.py` file's `instructions.distil` annotator as + +``` +@instructions.distil(model=,mode="dispatch") +def distil_summarization(text: str) -> GeneratedSummary: +// rest of code goes here +``` \ No newline at end of file diff --git a/examples/chain-of-density/chain_of_density.py b/examples/chain-of-density/chain_of_density.py new file mode 100644 index 000000000..706373a5e --- /dev/null +++ b/examples/chain-of-density/chain_of_density.py @@ -0,0 +1,151 @@ +from pydantic import BaseModel, Field, field_validator +from typing import List +import instructor +import nltk +from openai import OpenAI +import spacy + +client = instructor.patch(OpenAI()) +nlp = spacy.load("en_core_web_sm") + + +class InitialSummary(BaseModel): + """ + This is an initial summary which should be long ( 4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words. + """ + + summary: str = Field( + ..., + description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length", + ) + + +class RewrittenSummary(BaseModel): + """ + This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. + + Guidelines + - Make every word count : Rewrite the previous summary to improve flow and make space for additional entities + - Never drop entities from the previous summary. If space cannot be made, add fewer new entities. + - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article. + - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses" + - Missing entities can appear anywhere in the new summary + + An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. + """ + + summary: str = Field( + ..., + description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article", + ) + absent: List[str] = Field( + ..., + default_factory=list, + description="this is a list of Entities found absent from the new summary that were present in the previous summary", + ) + missing: List[str] = Field( + default_factory=list, + description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.", + ) + + @field_validator("summary") + def min_entity_density(cls, v: str): + # We want to make sure we have a minimum density of 0.12 whenever we do a rewrite. This ensures that the summary quality is always going up + tokens = nltk.word_tokenize(v) + num_tokens = len(tokens) + + # Extract Entities + doc = nlp(v) + num_entities = len(doc.ents) + + density = num_entities / num_tokens + if density < 0.08: + raise ValueError( + f"The summary of {v} has too few entities. Please regenerate a new summary with more new entities added to it. Remember that new entities can be added at any point of the summary." + ) + + return v + + @field_validator("summary") + def min_length(cls, v: str): + tokens = nltk.word_tokenize(v) + num_tokens = len(tokens) + if num_tokens < 60: + raise ValueError( + "The current summary is too short. Please make sure that you generate a new summary that is around 80 words long." + ) + return v + + @field_validator("missing") + def has_missing_entities(cls, missing_entities: List[str]): + if len(missing_entities) == 0: + raise ValueError( + "You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary" + ) + return missing_entities + + @field_validator("absent") + def has_no_absent_entities(cls, absent_entities: List[str]): + absent_entity_string = ",".join(absent_entities) + if len(absent_entities) > 0: + print(f"Detected absent entities of {absent_entity_string}") + raise ValueError( + f"Do not omit the following Entities {absent_entity_string} from the new summary" + ) + return absent_entities + + +def summarize_article(article: str, summary_steps: int = 3): + summary_chain = [] + # We first generate an initial summary + summary: InitialSummary = client.chat.completions.create( + model="gpt-4-0613", + response_model=InitialSummary, + messages=[ + { + "role": "system", + "content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words. ", + }, + {"role": "user", "content": f"Here is the Article: {article}"}, + { + "role": "user", + "content": "The generated summary should be about 80 words.", + }, + ], + max_retries=2, + ) + summary_chain.append(summary.summary) + for i in range(summary_steps): + new_summary: RewrittenSummary = client.chat.completions.create( + model="gpt-4-0613", + messages=[ + { + "role": "system", + "content": f""" + Article: {article} + You are going to generate an increasingly concise,entity-dense summary of the following article. + + Perform the following two tasks + - Identify 1-3 informative entities from the following article which is missing from the previous summary + - Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities + + Guidelines + - Make every word count: re-write the previous summary to improve flow and make space for additional entities + - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses". + - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article. + - Missing entities can appear anywhere in the new summary + - Never drop entities from the previous summary. If space cannot be made, add fewer new entities. + """, + }, + { + "role": "user", + "content": f"Here is the previous summary: {summary_chain[-1]}", + }, + ], + max_retries=5, + max_tokens=1000, + response_model=RewrittenSummary, + ) + summary_chain.append(new_summary.summary) + + return summary_chain diff --git a/examples/chain-of-density/finetune.py b/examples/chain-of-density/finetune.py new file mode 100644 index 000000000..45d509c3b --- /dev/null +++ b/examples/chain-of-density/finetune.py @@ -0,0 +1,48 @@ +from typing import List +from chain_of_density import summarize_article +import csv +import logging +import instructor +from pydantic import BaseModel, Field + +logging.basicConfig(level=logging.INFO) + +instructions = instructor.Instructions( + name="Chain Of Density", + finetune_format="messages", + # log handler is used to save the data to a file + # you can imagine saving it to a database or other storage + # based on your needs! + log_handlers=[logging.FileHandler("generated.jsonl")], +) + + +class GeneratedSummary(BaseModel): + """ + This represents a highly concise summary that includes as many entities as possible from the original source article. + + An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title. + + Guidelines + - Make every word count + - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article. + - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses" + """ + + summary: str = Field( + ..., + description="This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ", + ) + + +@instructions.distil +def distil_summarization(text: str) -> GeneratedSummary: + summary_chain: List[str] = summarize_article(text) + return GeneratedSummary(summary=summary_chain[-1]) + + +with open("test.csv", "r") as file: + reader = csv.reader(file) + next(reader) # Skip the header + for article, summary in reader: + distil_summarization(article) diff --git a/examples/chain-of-density/requirements.txt b/examples/chain-of-density/requirements.txt new file mode 100644 index 000000000..8cc8d88f6 --- /dev/null +++ b/examples/chain-of-density/requirements.txt @@ -0,0 +1,5 @@ +openai +pydantic +instructor +nltk +rich \ No newline at end of file diff --git a/examples/chain-of-density/run.py b/examples/chain-of-density/run.py deleted file mode 100644 index 0dfbf801e..000000000 --- a/examples/chain-of-density/run.py +++ /dev/null @@ -1,230 +0,0 @@ -import instructor -from openai import OpenAI - -from pydantic import BaseModel, Field - -from pprint import pprint -from typing import List - -client = instructor.patch(OpenAI()) - - -class Summary(BaseModel): - """Represents a summary entry in the list. - - Guidelines: - - The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, - containing little information beyond the entities marked as missing. Use overly verbose - language and fillers (e.g., "this article discusses") to reach ~80 words. - - Make every word count: rewrite the previous summary to improve flow and make space for - additional entities. - - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses." - - The summaries should become highly dense and concise yet self-contained, i.e., easily understood - without the article. - - Missing entities can appear anywhere in the new summary. - - Never drop entities from the previous summary. If space cannot be made, add fewer new entities. - """ - - index: int = Field(..., description="Index of the summary in the chain.") - denser_summary: str = Field(..., description="Concise yet self-contained summary.") - included_entities: List[str] = Field( - ..., description="Correct list of Entities found in the summary." - ) - missing_entities: List[str] = Field( - ..., - description="Correct list of Entities found absent from the summary that should be included in the next summary attempt.", - ) - - -# This multitask helper will be used to generate a chain of summaries. -# Allows us to extract data via streaming to see resuls faster -ChainOfDenseSummaries = instructor.MultiTask( - Summary, - name="chain-of-dense-summaries", - description=""" - Repeat the following 2 steps 5 times. - - Step 1. Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary. - - Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities. - - A missing entity is: - - - relevant to the main story, - - specific yet concise (5 words or fewer), - - novel (not in the previous summary), - - faithful (present in the article), - - anywhere (can be located anywhere in the article). - - Remember, use the exact same number of words for each summary.""", -) - - -def summarize_article(article: str, n_summaries: int = 5, stream: bool = True): - completion = client.chat.completions.create( - model="gpt-3.5-turbo-16k", - stream=stream, - messages=[ - { - "role": "system", - "content": """Summarize the following article with {n_summary} chain of summaries with increasing density:""", - }, - {"role": "user", "content": article}, - ], - functions=[ChainOfDenseSummaries.openai_schema], - function_call={"name": ChainOfDenseSummaries.openai_schema["name"]}, - ) - if stream: - return ChainOfDenseSummaries.from_streaming_response(completion) - return ChainOfDenseSummaries.from_response(completion) - - -if __name__ == "__main__": - example = { - "text": "The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.", - } - - # Generate a chain of summaries, however we can also stream the results - # to see the results faster - for summary in summarize_article(example["text"]): - pprint(summary.model_dump()) - - """ - {'denser_summary': 'State agencies in California cannot enter into contracts ' - 'worth $100,000 or more with contractors that discriminate ' - 'in benefits based on gender identity. The requirement ' - 'applies to contractors operating within the state, on ' - 'state-owned or occupied property outside the state, and ' - 'elsewhere in the United States where work related to a ' - 'state contract is being performed. Contractors must treat ' - 'employee benefit requests and eligibility documentation as ' - 'confidential. Exceptions to the requirement can be made in ' - 'certain circumstances. Contractors can avoid being seen as ' - 'discriminatory if they pay the actual costs of benefits or ' - 'if they are unable to provide certain benefits despite ' - 'reasonable efforts. Contracts must include a certification ' - 'of compliance with the requirement.', - 'included_entities': ['California', - 'contracts', - 'discrimination', - 'benefits', - 'gender identity', - 'state agencies', - 'state-owned property', - 'confidential', - 'exceptions'], - 'index': 0, - 'missing_entities': []} - {'denser_summary': 'State agencies in California cannot enter into contracts ' - 'worth $100,000 or more with contractors that discriminate ' - 'in benefits based on gender identity. The requirement ' - 'applies to contractors operating within the state, on ' - 'state-owned or occupied property outside the state, and ' - 'elsewhere in the United States where work related to a ' - 'state contract is being performed. Contractors must treat ' - 'employee benefit requests and eligibility documentation as ' - 'confidential. Exceptions to the requirement can be made in ' - 'certain circumstances, such as when there is only one ' - 'prospective contractor available or when the contract is ' - 'necessary to respond to an emergency. Contractors can ' - 'avoid being seen as discriminatory if they pay the actual ' - 'costs of benefits or if they are unable to provide certain ' - 'benefits despite reasonable efforts. Contracts must ' - 'include a certification of compliance with the ' - 'requirement, and false certification can result in ' - 'penalties.', - 'included_entities': ['California', - 'contracts', - 'discrimination', - 'benefits', - 'gender identity', - 'state agencies', - 'state-owned property', - 'confidential', - 'exceptions', - 'prospective contractor', - 'emergency', - 'actual costs', - 'penalties'], - 'index': 1, - 'missing_entities': ['availability', 'false certification']} - {'denser_summary': 'State agencies in California are prohibited from entering ' - 'into contracts worth $100,000 or more with contractors ' - 'that discriminate in benefits based on gender identity. ' - 'This requirement applies to contractors operating within ' - 'the state, on state-owned or occupied property outside the ' - 'state, and elsewhere in the United States where work ' - 'related to a state contract is being performed. ' - 'Contractors must keep employee benefit requests and ' - 'eligibility documentation confidential. There are ' - 'exceptions to this requirement, such as when there is only ' - 'one available contractor or when an emergency situation ' - 'requires immediate contracting. Contractors can avoid ' - 'being seen as discriminatory by paying the actual costs of ' - 'benefits or if they are unable to provide certain benefits ' - 'despite reasonable efforts. Contracts must include a ' - 'certification of compliance with this requirement, and ' - 'false certification can lead to penalties and the ' - 'application of other existing remedies.', - 'included_entities': ['California', - 'contracts', - 'discrimination', - 'benefits', - 'gender identity', - 'state agencies', - 'state-owned property', - 'confidential', - 'exceptions', - 'contractors', - 'availability', - 'emergency', - 'actual costs', - 'false certification', - 'penalties'], - 'index': 2, - 'missing_entities': ['contracting practices', 'federal laws']} - {'denser_summary': 'State agencies in California are prohibited from entering ' - 'into contracts worth $100,000 or more with contractors ' - 'that discriminate in benefits based on gender identity. ' - 'This requirement applies to contractors operating within ' - 'the state, on state-owned or occupied property outside the ' - 'state, and elsewhere in the United States where work ' - 'related to a state contract is being performed. ' - 'Contractors must keep employee benefit requests and ' - 'eligibility documentation confidential. There are ' - 'exceptions to this requirement, such as when there is only ' - 'one available contractor or when an emergency situation ' - 'requires immediate contracting. Contractors can avoid ' - 'being seen as discriminatory by paying the actual costs of ' - 'benefits or if they are unable to provide certain benefits ' - 'despite reasonable efforts. Contracts must include a ' - 'certification of compliance with this requirement, and ' - 'false certification can lead to penalties and the ' - 'application of other existing remedies. This section of ' - 'the Public Contract Code does not regulate the contracting ' - 'practices of local jurisdictions, and it is intended to be ' - 'consistent with applicable federal laws, rules, and ' - 'regulations.', - 'included_entities': ['California', - 'contracts', - 'discrimination', - 'benefits', - 'gender identity', - 'state agencies', - 'state-owned property', - 'confidential', - 'exceptions', - 'contractors', - 'availability', - 'emergency', - 'actual costs', - 'false certification', - 'penalties', - 'Public Contract Code', - 'local jurisdictions', - 'federal laws', - 'federal rules', - 'federal regulations'], - 'index': 3, - 'missing_entities': []} - """ diff --git a/mkdocs.yml b/mkdocs.yml index b6de72ec6..036bfe8ed 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -12,6 +12,20 @@ theme: repo: fontawesome/brands/github edit: material/pencil view: material/eye + theme: + admonition: + note: octicons/tag-16 + abstract: octicons/checklist-16 + info: octicons/info-16 + tip: octicons/squirrel-16 + success: octicons/check-16 + question: octicons/question-16 + warning: octicons/alert-16 + failure: octicons/x-circle-16 + danger: octicons/zap-16 + bug: octicons/bug-16 + example: octicons/beaker-16 + quote: octicons/quote-16 features: - announce.dismiss - content.action.edit @@ -59,6 +73,7 @@ theme: markdown_extensions: - abbr - admonition + - pymdownx.details - attr_list - def_list - footnotes