-
-
Notifications
You must be signed in to change notification settings - Fork 732
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
example: Table extraction w/ Pydantic (#288)
- Loading branch information
Showing
4 changed files
with
252 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
Certainly! Here's a comprehensive guide on using type annotations to extract markdown tables from images with OpenAI's new vision model. This guide includes all necessary code snippets, explanations, and a practical example. | ||
|
||
## Introduction | ||
|
||
This post demonstrates how to use Python's type annotations and OpenAI's new vision model to extract tables from images and convert them into markdown format. This method is particularly useful for data analysis and automation tasks. | ||
|
||
## Building the Custom Type for Markdown Tables | ||
|
||
First, we define a custom type, `MarkdownDataFrame`, to handle pandas DataFrames formatted in markdown. This type uses Python's `Annotated` and `InstanceOf` types, along with decorators `BeforeValidator` and `PlainSerializer`, to process and serialize the data. | ||
|
||
```python | ||
from io import StringIO | ||
from typing import Annotated, Any | ||
from pydantic import BaseModel, Field, BeforeValidator, PlainSerializer, InstanceOf, WithJsonSchema | ||
import pandas as pd | ||
|
||
def md_to_df(data: Any) -> Any: | ||
# Convert markdown to DataFrame | ||
if isinstance(data, str): | ||
return ( | ||
pd.read_csv( | ||
StringIO(data), # Process data | ||
sep="|", | ||
index_col=1, | ||
) | ||
.dropna(axis=1, how="all") | ||
.iloc[1:] | ||
.applymap(lambda x: x.strip()) | ||
) | ||
return data | ||
|
||
MarkdownDataFrame = Annotated[ | ||
InstanceOf(pd.DataFrame), | ||
BeforeValidator(md_to_df), | ||
PlainSerializer(lambda df: df.to_markdown()), | ||
WithJsonSchema( | ||
{ | ||
"type": "string", | ||
"description": "The markdown representation of the table, each one should be tidy, do not try to join tables that should be seperate", | ||
} | ||
|
||
) | ||
] | ||
``` | ||
|
||
## Defining the Table Class | ||
|
||
The `Table` class is essential for organizing the extracted data. It includes a caption and a dataframe, processed as a markdown table. Since most of the complexity is handled by the `MarkdownDataFrame` type, the `Table` class is straightforward! | ||
|
||
```python | ||
class Table(BaseModel): | ||
caption: str | ||
dataframe: MarkdownDataFrame | ||
``` | ||
|
||
## Extracting Tables from Images | ||
|
||
The `extract_table` function uses OpenAI's vision model to process an image URL and extract tables in markdown format. We utilize the `instructor` library to patch the OpenAI client for this purpose. | ||
|
||
```python | ||
import instructor | ||
from openai import OpenAI | ||
|
||
# Apply the patch to the OpenAI client to support response_model | ||
# Also use MD_JSON mode since the visino model does not support any special structured output mode | ||
client = instructor.patch(OpenAI(), mode=instructor.function_calls.Mode.MD_JSON) | ||
|
||
def extract_table(url: str) -> Iterable[Table]: | ||
return client.chat.completions.create( | ||
model="gpt-4-vision-preview", | ||
response_model=Iterable[Table], | ||
max_tokens=1800, | ||
messages=[ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "text", "text": "Extract table from image."}, | ||
{"type": "image_url", "image_url": {"url": url}} | ||
], | ||
} | ||
], | ||
) | ||
``` | ||
|
||
## Practical Example | ||
|
||
In this example, we apply the method to extract data from an image showing the top grossing apps in Ireland for October 2023. | ||
|
||
```python | ||
url = "https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png" | ||
tables = extract_table(url) | ||
for table in tables: | ||
print(table.caption, end="\n") | ||
print(table.dataframe) | ||
``` | ||
|
||
??? Note "Expand to see the output" | ||
|
||
 | ||
|
||
### Top 10 Grossing Apps in October 2023 (Ireland) for Android Platforms | ||
|
||
| Rank | App Name | Category | | ||
|------|----------------------------------|--------------------| | ||
| 1 | Google One | Productivity | | ||
| 2 | Disney+ | Entertainment | | ||
| 3 | TikTok - Videos, Music & LIVE | Entertainment | | ||
| 4 | Candy Crush Saga | Games | | ||
| 5 | Tinder: Dating, Chat & Friends | Social networking | | ||
| 6 | Coin Master | Games | | ||
| 7 | Roblox | Games | | ||
| 8 | Bumble - Dating & Make Friends | Dating | | ||
| 9 | Royal Match | Games | | ||
| 10 | Spotify: Music and Podcasts | Music & Audio | | ||
|
||
### Top 10 Grossing Apps in October 2023 (Ireland) for iOS Platforms | ||
|
||
| Rank | App Name | Category | | ||
|------|----------------------------------|--------------------| | ||
| 1 | Tinder: Dating, Chat & Friends | Social networking | | ||
| 2 | Disney+ | Entertainment | | ||
| 3 | YouTube: Watch, Listen, Stream | Entertainment | | ||
| 4 | Audible: Audio Entertainment | Entertainment | | ||
| 5 | Candy Crush Saga | Games | | ||
| 6 | TikTok - Videos, Music & LIVE | Entertainment | | ||
| 7 | Bumble - Dating & Make Friends | Dating | | ||
| 8 | Roblox | Games | | ||
| 9 | LinkedIn: Job Search & News | Business | | ||
| 10 | Duolingo - Language Lessons | Education | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
from io import StringIO | ||
from typing import Annotated, Any, Iterable | ||
from openai import OpenAI | ||
from pydantic import ( | ||
BaseModel, | ||
BeforeValidator, | ||
PlainSerializer, | ||
InstanceOf, | ||
WithJsonSchema, | ||
) | ||
import pandas as pd | ||
import instructor | ||
|
||
|
||
client = instructor.patch(OpenAI(), mode=instructor.function_calls.Mode.MD_JSON) | ||
|
||
|
||
def to_markdown(df: pd.DataFrame) -> str: | ||
return df.to_markdown() | ||
|
||
|
||
def md_to_df(data: Any) -> Any: | ||
if isinstance(data, str): | ||
return ( | ||
pd.read_csv( | ||
StringIO(data), # Get rid of whitespaces | ||
sep="|", | ||
index_col=1, | ||
) | ||
.dropna(axis=1, how="all") | ||
.iloc[1:] | ||
.map(lambda x: x.strip()) | ||
) | ||
return data | ||
|
||
|
||
MarkdownDataFrame = Annotated[ | ||
InstanceOf[pd.DataFrame], | ||
BeforeValidator(md_to_df), | ||
PlainSerializer(to_markdown), | ||
WithJsonSchema( | ||
{ | ||
"type": "string", | ||
"description": """ | ||
The markdown representation of the table, | ||
each one should be tidy, do not try to join tables | ||
that should be seperate""", | ||
} | ||
), | ||
] | ||
|
||
|
||
class Table(BaseModel): | ||
caption: str | ||
dataframe: MarkdownDataFrame | ||
|
||
|
||
def extract_table(url: str) -> Iterable[Table]: | ||
return client.chat.completions.create( | ||
model="gpt-4-vision-preview", | ||
response_model=Iterable[Table], | ||
max_tokens=1800, | ||
messages=[ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{ | ||
"type": "text", | ||
"text": """Extract the table from the image, and describe it. | ||
Each table should be tidy, do not try to join tables that | ||
should be seperately described.""", | ||
}, | ||
{ | ||
"type": "image_url", | ||
"image_url": {"url": url}, | ||
}, | ||
], | ||
} | ||
], | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
url = "https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png" | ||
tables = extract_table(url) | ||
for tbl in tables: | ||
print(tbl.caption, end="\n") | ||
print(tbl.dataframe) | ||
""" | ||
Top 10 grossing apps in October 2023 (Ireland) for Android platforms, listing the rank, app name, and category. | ||
App Name Category | ||
Rank | ||
1 Google One Productivity | ||
2 Disney+ Entertainment | ||
3 TikTok - Videos, Music & LIVE Entertainment | ||
4 Candy Crush Saga Games | ||
5 Tinder: Dating, Chat & Friends Social networking | ||
6 Coin Master Games | ||
7 Roblox Games | ||
8 Bumble - Dating & Make Friends Dating | ||
9 Royal Match Games | ||
10 Spotify: Music and Podcasts Music & Audio | ||
Top 10 grossing apps in October 2023 (Ireland) for iOS platforms, listing the rank, app name, and category. | ||
App Name Category | ||
Rank | ||
1 Tinder: Dating, Chat & Friends Social networking | ||
2 Disney+ Entertainment | ||
3 YouTube: Watch, Listen, Stream Entertainment | ||
4 Audible: Audio Entertainment Entertainment | ||
5 Candy Crush Saga Games | ||
6 TikTok - Videos, Music & LIVE Entertainment | ||
7 Bumble - Dating & Make Friends Dating | ||
8 Roblox Games | ||
9 LinkedIn: Job Search & News Business | ||
10 Duolingo - Language Lessons Education | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters