Skip to content

Commit

Permalink
example: Table extraction w/ Pydantic (#288)
Browse files Browse the repository at this point in the history
  • Loading branch information
jxnl authored Dec 18, 2023
1 parent 098f43a commit 8510af2
Show file tree
Hide file tree
Showing 4 changed files with 252 additions and 1 deletion.
129 changes: 129 additions & 0 deletions docs/examples/extracting_tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
Certainly! Here's a comprehensive guide on using type annotations to extract markdown tables from images with OpenAI's new vision model. This guide includes all necessary code snippets, explanations, and a practical example.

## Introduction

This post demonstrates how to use Python's type annotations and OpenAI's new vision model to extract tables from images and convert them into markdown format. This method is particularly useful for data analysis and automation tasks.

## Building the Custom Type for Markdown Tables

First, we define a custom type, `MarkdownDataFrame`, to handle pandas DataFrames formatted in markdown. This type uses Python's `Annotated` and `InstanceOf` types, along with decorators `BeforeValidator` and `PlainSerializer`, to process and serialize the data.

```python
from io import StringIO
from typing import Annotated, Any
from pydantic import BaseModel, Field, BeforeValidator, PlainSerializer, InstanceOf, WithJsonSchema
import pandas as pd

def md_to_df(data: Any) -> Any:
# Convert markdown to DataFrame
if isinstance(data, str):
return (
pd.read_csv(
StringIO(data), # Process data
sep="|",
index_col=1,
)
.dropna(axis=1, how="all")
.iloc[1:]
.applymap(lambda x: x.strip())
)
return data

MarkdownDataFrame = Annotated[
InstanceOf(pd.DataFrame),
BeforeValidator(md_to_df),
PlainSerializer(lambda df: df.to_markdown()),
WithJsonSchema(
{
"type": "string",
"description": "The markdown representation of the table, each one should be tidy, do not try to join tables that should be seperate",
}

)
]
```

## Defining the Table Class

The `Table` class is essential for organizing the extracted data. It includes a caption and a dataframe, processed as a markdown table. Since most of the complexity is handled by the `MarkdownDataFrame` type, the `Table` class is straightforward!

```python
class Table(BaseModel):
caption: str
dataframe: MarkdownDataFrame
```

## Extracting Tables from Images

The `extract_table` function uses OpenAI's vision model to process an image URL and extract tables in markdown format. We utilize the `instructor` library to patch the OpenAI client for this purpose.

```python
import instructor
from openai import OpenAI

# Apply the patch to the OpenAI client to support response_model
# Also use MD_JSON mode since the visino model does not support any special structured output mode
client = instructor.patch(OpenAI(), mode=instructor.function_calls.Mode.MD_JSON)

def extract_table(url: str) -> Iterable[Table]:
return client.chat.completions.create(
model="gpt-4-vision-preview",
response_model=Iterable[Table],
max_tokens=1800,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract table from image."},
{"type": "image_url", "image_url": {"url": url}}
],
}
],
)
```

## Practical Example

In this example, we apply the method to extract data from an image showing the top grossing apps in Ireland for October 2023.

```python
url = "https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png"
tables = extract_table(url)
for table in tables:
print(table.caption, end="\n")
print(table.dataframe)
```

??? Note "Expand to see the output"

![Top 10 Grossing Apps in October 2023 for Ireland](https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png)

### Top 10 Grossing Apps in October 2023 (Ireland) for Android Platforms

| Rank | App Name | Category |
|------|----------------------------------|--------------------|
| 1 | Google One | Productivity |
| 2 | Disney+ | Entertainment |
| 3 | TikTok - Videos, Music & LIVE | Entertainment |
| 4 | Candy Crush Saga | Games |
| 5 | Tinder: Dating, Chat & Friends | Social networking |
| 6 | Coin Master | Games |
| 7 | Roblox | Games |
| 8 | Bumble - Dating & Make Friends | Dating |
| 9 | Royal Match | Games |
| 10 | Spotify: Music and Podcasts | Music & Audio |

### Top 10 Grossing Apps in October 2023 (Ireland) for iOS Platforms

| Rank | App Name | Category |
|------|----------------------------------|--------------------|
| 1 | Tinder: Dating, Chat & Friends | Social networking |
| 2 | Disney+ | Entertainment |
| 3 | YouTube: Watch, Listen, Stream | Entertainment |
| 4 | Audible: Audio Entertainment | Entertainment |
| 5 | Candy Crush Saga | Games |
| 6 | TikTok - Videos, Music & LIVE | Entertainment |
| 7 | Bumble - Dating & Make Friends | Dating |
| 8 | Roblox | Games |
| 9 | LinkedIn: Job Search & News | Business |
| 10 | Duolingo - Language Lessons | Education |
3 changes: 2 additions & 1 deletion docs/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,6 @@
11. [How is Personally Identifiable Information sanitized from documents?](pii.md)
12. [How are action items and dependencies generated from transcripts?](action_items.md)
13. [How to enable OpenAI's moderation](moderation.md)
14. [How to extract tables from images](extracting_tables.md)

Explore more!
Explore more!
119 changes: 119 additions & 0 deletions examples/vision/run_table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
from io import StringIO
from typing import Annotated, Any, Iterable
from openai import OpenAI
from pydantic import (
BaseModel,
BeforeValidator,
PlainSerializer,
InstanceOf,
WithJsonSchema,
)
import pandas as pd
import instructor


client = instructor.patch(OpenAI(), mode=instructor.function_calls.Mode.MD_JSON)


def to_markdown(df: pd.DataFrame) -> str:
return df.to_markdown()


def md_to_df(data: Any) -> Any:
if isinstance(data, str):
return (
pd.read_csv(
StringIO(data), # Get rid of whitespaces
sep="|",
index_col=1,
)
.dropna(axis=1, how="all")
.iloc[1:]
.map(lambda x: x.strip())
)
return data


MarkdownDataFrame = Annotated[
InstanceOf[pd.DataFrame],
BeforeValidator(md_to_df),
PlainSerializer(to_markdown),
WithJsonSchema(
{
"type": "string",
"description": """
The markdown representation of the table,
each one should be tidy, do not try to join tables
that should be seperate""",
}
),
]


class Table(BaseModel):
caption: str
dataframe: MarkdownDataFrame


def extract_table(url: str) -> Iterable[Table]:
return client.chat.completions.create(
model="gpt-4-vision-preview",
response_model=Iterable[Table],
max_tokens=1800,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": """Extract the table from the image, and describe it.
Each table should be tidy, do not try to join tables that
should be seperately described.""",
},
{
"type": "image_url",
"image_url": {"url": url},
},
],
}
],
)


if __name__ == "__main__":
url = "https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png"
tables = extract_table(url)
for tbl in tables:
print(tbl.caption, end="\n")
print(tbl.dataframe)
"""
Top 10 grossing apps in October 2023 (Ireland) for Android platforms, listing the rank, app name, and category.
App Name Category
Rank
1 Google One Productivity
2 Disney+ Entertainment
3 TikTok - Videos, Music & LIVE Entertainment
4 Candy Crush Saga Games
5 Tinder: Dating, Chat & Friends Social networking
6 Coin Master Games
7 Roblox Games
8 Bumble - Dating & Make Friends Dating
9 Royal Match Games
10 Spotify: Music and Podcasts Music & Audio
Top 10 grossing apps in October 2023 (Ireland) for iOS platforms, listing the rank, app name, and category.
App Name Category
Rank
1 Tinder: Dating, Chat & Friends Social networking
2 Disney+ Entertainment
3 YouTube: Watch, Listen, Stream Entertainment
4 Audible: Audio Entertainment Entertainment
5 Candy Crush Saga Games
6 TikTok - Videos, Music & LIVE Entertainment
7 Bumble - Dating & Make Friends Dating
8 Roblox Games
9 LinkedIn: Job Search & News Business
10 Duolingo - Language Lessons Education
"""
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,8 @@ nav:
- Overview: 'examples/index.md'
- Text Classification: 'examples/classification.md'
- Self Critique: 'examples/self_critique.md'
- Image Extracting Tables: 'examples/extracting_tables.md'
- Moderation: 'examples/moderation.md'
- Citations: 'examples/exact_citations.md'
- Knowledge Graph: 'examples/knowledge_graph.md'
- Entity Resolution: 'examples/entity_resolution.md'
Expand Down

0 comments on commit 8510af2

Please sign in to comment.