-
-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example: Table extraction w/ Pydantic #288
Conversation
WalkthroughThe recent updates introduce a guide and tools for extracting markdown tables from images using OpenAI's new vision model. A Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on X ? TipsChat with CodeRabbit Bot (
|
Top 10 Grossing Apps in October 2023 (Ireland) for Android Platforms
Top 10 Grossing Apps in October 2023 (Ireland) for iOS Platforms
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 6
Configuration used: CodeRabbit UI
Files ignored due to filter (1)
- mkdocs.yml
Files selected for processing (3)
- docs/examples/extracting_tables.md (1 hunks)
- docs/examples/index.md (1 hunks)
- examples/vision/run_table.py (1 hunks)
Files skipped from review due to trivial changes (1)
- docs/examples/index.md
Additional comments: 1
examples/vision/run_table.py (1)
- 37-50: Verify the usage of
InstanceOf
,BeforeValidator
, andPlainSerializer
as they are not standard Pydantic validators. Ensure that these are custom extensions that are correctly implemented and used.
def md_to_df(data: Any) -> Any: | ||
# Convert markdown to DataFrame | ||
if isinstance(data, str): | ||
return ( | ||
pd.read_csv( | ||
StringIO(data), # Process data | ||
sep="|", | ||
index_col=1, | ||
) | ||
.dropna(axis=1, how="all") | ||
.iloc[1:] | ||
.applymap(lambda x: x.strip()) | ||
) | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for the md_to_df
function to manage cases where the markdown data is not in the expected format or pd.read_csv
throws an exception.
def extract_table(url: str) -> Iterable[Table]: | ||
return client.chat.completions.create( | ||
model="gpt-4-vision-preview", | ||
response_model=Iterable[Table], | ||
max_tokens=1800, | ||
messages=[ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "text", "text": "Extract table from image."}, | ||
{"type": "image_url", "image_url": {"url": url}} | ||
], | ||
} | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to the extract_table
function to manage potential failures or unexpected responses from the OpenAI API call.
def md_to_df(data: Any) -> Any: | ||
if isinstance(data, str): | ||
return ( | ||
pd.read_csv( | ||
StringIO(data), # Get rid of whitespaces | ||
sep="|", | ||
index_col=1, | ||
) | ||
.dropna(axis=1, how="all") | ||
.iloc[1:] | ||
.map(lambda x: x.strip()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .map(lambda x: x.strip())
call will raise an AttributeError
because it is applied to a DataFrame rather than a Series. This needs to be corrected to apply the strip operation to each element in the DataFrame.
- .map(lambda x: x.strip())
+ .applymap(lambda x: x.strip() if isinstance(x, str) else x)
Committable suggestion
❗ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def md_to_df(data: Any) -> Any: | |
if isinstance(data, str): | |
return ( | |
pd.read_csv( | |
StringIO(data), # Get rid of whitespaces | |
sep="|", | |
index_col=1, | |
) | |
.dropna(axis=1, how="all") | |
.iloc[1:] | |
.map(lambda x: x.strip()) | |
) | |
def md_to_df(data: Any) -> Any: | |
if isinstance(data, str): | |
return ( | |
pd.read_csv( | |
StringIO(data), # Get rid of whitespaces | |
sep="|", | |
index_col=1, | |
) | |
.dropna(axis=1, how="all") | |
.iloc[1:] | |
.applymap(lambda x: x.strip() if isinstance(x, str) else x) | |
) |
def extract_table(url: str) -> Iterable[Table]: | ||
return client.chat.completions.create( | ||
model="gpt-4-vision-preview", | ||
response_model=Iterable[Table], | ||
max_tokens=1800, | ||
messages=[ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{ | ||
"type": "text", | ||
"text": """Extract the table from the image, and describe it. | ||
Each table should be tidy, do not try to join tables that | ||
should be seperately described.""", | ||
}, | ||
{ | ||
"type": "image_url", | ||
"image_url": {"url": url}, | ||
}, | ||
], | ||
} | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response_model
parameter in the client.chat.completions.create
call is set to Iterable[Table]
, which is not a valid argument. This should be removed or replaced with a valid parameter.
- response_model=Iterable[Table],
Committable suggestion
❗ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def extract_table(url: str) -> Iterable[Table]: | |
return client.chat.completions.create( | |
model="gpt-4-vision-preview", | |
response_model=Iterable[Table], | |
max_tokens=1800, | |
messages=[ | |
{ | |
"role": "user", | |
"content": [ | |
{ | |
"type": "text", | |
"text": """Extract the table from the image, and describe it. | |
Each table should be tidy, do not try to join tables that | |
should be seperately described.""", | |
}, | |
{ | |
"type": "image_url", | |
"image_url": {"url": url}, | |
}, | |
], | |
} | |
], | |
) | |
def extract_table(url: str) -> Iterable[Table]: | |
return client.chat.completions.create( | |
model="gpt-4-vision-preview", | |
max_tokens=1800, | |
messages=[ | |
{ | |
"role": "user", | |
"content": [ | |
{ | |
"type": "text", | |
"text": """Extract the table from the image, and describe it. | |
Each table should be tidy, do not try to join tables that | |
should be seperately described.""", | |
}, | |
{ | |
"type": "image_url", | |
"image_url": {"url": url}, | |
}, | |
], | |
} | |
], | |
) |
if __name__ == "__main__": | ||
url = "https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png" | ||
tables = extract_table(url) | ||
for tbl in tables: | ||
print(tbl.caption, end="\n") | ||
print(tbl.dataframe) | ||
""" | ||
Top 10 grossing apps in October 2023 (Ireland) for Android platforms, listing the rank, app name, and category. | ||
|
||
App Name Category | ||
Rank | ||
1 Google One Productivity | ||
2 Disney+ Entertainment | ||
3 TikTok - Videos, Music & LIVE Entertainment | ||
4 Candy Crush Saga Games | ||
5 Tinder: Dating, Chat & Friends Social networking | ||
6 Coin Master Games | ||
7 Roblox Games | ||
8 Bumble - Dating & Make Friends Dating | ||
9 Royal Match Games | ||
10 Spotify: Music and Podcasts Music & Audio | ||
|
||
Top 10 grossing apps in October 2023 (Ireland) for iOS platforms, listing the rank, app name, and category. | ||
|
||
App Name Category | ||
Rank | ||
1 Tinder: Dating, Chat & Friends Social networking | ||
2 Disney+ Entertainment | ||
3 YouTube: Watch, Listen, Stream Entertainment | ||
4 Audible: Audio Entertainment Entertainment | ||
5 Candy Crush Saga Games | ||
6 TikTok - Videos, Music & LIVE Entertainment | ||
7 Bumble - Dating & Make Friends Dating | ||
8 Roblox Games | ||
9 LinkedIn: Job Search & News Business | ||
10 Duolingo - Language Lessons Education |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider parameterizing the URL and encapsulating the demonstration code in a function to improve reusability and maintainability.
""" | ||
Top 10 grossing apps in October 2023 (Ireland) for Android platforms, listing the rank, app name, and category. | ||
|
||
App Name Category | ||
Rank | ||
1 Google One Productivity | ||
2 Disney+ Entertainment | ||
3 TikTok - Videos, Music & LIVE Entertainment | ||
4 Candy Crush Saga Games | ||
5 Tinder: Dating, Chat & Friends Social networking | ||
6 Coin Master Games | ||
7 Roblox Games | ||
8 Bumble - Dating & Make Friends Dating | ||
9 Royal Match Games | ||
10 Spotify: Music and Podcasts Music & Audio | ||
|
||
Top 10 grossing apps in October 2023 (Ireland) for iOS platforms, listing the rank, app name, and category. | ||
|
||
App Name Category | ||
Rank | ||
1 Tinder: Dating, Chat & Friends Social networking | ||
2 Disney+ Entertainment | ||
3 YouTube: Watch, Listen, Stream Entertainment | ||
4 Audible: Audio Entertainment Entertainment | ||
5 Candy Crush Saga Games | ||
6 TikTok - Videos, Music & LIVE Entertainment | ||
7 Bumble - Dating & Make Friends Dating | ||
8 Roblox Games | ||
9 LinkedIn: Job Search & News Business | ||
10 Duolingo - Language Lessons Education |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded data at the end of the file should be removed or commented out if it's meant for documentation purposes. It's not clear why it's included in the script.
This is a peculiar example, and I'm unsure how to work with types in this context.
Here's the code I have:
When I use the OpenAI vision model to extract tables from a given URL:
The table displays the top 10 grossing Android apps in Ireland for October 2023, provided by SensorTower. The rankings are based on app revenue, and the categories indicate the primary function of each app.
The table lists the top 10 grossing iOS apps in Ireland for October 2023, provided by SensorTower. The rankings are based on app revenue, and the categories indicate the primary function of each app.
It would be useful to provide users or instructors with some helper annotations. For example:
By handling the prompt and validation properly:
table
attribute should have a JSON schema type ofstr
and a descriptive prompt.Table(...).table
becomes a DataFrame.Summary by CodeRabbit
New Features
Table
class for better table data management.Documentation
New Examples