-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Kosmos-2.5 #31711
base: main
Are you sure you want to change the base?
Support Kosmos-2.5 #31711
Changes from 92 commits
7e810e2
2fe1f94
ec82032
8066ee7
9c1539a
4eca23c
b574b09
d2c57cc
93b291f
b7be077
d577c90
2537140
24961cd
6eb0683
ca57f47
c23a8dd
452b23d
4308a40
2db6b88
1776f31
188adbf
3cebe13
c54f9a8
54a632e
5b3a6f7
e9e56d0
8b27f80
5ba6d84
25e3260
fbbf151
28b58ff
06c52ae
99f0d99
2a782f0
da45edd
0ddfe76
9dcacfc
0d166de
32df418
9fca9ca
87ccbc7
ed50bbd
c027a98
64f915e
26fb969
6b82ce0
482e5e1
bd76555
09d8b29
cfaa28f
ab546cc
6cae0b6
9e0c277
cc17791
865fc2f
162f569
889d9da
40dc555
7e5a91c
7dfd145
d0e4fb7
60240f2
2b2fe1c
267e1d6
2ea4d4f
18fa43b
2157f31
ac1968b
29d272b
70d85cd
1424e07
6f8b2e6
2cdb62a
f2b61c2
3681119
7df3000
de6d842
e09217e
210ccb1
4e709e5
e62993c
e6fe2ae
703ccfd
e41b875
9822d00
1e175ba
e583cd4
bb4c247
139e834
66af73d
6659897
720a8ab
8ee2aa9
9d7363f
1bd02b2
06cbb5d
f4c73b3
0ae49e0
ef6754c
9a01f8f
eb116ab
e1ab413
fe418d0
cc7d28f
e5ffaee
b5ebf09
dd12798
15feaea
ab687f5
87ab935
54b1984
5e5a9e9
0ed8541
df9d3ad
55cb12d
d99934d
c705049
806ca1b
9e620b6
65490b4
d0bf57e
f5d4439
40ff015
63603d6
f8497ce
eab8e69
a6154db
94cc6d2
968b033
142604d
4b7bc95
6f2bd73
08e1cb0
f2dae0d
fcc095f
f66c6ee
9a8479d
830671b
1c58c8f
0153a08
ac94b57
52788cc
0b9e5ad
925e14a
6ed504d
9a841ad
dcced48
91fa383
e3802f4
85da449
b1db4f2
f8c98d6
fbb3e59
ce3a6b0
90c4fcc
00e324d
9c8aff7
2c47915
395a636
8a058d9
c639eeb
b688c4f
d1c52f4
3a58742
d5b8349
9ddc86b
39dc6ef
b2c3db2
c356a36
55944fc
83d600e
2d4cbba
6b2f7d7
5f731a9
0ec499a
7f0d26c
db865db
bf14c4b
9b29aac
ce222a6
876cb6b
a3638ea
30f927a
a65a9b1
7c99fd0
ec9ea0c
8fc9699
22cb70d
001fd70
d1116f5
6f09a51
7d0b827
cd018b0
a5b23f8
d1debcc
1279316
69aec2e
8c579a9
af813ce
ca60142
19da4a2
777a3e2
1ace3d1
8d2e51f
7b65626
89c6901
59700f9
4988c47
3a411a3
4036920
59c21c9
48c6965
f36ef6f
d812476
f339e50
f81256a
beb281c
a6ff4d2
fb62fd6
9a90c54
cf804ac
8fc5a31
0593308
6a55353
2050bc3
9cdfdf3
c49d565
a809295
95dc35c
bcaf808
917dcc8
42c3216
cd35c34
185f370
0c0c485
0216ac8
6b288c3
94e563a
ad6ded5
17b78dd
b9fc031
0ece5c7
fba70ba
4281ff3
6e071c7
a5318da
ead54fd
6c320a6
783877d
7399d8a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
rendered properly in your Markdown viewer. | ||
|
||
--> | ||
|
||
# KOSMOS-2.5 | ||
|
||
## Overview | ||
|
||
Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models. | ||
ydshieh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The abstract from the paper is the following: | ||
|
||
*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.* | ||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png" | ||
alt="drawing" width="600"/> | ||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png" | ||
alt="drawing" width="600"/> | ||
|
||
<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small> | ||
|
||
## Example | ||
|
||
```python | ||
from PIL import Image | ||
import requests | ||
import torch | ||
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration | ||
import re | ||
repo = "microsoft/kosmos-2.5" | ||
device = "cuda:0" | ||
dtype = torch.bfloat16 | ||
model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype) | ||
processor = AutoProcessor.from_pretrained(repo) | ||
url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png" | ||
image = Image.open(requests.get(url, stream=True).raw) | ||
prompt = "<ocr>" # <md> | ||
inputs = processor(text=prompt, images=image, return_tensors="pt") | ||
height, width = inputs.pop("height"), inputs.pop("width") | ||
raw_width, raw_height = image.size | ||
scale_height = raw_height / height | ||
scale_width = raw_width / width | ||
inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()} | ||
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype) | ||
generated_ids = model.generate( | ||
**inputs, | ||
max_new_tokens=1024, | ||
) | ||
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | ||
def postprocess(y, scale_height, scale_width): | ||
y = y.replace(prompt, "") | ||
if "<md>" in prompt: | ||
return y | ||
pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>" | ||
bboxs_raw = re.findall(pattern, y) | ||
lines = re.split(pattern, y)[1:] | ||
bboxs = [re.findall(r"\d+", i) for i in bboxs_raw] | ||
bboxs = [[int(j) for j in i] for i in bboxs] | ||
info = "" | ||
for i in range(len(lines)): | ||
box = bboxs[i] | ||
x0, y0, x1, y1 = box | ||
if not (x0 >= x1 or y0 >= y1): | ||
x0 = int(x0 * scale_width) | ||
y0 = int(y0 * scale_height) | ||
x1 = int(x1 * scale_width) | ||
y1 = int(y1 * scale_height) | ||
info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}" | ||
return info | ||
output_text = postprocess(generated_text[0], scale_height, scale_width) | ||
print(output_text) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: (not necessary) Might be nice / interesting to refer to https://github.com/microsoft/unilm/blob/master/kosmos-2.5/draw_bbox.py and attach a screenshot of the output images. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one is easier to use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. But if I understand correctly, this only gives the string, but people are more interested to see the final images with bounding boxes or the structured MD layout. I am not saying to use If you have any consideration not to mention, I am OK not to have it here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean something like How to use? |
||
``` | ||
```text | ||
55,595,71,595,71,629,55,629,1 | ||
82,595,481,595,481,635,82,635,[REG] BLACK SAKURA | ||
716,590,841,590,841,629,716,629,45,455 | ||
55,637,71,637,71,672,55,672,1 | ||
82,637,486,637,486,675,82,675,COOKIE DOH SAUCES | ||
818,632,843,632,843,668,818,668,0 | ||
51,683,71,683,71,719,51,719,1 | ||
82,683,371,683,371,719,82,719,NATA DE COCO | ||
820,677,845,677,845,713,820,713,0 | ||
32,770,851,770,851,811,32,811,Sub Total 45,455 | ||
28,811,853,811,853,858,28,858,PB1 (10%) 4,545 | ||
28,857,855,857,855,905,28,905,Rounding 0 | ||
24,905,858,905,858,956,24,956,Total 50,000 | ||
17,1096,868,1096,868,1150,17,1150,Card Payment 50,000 | ||
``` | ||
|
||
|
||
|
||
## Kosmos2_5Config | ||
|
||
[[autodoc]] Kosmos2_5Config | ||
|
||
## Kosmos2_5ImageProcessor | ||
|
||
[[autodoc]] Kosmos2_5ImageProcessor | ||
|
||
## Kosmos2_5Processor | ||
|
||
[[autodoc]] Kosmos2_5Processor | ||
- __call__ | ||
|
||
## Kosmos2_5Model | ||
|
||
[[autodoc]] Kosmos2_5Model | ||
- forward | ||
|
||
## Kosmos2_5ForConditionalGeneration | ||
|
||
[[autodoc]] Kosmos2_5ForConditionalGeneration | ||
- forward |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -486,6 +486,11 @@ | |
"Kosmos2Config", | ||
"Kosmos2Processor", | ||
], | ||
"models.kosmos2_5": [ | ||
"Kosmos2_5Config", | ||
"Kosmos2_5ImageProcessor", | ||
"Kosmos2_5Processor", | ||
], | ||
"models.layoutlm": [ | ||
"LayoutLMConfig", | ||
"LayoutLMTokenizer", | ||
|
@@ -1149,6 +1154,7 @@ | |
_import_structure["models.idefics2"].extend(["Idefics2ImageProcessor"]) | ||
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"]) | ||
_import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"]) | ||
_import_structure["models.kosmos2_5"].extend(["Kosmos2_5ImageProcessor", "Kosmos2_5Processor"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where should I add it?
ydshieh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"]) | ||
_import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"]) | ||
_import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"]) | ||
|
@@ -2372,6 +2378,13 @@ | |
"Kosmos2PreTrainedModel", | ||
] | ||
) | ||
_import_structure["models.kosmos2_5"].extend( | ||
[ | ||
"Kosmos2_5ForConditionalGeneration", | ||
"Kosmos2_5Model", | ||
"Kosmos2_5PreTrainedModel", | ||
] | ||
) | ||
_import_structure["models.layoutlm"].extend( | ||
[ | ||
"LayoutLMForMaskedLM", | ||
|
@@ -5129,6 +5142,11 @@ | |
Kosmos2Config, | ||
Kosmos2Processor, | ||
) | ||
from .models.kosmos2_5 import ( | ||
Kosmos2_5Config, | ||
Kosmos2_5ImageProcessor, | ||
Kosmos2_5Processor, | ||
) | ||
from .models.layoutlm import ( | ||
LayoutLMConfig, | ||
LayoutLMTokenizer, | ||
|
@@ -5821,6 +5839,7 @@ | |
from .models.idefics2 import Idefics2ImageProcessor | ||
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor | ||
from .models.instructblipvideo import InstructBlipVideoImageProcessor | ||
from .models.kosmos2_5 import Kosmos2_5ImageProcessor, Kosmos2_5Processor | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where?
ydshieh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from .models.layoutlmv2 import ( | ||
LayoutLMv2FeatureExtractor, | ||
LayoutLMv2ImageProcessor, | ||
|
@@ -6852,6 +6871,11 @@ | |
Kosmos2Model, | ||
Kosmos2PreTrainedModel, | ||
) | ||
from .models.kosmos2_5 import ( | ||
Kosmos2_5ForConditionalGeneration, | ||
Kosmos2_5Model, | ||
Kosmos2_5PreTrainedModel, | ||
) | ||
from .models.layoutlm import ( | ||
LayoutLMForMaskedLM, | ||
LayoutLMForQuestionAnswering, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -116,6 +116,7 @@ | |
jamba, | ||
jetmoe, | ||
kosmos2, | ||
kosmos2_5, | ||
layoutlm, | ||
layoutlmv2, | ||
layoutlmv3, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# coding=utf-8 | ||
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from typing import TYPE_CHECKING | ||
|
||
from ...utils import ( | ||
OptionalDependencyNotAvailable, | ||
_LazyModule, | ||
is_torch_available, | ||
is_vision_available, | ||
) | ||
|
||
|
||
_import_structure = { | ||
"configuration_kosmos2_5": ["Kosmos2_5Config"], | ||
"image_processing_kosmos2_5": ["Kosmos2_5ImageProcessor"], | ||
"processing_kosmos2_5": ["Kosmos2_5Processor"], | ||
} | ||
|
||
try: | ||
if not is_torch_available(): | ||
raise OptionalDependencyNotAvailable() | ||
except OptionalDependencyNotAvailable: | ||
pass | ||
else: | ||
_import_structure["modeling_kosmos2_5"] = [ | ||
"Kosmos2_5ForConditionalGeneration", | ||
"Kosmos2_5Model", | ||
"Kosmos2_5PreTrainedModel", | ||
] | ||
|
||
|
||
if TYPE_CHECKING: | ||
from .configuration_kosmos2_5 import Kosmos2_5Config | ||
from .image_processing_kosmos2_5 import Kosmos2_5ImageProcessor | ||
from .processing_kosmos2_5 import Kosmos2_5Processor | ||
|
||
try: | ||
if not is_torch_available(): | ||
raise OptionalDependencyNotAvailable() | ||
except OptionalDependencyNotAvailable: | ||
pass | ||
else: | ||
from .modeling_kosmos2_5 import ( | ||
Kosmos2_5ForConditionalGeneration, | ||
Kosmos2_5Model, | ||
Kosmos2_5PreTrainedModel, | ||
) | ||
|
||
else: | ||
import sys | ||
|
||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure) | ||
ydshieh marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for me): I will revert this