Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kosmos-2.5 #31711

Open
wants to merge 341 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 92 commits
Commits
Show all changes
341 commits
Select commit Hold shift + click to select a range
7e810e2
upload doc images
ydshieh Jul 25, 2024
2fe1f94
[ydshieh] update eager/sdpa ocr expected outputs
ydshieh Jul 25, 2024
ec82032
[ydshieh] update FA2 ocr expected outputs
ydshieh Jul 25, 2024
8066ee7
[ydshieh] require_flash_attn
ydshieh Jul 25, 2024
9c1539a
[ydshieh] no need eval()
ydshieh Jul 25, 2024
4eca23c
[ydshieh] cuda_compute_capability_major_version
ydshieh Jul 25, 2024
b574b09
[ydshieh] fix FA2 deco
ydshieh Jul 25, 2024
d2c57cc
[ydshieh] [ydshieh] update eager ocr expected outputs
ydshieh Jul 25, 2024
93b291f
[ydshieh] update FA2 md expected outputs
ydshieh Jul 25, 2024
b7be077
[ydshieh] fix
ydshieh Jul 25, 2024
d577c90
remove add_special_tokens
Jul 26, 2024
2537140
without grad when generating
Jul 26, 2024
24961cd
Update src/transformers/models/kosmos2_5/configuration_kosmos2_5.py
tic-top Jul 26, 2024
6eb0683
Update src/transformers/models/kosmos2_5/convert_kosmos2_5.py
tic-top Jul 26, 2024
ca57f47
Update src/transformers/models/kosmos2_5/configuration_kosmos2_5.py
tic-top Jul 26, 2024
c23a8dd
Update src/transformers/models/kosmos2_5/configuration_kosmos2_5.py
tic-top Jul 27, 2024
452b23d
add batch test
Jul 28, 2024
4308a40
fix document in ks25 config
Jul 28, 2024
2db6b88
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 28, 2024
1776f31
fix foc in ks25 processor
Jul 28, 2024
188adbf
add comment to ks25 image processor
Jul 28, 2024
3cebe13
update copyright
Jul 28, 2024
c54f9a8
Update src/transformers/models/kosmos2_5/convert_kosmos2_5.py
tic-top Jul 28, 2024
54a632e
Update src/transformers/models/kosmos2_5/convert_kosmos2_5.py
tic-top Jul 28, 2024
5b3a6f7
fix doc in ks25 cfg
Jul 28, 2024
e9e56d0
simplify ks25 image procrssor
Jul 28, 2024
8b27f80
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 28, 2024
5ba6d84
simplify ks25 image processor
Jul 28, 2024
25e3260
[ydshieh] update repo name in doc
ydshieh Jul 29, 2024
fbbf151
[ydshieh] images, width, height, rows, cols = ...
ydshieh Jul 29, 2024
28b58ff
remove unnecessary comment
Jul 29, 2024
06c52ae
copied from comment added
Jul 29, 2024
99f0d99
add meaningful comment
Jul 29, 2024
2a782f0
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 29, 2024
da45edd
ks25 image processor test added
Jul 30, 2024
0ddfe76
add more ks25 processor test
Jul 30, 2024
9dcacfc
fix style
Jul 30, 2024
0d166de
[ydshieh] 2024
ydshieh Jul 30, 2024
32df418
[ydshieh] better skip
ydshieh Jul 30, 2024
9fca9ca
[ydshieh] num_image_tokens
ydshieh Jul 30, 2024
87ccbc7
Merge remote-tracking branch 'upstream/main' into main
Jul 30, 2024
ed50bbd
refractor FA2
Jul 30, 2024
c027a98
fix error
Jul 30, 2024
64f915e
fix ans
Jul 30, 2024
26fb969
[ydshieh] test_sdpa
ydshieh Jul 30, 2024
6b82ce0
[ydshieh] better skip
ydshieh Jul 30, 2024
482e5e1
[ydshieh] better skip
ydshieh Jul 30, 2024
bd76555
fix format
Jul 30, 2024
09d8b29
make style
Jul 30, 2024
cfaa28f
test_model_input_names need torch
Jul 30, 2024
ab546cc
[ydshieh] remove
ydshieh Jul 30, 2024
6cae0b6
[ydshieh] add copied
ydshieh Jul 30, 2024
9e0c277
[ydshieh] style
ydshieh Jul 30, 2024
cc17791
[ydshieh] Kosmos2_5ForConditionalGeneration
ydshieh Jul 30, 2024
865fc2f
[ydshieh] docstring
ydshieh Jul 30, 2024
162f569
[ydshieh] copied
ydshieh Jul 30, 2024
889d9da
[ydshieh] copied
ydshieh Jul 30, 2024
40dc555
[ydshieh] copied
ydshieh Jul 30, 2024
7e5a91c
[ydshieh] copied
ydshieh Jul 30, 2024
7dfd145
[ydshieh] copied
ydshieh Jul 30, 2024
d0e4fb7
[ydshieh] copied
ydshieh Jul 30, 2024
60240f2
[ydshieh] copied
ydshieh Jul 30, 2024
2b2fe1c
[ydshieh] copied
ydshieh Aug 2, 2024
267e1d6
[ydshieh] copied
ydshieh Aug 2, 2024
2ea4d4f
[ydshieh] fix
ydshieh Aug 2, 2024
18fa43b
[ydshieh] fix
ydshieh Aug 2, 2024
2157f31
[ydshieh] fix
ydshieh Aug 2, 2024
ac1968b
fix bug
Aug 3, 2024
29d272b
[kirp] make style
Aug 3, 2024
70d85cd
[ydshieh] copied
ydshieh Aug 5, 2024
1424e07
[ydshieh] copied
ydshieh Aug 5, 2024
6f8b2e6
[ydshieh] _init_weights
ydshieh Aug 5, 2024
2cdb62a
[ydshieh] _init_weights
ydshieh Aug 5, 2024
f2b61c2
[ydshieh] _init_weights
ydshieh Aug 5, 2024
3681119
[yilinjia] fix doc in config
Aug 7, 2024
7df3000
[ydshieh] update vision model class inheritance
ydshieh Aug 12, 2024
de6d842
[ydshieh] copied statement for vision model
ydshieh Aug 12, 2024
e09217e
[ydshieh] update _init_weights
ydshieh Aug 13, 2024
210ccb1
[ydshieh] update _init_weights
ydshieh Aug 13, 2024
4e709e5
[ydshieh] update _init_weights
ydshieh Aug 13, 2024
e62993c
[ydshieh] copied statement for Kosmos2_5TextModel
ydshieh Aug 13, 2024
e6fe2ae
[ydshieh] Kosmos2TextForCausalLM
ydshieh Aug 13, 2024
703ccfd
[ydshieh] tiny tweak
ydshieh Aug 13, 2024
e41b875
[ydshieh] tests
ydshieh Aug 13, 2024
9822d00
[ydshieh] tests
ydshieh Aug 13, 2024
1e175ba
[ydshieh] tests
ydshieh Aug 13, 2024
e583cd4
[ydshieh] tests
ydshieh Aug 13, 2024
bb4c247
[ydshieh] stye
ydshieh Aug 13, 2024
139e834
[ydshieh] revert
ydshieh Aug 13, 2024
66af73d
remove old url
Aug 14, 2024
6659897
[ydshieh] fix
ydshieh Aug 14, 2024
720a8ab
[ydshieh] fix
ydshieh Aug 14, 2024
8ee2aa9
[ydshieh] fix
ydshieh Aug 14, 2024
9d7363f
[ydshieh] update value
ydshieh Aug 21, 2024
1bd02b2
[ydshieh] add to toctree
ydshieh Aug 21, 2024
06cbb5d
[kirp] update the example part in readme
Aug 27, 2024
f4c73b3
[kirp] remove zero bias
Sep 2, 2024
0ae49e0
[kirp] iterate over the images only once
Sep 2, 2024
ef6754c
[kirp] remove cross attention
Sep 2, 2024
9a01f8f
[kirp] reformat
Sep 2, 2024
eb116ab
[kirp] use string
Sep 2, 2024
e1ab413
[kirp] remove creating mask in the layer
Sep 2, 2024
fe418d0
[kirp] remove cache
Sep 2, 2024
cc7d28f
Revert "[kirp] remove creating mask in the layer"
Sep 2, 2024
e5ffaee
[kirp] fix typo in processor
Sep 3, 2024
b5ebf09
[kirp] remove head mask
Sep 3, 2024
dd12798
[kirp] remove test file
Sep 3, 2024
15feaea
[kirp] cache for eager
Sep 30, 2024
ab687f5
[kirp] sdpa cache
Sep 30, 2024
87ab935
[kirp] move attention_mask maker to vision encoder
Sep 30, 2024
54b1984
[kirp] cache sdpa and format
Sep 30, 2024
5e5a9e9
[kirp] fix format
Sep 30, 2024
0ed8541
[kirp] fix format
Sep 30, 2024
df9d3ad
[kirp] use update_causal_mask
Sep 30, 2024
55cb12d
[kirp] check copies
Sep 30, 2024
d99934d
[kirp] regroup the init
Sep 30, 2024
c705049
[kirp] make style
Sep 30, 2024
806ca1b
[run-slow] kosmos2_5
Sep 30, 2024
9e620b6
[run-slow] fix checkpoint bug
Sep 30, 2024
65490b4
[run-slow] fix checkpoint bug
Sep 30, 2024
d0bf57e
Merge remote-tracking branch 'upstream/main' into main
Oct 2, 2024
f5d4439
[run-slow] kosmos2_5
Oct 2, 2024
40ff015
[run-slow] kosmos2_5
Oct 2, 2024
63603d6
[kirp] remove cross_attn in textblock
tic-top Oct 10, 2024
f8497ce
[run-slow] kosmos2_5
tic-top Oct 10, 2024
eab8e69
[run-slow] kosmos2_5
tic-top Oct 10, 2024
a6154db
[run-slow] kosmos2_5
tic-top Oct 11, 2024
94cc6d2
[ydshieh] update loop
ydshieh Oct 22, 2024
968b033
[ydshieh] remove duplication in init file
ydshieh Oct 25, 2024
142604d
[ydshieh] tokenizer class
ydshieh Oct 25, 2024
4b7bc95
[ydshieh] remove copied from
ydshieh Oct 25, 2024
6f2bd73
[ydshieh] skip
ydshieh Oct 25, 2024
08e1cb0
[ydshieh] move
ydshieh Oct 29, 2024
f2dae0d
Merge branch 'main' into kosmos25
ydshieh Oct 29, 2024
fcc095f
[ydshieh] fix copie
ydshieh Oct 29, 2024
f66c6ee
[ydshieh] remove
ydshieh Oct 29, 2024
9a8479d
[ydshieh] Add to MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
ydshieh Oct 29, 2024
830671b
[ydshieh] new init
ydshieh Oct 29, 2024
1c58c8f
[ydshieh] fix
ydshieh Oct 29, 2024
0153a08
[ydshieh] remove
ydshieh Oct 31, 2024
ac94b57
[ydshieh] add ProcessorTesterMixin
ydshieh Oct 31, 2024
52788cc
[ydshieh] add GenerationTesterMixin
ydshieh Oct 31, 2024
0b9e5ad
Merge branch 'main' into kosmos25
ydshieh Dec 6, 2024
925e14a
Merge branch 'main' into main
ydshieh Dec 6, 2024
6ed504d
fix
ydshieh Dec 6, 2024
9a841ad
fix
ydshieh Dec 6, 2024
dcced48
fix
ydshieh Dec 13, 2024
91fa383
fix
ydshieh Dec 13, 2024
e3802f4
fix
ydshieh Dec 13, 2024
85da449
fix
ydshieh Dec 13, 2024
b1db4f2
fix
ydshieh Dec 13, 2024
f8c98d6
it's Friday night, let cross finger
ydshieh Dec 13, 2024
fbb3e59
it's Friday night, let cross finger
ydshieh Dec 13, 2024
ce3a6b0
it's Friday night, let cross finger
ydshieh Dec 13, 2024
90c4fcc
it's Friday night, let cross finger
ydshieh Dec 13, 2024
00e324d
it's Friday night, let cross finger
ydshieh Dec 13, 2024
9c8aff7
it's Friday night, let cross finger
ydshieh Dec 13, 2024
2c47915
it's Friday night, let cross finger
ydshieh Dec 13, 2024
395a636
it's Monday let's go
ydshieh Dec 16, 2024
8a058d9
it's Monday let's go
ydshieh Dec 16, 2024
c639eeb
it's Monday let's go
ydshieh Dec 16, 2024
b688c4f
Merge branch 'ca03842c' into kosmos25
ydshieh Dec 16, 2024
d1c52f4
temp
ydshieh Dec 17, 2024
3a58742
temp
ydshieh Dec 17, 2024
d5b8349
temp
ydshieh Dec 17, 2024
9ddc86b
temp
ydshieh Dec 17, 2024
39dc6ef
temp
ydshieh Dec 17, 2024
b2c3db2
temp
ydshieh Dec 17, 2024
c356a36
temp
ydshieh Dec 17, 2024
55944fc
temp
ydshieh Dec 17, 2024
83d600e
temp
ydshieh Dec 17, 2024
2d4cbba
temp
ydshieh Dec 17, 2024
6b2f7d7
temp
ydshieh Dec 17, 2024
5f731a9
temp
ydshieh Dec 17, 2024
0ec499a
temp
ydshieh Dec 17, 2024
7f0d26c
temp
ydshieh Dec 17, 2024
db865db
temp
ydshieh Dec 17, 2024
bf14c4b
temp
ydshieh Dec 17, 2024
9b29aac
temp
ydshieh Dec 17, 2024
ce222a6
temp
ydshieh Dec 17, 2024
876cb6b
temp
ydshieh Dec 17, 2024
a3638ea
temp
ydshieh Dec 17, 2024
30f927a
temp
ydshieh Dec 17, 2024
a65a9b1
temp
ydshieh Dec 17, 2024
7c99fd0
temp
ydshieh Dec 17, 2024
ec9ea0c
fix
ydshieh Dec 17, 2024
8fc9699
fix
ydshieh Dec 17, 2024
22cb70d
fix
ydshieh Dec 18, 2024
001fd70
fix
ydshieh Dec 18, 2024
d1116f5
fix
ydshieh Dec 18, 2024
6f09a51
fix
ydshieh Dec 18, 2024
7d0b827
Merge branch 'main' into main
ydshieh Dec 18, 2024
cd018b0
Merge branch 'main' into kosmos25
ydshieh Jan 10, 2025
a5b23f8
Merge branch 'temp' into kosmos25
ydshieh Jan 21, 2025
d1debcc
no more copied
ydshieh Jan 21, 2025
1279316
fix
ydshieh Jan 21, 2025
69aec2e
Apply suggestions from code review
ydshieh Jan 21, 2025
8c579a9
fix default values in docstrings
ydshieh Jan 21, 2025
af813ce
update doc
ydshieh Jan 21, 2025
ca60142
Merge branch 'main_b5aaf875' into kosmos25
ydshieh Jan 24, 2025
19da4a2
[update] Kosmos2_5TextTransformer.forward
ydshieh Jan 24, 2025
777a3e2
Update Kosmos2_5TextBlock.forward # Need to update `self.self_attn` i…
ydshieh Jan 24, 2025
1ace3d1
Don't return past_key_value # need further changes
ydshieh Jan 24, 2025
8d2e51f
fix import issues
ydshieh Jan 24, 2025
7b65626
Fix Kosmos2_5ImageToTextProjection.forward: remove `_` when calling `…
ydshieh Jan 24, 2025
89c6901
Add eager_attention_forward
ydshieh Jan 24, 2025
59700f9
temp. update KOSMOS2_5_TEXT_ATTENTION_CLASSES # Need to remove this v…
ydshieh Jan 24, 2025
4988c47
Add `self.config = config` to `Kosmos2_5TextAttention.__init__`
ydshieh Jan 24, 2025
3a411a3
fix: change self.attention_dropout to self.dropout
ydshieh Jan 24, 2025
4036920
fix: remove the redudant ` * self.scaling`
ydshieh Jan 24, 2025
59c21c9
debug: partial revert
ydshieh Jan 24, 2025
48c6965
ugly fix for numerical issue
ydshieh Jan 27, 2025
f36ef6f
back to the clean version with the scaling issue fixed
ydshieh Jan 27, 2025
d812476
fix missing comma
ydshieh Jan 27, 2025
f339e50
add comment about sdpa: currently some tests failing because we use e…
ydshieh Jan 27, 2025
f81256a
Merge branch 'main' into kosmos25
ydshieh Jan 31, 2025
beb281c
comment
ydshieh Jan 31, 2025
a6ff4d2
Use ALL_ATTENTION_FUNCTIONS in `Kosmos2_5TextAttention`
ydshieh Jan 31, 2025
fb62fd6
Remove other attn impl. and KOSMOS2_5_TEXT_ATTENTION_CLASSES
ydshieh Jan 31, 2025
9a90c54
Update Kosmos2_5ImageToTextProjection
ydshieh Jan 31, 2025
cf804ac
Remove output_attentions: bool = False
ydshieh Jan 31, 2025
8fc5a31
remove if not output_attentions:
ydshieh Jan 31, 2025
0593308
Deal with vision part
ydshieh Jan 31, 2025
6a55353
Fix scaling
ydshieh Jan 31, 2025
2050bc3
Merge branch 'main' into kosmos25
ydshieh Feb 3, 2025
9cdfdf3
clean up
ydshieh Feb 3, 2025
c49d565
remove test_torchscript
ydshieh Feb 3, 2025
a809295
remove test_torchscript = False
ydshieh Feb 3, 2025
95dc35c
✅✅✅ finally green CI
ydshieh Feb 3, 2025
bcaf808
ruff fix
ydshieh Feb 3, 2025
917dcc8
ruff format
ydshieh Feb 3, 2025
42c3216
remove copied
ydshieh Feb 3, 2025
cd35c34
ruff format
ydshieh Feb 3, 2025
185f370
Merge branch 'main' into kosmos25
ydshieh Feb 4, 2025
0c0c485
update
ydshieh Feb 5, 2025
0216ac8
Merge branch 'main' into main
ydshieh Feb 5, 2025
6b288c3
lm loss update
ydshieh Feb 7, 2025
94e563a
Merge branch 'main' into kosmos25
ydshieh Feb 11, 2025
ad6ded5
update
ydshieh Feb 11, 2025
17b78dd
add width and height
ydshieh Feb 12, 2025
b9fc031
remove pop in the test
ydshieh Feb 13, 2025
0ece5c7
remove from prepare_inputs_for_generation
ydshieh Feb 13, 2025
fba70ba
width and height for base model
ydshieh Feb 13, 2025
4281ff3
update docstring
ydshieh Feb 13, 2025
6e071c7
update docstring
ydshieh Feb 13, 2025
a5318da
update docstring
ydshieh Feb 13, 2025
ead54fd
style
ydshieh Feb 13, 2025
6c320a6
fix copie
ydshieh Feb 13, 2025
783877d
rename doc
ydshieh Feb 13, 2025
7399d8a
add fast image processor
ydshieh Feb 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/self-pr-slow-ci.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for me): I will revert this

Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ jobs:
fail-fast: false
matrix:
folders: ${{ fromJson(needs.find_models_to_run.outputs.models) }}
machine_type: [single-gpu, multi-gpu]
runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, ci]
machine_type: [single-gpu]
runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, a10, ci]
container:
image: huggingface/transformers-all-latest-gpu
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ Flax), PyTorch, and/or TensorFlow.
| [JetMoe](model_doc/jetmoe) | ✅ | ❌ | ❌ |
| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ |
| [KOSMOS-2](model_doc/kosmos-2) | ✅ | ❌ | ❌ |
| [KOSMOS-2.5](model_doc/kosmos-2.5) | ✅ | ❌ | ❌ |
| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ |
| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ |
| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ |
Expand Down
126 changes: 126 additions & 0 deletions docs/source/en/model_doc/kosmos-2.5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# KOSMOS-2.5

## Overview

Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.
ydshieh marked this conversation as resolved.
Show resolved Hide resolved

The abstract from the paper is the following:

*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png"
alt="drawing" width="600"/>

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png"
alt="drawing" width="600"/>

<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small>

## Example

```python
from PIL import Image
import requests
import torch
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
import re
repo = "microsoft/kosmos-2.5"
device = "cuda:0"
dtype = torch.bfloat16
model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained(repo)
url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<ocr>" # <md>
inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_height = raw_height / height
scale_width = raw_width / width
inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
generated_ids = model.generate(
**inputs,
max_new_tokens=1024,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
def postprocess(y, scale_height, scale_width):
y = y.replace(prompt, "")
if "<md>" in prompt:
return y
pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
bboxs_raw = re.findall(pattern, y)
lines = re.split(pattern, y)[1:]
bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
bboxs = [[int(j) for j in i] for i in bboxs]
info = ""
for i in range(len(lines)):
box = bboxs[i]
x0, y0, x1, y1 = box
if not (x0 >= x1 or y0 >= y1):
x0 = int(x0 * scale_width)
y0 = int(y0 * scale_height)
x1 = int(x1 * scale_width)
y1 = int(y1 * scale_height)
info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
return info
output_text = postprocess(generated_text[0], scale_height, scale_width)
print(output_text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (not necessary)

Might be nice / interesting to refer to

https://github.com/microsoft/unilm/blob/master/kosmos-2.5/draw_bbox.py

and attach a screenshot of the output images.

Copy link
Author

@tic-top tic-top Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is easier to use.
The python file above need to convert the str to json first, then draw.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But if I understand correctly, this only gives the string, but people are more interested to see the final images with bounding boxes or the structured MD layout.

I am not saying to use draw_bbox.py in this documentation. Just mention that there is a such file to draw things and give the link as a reference.

If you have any consideration not to mention, I am OK not to have it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like How to use?

```
```text
55,595,71,595,71,629,55,629,1
82,595,481,595,481,635,82,635,[REG] BLACK SAKURA
716,590,841,590,841,629,716,629,45,455
55,637,71,637,71,672,55,672,1
82,637,486,637,486,675,82,675,COOKIE DOH SAUCES
818,632,843,632,843,668,818,668,0
51,683,71,683,71,719,51,719,1
82,683,371,683,371,719,82,719,NATA DE COCO
820,677,845,677,845,713,820,713,0
32,770,851,770,851,811,32,811,Sub Total 45,455
28,811,853,811,853,858,28,858,PB1 (10%) 4,545
28,857,855,857,855,905,28,905,Rounding 0
24,905,858,905,858,956,24,956,Total 50,000
17,1096,868,1096,868,1150,17,1150,Card Payment 50,000
```



## Kosmos2_5Config

[[autodoc]] Kosmos2_5Config

## Kosmos2_5ImageProcessor

[[autodoc]] Kosmos2_5ImageProcessor

## Kosmos2_5Processor

[[autodoc]] Kosmos2_5Processor
- __call__

## Kosmos2_5Model

[[autodoc]] Kosmos2_5Model
- forward

## Kosmos2_5ForConditionalGeneration

[[autodoc]] Kosmos2_5ForConditionalGeneration
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Kosmos-2.5](https://huggingface.co/docs/transformers/model_doc/kosmos2_5#transformers.Kosmos2_5Model)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
Expand Down Expand Up @@ -209,6 +210,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Kosmos-2.5](https://huggingface.co/docs/transformers/model_doc/kosmos2_5#transformers.Kosmos2_5Model)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [OLMo](https://huggingface.co/docs/transformers/model_doc/olmo#transformers.OlmoModel)
* [PaliGemma](https://huggingface.co/docs/transformers/model_doc/paligemma#transformers.PaliGemmaForConditionalGeneration)
Expand Down
24 changes: 24 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,11 @@
"Kosmos2Config",
"Kosmos2Processor",
],
"models.kosmos2_5": [
"Kosmos2_5Config",
"Kosmos2_5ImageProcessor",
"Kosmos2_5Processor",
],
"models.layoutlm": [
"LayoutLMConfig",
"LayoutLMTokenizer",
Expand Down Expand Up @@ -1149,6 +1154,7 @@
_import_structure["models.idefics2"].extend(["Idefics2ImageProcessor"])
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
_import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
_import_structure["models.kosmos2_5"].extend(["Kosmos2_5ImageProcessor", "Kosmos2_5Processor"])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should I add it?

ydshieh marked this conversation as resolved.
Show resolved Hide resolved
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
_import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
_import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"])
Expand Down Expand Up @@ -2372,6 +2378,13 @@
"Kosmos2PreTrainedModel",
]
)
_import_structure["models.kosmos2_5"].extend(
[
"Kosmos2_5ForConditionalGeneration",
"Kosmos2_5Model",
"Kosmos2_5PreTrainedModel",
]
)
_import_structure["models.layoutlm"].extend(
[
"LayoutLMForMaskedLM",
Expand Down Expand Up @@ -5129,6 +5142,11 @@
Kosmos2Config,
Kosmos2Processor,
)
from .models.kosmos2_5 import (
Kosmos2_5Config,
Kosmos2_5ImageProcessor,
Kosmos2_5Processor,
)
from .models.layoutlm import (
LayoutLMConfig,
LayoutLMTokenizer,
Expand Down Expand Up @@ -5821,6 +5839,7 @@
from .models.idefics2 import Idefics2ImageProcessor
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
from .models.instructblipvideo import InstructBlipVideoImageProcessor
from .models.kosmos2_5 import Kosmos2_5ImageProcessor, Kosmos2_5Processor
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where?

ydshieh marked this conversation as resolved.
Show resolved Hide resolved
from .models.layoutlmv2 import (
LayoutLMv2FeatureExtractor,
LayoutLMv2ImageProcessor,
Expand Down Expand Up @@ -6852,6 +6871,11 @@
Kosmos2Model,
Kosmos2PreTrainedModel,
)
from .models.kosmos2_5 import (
Kosmos2_5ForConditionalGeneration,
Kosmos2_5Model,
Kosmos2_5PreTrainedModel,
)
from .models.layoutlm import (
LayoutLMForMaskedLM,
LayoutLMForQuestionAnswering,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@
jamba,
jetmoe,
kosmos2,
kosmos2_5,
layoutlm,
layoutlmv2,
layoutlmv3,
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@
("jetmoe", "JetMoeConfig"),
("jukebox", "JukeboxConfig"),
("kosmos-2", "Kosmos2Config"),
("kosmos-2.5", "Kosmos2_5Config"),
("layoutlm", "LayoutLMConfig"),
("layoutlmv2", "LayoutLMv2Config"),
("layoutlmv3", "LayoutLMv3Config"),
Expand Down Expand Up @@ -413,6 +414,7 @@
("jetmoe", "JetMoe"),
("jukebox", "Jukebox"),
("kosmos-2", "KOSMOS-2"),
("kosmos-2.5", "KOSMOS-2.5"),
("layoutlm", "LayoutLM"),
("layoutlmv2", "LayoutLMv2"),
("layoutlmv3", "LayoutLMv3"),
Expand Down Expand Up @@ -628,6 +630,7 @@
("data2vec-vision", "data2vec"),
("donut-swin", "donut"),
("kosmos-2", "kosmos2"),
("kosmos-2.5", "kosmos2_5"),
("maskformer-swin", "maskformer"),
("xclip", "x_clip"),
("clip_vision_model", "clip"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
("instructblip", ("BlipImageProcessor",)),
("instructblipvideo", ("InstructBlipVideoImageProcessor",)),
("kosmos-2", ("CLIPImageProcessor",)),
("kosmos-2.5", ("Kosmos2_5ImageProcessor",)),
("layoutlmv2", ("LayoutLMv2ImageProcessor",)),
("layoutlmv3", ("LayoutLMv3ImageProcessor",)),
("levit", ("LevitImageProcessor",)),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@
("jetmoe", "JetMoeModel"),
("jukebox", "JukeboxModel"),
("kosmos-2", "Kosmos2Model"),
("kosmos-2.5", "Kosmos2_5Model"),
("layoutlm", "LayoutLMModel"),
("layoutlmv2", "LayoutLMv2Model"),
("layoutlmv3", "LayoutLMv3Model"),
Expand Down Expand Up @@ -702,6 +703,7 @@
("instructblip", "InstructBlipForConditionalGeneration"),
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"),
("llava-next-video", "LlavaNextVideoForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
("instructblip", "InstructBlipProcessor"),
("instructblipvideo", "InstructBlipVideoProcessor"),
("kosmos-2", "Kosmos2Processor"),
("kosmos-2.5", "Kosmos2_5Processor"),
("layoutlmv2", "LayoutLMv2Processor"),
("layoutlmv3", "LayoutLMv3Processor"),
("llava", "LlavaProcessor"),
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,10 @@
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"kosmos-2.5",
("PreTrainedTokenizerFast", None),
),
ydshieh marked this conversation as resolved.
Show resolved Hide resolved
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
Expand Down
64 changes: 64 additions & 0 deletions src/transformers/models/kosmos2_5/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# coding=utf-8
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
is_vision_available,
)


_import_structure = {
"configuration_kosmos2_5": ["Kosmos2_5Config"],
"image_processing_kosmos2_5": ["Kosmos2_5ImageProcessor"],
"processing_kosmos2_5": ["Kosmos2_5Processor"],
}

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_kosmos2_5"] = [
"Kosmos2_5ForConditionalGeneration",
"Kosmos2_5Model",
"Kosmos2_5PreTrainedModel",
]


if TYPE_CHECKING:
from .configuration_kosmos2_5 import Kosmos2_5Config
from .image_processing_kosmos2_5 import Kosmos2_5ImageProcessor
from .processing_kosmos2_5 import Kosmos2_5Processor

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_kosmos2_5 import (
Kosmos2_5ForConditionalGeneration,
Kosmos2_5Model,
Kosmos2_5PreTrainedModel,
)

else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
ydshieh marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading