Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ColPali to πŸ€— transformers #33736

Merged
merged 137 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
0f5c6a7
feat: run `add-new-model-like`
tonywu71 Sep 18, 2024
726f156
feat: add paligemma code with "copied from"
tonywu71 Sep 19, 2024
9a88bf1
feat: add ColPaliProcessor
tonywu71 Sep 19, 2024
fab4e46
feat: add ColPaliModel
tonywu71 Sep 19, 2024
66656f6
feat: add ColPaliConfig
tonywu71 Sep 19, 2024
a377d60
feat: rename `ColPaliForConditionalGeneration` to `ColPaliModel`
tonywu71 Sep 19, 2024
0addcab
fixup modeling colpali
tonywu71 Sep 19, 2024
e8979b9
fix: fix root import shortcuts
tonywu71 Sep 19, 2024
49fb8ba
fix: fix `modeling_auto` dict
tonywu71 Sep 19, 2024
88b0212
feat: comment out ColPali test file
tonywu71 Sep 19, 2024
cbd781b
fix: fix typos from `add-new-model-like`
tonywu71 Sep 19, 2024
44fcd04
feat: explicit the forward input args
tonywu71 Sep 26, 2024
a6ca45a
feat: move everything to `modular_colpali.py`
tonywu71 Sep 26, 2024
af9ca36
fix: put back ColPaliProcesor
tonywu71 Sep 26, 2024
087870b
feat: add auto-generated files
tonywu71 Sep 26, 2024
cc11ef8
fix: run `fix-copies`
tonywu71 Sep 26, 2024
f69ee9b
fix: remove DOCStRING constants to make modular converter work
tonywu71 Sep 26, 2024
fbe5665
fix: fix typo + modular converter
tonywu71 Sep 26, 2024
e58794c
fix: add missing imports
tonywu71 Sep 26, 2024
2dd5218
feat: no more errors when loading ColPaliModel
tonywu71 Sep 26, 2024
e05ea43
fix: remove unused args in forward + tweak doc
tonywu71 Sep 26, 2024
bda6916
feat: rename `ColPaliModel` to `ColPaliForRetrieval`
tonywu71 Sep 26, 2024
bfff564
fix: apply `fix-copies`
tonywu71 Sep 26, 2024
da4c566
feat: add ColPaliProcessor to `modular_colpali`
tonywu71 Sep 26, 2024
ae37f18
fix: run make quality + make style
tonywu71 Sep 26, 2024
38f0d8c
fix: remove duplicate line in configuration_auto
tonywu71 Sep 27, 2024
c63a302
feat: make ColPaliModel inehrit from PaliGemmaForConditionalGeneration
tonywu71 Sep 27, 2024
d66606e
fix: tweak and use ColPaliConfig
tonywu71 Sep 27, 2024
7f750d3
feat: rename `score` to `post_process_retrieval`
tonywu71 Sep 27, 2024
41dbbb8
build: run modular formatter + make style
tonywu71 Sep 27, 2024
28592c9
feat: convert colpali weights + fixes
tonywu71 Sep 27, 2024
84763a3
feat: remove old weight converter file
tonywu71 Sep 27, 2024
672bdb2
feat: add and validate tests
tonywu71 Sep 27, 2024
f7ce9b1
feat: replace harcoded path to "vidore/colpali-v1.2-hf" in tests
tonywu71 Sep 27, 2024
3789a6e
fix: add bfloat16 conversion in weight converter
tonywu71 Sep 27, 2024
5e09645
feat: replace pytest with unittest in modeling colpali test
tonywu71 Sep 28, 2024
8ea8273
feat: add sanity check for weight conversion (doesn't work yet)
tonywu71 Sep 28, 2024
d100779
feat: add shape sanity check in weigth converter
tonywu71 Sep 28, 2024
e6bdf40
feat: make ColPaliProcessor args explicit
tonywu71 Sep 28, 2024
abe3232
doc: add doc for ColPali
tonywu71 Sep 28, 2024
6ae178c
fix: trying to fix output mismatch
tonywu71 Sep 28, 2024
6d35b27
feat: tweaks
tonywu71 Sep 28, 2024
0653340
fix: ColPaliModelOutput inherits from ModelOutput instead of PaliGemm…
tonywu71 Sep 30, 2024
97a6468
fix: address comments on PR
tonywu71 Oct 2, 2024
8212717
fix: adapt tests to the Hf norm
tonywu71 Oct 2, 2024
a7b297a
wip: try things
tonywu71 Oct 7, 2024
592e716
feat: add `__call__` method to `ColPaliProcessor`
tonywu71 Oct 13, 2024
f50a979
feat: remove need for dummy image in `process_queries`
tonywu71 Oct 13, 2024
25eb21b
build: run new modular converter
tonywu71 Oct 16, 2024
3ed7627
fix: fix incorrect method override
tonywu71 Oct 16, 2024
9038ead
Fix tests, processing, modular, convert
yonigozlan Oct 16, 2024
cb7e301
fix tokenization auto
yonigozlan Oct 16, 2024
3f118ca
hotfix: manually fix processor -> fixme once convert modular is fixed
tonywu71 Oct 19, 2024
3aa11a6
fix: convert weights working
tonywu71 Oct 19, 2024
8ff8962
feat: rename and improve convert weight script
tonywu71 Oct 19, 2024
7a54fec
feat: tweaks
tonywu71 Oct 19, 2024
2c94eaa
fest: remove `device` input for `post_process_retrieval`
tonywu71 Oct 21, 2024
2d7e96f
refactor: remove unused `get_torch_device`
tonywu71 Oct 21, 2024
1189340
Fix all tests
yonigozlan Oct 21, 2024
246b67e
docs: update ColPali model doc
tonywu71 Oct 21, 2024
4a5bc0c
wip: fix convert weights to hf
tonywu71 Oct 21, 2024
afbbc98
fix logging modular
yonigozlan Oct 21, 2024
9db013d
docs: add acknowledgements in model doc
tonywu71 Oct 22, 2024
c4e156c
docs: add missing docstring to ColPaliProcessor
tonywu71 Oct 25, 2024
0b4e089
docs: tweak
tonywu71 Oct 25, 2024
d6a0bde
docs: add doc for `ColPaliForRetrievalOutput.forward`
tonywu71 Oct 25, 2024
1f115f9
feat: add modifications from colpali-engine v0.3.2 in ColPaliProcessor
tonywu71 Oct 25, 2024
20d1927
fix: fix and upload colapli hf weights
tonywu71 Oct 29, 2024
5ef48fb
refactor: rename `post_process_retrieval` to `score_retrieval`
tonywu71 Oct 29, 2024
5ae2bac
fix: fix wrong typing for `score_retrieval`
tonywu71 Oct 29, 2024
ffe894a
test: add integration test for ColPali
tonywu71 Oct 29, 2024
b0e33be
chore: rerun convert modular
tonywu71 Oct 29, 2024
f052927
build: fix root imports
tonywu71 Oct 29, 2024
ad09d67
Update docs/source/en/index.md
tonywu71 Oct 30, 2024
0dd1524
fix: address PR comments
tonywu71 Oct 30, 2024
b647788
wip: reduce the prediction gap in weight conversion
tonywu71 Oct 30, 2024
153f339
docs: add comment in weight conversion script
tonywu71 Oct 30, 2024
97b3a24
docs: add example for `ColPaliForRetrieval.forward`
tonywu71 Oct 30, 2024
a711fa7
tests: change dataset path to the new one in hf-internal
tonywu71 Oct 30, 2024
e9035d9
fix: colpali weight conversion works
tonywu71 Oct 30, 2024
9f7299b
test: add fine-grained check for ColPali integration test
tonywu71 Oct 31, 2024
43274d2
fix: fix typos in convert weight script
tonywu71 Oct 31, 2024
f6e3155
docs: move input docstring in a variable
tonywu71 Oct 31, 2024
da03264
fix: remove hardcoded torch device in test
tonywu71 Oct 31, 2024
930f91a
fix: run the new modular refactor
tonywu71 Nov 1, 2024
db37344
docs: fix python example for ColPali
tonywu71 Nov 2, 2024
e72c379
feat: add option to choose `score_retrieval`'s output dtype and device
tonywu71 Nov 2, 2024
5b11870
docs: update doc for `score_retrieval`
tonywu71 Nov 8, 2024
c53ffcb
feat: add `patch_size` property in ColPali model
tonywu71 Nov 10, 2024
5346292
chore: run `make fix-copies`
tonywu71 Nov 15, 2024
b102738
docs: update description for ColPali cookbooks
tonywu71 Nov 18, 2024
1d24773
fix: remove `ignore_index` methods
tonywu71 Nov 20, 2024
73d607a
feat: remove non-transformers specific methods
tonywu71 Nov 20, 2024
1db4c6c
feat: update `__init__.py` to new hf format
tonywu71 Nov 20, 2024
da05b70
fix: fix root imports in transformers
tonywu71 Nov 20, 2024
1b4f8f3
feat: remove ColPali's inheritance from PaliGemma
tonywu71 Nov 20, 2024
f100888
Fix CI issues
yonigozlan Nov 21, 2024
38210dc
nit remove prints
yonigozlan Nov 21, 2024
aee8d7c
feat: remove ColPali config and model from `modular_colpali.py`
tonywu71 Nov 24, 2024
f53ae20
feat: add `ColPaliPreTrainedModel` and update modeling and configurat…
tonywu71 Nov 24, 2024
b93c76b
fix: fix auto-removed imports in root `__init__.py`
tonywu71 Nov 24, 2024
87a16fd
fix: various fixes
tonywu71 Nov 24, 2024
fba3b77
fix: fix `_init_weight`
tonywu71 Nov 24, 2024
1e6c4ab
temp: comment `AutoModel.from_config` for experiments
tonywu71 Nov 24, 2024
6d20088
fix: add missing `output_attentions` arg in ColPali's forward
tonywu71 Nov 24, 2024
be6a0bd
fix: fix `resize_token_embeddings`
tonywu71 Nov 24, 2024
ecc7982
fix: make `input_ids` optional in forward
tonywu71 Nov 24, 2024
b1a25ce
feat: rename `projection_layer` to `embedding_proj_layer`
tonywu71 Nov 24, 2024
84fefad
wip: fix convert colpali weight script
tonywu71 Nov 24, 2024
836dc97
fix tests and convert weights from original repo
yonigozlan Nov 26, 2024
1eaa3d3
fix unprotected import
yonigozlan Nov 26, 2024
f187bc0
fix unprotected torch import
yonigozlan Nov 26, 2024
3646790
fix style
yonigozlan Nov 26, 2024
c8efb8a
change vlm_backbone_config to vlm_config
yonigozlan Nov 26, 2024
a30a74d
fix unprotected import in modular this time
yonigozlan Nov 26, 2024
c42c61b
fix: load config from Hub + tweaks in convert weight script
tonywu71 Nov 28, 2024
e981b71
docs: move example usage from model docstring to model markdown
tonywu71 Nov 28, 2024
2ce28f5
docs: fix input docstring for ColPali's forward method
tonywu71 Nov 28, 2024
a582f48
fix: use `sub_configs` for ColPaliConfig
tonywu71 Nov 28, 2024
9f34d80
fix: remove non-needed sanity checks in weight conversion script + tw…
tonywu71 Nov 28, 2024
05c29da
fix: fix issue with `replace_return_docstrings` in ColPali's `forward`
tonywu71 Nov 28, 2024
f67e217
docs: update docstring for `ColPaliConfig`
tonywu71 Nov 28, 2024
2ed868c
test: change model path in ColPali test
tonywu71 Nov 28, 2024
2aa5e9d
fix: fix ColPaliConfig
tonywu71 Nov 28, 2024
e6944ad
fix: fix weight conversion script
tonywu71 Nov 28, 2024
337a0a0
test: fix expected weights for ColPali model
tonywu71 Nov 28, 2024
c10e760
docs: update ColPali markdown
tonywu71 Nov 28, 2024
69d01fc
docs: fix minor typo in ColPaliProcessor
tonywu71 Nov 28, 2024
8061469
Fix tests and add _no_split_modules
yonigozlan Nov 29, 2024
7dce43f
add text_config to colpali config
yonigozlan Dec 5, 2024
855f139
[run slow] colpali
yonigozlan Dec 5, 2024
603e9e4
Merge branch 'main' into add-colpali
yonigozlan Dec 5, 2024
c41bad4
move inputs to torch_device in integration test
yonigozlan Dec 5, 2024
21c1309
skip test_model_parallelism
yonigozlan Dec 5, 2024
505ad9e
docs: clarify quickstart snippet in ColPali's model card
tonywu71 Dec 9, 2024
655bac7
docs: update ColPali's model card
tonywu71 Dec 10, 2024
e9af3a5
Merge remote-tracking branch 'upstream/main' into add-colpali
yonigozlan Dec 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -834,6 +834,8 @@
title: CLIPSeg
- local: model_doc/clvp
title: CLVP
- local: model_doc/colpali
title: ColPali
- local: model_doc/data2vec
title: Data2Vec
- local: model_doc/deplot
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ Flax), PyTorch, and/or TensorFlow.
| [CodeLlama](model_doc/code_llama) | βœ… | ❌ | βœ… |
| [Cohere](model_doc/cohere) | βœ… | ❌ | ❌ |
| [Cohere2](model_doc/cohere2) | βœ… | ❌ | ❌ |
| [ColPali](model_doc/colpali) | βœ… | ❌ | ❌ |
| [Conditional DETR](model_doc/conditional_detr) | βœ… | ❌ | ❌ |
| [ConvBERT](model_doc/convbert) | βœ… | βœ… | ❌ |
| [ConvNeXT](model_doc/convnext) | βœ… | βœ… | ❌ |
Expand Down
95 changes: 95 additions & 0 deletions docs/source/en/model_doc/colpali.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# ColPali

## Overview

The ColPali model was proposed in [ColPali: Efficient Document Retrieval with Vision Language Models](https://doi.org/10.48550/arXiv.2407.01449) by **Manuel Faysse***, **Hugues Sibille***, **Tony Wu***, Bilel Omrani, Gautier Viaud, CΓ©line Hudelot, Pierre Colombo (* denotes equal contribution).

With our new model *ColPali*, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.

Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document. ColPali is also highly interpretable: similarity maps can be obtained between patches and query tokens. These maps highlight ColPali’s strong OCR capabilities and chart understanding.

**Paper abstract:**

> Documents are visually rich structures that convey information through text, but also figures, page layouts, tables, or even fonts. Since modern retrieval systems mainly rely on the textual information they extract from document pages to index documents -often through lengthy and brittle processes-, they struggle to exploit key visual cues efficiently. This limits their capabilities in many practical document retrieval applications such as Retrieval Augmented Generation (RAG). To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark *ViDoRe*, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings. The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages. We release *ColPali*, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages. Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically simpler, faster and end-to-end trainable.
>
> We release models, data, code and benchmarks under open licenses at [https://huggingface.co/vidore](https://huggingface.co/vidore).

## Resources

- The official blog post detailing ColPali can be found [here](https://huggingface.co/blog/manu/colpali). πŸ“
- The original model implementation code for the ColPali model and for the `colpali-engine` package can be found [here](https://github.com/illuin-tech/colpali). 🌎
- Cookbooks for learning to use the transformers-native version of ColPali, fine-tuning, and similarity maps generation can be found [here](https://github.com/tonywu71/colpali-cookbooks). πŸ“š

This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) and [@yonigozlan](https://huggingface.co/yonigozlan).

## Usage

This example demonstrates how to use ColPali to embed both queries and images, calculate their similarity scores, and identify the most relevant matches. For a specific query, you can retrieve the top-k most similar images by selecting the ones with the highest similarity scores.

```python
import torch
from PIL import Image

from transformers import ColPaliForRetrieval, ColPaliProcessor

model_name = "vidore/colpali-v1.2-hf"

model = ColPaliForRetrieval.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0", # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

# Your inputs (replace dummy images with screenshots of your documents)
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"What is the organizational structure for our R&D department?",
"Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)
```

## ColPaliConfig

[[autodoc]] ColPaliConfig

## ColPaliProcessor

[[autodoc]] ColPaliProcessor

## ColPaliForRetrieval

[[autodoc]] ColPaliForRetrieval
- forward
20 changes: 20 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,10 @@
],
"models.cohere": ["CohereConfig"],
"models.cohere2": ["Cohere2Config"],
"models.colpali": [
"ColPaliConfig",
"ColPaliProcessor",
],
"models.conditional_detr": ["ConditionalDetrConfig"],
"models.convbert": [
"ConvBertConfig",
Expand Down Expand Up @@ -1468,6 +1472,7 @@
"MODEL_FOR_OBJECT_DETECTION_MAPPING",
"MODEL_FOR_PRETRAINING_MAPPING",
"MODEL_FOR_QUESTION_ANSWERING_MAPPING",
"MODEL_FOR_RETRIEVAL_MAPPING",
"MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING",
"MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING",
"MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING",
Expand Down Expand Up @@ -1789,6 +1794,12 @@
)
_import_structure["models.cohere"].extend(["CohereForCausalLM", "CohereModel", "CoherePreTrainedModel"])
_import_structure["models.cohere2"].extend(["Cohere2ForCausalLM", "Cohere2Model", "Cohere2PreTrainedModel"])
_import_structure["models.colpali"].extend(
[
"ColPaliForRetrieval",
"ColPaliPreTrainedModel",
]
)
_import_structure["models.conditional_detr"].extend(
[
"ConditionalDetrForObjectDetection",
Expand Down Expand Up @@ -5207,6 +5218,10 @@
)
from .models.cohere import CohereConfig
from .models.cohere2 import Cohere2Config
from .models.colpali import (
ColPaliConfig,
ColPaliProcessor,
)
from .models.conditional_detr import (
ConditionalDetrConfig,
)
Expand Down Expand Up @@ -6413,6 +6428,7 @@
MODEL_FOR_OBJECT_DETECTION_MAPPING,
MODEL_FOR_PRETRAINING_MAPPING,
MODEL_FOR_QUESTION_ANSWERING_MAPPING,
MODEL_FOR_RETRIEVAL_MAPPING,
MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING,
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
Expand Down Expand Up @@ -6689,6 +6705,10 @@
Cohere2Model,
Cohere2PreTrainedModel,
)
from .models.colpali import (
ColPaliForRetrieval,
ColPaliPreTrainedModel,
)
from .models.conditional_detr import (
ConditionalDetrForObjectDetection,
ConditionalDetrForSegmentation,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
codegen,
cohere,
cohere2,
colpali,
conditional_detr,
convbert,
convnext,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
"MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING",
"MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING",
"MODEL_FOR_VISION_2_SEQ_MAPPING",
"MODEL_FOR_RETRIEVAL_MAPPING",
"MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING",
"MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING",
"MODEL_MAPPING",
Expand Down Expand Up @@ -252,6 +253,7 @@
MODEL_FOR_OBJECT_DETECTION_MAPPING,
MODEL_FOR_PRETRAINING_MAPPING,
MODEL_FOR_QUESTION_ANSWERING_MAPPING,
MODEL_FOR_RETRIEVAL_MAPPING,
MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING,
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
("codegen", "CodeGenConfig"),
("cohere", "CohereConfig"),
("cohere2", "Cohere2Config"),
("colpali", "ColPaliConfig"),
("conditional_detr", "ConditionalDetrConfig"),
("convbert", "ConvBertConfig"),
("convnext", "ConvNextConfig"),
Expand Down Expand Up @@ -373,6 +374,7 @@
("codegen", "CodeGen"),
("cohere", "Cohere"),
("cohere2", "Cohere2"),
("colpali", "ColPali"),
("conditional_detr", "Conditional DETR"),
("convbert", "ConvBERT"),
("convnext", "ConvNeXT"),
Expand Down
8 changes: 8 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,7 @@
("big_bird", "BigBirdForPreTraining"),
("bloom", "BloomForCausalLM"),
("camembert", "CamembertForMaskedLM"),
("colpali", "ColPaliForRetrieval"),
("ctrl", "CTRLLMHeadModel"),
("data2vec-text", "Data2VecTextForMaskedLM"),
("deberta", "DebertaForMaskedLM"),
Expand Down Expand Up @@ -775,6 +776,12 @@
]
)

MODEL_FOR_RETRIEVAL_MAPPING_NAMES = OrderedDict(
[
("colpali", "ColPaliForRetrieval"),
]
)

MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
[
("aria", "AriaForConditionalGeneration"),
Expand Down Expand Up @@ -1473,6 +1480,7 @@
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
)
MODEL_FOR_RETRIEVAL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_RETRIEVAL_MAPPING_NAMES)
MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING_NAMES
)
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@
("clip", "CLIPProcessor"),
("clipseg", "CLIPSegProcessor"),
("clvp", "ClvpProcessor"),
("colpali", "ColPaliProcessor"),
("flava", "FlavaProcessor"),
("fuyu", "FuyuProcessor"),
("git", "GitProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@
("codegen", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
("cohere", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
("cohere2", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
("colpali", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
(
"cpm",
Expand Down
28 changes: 28 additions & 0 deletions src/transformers/models/colpali/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_colpali import *
from .modeling_colpali import *
from .processing_colpali import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
106 changes: 106 additions & 0 deletions src/transformers/models/colpali/configuration_colpali.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ColPali model configuration"""

import logging
from copy import deepcopy

from ...configuration_utils import PretrainedConfig
from ..auto import CONFIG_MAPPING, AutoConfig


logger = logging.getLogger(__name__)


class ColPaliConfig(PretrainedConfig):
r"""
Configuration class to store the configuration of a [`ColPaliForRetrieval`]. It is used to instantiate an instance
of `ColPaliForRetrieval` according to the specified arguments, defining the model architecture following the methodology
from the "ColPali: Efficient Document Retrieval with Vision Language Models" paper.

Creating a configuration with the default settings will result in a configuration where the VLM backbone is set to the
default PaliGemma configuration, i.e the one from [vidore/colpali-v1.2](https://huggingface.co/vidore/colpali-v1.2).

The ColPali config is very similar to [`PaligemmaConfig`], but with an extra attribute defining the embedding dimension.

Note that contrarily to what the class name suggests (actually the name refers to the ColPali **methodology**), you can
use a different VLM backbone model than PaliGemma by passing the corresponding VLM configuration to the class constructor.

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.

Args:
vlm_config (`PretrainedConfig`, *optional*):
Configuration of the VLM backbone model.
text_config (`PretrainedConfig`, *optional*):
Configuration of the text backbone model. Overrides the `text_config` attribute of the `vlm_config` if provided.
embedding_dim (`int`, *optional*, defaults to 128):
Dimension of the multi-vector embeddings produced by the model.

Example:

```python
from transformers.models.colpali import ColPaliConfig, ColPaliForRetrieval

config = ColPaliConfig()
model = ColPaliForRetrieval(config)
```
"""

model_type = "colpali"
sub_configs = {"vlm_config": PretrainedConfig, "text_config": AutoConfig}

def __init__(
self,
vlm_config=None,
text_config=None,
embedding_dim: int = 128,
**kwargs,
):
if vlm_config is None:
vlm_config = CONFIG_MAPPING["paligemma"]()
logger.info(
"`vlm_config` is `None`. Initializing `vlm_config` with the `PaliGemmaConfig` with default values."
)
elif isinstance(vlm_config, dict):
vlm_config = deepcopy(vlm_config)
if "model_type" not in vlm_config:
raise KeyError(
"The `model_type` key is missing in the `vlm_config` dictionary. Please provide the model type."
)
elif vlm_config["model_type"] not in CONFIG_MAPPING:
raise ValueError(
f"The model type `{vlm_config['model_type']}` is not supported. Please provide a valid model type."
)
vlm_config = CONFIG_MAPPING[vlm_config["model_type"]](**vlm_config)
elif isinstance(vlm_config, PretrainedConfig):
vlm_config = vlm_config
else:
raise TypeError(
f"Invalid type for `vlm_config`. Expected `PretrainedConfig`, `dict`, or `None`, but got {type(vlm_config)}."
)

self.vlm_config = vlm_config
self.text_config = text_config = text_config if text_config is not None else vlm_config.text_config
if isinstance(self.text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "gemma"
self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)

self.embedding_dim = embedding_dim

super().__init__(**kwargs)


__all__ = ["ColPaliConfig"]
Loading