Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [Experimental] New VLM Pipeline leveraging vision models #708

Closed
wants to merge 34 commits into from

Conversation

maxmnemonic
Copy link
Contributor

@maxmnemonic maxmnemonic commented Jan 8, 2025

Preliminary integration with SmolDocling model and VLM Pipeline:

  • SmolDocling inference model
  • New VLM Pipeline that uses SmolDocling model
  • Assembly code that builds Docling document from Doc-tags format predicted by SmolDocling
  • Example of how to use
  • Rudimentary speed measurement logging

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

SmolDocling

Copy link

mergify bot commented Jan 8, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@maxmnemonic maxmnemonic changed the title WIP: Integration of SmolDocling pipeline WIP: Integration of SmolDocling Jan 8, 2025
@maxmnemonic maxmnemonic force-pushed the mly/smol-docling-integration branch from e4a60ae to 48faf18 Compare January 8, 2025 15:12
@cau-git cau-git changed the title WIP: Integration of SmolDocling feat: [WIP] Integration of SmolDocling Jan 10, 2025
@maxmnemonic maxmnemonic force-pushed the mly/smol-docling-integration branch from 64e854e to 354c90a Compare January 16, 2025 15:52
Copy link
Contributor Author

@maxmnemonic maxmnemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged all the proposed changes

@maxmnemonic maxmnemonic marked this pull request as ready for review February 12, 2025 17:55
@maxmnemonic maxmnemonic changed the title feat: [WIP] Integration of SmolDocling feat: Integration of SmolDocling Feb 12, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments submitted

@dolfim-ibm
Copy link
Contributor

I'm summarizing here the target of this PR, I will submit code proposals later.

VlmPipeline

Specs of the new pipeline

  • Input: (PDF) Document
  • Processing: using a vision language model
  • Output: DoclingDocument

Implementations

SmolDocling

Here the model will produce accurate DocTags which are converted (in the assemble step) to a DoclingDocument.

Other DocTags models

In the future we expect more models producing DocTags, which would go through the same assembling step of SmolDocling.

Other intermediate outputs

The pipeline will also support the case of VLMs producing a different intermediate representation. For example, models producing Markdown output, then we internally reuse the Markdown backend to create the DoclingDocument.

Wrap up

We definitely don't have to implement more than what it is nicely done in the PR, but a few naming (specially in the options) could be tuned for being ready for the next steps.

My suggestion is to use the vlm_options as the discriminator which, in the future, will decide things like 1) which model to call, 2) which type of internal assemble.

I would at least introduce from the beginning the kind in the options.

@@ -229,6 +229,13 @@ def repo_cache_folder(self) -> str:
)


class SmolDoclingOptions(BaseModel):
question: str = "Convert this page to docling." # "Perform Layout Analysis."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this as a wish for experimenting. What I'm really asking is if experimenting would not imply also using a different model?

Meaning, wouldn't it be better to

  • either both question and repo_id
  • or none of those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same model can be instructed differently to perform different tasks, default option is conversion to docling doc tags

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently allowing a free question. Only a few, very specific prompts, will produce processable output by the pipeline.

Whenever we would consider making a nice wrapper for running interesting prompts with SmolDocling, I think it would deserve its own place in the docling-ibm-models, not in the model class for the processing pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either allow both question and repo_id or none of the two.

@dolfim-ibm dolfim-ibm changed the title feat: Integration of SmolDocling feat: New VLM Pipeline leveraging vision models Feb 14, 2025
cau-git and others added 11 commits February 24, 2025 11:46
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Maksym Lysak <[email protected]>
…e assembly code, example included.

Signed-off-by: Maksym Lysak <[email protected]>
…s in VLM pipeline. This enables correct figure extraction and page numbers in provenances

Signed-off-by: Maksym Lysak <[email protected]>
…easurement in smol_docling models

Signed-off-by: Maksym Lysak <[email protected]>
Maksym Lysak and others added 19 commits February 24, 2025 13:12
…query and artifacts path) via client code, see example in minimal_smol_docling. Provisioning for other potential vlm all-in-one models.

Signed-off-by: Maksym Lysak <[email protected]>
…ng of doctags, updated logging

Signed-off-by: Maksym Lysak <[email protected]>
… provenance definition for all elements

Signed-off-by: Maksym Lysak <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Maksym Lysak <[email protected]>
…lated VLM pipeline option, few other minor things

Signed-off-by: Maksym Lysak <[email protected]>
…recated in the pipelines)

Signed-off-by: Maksym Lysak <[email protected]>
@maxmnemonic maxmnemonic force-pushed the mly/smol-docling-integration branch from 61fce90 to a7a1f32 Compare February 24, 2025 13:45
Maksym Lysak added 4 commits February 24, 2025 15:13
Signed-off-by: Maksym Lysak <[email protected]>
Signed-off-by: Maksym Lysak <[email protected]>
@dolfim-ibm dolfim-ibm changed the title feat: New VLM Pipeline leveraging vision models feat: [Experimental] New VLM Pipeline leveraging vision models Feb 25, 2025
@cau-git
Copy link
Contributor

cau-git commented Feb 26, 2025

This was merged on a derived PR.

@cau-git cau-git closed this Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants