Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Code and equation model for PDF and code blocks in markdown #752

Merged
merged 19 commits into from
Jan 24, 2025

Conversation

Matteo-Omenetti
Copy link
Contributor

@Matteo-Omenetti Matteo-Omenetti commented Jan 15, 2025

  • Use the new add_code() method in the markdown backend (with typing fixes)
  • Add the new Code and Formula model for PDFs

Copy link

mergify bot commented Jan 15, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@cau-git cau-git changed the title Mao1/code equation model feat: code equation model Jan 15, 2025
@cau-git cau-git changed the title feat: code equation model feat: Preparation for code equation model Jan 15, 2025
@Matteo-Omenetti Matteo-Omenetti force-pushed the mao1/code_equation_model branch from 156d38b to aa221c7 Compare January 15, 2025 09:24
Matteo-Omenetti and others added 4 commits January 21, 2025 09:36
@Matteo-Omenetti Matteo-Omenetti force-pushed the mao1/code_equation_model branch from fe04026 to bfccc6e Compare January 21, 2025 14:38
@dolfim-ibm dolfim-ibm changed the title feat: Preparation for code equation model feat: Code and equation model for PDF and code blocks in markdown Jan 24, 2025
@dolfim-ibm dolfim-ibm marked this pull request as ready for review January 24, 2025 15:24
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dolfim-ibm dolfim-ibm merged commit 3213b24 into main Jan 24, 2025
9 checks passed
@dolfim-ibm dolfim-ibm deleted the mao1/code_equation_model branch January 24, 2025 15:54
vancura pushed a commit to vancura/docling that referenced this pull request Feb 6, 2025
…4SD#752)

* propagated changes for new CodeItem class

Signed-off-by: Matteo Omenetti <[email protected]>

* Rebased branch on latest main. changes for CodeItem

Signed-off-by: Matteo Omenetti <[email protected]>

* removed unused files

Signed-off-by: Matteo Omenetti <[email protected]>

* chore: update lockfile

Signed-off-by: Christoph Auer <[email protected]>

* pin latest docling-core

Signed-off-by: Michele Dolfi <[email protected]>

* update docling-core pinning

Signed-off-by: Michele Dolfi <[email protected]>

* pin docling-core

Signed-off-by: Michele Dolfi <[email protected]>

* use new add_code in backends and update typing in MD backend

Signed-off-by: Michele Dolfi <[email protected]>

* added if statement for backend

Signed-off-by: Matteo Omenetti <[email protected]>

* removed unused import

Signed-off-by: Matteo Omenetti <[email protected]>

* removed print statements

Signed-off-by: Matteo Omenetti <[email protected]>

* gt for new pdf

Signed-off-by: Matteo Omenetti <[email protected]>

* Update docling/pipeline/standard_pdf_pipeline.py

Co-authored-by: Michele Dolfi <[email protected]>
Signed-off-by: Matteo <[email protected]>

* fixed doc comment of __call__ function of code_formula_model

Signed-off-by: Matteo Omenetti <[email protected]>

* fix artifacts_path type

Signed-off-by: Michele Dolfi <[email protected]>

* move imports

Signed-off-by: Michele Dolfi <[email protected]>

* move expansion_factor to base class

Signed-off-by: Michele Dolfi <[email protected]>

---------

Signed-off-by: Matteo Omenetti <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Matteo <[email protected]>
Co-authored-by: Christoph Auer <[email protected]>
Co-authored-by: Michele Dolfi <[email protected]>
Co-authored-by: Michele Dolfi <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
@ShayanTalaei
Copy link

Thanks for your awesome work! Could you please provide an example/detailed documentation of how to use this feature?

@Matteo-Omenetti
Copy link
Contributor Author

Hello @ShayanTalaei
The code and formula models are turned off by default. You need to turn them on using the PipelineOptions object and the pass this object to the DocumentConverter object.

from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True
pipeline_options.do_code_enrichment = True
pipeline_options.do_formula_enrichment = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants