feat(ocr): added support for PaddleOCR engine #393

Swaymaw · 2024-11-20T09:47:03Z

Added PaddleOCR Model as an OCR engine option.
Added Options for configuring PaddleOCR model during document conversion using pipeline options.
Updates documentation, added tests and updated dependencies(extras) to reflect the added engine support.
Updated examples to demonstrate the use of PaddleOcrOptions.

This change allows users to seamlessly work with PaddleOCR engine which provides higher accuracy and performance in use-cases which require working with complex PDF files.

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

mergify · 2024-11-20T10:01:39Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:

Signed-off-by: Swaymaw <[email protected]> Signed-off-by: Swaymaw <[email protected]>

…ling document converter Signed-off-by: Swaymaw <[email protected]>

Signed-off-by: Swaymaw <[email protected]>

Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Swaymaw <[email protected]>

docling/datamodel/pipeline_options.py

docling/models/paddle_ocr_model.py

dolfim-ibm · 2024-11-20T11:55:36Z

pyproject.toml

@@ -95,6 +95,7 @@ torchvision = [

 [tool.poetry.extras]
 tesserocr = ["tesserocr"]
+paddleocr = ["paddlepaddle", "paddleocr"]


For adding extras, these should also be in the main dependencies, with the optional=true flag

Hey, I am getting this numpy version issue however, in my environment I have:

numpy == 1.26.4

deepsearch-glm == 0.26.1

paddlepaddle == 2.6.2

paddleocr == 2.9.1

the library seems to be working fine, all the tests are getting passed as well. Can you please help me in resolving this?

which python version?

can you please try to rebase with the latest main branch? we just merged something about numpy conflicts as well

These are the lines I added in pyproject.toml to add the necessary libraries for paddle_ocr:

This is the error I am getting:

I have rebased with the latest main branch however i am still facing the numpy version conflict between paddleocr and deepsearch-glm. For now, I have removed paddlepaddle and paddleocr from the requirement and the extras in pyproject.toml as a simple workaround for now can be to manually install these libraries whenever someone wants to use PaddleOCR engine. We have instructions for the same in installation.md and if someone tries to use it we throw an import error with the instructions to install the library as well.

docs/installation.md

Glider95 · 2024-11-22T00:15:17Z

Hello,

Does a RapidOCR implementation could be possible too? (Wrapper of PaddleOCR, a lot easier to install) !

PeterStaar-IBM · 2024-11-22T08:37:28Z

Hello,

Does a RapidOCR implementation could be possible too? (Wrapper of PaddleOCR, a lot easier to install) !

What is the added delta with RapidOCR compared to PaddleOCR?

Signed-off-by: Swaymaw <[email protected]>

Swaymaw · 2024-11-22T09:36:12Z

Hello,
Does a RapidOCR implementation could be possible too? (Wrapper of PaddleOCR, a lot easier to install) !

What is the added delta with RapidOCR compared to PaddleOCR?

It is just the poetry.lock file nothing much has changed code-wise.

dolfim-ibm · 2024-11-25T07:56:24Z

@Swaymaw we will check if adding the packages as extras work for us. Meanwhile, can you please make sure to add those manual dependencies in the CI tests?

ezscode · 2024-11-26T08:22:34Z

Hope this merge successfully !

dolfim-ibm · 2024-11-26T09:04:07Z

Hope this merge successfully !

@ezscode I think this PR will be superseded by #415

Swaymaw · 2024-11-26T09:33:39Z

@dolfim-ibm Should I close this pull request to avoid any confusion?

dolfim-ibm · 2024-11-26T11:17:49Z

@dolfim-ibm Should I close this pull request to avoid any confusion?

@Swaymaw yes, I'm closing as discussed in #415 .

Swaymaw and others added 6 commits November 20, 2024 15:36

Update README.md

9ffd3d9

Signed-off-by: Swaymaw <[email protected]> Signed-off-by: Swaymaw <[email protected]>

integrated paddleocr model for performing accurate ocr when using doc…

fc0523b

…ling document converter Signed-off-by: Swaymaw <[email protected]>

original readme for pull request

476affe

Signed-off-by: Swaymaw <[email protected]>

original readme for pull request

93f50a1

Signed-off-by: Swaymaw <[email protected]>

added documentation, tests and updated dependencies to support paddleocr

db14192

Signed-off-by: Swaymaw <[email protected]>

fix: propagate document limits to converter (#388)

a308821

Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Swaymaw <[email protected]>

PeterStaar-IBM requested review from dolfim-ibm, vagenas, cau-git and PeterStaar-IBM November 20, 2024 11:52

dolfim-ibm requested changes Nov 20, 2024

View reviewed changes

syncing with latest commit on original branch

86d9a2c

fixing styling issues

a00940f

Signed-off-by: Swaymaw <[email protected]>

dolfim-ibm closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ocr): added support for PaddleOCR engine #393

feat(ocr): added support for PaddleOCR engine #393

Swaymaw commented Nov 20, 2024 •

edited

Loading

mergify bot commented Nov 20, 2024 •

edited

Loading

dolfim-ibm Nov 20, 2024

Swaymaw Nov 20, 2024

dolfim-ibm Nov 20, 2024

Swaymaw Nov 20, 2024 •

edited

Loading

Glider95 commented Nov 22, 2024

PeterStaar-IBM commented Nov 22, 2024

Swaymaw commented Nov 22, 2024

dolfim-ibm commented Nov 25, 2024

ezscode commented Nov 26, 2024

dolfim-ibm commented Nov 26, 2024

Swaymaw commented Nov 26, 2024

dolfim-ibm commented Nov 26, 2024

feat(ocr): added support for PaddleOCR engine #393

feat(ocr): added support for PaddleOCR engine #393

Conversation

Swaymaw commented Nov 20, 2024 • edited Loading

mergify bot commented Nov 20, 2024 • edited Loading

Merge Protections

🟢 Enforce conventional commit

dolfim-ibm Nov 20, 2024

Choose a reason for hiding this comment

Swaymaw Nov 20, 2024

Choose a reason for hiding this comment

dolfim-ibm Nov 20, 2024

Choose a reason for hiding this comment

Swaymaw Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Glider95 commented Nov 22, 2024

PeterStaar-IBM commented Nov 22, 2024

Swaymaw commented Nov 22, 2024

dolfim-ibm commented Nov 25, 2024

ezscode commented Nov 26, 2024

dolfim-ibm commented Nov 26, 2024

Swaymaw commented Nov 26, 2024

dolfim-ibm commented Nov 26, 2024

Swaymaw commented Nov 20, 2024 •

edited

Loading

mergify bot commented Nov 20, 2024 •

edited

Loading

Swaymaw Nov 20, 2024 •

edited

Loading