Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to refine OCR result and choose custom model? #806

Closed
geoHeil opened this issue Jan 24, 2025 · 8 comments
Closed

how to refine OCR result and choose custom model? #806

geoHeil opened this issue Jan 24, 2025 · 8 comments
Assignees
Labels
ocr question Further information is requested

Comments

@geoHeil
Copy link

geoHeil commented Jan 24, 2025

Question

  • I observe the best result is obtained by using rapid ocr
  • rapid ocr is also not perfect
  • how can I choose a different model in rapidocr
  • I am trying to set
ocr_model = "en_PP-OCRv4_rec"
pipeline_options.ocr_options = RapidOcrOptions(
    det_model_path=ocr_model,
    cls_model_path=ocr_model,
    rec_model_path=ocr_model
)

and observe that:
The model is not downloaded from huggingface - the normal (default model) is downloaded.

This leads me to the 2nd question: I can this default behaviour be kept to still download the model and in addition refine the OCR result?

@geoHeil geoHeil added the question Further information is requested label Jan 24, 2025
@cau-git
Copy link
Contributor

cau-git commented Jan 29, 2025

@geoHeil can you please check if docling==2.17.0, a fix was merged here: #786

@geoHeil
Copy link
Author

geoHeil commented Jan 29, 2025

Just updated - still, the same failure: FileNotFoundError: en_PP-OCRv4_rec does not exists. << the model is NOT downloaded. The default model is usually downloaded - where can the configuration accept a hugginface model name?

@vagenas vagenas added the enhancement New feature or request label Jan 30, 2025
@vagenas vagenas removed the enhancement New feature or request label Jan 30, 2025
@nikos-livathinos nikos-livathinos self-assigned this Jan 30, 2025
@nikos-livathinos
Copy link
Collaborator

nikos-livathinos commented Jan 30, 2025

After some investigation, this is the current situation:

  1. The rapidocr-onnxruntime module does not download any model files.
  2. The model files for chinese have been included inside the python package. Specifically the following ONNX files are used by default: ch_PP-OCRv4_det_infer.onnx, ch_PP-OCRv4_rec_infer.onnx, ch_ppocr_mobile_v2.0_cls_infer.onnx.
  3. The user can download alternative models files from HF RapidOCR repo and pass the file paths as parameters to RapidOcrOptions.

For example, assuming that the ONNX files have been downloaded from HF in the local dir rapidocr_models_root, the following code works and produces quite good results (checking with the PDF file from the issue linked by @geoHeil) :

    det_model_path = os.path.join(rapidocr_models_root, "en_PP-OCRv3_det_infer.onnx")
    rec_model_path = os.path.join(rapidocr_models_root, "ch_PP-OCRv4_rec_server_infer.onnx")
    cls_model_path = os.path.join(rapidocr_models_root, "ch_ppocr_mobile_v2.0_cls_train.onnx")
    ocr_options = RapidOcrOptions(
        det_model_path=det_model_path,
        rec_model_path=rec_model_path,
        cls_model_path=cls_model_path,
    )
    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options
    )

    # Convert the document
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options
            ),
        },
    )
    pdf_file = Path("WO2021041671A1-small.pdf")
    conversion_result: ConversionResult = converter.convert(source=pdf_file)
    doc = conversion_result.document
    md = doc.export_to_markdown()
    print(md)

The problem is that RappidOCR does not support any language parameter.

Therefore, although Docling has a lang parameter in RapidOcrOptions, it is not used actually

@cau-git , @dolfim-ibm We could assist Docling users by implementing the following approach inside Docling:

  1. Check if the lang option is "english" AND there are no explicit model file-paths provided by the user, then:
  2. Manually download the english onnx files from RapidOCR HF repo (as I did manually in my example code above).
    • However this introduces the complexity to decide which combination of det/rec/cls files should be used for the selected language.
    • We can implement some rule-based logic that scans the RappidOCR HF repo and get the most recent files available for the selected language or fall back to chinese which is the standard language for PaddleOCR.
  3. Pass the downloaded model files as parameters to the underlying RapidOCR model.

@dolfim-ibm
Copy link
Contributor

Very good analysis and findings. I propose the following approach to take as a new enhancement.

  1. We create and maintain a mapping of "language" to "onnx" files (maybe only english in the beginning)
  2. In the __init__ of the model class we apply a similar approach as the other models with HF weights, i.e. we accept an artifacts_path: Optional[Union[Path, str]]=None argument and if None the weights are being downloaded for the languages in the options
  3. The mapping in 1. should point to an exact rev hash on HF

@nikos-livathinos
Copy link
Collaborator

nikos-livathinos commented Jan 30, 2025

@dolfim-ibm regarding passing the artifacts_path as an __init__ parameter, I think in case of RapidOcrModel there is a collision with the RapidOcrOptions which already provides path options for the PaddleOCR models det_model_path, det_model_path, rec_model_path (check here).

Maybe instead of introducing the artifacts_path, we can implement the logic you propose using the existing paths of RapidOcrOptions.

@geoHeil
Copy link
Author

geoHeil commented Jan 31, 2025

Many thanks @nikos-livathinos for supporting and suggesting a further refinement. You mention:

andd produces quite good results

Given the particular file from the issue (RapidAI/RapidOCR#330)

396 |                 |
397 | 8173.3 >16666.7 |

the columns are still confused even with the custom english model. Is there any option I can feed for correctly identifying the table?

Is this also the case for you?

@nikos-livathinos
Copy link
Collaborator

@geoHeil the quality of the end-to-end conversion depends on many factors and I feel this goes beyond the scope of this issue.

However, this is the input PDF: WO2021041671A1-small.pdf
And here is the converted markdown file: test1_WO2021041671A1-small.md.

@geoHeil
Copy link
Author

geoHeil commented Feb 2, 2025

let me close this issue and ask for mroe details on this file in another none #866

@geoHeil geoHeil closed this as completed Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocr question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants