how to refine OCR result and choose custom model? #806

geoHeil · 2025-01-24T17:58:33Z

Question

I observe the best result is obtained by using rapid ocr
rapid ocr is also not perfect
how can I choose a different model in rapidocr
- see the example how to fix the OCR RapidAI/RapidOCR#330
I am trying to set

ocr_model = "en_PP-OCRv4_rec"
pipeline_options.ocr_options = RapidOcrOptions(
    det_model_path=ocr_model,
    cls_model_path=ocr_model,
    rec_model_path=ocr_model
)

and observe that:
The model is not downloaded from huggingface - the normal (default model) is downloaded.

This leads me to the 2nd question: I can this default behaviour be kept to still download the model and in addition refine the OCR result?

cau-git · 2025-01-29T12:43:44Z

@geoHeil can you please check if docling==2.17.0, a fix was merged here: #786

geoHeil · 2025-01-29T15:19:08Z

Just updated - still, the same failure: FileNotFoundError: en_PP-OCRv4_rec does not exists. << the model is NOT downloaded. The default model is usually downloaded - where can the configuration accept a hugginface model name?

nikos-livathinos · 2025-01-30T11:57:59Z

After some investigation, this is the current situation:

The rapidocr-onnxruntime module does not download any model files.
The model files for chinese have been included inside the python package. Specifically the following ONNX files are used by default: ch_PP-OCRv4_det_infer.onnx, ch_PP-OCRv4_rec_infer.onnx, ch_ppocr_mobile_v2.0_cls_infer.onnx.
The user can download alternative models files from HF RapidOCR repo and pass the file paths as parameters to RapidOcrOptions.

For example, assuming that the ONNX files have been downloaded from HF in the local dir rapidocr_models_root, the following code works and produces quite good results (checking with the PDF file from the issue linked by @geoHeil) :

    det_model_path = os.path.join(rapidocr_models_root, "en_PP-OCRv3_det_infer.onnx")
    rec_model_path = os.path.join(rapidocr_models_root, "ch_PP-OCRv4_rec_server_infer.onnx")
    cls_model_path = os.path.join(rapidocr_models_root, "ch_ppocr_mobile_v2.0_cls_train.onnx")
    ocr_options = RapidOcrOptions(
        det_model_path=det_model_path,
        rec_model_path=rec_model_path,
        cls_model_path=cls_model_path,
    )
    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options
    )

    # Convert the document
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options
            ),
        },
    )
    pdf_file = Path("WO2021041671A1-small.pdf")
    conversion_result: ConversionResult = converter.convert(source=pdf_file)
    doc = conversion_result.document
    md = doc.export_to_markdown()
    print(md)

The problem is that RappidOCR does not support any language parameter.

Therefore, although Docling has a lang parameter in RapidOcrOptions, it is not used actually

@cau-git , @dolfim-ibm We could assist Docling users by implementing the following approach inside Docling:

Check if the lang option is "english" AND there are no explicit model file-paths provided by the user, then:
Manually download the english onnx files from RapidOCR HF repo (as I did manually in my example code above).
- However this introduces the complexity to decide which combination of det/rec/cls files should be used for the selected language.
- We can implement some rule-based logic that scans the RappidOCR HF repo and get the most recent files available for the selected language or fall back to chinese which is the standard language for PaddleOCR.
Pass the downloaded model files as parameters to the underlying RapidOCR model.

dolfim-ibm · 2025-01-30T13:26:20Z

Very good analysis and findings. I propose the following approach to take as a new enhancement.

We create and maintain a mapping of "language" to "onnx" files (maybe only english in the beginning)
In the __init__ of the model class we apply a similar approach as the other models with HF weights, i.e. we accept an artifacts_path: Optional[Union[Path, str]]=None argument and if None the weights are being downloaded for the languages in the options
The mapping in 1. should point to an exact rev hash on HF

nikos-livathinos · 2025-01-30T17:17:43Z

@dolfim-ibm regarding passing the artifacts_path as an __init__ parameter, I think in case of RapidOcrModel there is a collision with the RapidOcrOptions which already provides path options for the PaddleOCR models det_model_path, det_model_path, rec_model_path (check here).

Maybe instead of introducing the artifacts_path, we can implement the logic you propose using the existing paths of RapidOcrOptions.

geoHeil · 2025-01-31T07:52:12Z

Many thanks @nikos-livathinos for supporting and suggesting a further refinement. You mention:

andd produces quite good results

Given the particular file from the issue (RapidAI/RapidOCR#330)

396 |                 |
397 | 8173.3 >16666.7 |

the columns are still confused even with the custom english model. Is there any option I can feed for correctly identifying the table?

Is this also the case for you?

nikos-livathinos · 2025-01-31T13:43:32Z

@geoHeil the quality of the end-to-end conversion depends on many factors and I feel this goes beyond the scope of this issue.

However, this is the input PDF: WO2021041671A1-small.pdf
And here is the converted markdown file: test1_WO2021041671A1-small.md.

geoHeil · 2025-02-02T17:43:50Z

let me close this issue and ask for mroe details on this file in another none #866

geoHeil added the question Further information is requested label Jan 24, 2025

vagenas added the enhancement New feature or request label Jan 30, 2025

nikos-livathinos added the ocr label Jan 30, 2025

vagenas removed the enhancement New feature or request label Jan 30, 2025

nikos-livathinos self-assigned this Jan 30, 2025

geoHeil mentioned this issue Feb 2, 2025

refine quality of OCR for tables #866

Open

geoHeil closed this as completed Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to refine OCR result and choose custom model? #806

how to refine OCR result and choose custom model? #806

geoHeil commented Jan 24, 2025

cau-git commented Jan 29, 2025

geoHeil commented Jan 29, 2025

nikos-livathinos commented Jan 30, 2025 •

edited

Loading

dolfim-ibm commented Jan 30, 2025

nikos-livathinos commented Jan 30, 2025 •

edited

Loading

geoHeil commented Jan 31, 2025

nikos-livathinos commented Jan 31, 2025

geoHeil commented Feb 2, 2025

how to refine OCR result and choose custom model? #806

how to refine OCR result and choose custom model? #806

Comments

geoHeil commented Jan 24, 2025

Question

cau-git commented Jan 29, 2025

geoHeil commented Jan 29, 2025

nikos-livathinos commented Jan 30, 2025 • edited Loading

dolfim-ibm commented Jan 30, 2025

nikos-livathinos commented Jan 30, 2025 • edited Loading

geoHeil commented Jan 31, 2025

nikos-livathinos commented Jan 31, 2025

geoHeil commented Feb 2, 2025

nikos-livathinos commented Jan 30, 2025 •

edited

Loading

nikos-livathinos commented Jan 30, 2025 •

edited

Loading