Explore the docs
View Demo
·
Report Bug
·
Request Feature
The HICMA Dataset Benchmarking Tool is a powerful utility designed to assess the performance of Optical Character Recognition (OCR) models on the HICMA Dataset. This tool is intended for researchers, developers, and data scientists working with OCR technologies, providing them with valuable insights into the accuracy and efficiency of their OCR models.
-
Dataset Evaluation: The OCR Benchmarking Tool allows users to easily evaluate the performance of their OCR models on the HICMA Dataset. By providing a standardized and consistent evaluation environment, users can compare different OCR systems objectively.
-
Metrics and Statistics: The tool offers a comprehensive set of evaluation metrics, including character error rate (CER), word error rate (WER) and Levenshtein ratio. It also provides statistical summaries to understand model performance across different images and fonts.
-
Configurable Parameters: Users can customize evaluation parameters to suit their specific requirements. These parameters include character recognition confidence thresholds, metrics choice and image pre-processing options.
-
Visualizations: The HICMA OCR Benchmarking Tool generates interactive visualizations and charts to help users visualize the results more intuitively. These visualizations aid in identifying patterns and areas for improvement in the OCR models (TODO).
-
Easy Integration: The tool is designed to seamlessly integrate with popular OCR frameworks and libraries (Tesseract OCR, Kraken, EasyOCR, etc...), making it convenient for users to benchmark their existing models without significant code modifications.
-
Reproducibility: The tool ensures reproducibility by saving evaluation results and summmaries. This allows users to review and share their experiments with others easily.
For more information regarding the dataset and its benchmarking tools, check out the paper accepted in Proceedings of ArabicNLP 2023, colocated with EMNLP 2023, found here.
To get a local copy up and running follow these simple steps:
Depending on the OCR model you are benchmarking, you need to make sure to set up the environment according to the OCR model documentation:
- Tesseract OCR : make sure to set up Tesseract OCR on your machine by following the documentation here. Make sure to also provide the link to the pretrained model you would like to use. Links to Pretrained Tesseract OCR models on Arabic Text used in this benchmark can be found here.
- Kraken : Make sure to also provide the link to the pretrained model you would like to use. Links to Pretrained Kraken models on Arabic Text used in this benchmark can be found here.
- Clone the repo
git clone https://github.com/anisdismail/HICMA-benchmark
- Change to the project repositry:
cd HICMA-benchmark
- Run the following command to create the instant the required packages to the environment:
pip install -r requirements.txt
- To start the benchmark script, run the following command:
python benchmark.py --config config.json
You can modify the experiment parameters in the config.json file:
{
"general": {
"save_dir": "",
"metrics": [],
"model": "",
"data_dir": ""
},
"image_processing": {
"image_processing_width": null,
"image_processing_height": null,
"binarize": true,
"grayscale": true
}
}
Please note that the benchmark tool will only work with one model at a time, therefore make sure to choose only one of the following models at a time and add it to the config.json. If you provide more than one model in the same config file, the benchmarking script will only use the first.
"TesseractOCR": {
"tesseract_psm": 13,
"tesseract_oem": 1,
"trained_model_url": ,
"TesseractOCR_path":
},
"Kraken": {
"kraken_url": [],
"text_direction": "horizontal-lr"
},
"EasyOCR": {
"easyocr_detail": 0
}
- You can run the script from the command line and it will parse the command line arguments based on the given parameters. For example:
python benchmark.py --save_dir save_dir --metrics "CER" "WER" "Levenshtein_ratio" --model "model" --data_dir "" --image_processing_width null --image_processing_height null --binarize true --grayscale true --tesseract_psm 13 --tesseract_oem 1 --trained_model_url "" --TesseractOCR_path "" --kraken_url "" --text_direction "horizontal-lr" --easyocr_detail 0
Please note that you should only specify one model at a time in the command line argument. If you provide more than one model in the command, the benchmarking script will only use the first model.
- You can also check all the possible parameters with their corresponding description using the following command:
python benchmark.py --help
It will generate the following output:
options:
--config CONFIG
Configurati on file path
--save_dir SAVE_DIR
Save directory
--metrics METRICS [METRICS ...]
List of metrics for benchmarking (e.g.,
WER, CER, Levenshtein ratio)
--model MODEL
Model to be evaluated (e.g., Tesseract OCR,
KrakenOCR, EasyOCR)
--data_dir DATA_DIR
Path to the data directory containing the dataset
--tesseract_psm TESSERACT_PSM
Tesseract OCR page segmentation mode (PSM)
--tesseract_oem TESSERACT_OEM
Tesseract OCR engine mode (OEM)
--trained_model_url TRAINED_MODEL_URL
Tesseract OCR trained model url
--TesseractOCR_path TESSERACTOCR_PATH
Tesseract OCR executable path
--kraken_url KRAKEN_URL
Kraken OCR trained model url
--text_direction TEXT_DIRECTION
Text direction for Kraken OCR (e.g. horizontal-lr)
--easyocr_detail EASYOCR_DETAIL
EasyOCR detail level
--image_processing_width IMAGE_PROCESSING_WIDTH
Width for image processing
--image_processing_height IMAGE_PROCESSING_HEIGHT
Height for image processing
--binarize Binarize image
--grayscale
Convert image to grayscale
Visualizations: The HICMA OCR Benchmarking Tool should generate interactive visualizations and charts to help users visualize the results more intuitively. These visualizations aid in identifying patterns and areas for improvement in the OCR models. See the open issues for an extended list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
The HICMA dataset and its benchmark are publicly available, and are published for research purposes under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence. See LICENSE
for more information.