Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: crashes with tesseract 5.4.0 #1328

Closed
2 tasks done
mplx opened this issue Jun 8, 2024 · 8 comments
Closed
2 tasks done

[Bug]: crashes with tesseract 5.4.0 #1328

mplx opened this issue Jun 8, 2024 · 8 comments
Assignees
Labels

Comments

@mplx
Copy link

mplx commented Jun 8, 2024

What were you trying to do?

with tesseract 5.4.0 (released 2 days ago) ocrmypdf crashes with SubprocessOutputError; tried with multiple pdfs; downgraded to tesseract 5.3.4 and everything is fine again.

Where are you installing/running from?

PyPI (pip, poetry, pipx, etc.); see https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=ocrmypdf

OCRmyPDF version

16.3.1

What operating system are you working on?

Linux

Operating system details and version

Archlinux, Kernel 6.9.3-arch1-1

Simple sanity checks

  • Operating system is currently supported by its vendor (not end of life)
  • Python version is compatible with OCRmyPDF

Relevant log output

DEBUG ocrmypdf - ocrmypdf 16.3.1                                                                       __main__.py:59
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']                                       __init__.py:133
  DEBUG ocrmypdf.subprocess - Found tesseract 5.4.0                                                     __init__.py:343
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']                                       __init__.py:133
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']                                              __init__.py:133
  DEBUG ocrmypdf.subprocess - Found gs 10.3.1                                                           __init__.py:343
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']                                              __init__.py:133
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']                                    __init__.py:133
  DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = List of available languages in                   __init__.py:73
"/usr/share/tessdata/" (4):                                                                                            
deu                                                                                                                    
deu_frak                                                                                                               
eng                                                                                                                    
osd                                                                                                                    
                                                                                                                       
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled                                                          helpers.py:326
  DEBUG ocrmypdf.helpers - os.symlink(scan2.pdf, /tmp/ocrmypdf.io.tx8_jr40/origin)                       helpers.py:179
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.tx8_jr40/origin,                                  helpers.py:179
/tmp/ocrmypdf.io.tx8_jr40/origin.pdf)                                                                                  
  DEBUG root - Gathering info with 1 thread workers                                                         info.py:778
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled                                                          helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
  DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3             tesseract_ocr.py:184
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled                                                          helpers.py:326
  DEBUG ocrmypdf._pipeline -    1  Rasterize with png16m, rotation 0                                   _pipeline.py:539
  DEBUG ocrmypdf.subprocess -    1  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE',       __init__.py:133
'-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1',                                         
'-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr',                                          
'-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.tx8_jr40/origin.pdf']                                               
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13                                           PngImagePlugin.py:191
  DEBUG PIL.PngImagePlugin -    1  STREAM b'sRGB' 41 1                                            PngImagePlugin.py:191
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 54 9                                            PngImagePlugin.py:191
  DEBUG PIL.PngImagePlugin -    1  STREAM b'tEXt' 75 32                                           PngImagePlugin.py:191
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 119 8192                                        PngImagePlugin.py:191
  DEBUG ocrmypdf._exec.ghostscript -    1  Rotating output by 0                                      ghostscript.py:149
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13                                           PngImagePlugin.py:191
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 41 9                                            PngImagePlugin.py:191
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 62 65536                                        PngImagePlugin.py:191
  DEBUG ocrmypdf._pipeline -    1  resolution (299.9994, 299.9994)                                     _pipeline.py:618
  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'eng',                                 __init__.py:133
'/tmp/ocrmypdf.io.tx8_jr40/000001_ocr.png', '/tmp/ocrmypdf.io.tx8_jr40/000001_ocr_hocr', 'hocr', 'txt']                
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/1 -:--:--
  ERROR ocrmypdf._pipelines._common - ExitCodeException                                                  _common.py:259
Traceback (most recent call last):                                                                                     
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_exec/tesseract.py", line 313, in generate_hocr                     
    p = run(args_tesseract, stdout=PIPE, stderr=STDOUT, timeout=timeout, check=True)                                   
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                   
  File "/usr/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 62, in run                            
    proc = subprocess_run(args, env=env, check=check, **kwargs)                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                        
  File "/usr/lib/python3.12/subprocess.py", line 571, in run                                                           
    raise CalledProcessError(retcode, process.args,                                                                    
subprocess.CalledProcessError: Command '['tesseract', '-l', 'eng',                                                     
'/tmp/ocrmypdf.io.tx8_jr40/000001_ocr.png', '/tmp/ocrmypdf.io.tx8_jr40/000001_ocr_hocr', 'hocr', 'txt']'               
died with <Signals.SIGFPE: 8>.                                                                                         
                                                                                                                       
The above exception was the direct cause of the following exception:                                                   
                                                                                                                       
Traceback (most recent call last):                                                                                     
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in                                
cli_exception_handler                                                                                                  
    return fn(options, plugin_manager)                                                                                 
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 190, in _run_pipeline                      
    optimize_messages = exec_concurrent(context, executor)                                                             
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                             
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 117, in exec_concurrent                    
    executor(                                                                                                          
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__                               
    self._execute(                                                                                                     
  File "/usr/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in                       
_execute                                                                                                               
    result = future.result()                                                                                           
             ^^^^^^^^^^^^^^^                                                                                           
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result                                          
    return self.__get_result()                                                                                         
           ^^^^^^^^^^^^^^^^^^^                                                                                         
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result                                    
    raise self._exception                                                                                              
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run                                             
    result = self.fn(*self.args, **self.kwargs)                                                                        
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 81, in _exec_page_sync                     
    ocr_out, text_out = _image_to_ocr_text(page_context, ocr_image_out)                                                
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 62, in _image_to_ocr_text                  
    hocr_out, text_out = ocr_engine_hocr(ocr_image_out, page_context)                                                  
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                  
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 665, in ocr_engine_hocr                         
    ocr_engine.generate_hocr(                                                                                          
  File "/usr/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/tesseract_ocr.py", line 253, in                     
generate_hocr                                                                                                          
    tesseract.generate_hocr(                                                                                           
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_exec/tesseract.py", line 327, in generate_hocr                     
    raise SubprocessOutputError() from e                                                                               
ocrmypdf.exceptions.SubprocessOutputError
@mplx mplx added the bug label Jun 8, 2024
@jbarlow83
Copy link
Collaborator

This is a tesseract issue - we will need to wait for them resolve it
tesseract-ocr/tesseract#4257

@jbarlow83
Copy link
Collaborator

Since the issue is with Tesseract itself, downgrading is the only option at the moment

@amitdo
Copy link

amitdo commented Jun 10, 2024

The bug is in the the legacy engine.

Since the issue is with Tesseract itself, downgrading is the only option at the moment

It's not the only option, unless ypu want Tesseract to use the legacy engine.

You can bypass this bug by using a model from the tessdata_fast repo or by using oem 1.

@jbarlow83
Copy link
Collaborator

@amitdo ocrmypdf uses orientation and script detection (osd.traineddata) which currently only has the legacy option even in tessdata_fast. Your workaround will help people looking to get tesseract 5.4.0 working on OCR (without using any feature that requires page orientation detection) but it's not a full solution.

For maintainers looking for a full solution that passes the test suite, unfortunately ocrmypdf with tesseract 5.4.0 is not workable and will have to wait for 5.4.1.

@amitdo
Copy link

amitdo commented Jun 10, 2024

Yeah, I forgot about OSD.

@kmille
Copy link

kmille commented Jun 12, 2024

I just updated tesseract to version 5.4.1-1 on Arch Linux and the problem is gone.

@mplx
Copy link
Author

mplx commented Jun 13, 2024

For's for me aswell... ArchLinux w/ ocrmypdf 16.3.1-1 + tesseract 5.4.1-1

@jbarlow83
Copy link
Collaborator

In 16.4.0 we refuse to use tesseract 5.4.0. 5.4.1 with any version works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants