Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uuencode-generated text is OCRed with many mistakes #4197

Closed
yurivict opened this issue Mar 6, 2024 · 2 comments
Closed

uuencode-generated text is OCRed with many mistakes #4197

yurivict opened this issue Mar 6, 2024 · 2 comments
Labels

Comments

@yurivict
Copy link

yurivict commented Mar 6, 2024

Current Behavior

This text which was generated by the UNIX uuencode command:
image
is OCRed incorrectly:

M=N[-?3NGO6HU7W; 7SF#W; INU]6\UXMU<@:4]NKJ5Q -#I5>KO0+:!>=[UUZ73
MVTV*-0=[TU=[W>]Q[=U!3WMPS5M[#VO[ 'KWO/>NA>G%1T$*>SW=Z\P]=Y%T]"
M8-$(( )B9 )A,3333(:INI@TT#3333%#0,F0::33TT ">@T:4])Z8)HTP)A,

Tesseract added spaces which aren't present, and failed to detect clearly visible back-quotes, among other issues.

Versions: tesseract-5.3.4, tesseract-data-4.1.0
FreeBSD 14.0

Expected Behavior

n/a

Suggested Fix

n/a

tesseract -v

tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.40 : libtiff 4.4.0 : zlib 1.3 : libwebp 1.3.2
Found SSE4.1
Found OpenMP 201811
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.5.0 OpenSSL/3.0.12 zlib/1.3 libpsl/0.21.2 (+libidn2/2.3.4) libssh2/1.11.0 nghttp2/1.58.0

Operating System

No response

Other Operating System

FreeBSD 14.0

uname -a

FreeBSD xx.xx.xx 14.0-STABLE FreeBSD 14.0-STABLE #1 stable/14-n266076-2001d7f6a272: Sat Dec 30 13:33:21 PST 2023 [email protected]:/disk-samsung/obj/disk-samsung/freebsd-src/amd64.amd64/sys/GENERIC amd64

Compiler

No response

CPU

n/a

Virtualization / Containers

n/a

Other Information

n/a

@stweil
Copy link
Member

stweil commented Mar 6, 2024

With -l Latin, the result includes backquotes:

M=N [ - ?3NGO6HU7W ; 7ZSF#W; INU] 6\UXMU<@:4]NKJ5Q`-#I5>K0+: !>=[UUZ73
MVTV* -0=[TU=[W>]Q[=U!3WMP5M[#V9 ['KW0O/>NA>G%1T$*>SW=Z\P]=Y%T] `
M8-$((`)B9`)A,3333(:I^I@QTT#3333$#0,F0::33TT`">@T:4])Z8)HTP)A,

@stweil
Copy link
Member

stweil commented Mar 6, 2024

The recognition quality is not a Tesseract issue. It depends on the neural network which is used. In this case most models were trained with texts and languages which require a space after a comma, for example, so it is expected that such models will add such spaces.

If you want to decode uuencoded text, training such data would help. Or try to whitelist possible and to blacklist unexpected characters.

@stweil stweil closed this as completed Mar 6, 2024
@stweil stweil added the question label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants