uuencode-generated text is OCRed with many mistakes #4197

yurivict · 2024-03-06T09:35:00Z

Current Behavior

This text which was generated by the UNIX uuencode command:

is OCRed incorrectly:

M=N[-?3NGO6HU7W; 7SF#W; INU]6\UXMU<@:4]NKJ5Q -#I5>KO0+:!>=[UUZ73
MVTV*-0=[TU=[W>]Q[=U!3WMPS5M[#VO[ 'KWO/>NA>G%1T$*>SW=Z\P]=Y%T]"
M8-$(( )B9 )A,3333(:INI@TT#3333%#0,F0::33TT ">@T:4])Z8)HTP)A,

Tesseract added spaces which aren't present, and failed to detect clearly visible back-quotes, among other issues.

Versions: tesseract-5.3.4, tesseract-data-4.1.0
FreeBSD 14.0

Expected Behavior

n/a

Suggested Fix

n/a

tesseract -v

tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.40 : libtiff 4.4.0 : zlib 1.3 : libwebp 1.3.2
Found SSE4.1
Found OpenMP 201811
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.5.0 OpenSSL/3.0.12 zlib/1.3 libpsl/0.21.2 (+libidn2/2.3.4) libssh2/1.11.0 nghttp2/1.58.0

Operating System

No response

Other Operating System

FreeBSD 14.0

uname -a

FreeBSD xx.xx.xx 14.0-STABLE FreeBSD 14.0-STABLE #1 stable/14-n266076-2001d7f6a272: Sat Dec 30 13:33:21 PST 2023 [email protected]:/disk-samsung/obj/disk-samsung/freebsd-src/amd64.amd64/sys/GENERIC amd64

Compiler

No response

CPU

n/a

Virtualization / Containers

n/a

Other Information

n/a

The text was updated successfully, but these errors were encountered:

stweil · 2024-03-06T10:55:11Z

With -l Latin, the result includes backquotes:

M=N [ - ?3NGO6HU7W ; 7ZSF#W; INU] 6\UXMU<@:4]NKJ5Q`-#I5>K0+: !>=[UUZ73
MVTV* -0=[TU=[W>]Q[=U!3WMP5M[#V9 ['KW0O/>NA>G%1T$*>SW=Z\P]=Y%T] `
M8-$((`)B9`)A,3333(:I^I@QTT#3333$#0,F0::33TT`">@T:4])Z8)HTP)A,

stweil · 2024-03-06T11:03:59Z

The recognition quality is not a Tesseract issue. It depends on the neural network which is used. In this case most models were trained with texts and languages which require a space after a comma, for example, so it is expected that such models will add such spaces.

If you want to decode uuencoded text, training such data would help. Or try to whitelist possible and to blacklist unexpected characters.

stweil closed this as completed Mar 6, 2024

stweil added the question label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uuencode-generated text is OCRed with many mistakes #4197

uuencode-generated text is OCRed with many mistakes #4197

yurivict commented Mar 6, 2024 •

edited

Loading

stweil commented Mar 6, 2024

stweil commented Mar 6, 2024 •

edited

Loading

uuencode-generated text is OCRed with many mistakes #4197

uuencode-generated text is OCRed with many mistakes #4197

Comments

yurivict commented Mar 6, 2024 • edited Loading

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Mar 6, 2024

stweil commented Mar 6, 2024 • edited Loading

yurivict commented Mar 6, 2024 •

edited

Loading

stweil commented Mar 6, 2024 •

edited

Loading