PNG images being reencoded unnecessarily in PDFs #3092

ericrechlin · 2020-09-07T04:10:24Z

Tesseract Version: 4.1.1 (also verified with latest 5.0 code)
Commit Number: 5fe1390 and later
Platform: Linux Desktop 4.4.0-19041-Microsoft defect issue #1-Microsoft Fri Dec 06 14:06:00 PST 2019 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 18.04 under WSL)

Current Behavior:

After upgrading my version of tesseract, I noticed that the PDF files it generates from PNGs are often larger, sometimes much larger, than they used to be with past versions.

For example, consider the following 262 KB PNG file:

Run it with the following command line to generate an OCR'd PDF:

tesseract sample.png sample pdf

With versions of tesseract-ocr prior to 4.0-rc1, this generates a 269 KB PDF. However, with versions beginning with 4.0-rc1, this results in a 402 KB PDF, because the image is being needlessly reencoded, rather than being placed into the PDF unchanged.

I have narrowed this bug down to the fix applied for issue #1914, applied with commit 5fe1390. Specifically, the changes made to src/api/pdfrenderer.cpp in this commit are what are causing the above problem.

Expected Behavior:

The generated PDF should be similar in size to the original PNG instead of much larger (269 KB in this case instead of 402 KB).

Suggested Fix:

Changes need to be made to the imageToPDFObj method in src/api/pdfrenderer.cpp, either reverting the changes made in #1914 or modifying them to avoid reencoding the image.

The text was updated successfully, but these errors were encountered:

zdenop · 2022-06-10T12:40:25Z

Actually problem in src/ccmain/thresholder.cpp:
Image tmp = pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC);

Reported image depth (Image src) is 8, but after removing colormap depth is increased to 32 so image has bigger size...
@DanBloomberg: is it correct behavior?

zdenop · 2022-06-15T17:21:03Z

@DanBloomberg :Did you have chance to have a look at pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC) behaviour?

DanBloomberg · 2022-06-15T17:22:39Z

It's on my 'radar' for this week.

DanBloomberg · 2022-06-17T00:07:40Z

For many image processing operations, it is necessary to remove a colormap, because the actual pixel values are in the colormap, not in the data array.

However, for generating a pdf, as noticed 2 years ago it is often more efficient to use FLATE encoding on the colormapped image than on an RGB image.

I have done some experiments in leptonica, and it preserves the colormap when encoding into pdf either a png file or a colormapped pix. You do not want to remove the colormap. The problem is the identified line 174 in ccmain/thresholder.cpp.

stweil added the bug label Sep 8, 2020

stweil added this to the 5.0.0 milestone Sep 8, 2020

amitdo added the PDF label Dec 17, 2020

zdenop mentioned this issue Jun 10, 2022

Plans for tesseract 5.x.y #3673

Open

zdenop added a commit that referenced this issue Jun 23, 2022

fix issue #3092 - skip removing colormap

18fb5aa

zdenop closed this as completed Jun 23, 2022

This was referenced Oct 14, 2022

Regression: gif not scraping in 5.2 that was OK in 5.1 #3940

Closed

fix issue #3940 - remove colormap before thresholding #3942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PNG images being reencoded unnecessarily in PDFs #3092

PNG images being reencoded unnecessarily in PDFs #3092

ericrechlin commented Sep 7, 2020

zdenop commented Jun 10, 2022 •

edited

Loading

zdenop commented Jun 15, 2022

DanBloomberg commented Jun 15, 2022

DanBloomberg commented Jun 17, 2022

PNG images being reencoded unnecessarily in PDFs #3092

PNG images being reencoded unnecessarily in PDFs #3092

Comments

ericrechlin commented Sep 7, 2020

Current Behavior:

Expected Behavior:

Suggested Fix:

zdenop commented Jun 10, 2022 • edited Loading

zdenop commented Jun 15, 2022

DanBloomberg commented Jun 15, 2022

DanBloomberg commented Jun 17, 2022

zdenop commented Jun 10, 2022 •

edited

Loading