Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PNG images being reencoded unnecessarily in PDFs #3092

Closed
ericrechlin opened this issue Sep 7, 2020 · 4 comments
Closed

PNG images being reencoded unnecessarily in PDFs #3092

ericrechlin opened this issue Sep 7, 2020 · 4 comments
Milestone

Comments

@ericrechlin
Copy link

  • Tesseract Version: 4.1.1 (also verified with latest 5.0 code)
  • Commit Number: 5fe1390 and later
  • Platform: Linux Desktop 4.4.0-19041-Microsoft defect issue #1-Microsoft Fri Dec 06 14:06:00 PST 2019 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 18.04 under WSL)

Current Behavior:

After upgrading my version of tesseract, I noticed that the PDF files it generates from PNGs are often larger, sometimes much larger, than they used to be with past versions.

For example, consider the following 262 KB PNG file:

sample

Run it with the following command line to generate an OCR'd PDF:

tesseract sample.png sample pdf

With versions of tesseract-ocr prior to 4.0-rc1, this generates a 269 KB PDF. However, with versions beginning with 4.0-rc1, this results in a 402 KB PDF, because the image is being needlessly reencoded, rather than being placed into the PDF unchanged.

I have narrowed this bug down to the fix applied for issue #1914, applied with commit 5fe1390. Specifically, the changes made to src/api/pdfrenderer.cpp in this commit are what are causing the above problem.

Expected Behavior:

The generated PDF should be similar in size to the original PNG instead of much larger (269 KB in this case instead of 402 KB).

Suggested Fix:

Changes need to be made to the imageToPDFObj method in src/api/pdfrenderer.cpp, either reverting the changes made in #1914 or modifying them to avoid reencoding the image.

@stweil stweil added the bug label Sep 8, 2020
@stweil stweil added this to the 5.0.0 milestone Sep 8, 2020
@amitdo amitdo added the PDF label Dec 17, 2020
@zdenop
Copy link
Contributor

zdenop commented Jun 10, 2022

Actually problem in src/ccmain/thresholder.cpp:
Image tmp = pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC);

Reported image depth (Image src) is 8, but after removing colormap depth is increased to 32 so image has bigger size...
@DanBloomberg: is it correct behavior?

@zdenop
Copy link
Contributor

zdenop commented Jun 15, 2022

@DanBloomberg :Did you have chance to have a look at pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC) behaviour?

@DanBloomberg
Copy link

It's on my 'radar' for this week.

@DanBloomberg
Copy link

For many image processing operations, it is necessary to remove a colormap, because the actual pixel values are in the colormap, not in the data array.

However, for generating a pdf, as noticed 2 years ago it is often more efficient to use FLATE encoding on the colormapped image than on an RGB image.

I have done some experiments in leptonica, and it preserves the colormap when encoding into pdf either a png file or a colormapped pix. You do not want to remove the colormap. The problem is the identified line 174 in ccmain/thresholder.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants