-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PNG images being reencoded unnecessarily in PDFs #3092
Comments
Actually problem in src/ccmain/thresholder.cpp: Reported image depth ( |
@DanBloomberg :Did you have chance to have a look at |
It's on my 'radar' for this week. |
For many image processing operations, it is necessary to remove a colormap, because the actual pixel values are in the colormap, not in the data array. However, for generating a pdf, as noticed 2 years ago it is often more efficient to use FLATE encoding on the colormapped image than on an RGB image. I have done some experiments in leptonica, and it preserves the colormap when encoding into pdf either a png file or a colormapped pix. You do not want to remove the colormap. The problem is the identified line 174 in ccmain/thresholder.cpp. |
Current Behavior:
After upgrading my version of tesseract, I noticed that the PDF files it generates from PNGs are often larger, sometimes much larger, than they used to be with past versions.
For example, consider the following 262 KB PNG file:
Run it with the following command line to generate an OCR'd PDF:
tesseract sample.png sample pdf
With versions of tesseract-ocr prior to 4.0-rc1, this generates a 269 KB PDF. However, with versions beginning with 4.0-rc1, this results in a 402 KB PDF, because the image is being needlessly reencoded, rather than being placed into the PDF unchanged.
I have narrowed this bug down to the fix applied for issue #1914, applied with commit 5fe1390. Specifically, the changes made to src/api/pdfrenderer.cpp in this commit are what are causing the above problem.
Expected Behavior:
The generated PDF should be similar in size to the original PNG instead of much larger (269 KB in this case instead of 402 KB).
Suggested Fix:
Changes need to be made to the imageToPDFObj method in src/api/pdfrenderer.cpp, either reverting the changes made in #1914 or modifying them to avoid reencoding the image.
The text was updated successfully, but these errors were encountered: