You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been using this library for a long time now in order to extract text from pdfs. However, since we've switched methods for generating the input pdfs, the resulting extracted text is nothing but gibberish. To me it looks like an interesting encoding problem :)
I've compared the extracted text with another pdf library called PdfPig which extracts the text as expected. However, for performance reasons, iTextSharp is still the prefered option.
To debug the problem, I've looked at the differences in the tokenization code. Here you can find it in the alternative implementation. Certain parts look very familiar, but I noticed that PdfPig also corrects for endianness. I've reimplemented that in the PrTokenizer, but that didn't seem to be the solution.
We would really appreciate it if you could help us find a solution. Thanks!
Environment
The in-use version: 1.5.1
Operating system: Windows
IDE: VS2019
Example code/Steps to reproduce:
sample-original.pdf This is the original file, where text extraction works in both methods.
sample-recostar.pdf This is the file after it's gone through OCR software (I think it uses Ghostscript to generate the pdf).
Notice that using iTextSharp in combination with sample-recostar.pdf results in nonesense, while the same file with PdfPig results in the expected text.
ITEXT
OUTPUT sample-original.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...
OUTPUT sample-recostar.pdf:
� � � � � � �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
PDFPIG
OUTPUT sample-original.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...
OUTPUT sample-recostar.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2...
The text was updated successfully, but these errors were encountered:
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related problems.
lockbot
locked as resolved and limited conversation to collaborators
Jan 18, 2020
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Summary of the issue
Hi there,
We've been using this library for a long time now in order to extract text from pdfs. However, since we've switched methods for generating the input pdfs, the resulting extracted text is nothing but gibberish. To me it looks like an interesting encoding problem :)
I've compared the extracted text with another pdf library called PdfPig which extracts the text as expected. However, for performance reasons, iTextSharp is still the prefered option.
To debug the problem, I've looked at the differences in the tokenization code. Here you can find it in the alternative implementation. Certain parts look very familiar, but I noticed that PdfPig also corrects for endianness. I've reimplemented that in the PrTokenizer, but that didn't seem to be the solution.
We would really appreciate it if you could help us find a solution. Thanks!
Environment
Example code/Steps to reproduce:
Output:
Notice that using iTextSharp in combination with sample-recostar.pdf results in nonesense, while the same file with PdfPig results in the expected text.
The text was updated successfully, but these errors were encountered: