Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problem with extracted text from Ghostscript generated pdf #42

Closed
tincann opened this issue Aug 16, 2019 · 3 comments
Closed

Comments

@tincann
Copy link

tincann commented Aug 16, 2019

Summary of the issue

Hi there,

We've been using this library for a long time now in order to extract text from pdfs. However, since we've switched methods for generating the input pdfs, the resulting extracted text is nothing but gibberish. To me it looks like an interesting encoding problem :)

I've compared the extracted text with another pdf library called PdfPig which extracts the text as expected. However, for performance reasons, iTextSharp is still the prefered option.

To debug the problem, I've looked at the differences in the tokenization code. Here you can find it in the alternative implementation. Certain parts look very familiar, but I noticed that PdfPig also corrects for endianness. I've reimplemented that in the PrTokenizer, but that didn't seem to be the solution.

We would really appreciate it if you could help us find a solution. Thanks!

Environment

The in-use version: 1.5.1
Operating system: Windows
IDE: VS2019

Example code/Steps to reproduce:

  • sample-original.pdf This is the original file, where text extraction works in both methods.
  • sample-recostar.pdf This is the file after it's gone through OCR software (I think it uses Ghostscript to generate the pdf).
class Program
    {
        static void Main(string[] args)
        {
            var fileNames = new[]
            {
                "sample-original.pdf",
                "sample-recostar.pdf"
            };

            ExtractWithItext(fileNames);
            ExtractWithPdfPig(fileNames);

            Console.ReadLine();
        }

        public static void ExtractWithItext(IEnumerable<string> fileNames)
        {
            Console.WriteLine($"ITEXT");
            foreach (var fileName in fileNames)
            {
                var reader = new iTextSharp.text.pdf.PdfReader(fileName);

                Console.WriteLine($"OUTPUT {fileName}: ");

                var content = reader.GetPageContent(1);

                var tokenizer = new iTextSharp.text.pdf.PrTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(content));
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == iTextSharp.text.pdf.PrTokeniser.TK_STRING)
                    {
                        Console.Write(tokenizer.StringValue);
                    }
                }

                Console.WriteLine();
            }
        }

        private static void ExtractWithPdfPig(IEnumerable<string> fileNames)
        {
            Console.WriteLine($"PDFPIG");
            foreach (var fileName in fileNames)
            {
                Console.WriteLine($"OUTPUT {fileName}: ");
                using(var stream = File.OpenRead(fileName))
                using (UglyToad.PdfPig.PdfDocument document = UglyToad.PdfPig.PdfDocument.Open(stream))
                {
                    var page = document.GetPage(1);

                    string fullText = string.Join(" ", page.GetWords());

                    Console.WriteLine(fullText);
                    Console.WriteLine();
                }
            }
        }
    }

Output:

Notice that using iTextSharp in combination with sample-recostar.pdf results in nonesense, while the same file with PdfPig results in the expected text.

ITEXT
OUTPUT sample-original.pdf:
 A Simple PDF File  This is a small demonstration .pdf file -  just for use in the Virtual Mechanics tutorials. More text. And more  text. And more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. Boring, zzzzz. And more text. And more text. And  more text. And more text. And more text. And more text. And more text.  And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. And more text. Even more. Continued on page 2 ...
OUTPUT sample-recostar.pdf:
 � � � � � �  �
 � � � � � � � � � � � � �  � � �  � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �


PDFPIG
OUTPUT sample-original.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...

OUTPUT sample-recostar.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2...

@VahidN
Copy link
Owner

VahidN commented Aug 17, 2019

To be able to process this PDF, we will need the CMapAwareDocumentFont class and its Decode method. That's part of the V5x. Sorry!

@tincann
Copy link
Author

tincann commented Aug 18, 2019

Ah, that's too bad! Thank you anyway.

@tincann tincann closed this as completed Aug 18, 2019
@lock
Copy link

lock bot commented Jan 18, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related problems.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants