Encoding problem with extracted text from Ghostscript generated pdf #42

tincann · 2019-08-16T16:21:14Z

Summary of the issue

Hi there,

We've been using this library for a long time now in order to extract text from pdfs. However, since we've switched methods for generating the input pdfs, the resulting extracted text is nothing but gibberish. To me it looks like an interesting encoding problem :)

I've compared the extracted text with another pdf library called PdfPig which extracts the text as expected. However, for performance reasons, iTextSharp is still the prefered option.

To debug the problem, I've looked at the differences in the tokenization code. Here you can find it in the alternative implementation. Certain parts look very familiar, but I noticed that PdfPig also corrects for endianness. I've reimplemented that in the PrTokenizer, but that didn't seem to be the solution.

We would really appreciate it if you could help us find a solution. Thanks!

Environment

The in-use version: 1.5.1
Operating system: Windows
IDE: VS2019

Example code/Steps to reproduce:

sample-original.pdf This is the original file, where text extraction works in both methods.
sample-recostar.pdf This is the file after it's gone through OCR software (I think it uses Ghostscript to generate the pdf).

class Program
    {
        static void Main(string[] args)
        {
            var fileNames = new[]
            {
                "sample-original.pdf",
                "sample-recostar.pdf"
            };

            ExtractWithItext(fileNames);
            ExtractWithPdfPig(fileNames);

            Console.ReadLine();
        }

        public static void ExtractWithItext(IEnumerable<string> fileNames)
        {
            Console.WriteLine($"ITEXT");
            foreach (var fileName in fileNames)
            {
                var reader = new iTextSharp.text.pdf.PdfReader(fileName);

                Console.WriteLine($"OUTPUT {fileName}: ");

                var content = reader.GetPageContent(1);

                var tokenizer = new iTextSharp.text.pdf.PrTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(content));
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == iTextSharp.text.pdf.PrTokeniser.TK_STRING)
                    {
                        Console.Write(tokenizer.StringValue);
                    }
                }

                Console.WriteLine();
            }
        }

        private static void ExtractWithPdfPig(IEnumerable<string> fileNames)
        {
            Console.WriteLine($"PDFPIG");
            foreach (var fileName in fileNames)
            {
                Console.WriteLine($"OUTPUT {fileName}: ");
                using(var stream = File.OpenRead(fileName))
                using (UglyToad.PdfPig.PdfDocument document = UglyToad.PdfPig.PdfDocument.Open(stream))
                {
                    var page = document.GetPage(1);

                    string fullText = string.Join(" ", page.GetWords());

                    Console.WriteLine(fullText);
                    Console.WriteLine();
                }
            }
        }
    }

Output:

Notice that using iTextSharp in combination with sample-recostar.pdf results in nonesense, while the same file with PdfPig results in the expected text.

ITEXT
OUTPUT sample-original.pdf:
 A Simple PDF File  This is a small demonstration .pdf file -  just for use in the Virtual Mechanics tutorials. More text. And more  text. And more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. Boring, zzzzz. And more text. And more text. And  more text. And more text. And more text. And more text. And more text.  And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. And more text. Even more. Continued on page 2 ...
OUTPUT sample-recostar.pdf:
 � � � � � �  �
 � � � � � � � � � � � � �  � � �  � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �


PDFPIG
OUTPUT sample-original.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...

OUTPUT sample-recostar.pdf:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2...

The text was updated successfully, but these errors were encountered:

VahidN · 2019-08-17T12:47:18Z

To be able to process this PDF, we will need the CMapAwareDocumentFont class and its Decode method. That's part of the V5x. Sorry!

tincann · 2019-08-18T10:40:11Z

Ah, that's too bad! Thank you anyway.

lock · 2020-01-18T07:47:40Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related problems.

tincann closed this as completed Aug 18, 2019

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding problem with extracted text from Ghostscript generated pdf #42

Encoding problem with extracted text from Ghostscript generated pdf #42

tincann commented Aug 16, 2019

VahidN commented Aug 17, 2019

tincann commented Aug 18, 2019

lock bot commented Jan 18, 2020

Encoding problem with extracted text from Ghostscript generated pdf #42

Encoding problem with extracted text from Ghostscript generated pdf #42

Comments

tincann commented Aug 16, 2019

Summary of the issue

Environment

Example code/Steps to reproduce:

Output:

VahidN commented Aug 17, 2019

tincann commented Aug 18, 2019

lock bot commented Jan 18, 2020