Incorrect extraction of negative/minus (-) sign #289

RobertHardwick · 2019-09-04T09:38:26Z

First of all, thanks for pdfminer.six!

I am using pdfminer.six (specifically pdf2txt.py) to get access to x y z coordinate data tables held in pdf files from academic papers.

When a negative coordinate is given (e.g. -10), I'm finding that there is quite some variability in the way the negative/minus sign (-) is presented in the extracted output.

Sometimes it is extracted correctly (e.g. Anguera2007.pdf)
In other examples I have found it to be replaced with one of the following symbols:
")" - see Abreu2012.pdf
"≥" - see Agnew2008.pdf
"2" - see Chaminade2002.pdf
"ÿ" - see Costantini2005.pdf
"¡" - see Cross2010.pdf
" " (apparent blank space) - see Casile2010.pdf

Many of these are easy to fix with simple text substitution (I don't care about the rest of the text, just these coordinates).

However, some of these (particularly the replacement with the number 2, or what appears to be a blank space) are much more problematic. In these cases, if the original text was "-4" it could be converted to "24" ("-" replaced with "2") or "4" ("-" replaced with a blank space). While I can check the coordinates manually, I plan to do this with several hundred documents, so would prefer a better solution!

Any help on this issue would be greatly appreciated (my best guess is that it may have something to do with the specific fonts used in these examples?).

Thanks in advance!

pietermarsman · 2019-09-09T18:39:06Z

I can replicate this issue (at least for Abreu2012.pdf). I also tried to figure out what is going wrong, but that is a lot more difficult.

First of all, the ")" in Abreu2012.pdf is not totaly weird. If you open the pdf with Adobe Acrobat Reader it shows a "-" in the x, y, z table, for example "-16". But if you copy this text into a text editor, then you get ")16". This indicates there is some relation between the ")" and the "-".

I investigated the different places where such a mapping/relation is defined and parsed by pdfminer, but I could not find anything that was wrong. There are multiple Type 1 Fonts in Abreu2012.pdf. A Type 1 Font can define Encoding differences or Unicode mappings that change the characters, but there is no mapping from "(" to "-". It does define that the "(" should be mapped to a "(", which is rather odd, and I guess there is a bug in here somewhere. But I cannot find it.

pietermarsman · 2019-09-09T19:29:44Z

I've also checked if this is a recent bug, but it's not. It's been here for as long as I can go back. The earliest version that I could get pdf2txt.py to work is 1b47bed (may 2015) and that commit also outputs ")" instead of "-".

pietermarsman · 2019-09-09T20:18:56Z

The same mistake is made by PyPDF2

pietermarsman · 2019-09-09T20:42:18Z

Just like poppler's pdftotext command

pietermarsman · 2019-09-09T20:50:54Z

Last observation for today: the "-" character in the x, y, z table is a different character than the other "-" characters in the Abreu2012.pdf.

My current best guess is that "(" is (in some fonts) replaced by a polygon-like structure, that looks like a "-" character, but is in fact a series of coordinates. If you look closely at the dashes that are displayed in the table, you see that they are much wider than the dashes that are used in other places.

pietermarsman · 2019-09-09T21:13:41Z

I also checked how LateX saves pdf's when using different hyphens, dashes or minus signs. All of them are the same unicode character (e.g. if you search for one of them, you find them all).

\documentclass{paper}

\usepackage{textcomp}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{siunitx}

\DeclareUnicodeCharacter{2212}{\textminus}% requires a unicode capable editor

\begin{document}

\mathchardef\mhyphen="2D

\noindent
$hello \mhyphen world$ \\
hello - world \\ 
hello -- world \\
hello --- world \\
hello \textminus world (textcomp minus)\\
hello − world (unicode minus)\\% requires a unicode capable editor
hello - world (normal text minus)\\
hello $-$ world (all math mode)\\    
hello \num{-1} (siunitx textmode)\\
hello $\num{-1}$ (siunitx mathmode)

\end{document}

pietermarsman · 2019-09-09T21:16:06Z

And also no differences between different dashes if I create a pdf file with Microsoft Word.

RobertHardwick · 2019-09-24T15:13:46Z

Thanks for following up on this.
For my purposes the most problematic errors are the replacement of the minus signs with blank spaces or the number 2, as in Chaminade2002.pdf and Casile2010.pdf above.
These two seem the trickiest to solve - the rest can be worked around with simple character replacement (at least for my purposes).
Might it be possible to repeat your analyses with these two papers to see if anything specific to them comes up?

I also have a huge list of papers that I'm processing using pdfminer.six. Very happy to provide a full list of papers and their associated errors if it might help to make a pattern emerge.

pietermarsman · 2019-09-24T18:46:35Z

You can check it by yourself easily. If you copy the minus sign to another text editor you can see what the result is. If I copy the minus from Chaminade2002.pdf using Adobe Acrobat Reader DC I get a "2". For Casile2010.pdf I also get an apparent blank space.

I don't think there is much we can do here since Adobe Acrobat Reader DC is making the same error.

My hypothesis is that the error is caused by a replacement of the character id by a polygon. I don't know how to confirm this hypothesis.

I guess most of the pdf's you are analyzing are created with LateX. Do you have any chance of getting the source files for those pdf's? This might help to reproduce the problem in a much more isolated way.

RobertHardwick · 2019-09-24T19:24:02Z

Thanks for the suggestion, I'm able to replicate the error, but have found something notable.

From Chaminade2002.pdf I copied '-18' and got '218'.
However, when I tried copying the text and selecting 'copy with formatting', it took a lot longer, but I get the correct value of '-18'.

So, is there a way to tell pdfminer.six to copy with formatting?

(Unlikely I can get access to the original latex files I'm afraid).

pietermarsman · 2019-09-24T19:26:21Z

What application do you use to do "copy with formatting"?

RobertHardwick · 2019-09-24T19:55:15Z

Adobe Acrobat Pro. Doesn't seem to be a typical option (can't see if when I open the pdf with google chrome, for example).

RobertHardwick · 2019-09-24T20:07:10Z

Also, when select 'edit text', then copy the '-18', I get '􀀀18'.

pietermarsman · 2019-09-25T07:26:11Z

And what happens if you copy the minus sign once again, for example to notepad?

RobertHardwick · 2019-09-25T13:30:08Z

When I copy the text into notepad, a small open square appears in place of the minus sign. When I copy it here it gives the 􀀀 symbol.

RobertHardwick · 2019-10-07T11:50:00Z

Does the information above help, or is there any other information I could provide that might help? If there might be a way to resolve the 2/blank issue then, at least for my purposes, all is good to proceed.

pietermarsman · 2019-10-15T17:06:42Z

My hypothesis is that the error is caused by a replacement of the character id by a polygon. But I don't know how to confirm this hypothesis.

If the hypothesis is true, I can't think of a way how this can be solved.

pietermarsman · 2022-03-20T12:07:29Z

I know a bit more about the conversion to readable text now.

PDF separately represents glyphs (what you see) and a ToUnicode map that is a universal representation of the meaning of those glyphs. When you copy text (using any PDF reader) you get the value of the ToUnicode map for each of the glyphs. If there is an error in the ToUnicode map, e.g. missing or unaligned items in the map, you get the wrong output.

That is the case with your PDF's. If you copy the text with any PDF reader you will see that you get the same wrong result. In that case there is nothing that pdfminer.six can do to get you the correct output.

FYI, when this PR is accepted pdfminer.six will show more verbosely that something is wrong with the ToUnicode map: #731

pietermarsman added type: bug type: new feature labels Oct 12, 2019

pietermarsman removed the type: new feature label Mar 20, 2022

pietermarsman closed this as completed Mar 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect extraction of negative/minus (-) sign #289

Incorrect extraction of negative/minus (-) sign #289

RobertHardwick commented Sep 4, 2019

pietermarsman commented Sep 9, 2019 •

edited

Loading

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

RobertHardwick commented Sep 24, 2019

pietermarsman commented Sep 24, 2019

RobertHardwick commented Sep 24, 2019

pietermarsman commented Sep 24, 2019

RobertHardwick commented Sep 24, 2019

RobertHardwick commented Sep 24, 2019

pietermarsman commented Sep 25, 2019

RobertHardwick commented Sep 25, 2019

RobertHardwick commented Oct 7, 2019

pietermarsman commented Oct 15, 2019

pietermarsman commented Mar 20, 2022

Incorrect extraction of negative/minus (-) sign #289

Incorrect extraction of negative/minus (-) sign #289

Comments

RobertHardwick commented Sep 4, 2019

pietermarsman commented Sep 9, 2019 • edited Loading

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

pietermarsman commented Sep 9, 2019

RobertHardwick commented Sep 24, 2019

pietermarsman commented Sep 24, 2019

RobertHardwick commented Sep 24, 2019

pietermarsman commented Sep 24, 2019

RobertHardwick commented Sep 24, 2019

RobertHardwick commented Sep 24, 2019

pietermarsman commented Sep 25, 2019

RobertHardwick commented Sep 25, 2019

RobertHardwick commented Oct 7, 2019

pietermarsman commented Oct 15, 2019

pietermarsman commented Mar 20, 2022

pietermarsman commented Sep 9, 2019 •

edited

Loading