Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect extraction of negative/minus (-) sign #289

Closed
RobertHardwick opened this issue Sep 4, 2019 · 18 comments
Closed

Incorrect extraction of negative/minus (-) sign #289

RobertHardwick opened this issue Sep 4, 2019 · 18 comments

Comments

@RobertHardwick
Copy link

First of all, thanks for pdfminer.six!

I am using pdfminer.six (specifically pdf2txt.py) to get access to x y z coordinate data tables held in pdf files from academic papers.

When a negative coordinate is given (e.g. -10), I'm finding that there is quite some variability in the way the negative/minus sign (-) is presented in the extracted output.

Sometimes it is extracted correctly (e.g. Anguera2007.pdf)
In other examples I have found it to be replaced with one of the following symbols:
")" - see Abreu2012.pdf
"≥" - see Agnew2008.pdf
"2" - see Chaminade2002.pdf
"ÿ" - see Costantini2005.pdf
"¡" - see Cross2010.pdf
" " (apparent blank space) - see Casile2010.pdf

Many of these are easy to fix with simple text substitution (I don't care about the rest of the text, just these coordinates).

However, some of these (particularly the replacement with the number 2, or what appears to be a blank space) are much more problematic. In these cases, if the original text was "-4" it could be converted to "24" ("-" replaced with "2") or "4" ("-" replaced with a blank space). While I can check the coordinates manually, I plan to do this with several hundred documents, so would prefer a better solution!

Any help on this issue would be greatly appreciated (my best guess is that it may have something to do with the specific fonts used in these examples?).

Thanks in advance!

@pietermarsman
Copy link
Member

pietermarsman commented Sep 9, 2019

I can replicate this issue (at least for Abreu2012.pdf). I also tried to figure out what is going wrong, but that is a lot more difficult.

First of all, the ")" in Abreu2012.pdf is not totaly weird. If you open the pdf with Adobe Acrobat Reader it shows a "-" in the x, y, z table, for example "-16". But if you copy this text into a text editor, then you get ")16". This indicates there is some relation between the ")" and the "-".

I investigated the different places where such a mapping/relation is defined and parsed by pdfminer, but I could not find anything that was wrong. There are multiple Type 1 Fonts in Abreu2012.pdf. A Type 1 Font can define Encoding differences or Unicode mappings that change the characters, but there is no mapping from "(" to "-". It does define that the "(" should be mapped to a "(", which is rather odd, and I guess there is a bug in here somewhere. But I cannot find it.

@pietermarsman
Copy link
Member

I've also checked if this is a recent bug, but it's not. It's been here for as long as I can go back. The earliest version that I could get pdf2txt.py to work is 1b47bed (may 2015) and that commit also outputs ")" instead of "-".

@pietermarsman
Copy link
Member

The same mistake is made by PyPDF2

@pietermarsman
Copy link
Member

Just like poppler's pdftotext command

@pietermarsman
Copy link
Member

Last observation for today: the "-" character in the x, y, z table is a different character than the other "-" characters in the Abreu2012.pdf.

My current best guess is that "(" is (in some fonts) replaced by a polygon-like structure, that looks like a "-" character, but is in fact a series of coordinates. If you look closely at the dashes that are displayed in the table, you see that they are much wider than the dashes that are used in other places.

@pietermarsman
Copy link
Member

I also checked how LateX saves pdf's when using different hyphens, dashes or minus signs. All of them are the same unicode character (e.g. if you search for one of them, you find them all).

\documentclass{paper}

\usepackage{textcomp}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{siunitx}

\DeclareUnicodeCharacter{2212}{\textminus}% requires a unicode capable editor

\begin{document}

\mathchardef\mhyphen="2D

\noindent
$hello \mhyphen world$ \\
hello - world \\ 
hello -- world \\
hello --- world \\
hello \textminus world (textcomp minus)\\
hello − world (unicode minus)\\% requires a unicode capable editor
hello - world (normal text minus)\\
hello $-$ world (all math mode)\\    
hello \num{-1} (siunitx textmode)\\
hello $\num{-1}$ (siunitx mathmode)

\end{document}

@pietermarsman
Copy link
Member

And also no differences between different dashes if I create a pdf file with Microsoft Word.

@RobertHardwick
Copy link
Author

Thanks for following up on this.
For my purposes the most problematic errors are the replacement of the minus signs with blank spaces or the number 2, as in Chaminade2002.pdf and Casile2010.pdf above.
These two seem the trickiest to solve - the rest can be worked around with simple character replacement (at least for my purposes).
Might it be possible to repeat your analyses with these two papers to see if anything specific to them comes up?

I also have a huge list of papers that I'm processing using pdfminer.six. Very happy to provide a full list of papers and their associated errors if it might help to make a pattern emerge.

@pietermarsman
Copy link
Member

You can check it by yourself easily. If you copy the minus sign to another text editor you can see what the result is. If I copy the minus from Chaminade2002.pdf using Adobe Acrobat Reader DC I get a "2". For Casile2010.pdf I also get an apparent blank space.

I don't think there is much we can do here since Adobe Acrobat Reader DC is making the same error.

My hypothesis is that the error is caused by a replacement of the character id by a polygon. I don't know how to confirm this hypothesis.

I guess most of the pdf's you are analyzing are created with LateX. Do you have any chance of getting the source files for those pdf's? This might help to reproduce the problem in a much more isolated way.

@RobertHardwick
Copy link
Author

Thanks for the suggestion, I'm able to replicate the error, but have found something notable.

From Chaminade2002.pdf I copied '-18' and got '218'.
However, when I tried copying the text and selecting 'copy with formatting', it took a lot longer, but I get the correct value of '-18'.

So, is there a way to tell pdfminer.six to copy with formatting?

(Unlikely I can get access to the original latex files I'm afraid).

@pietermarsman
Copy link
Member

What application do you use to do "copy with formatting"?

@RobertHardwick
Copy link
Author

Adobe Acrobat Pro. Doesn't seem to be a typical option (can't see if when I open the pdf with google chrome, for example).

@RobertHardwick
Copy link
Author

Also, when select 'edit text', then copy the '-18', I get '􀀀18'.

@pietermarsman
Copy link
Member

And what happens if you copy the minus sign once again, for example to notepad?

@RobertHardwick
Copy link
Author

When I copy the text into notepad, a small open square appears in place of the minus sign. When I copy it here it gives the 􀀀 symbol.

@RobertHardwick
Copy link
Author

Does the information above help, or is there any other information I could provide that might help? If there might be a way to resolve the 2/blank issue then, at least for my purposes, all is good to proceed.

@pietermarsman
Copy link
Member

My hypothesis is that the error is caused by a replacement of the character id by a polygon. But I don't know how to confirm this hypothesis.

If the hypothesis is true, I can't think of a way how this can be solved.

@pietermarsman
Copy link
Member

I know a bit more about the conversion to readable text now.

PDF separately represents glyphs (what you see) and a ToUnicode map that is a universal representation of the meaning of those glyphs. When you copy text (using any PDF reader) you get the value of the ToUnicode map for each of the glyphs. If there is an error in the ToUnicode map, e.g. missing or unaligned items in the map, you get the wrong output.

That is the case with your PDF's. If you copy the text with any PDF reader you will see that you get the same wrong result. In that case there is nothing that pdfminer.six can do to get you the correct output.

FYI, when this PR is accepted pdfminer.six will show more verbosely that something is wrong with the ToUnicode map: #731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants