Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layout algorithm merges non-adjacent text lines #658

Closed
0xabu opened this issue Aug 16, 2021 · 2 comments
Closed

layout algorithm merges non-adjacent text lines #658

0xabu opened this issue Aug 16, 2021 · 2 comments

Comments

@0xabu
Copy link
Contributor

0xabu commented Aug 16, 2021

Running pdf2txt on the attached PDF(which came to me from @joelostblom via 0xabu/pdfannots#24) produces output in the wrong order.

Expected output (from Poppler pdftotext):

Heading
Link to heading that is working with vim-pandoc.
Link to heading “that is” not working with vim-pandoc.

Subheading
Some “more text”

1

Actual output (from pdf2txt.py with defaults):

Link to heading that is working with vim-pandoc.

Link to heading “that is” not working with vim-pandoc.

Heading

Subheading

Some “more text”

1

It is possible to get the correct result by disabling the default layout detection algorithm. After applying #657, we can use pdf2txt.py --boxes-flow=disabled and get the output in the expected order.

@0xabu
Copy link
Contributor Author

0xabu commented Aug 16, 2021

Since I spent some time digging into this, I thought it might be useful to document here what goes wrong in the default case. The relevant code in LTLayoutContainer.group_textboxes tries to iteratively merge text lines that are close together into text groups, until everything is in one sorted group. Lines to merge into groups are chosen using a "distance" function (dist) defined as area of the bounding box containing both text lines minus the area of the two text lines. By minimising this distance, the algorithm tries to select lines that are "close" together.

On this page, the lines are:

  1. Heading (I'll call this short)
  2. Link to heading that... (I'll call this long)
  3. Ling to heading "that is" ... (long)
  4. Subheading (short)
  5. Some "more text" (short)
  6. The final page number "1"

The algorithm first:

  1. group short lines 4 and 5
  2. group long lines 2 and 3

Now we have:

  • Line 1 ("Heading") which is short
  • The group of long lines 2 and 3
  • The group of short lines 4 and 5
  • The page number line

So far so good, but here the problem arises. Because line 1 is short, its distance (defined as above as area difference) from the second group of short lines is smaller than the distance from the long lines that actually follow it, so it is merged incorrectly with the short group, producing the final order:

  • The group of long lines 2 and 3
  • A merged group of "Heading" (line 3) followed by lines 4 and 5
  • The page number line

It looks like a mistake to group lines purely based on distance, in particular when there are objects in between the two (the "long" group, in this case). Indeed, the helper function isany can detect such objects, but it is never called brecause skip_isany is never True. I suspect #315 may have regressed this.

@0xabu
Copy link
Contributor Author

0xabu commented Aug 16, 2021

This appears to be a regression from #315 or 2bee7d8. The commit immediately prior (6cc78ee) produces the correct output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant