layout algorithm merges non-adjacent text lines #658

0xabu · 2021-08-16T16:37:46Z

Running pdf2txt on the attached PDF(which came to me from @joelostblom via 0xabu/pdfannots#24) produces output in the wrong order.

Expected output (from Poppler pdftotext):

Heading
Link to heading that is working with vim-pandoc.
Link to heading “that is” not working with vim-pandoc.

Subheading
Some “more text”

1

Actual output (from pdf2txt.py with defaults):

Link to heading that is working with vim-pandoc.

Link to heading “that is” not working with vim-pandoc.

Heading

Subheading

Some “more text”

1

It is possible to get the correct result by disabling the default layout detection algorithm. After applying #657, we can use pdf2txt.py --boxes-flow=disabled and get the output in the expected order.

The text was updated successfully, but these errors were encountered:

0xabu · 2021-08-16T16:55:26Z

Since I spent some time digging into this, I thought it might be useful to document here what goes wrong in the default case. The relevant code in LTLayoutContainer.group_textboxes tries to iteratively merge text lines that are close together into text groups, until everything is in one sorted group. Lines to merge into groups are chosen using a "distance" function (dist) defined as area of the bounding box containing both text lines minus the area of the two text lines. By minimising this distance, the algorithm tries to select lines that are "close" together.

On this page, the lines are:

Heading (I'll call this short)
Link to heading that... (I'll call this long)
Ling to heading "that is" ... (long)
Subheading (short)
Some "more text" (short)
The final page number "1"

The algorithm first:

group short lines 4 and 5
group long lines 2 and 3

Now we have:

Line 1 ("Heading") which is short
The group of long lines 2 and 3
The group of short lines 4 and 5
The page number line

So far so good, but here the problem arises. Because line 1 is short, its distance (defined as above as area difference) from the second group of short lines is smaller than the distance from the long lines that actually follow it, so it is merged incorrectly with the short group, producing the final order:

The group of long lines 2 and 3
A merged group of "Heading" (line 3) followed by lines 4 and 5
The page number line

It looks like a mistake to group lines purely based on distance, in particular when there are objects in between the two (the "long" group, in this case). Indeed, the helper function isany can detect such objects, but it is never called brecause skip_isany is never True. I suspect #315 may have regressed this.

0xabu · 2021-08-16T17:02:33Z

This appears to be a regression from #315 or 2bee7d8. The commit immediately prior (6cc78ee) produces the correct output.

0xabu mentioned this issue Aug 17, 2021

Fix regression in page layout that sometimes returned text lines out of order #659

Merged

0xabu closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layout algorithm merges non-adjacent text lines #658

layout algorithm merges non-adjacent text lines #658

0xabu commented Aug 16, 2021

0xabu commented Aug 16, 2021 •

edited

Loading

0xabu commented Aug 16, 2021

layout algorithm merges non-adjacent text lines #658

layout algorithm merges non-adjacent text lines #658

Comments

0xabu commented Aug 16, 2021

0xabu commented Aug 16, 2021 • edited Loading

0xabu commented Aug 16, 2021

0xabu commented Aug 16, 2021 •

edited

Loading