-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
layout algorithm merges non-adjacent text lines #658
Comments
Since I spent some time digging into this, I thought it might be useful to document here what goes wrong in the default case. The relevant code in LTLayoutContainer.group_textboxes tries to iteratively merge text lines that are close together into text groups, until everything is in one sorted group. Lines to merge into groups are chosen using a "distance" function ( On this page, the lines are:
The algorithm first:
Now we have:
So far so good, but here the problem arises. Because line 1 is short, its distance (defined as above as area difference) from the second group of short lines is smaller than the distance from the long lines that actually follow it, so it is merged incorrectly with the short group, producing the final order:
It looks like a mistake to group lines purely based on distance, in particular when there are objects in between the two (the "long" group, in this case). Indeed, the helper function |
Running pdf2txt on the attached PDF(which came to me from @joelostblom via 0xabu/pdfannots#24) produces output in the wrong order.
Expected output (from Poppler
pdftotext
):Actual output (from
pdf2txt.py
with defaults):It is possible to get the correct result by disabling the default layout detection algorithm. After applying #657, we can use
pdf2txt.py --boxes-flow=disabled
and get the output in the expected order.The text was updated successfully, but these errors were encountered: