Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to remove duplicate header rows #76

Closed
jstockwin opened this issue May 5, 2020 · 2 comments · Fixed by #89
Closed

Add feature to remove duplicate header rows #76

jstockwin opened this issue May 5, 2020 · 2 comments · Fixed by #89

Comments

@jstockwin
Copy link
Owner

It is often the case that if a table goes over a page break then the header is repeated on the new page.

Even though it's not strictly a pdf parsing thing, it might be nice to add a util to handle this case (similar to the fact we handle adding the header to the table even though that's not strictly pdf parsing).

Essentially, if the header row repeats then it should be removed. Essentially I think all rows would be checked against the header row, and if they match both the text and the font, it should be removed from the table. We'd have to keep track of the removed rows sometimes so that the checks pass (we have checks to ensure the correct number of elements were detected).

There should be a parameter to enable this behaviour and it should default to False.

@paulopaixaoamaral
Copy link
Collaborator

What should happen when we want to remove duplicate header rows by calling extract_table on tables with gaps?

For example:

| header_elem_1 | header_elem_2
| elem_1        | elem_2
| header_elem_3 |               | header_elem_4

If we call extract_table() on the above table, we will get:

[
 [header_elem_1, header_elem_2, None], 
 [elem_1, elem_2, None], 
 [header_elem_3, None, header_elem_4]
]

Now assuming that header_elem_1 == header_elem_3 and header_elem_2 == header_elem_4, should we consider that the last row is a duplicate header row (despite the fact that in the final result they are different rows)?

@jstockwin
Copy link
Owner Author

jstockwin commented May 21, 2020

@paulopaixaoamaral No I don't think so.

I think a header ["a", "b", None] should match ["a", "b", None] and only that. The gaps are positional, so ["a", "b", None] and ["a", None, "b"] are very different.

So yeah, I think it should only be removed if (a) All the gaps are in the same place, and (b) the remaining elements are in the right places and have (i) the same text and (ii) the same font.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants