You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is often the case that if a table goes over a page break then the header is repeated on the new page.
Even though it's not strictly a pdf parsing thing, it might be nice to add a util to handle this case (similar to the fact we handle adding the header to the table even though that's not strictly pdf parsing).
Essentially, if the header row repeats then it should be removed. Essentially I think all rows would be checked against the header row, and if they match both the text and the font, it should be removed from the table. We'd have to keep track of the removed rows sometimes so that the checks pass (we have checks to ensure the correct number of elements were detected).
There should be a parameter to enable this behaviour and it should default to False.
The text was updated successfully, but these errors were encountered:
Now assuming that header_elem_1 == header_elem_3 and header_elem_2 == header_elem_4, should we consider that the last row is a duplicate header row (despite the fact that in the final result they are different rows)?
I think a header ["a", "b", None] should match ["a", "b", None] and only that. The gaps are positional, so ["a", "b", None] and ["a", None, "b"] are very different.
So yeah, I think it should only be removed if (a) All the gaps are in the same place, and (b) the remaining elements are in the right places and have (i) the same text and (ii) the same font.
It is often the case that if a table goes over a page break then the header is repeated on the new page.
Even though it's not strictly a pdf parsing thing, it might be nice to add a util to handle this case (similar to the fact we handle adding the header to the table even though that's not strictly pdf parsing).
Essentially, if the header row repeats then it should be removed. Essentially I think all rows would be checked against the header row, and if they match both the text and the font, it should be removed from the table. We'd have to keep track of the removed rows sometimes so that the checks pass (we have checks to ensure the correct number of elements were detected).
There should be a parameter to enable this behaviour and it should default to
False
.The text was updated successfully, but these errors were encountered: