Add feature to remove duplicate header rows #76

jstockwin · 2020-05-05T09:53:50Z

It is often the case that if a table goes over a page break then the header is repeated on the new page.

Even though it's not strictly a pdf parsing thing, it might be nice to add a util to handle this case (similar to the fact we handle adding the header to the table even though that's not strictly pdf parsing).

Essentially, if the header row repeats then it should be removed. Essentially I think all rows would be checked against the header row, and if they match both the text and the font, it should be removed from the table. We'd have to keep track of the removed rows sometimes so that the checks pass (we have checks to ensure the correct number of elements were detected).

There should be a parameter to enable this behaviour and it should default to False.

The text was updated successfully, but these errors were encountered:

paulopaixaoamaral · 2020-05-21T15:49:54Z

What should happen when we want to remove duplicate header rows by calling extract_table on tables with gaps?

For example:

| header_elem_1 | header_elem_2
| elem_1        | elem_2
| header_elem_3 |               | header_elem_4

If we call extract_table() on the above table, we will get:

[
 [header_elem_1, header_elem_2, None], 
 [elem_1, elem_2, None], 
 [header_elem_3, None, header_elem_4]
]

Now assuming that header_elem_1 == header_elem_3 and header_elem_2 == header_elem_4, should we consider that the last row is a duplicate header row (despite the fact that in the final result they are different rows)?

jstockwin · 2020-05-21T16:06:21Z

@paulopaixaoamaral No I don't think so.

I think a header ["a", "b", None] should match ["a", "b", None] and only that. The gaps are positional, so ["a", "b", None] and ["a", None, "b"] are very different.

So yeah, I think it should only be removed if (a) All the gaps are in the same place, and (b) the remaining elements are in the right places and have (i) the same text and (ii) the same font.

jstockwin added priority: low difficulty: medium component: tables enhancement labels May 5, 2020

paulopaixaoamaral mentioned this issue May 22, 2020

[tables] Add flag to remove duplicate header rows #89

Merged

6 tasks

paulopaixaoamaral closed this as completed in #89 May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature to remove duplicate header rows #76

Add feature to remove duplicate header rows #76

jstockwin commented May 5, 2020

paulopaixaoamaral commented May 21, 2020

jstockwin commented May 21, 2020 •

edited

Loading

Add feature to remove duplicate header rows #76

Add feature to remove duplicate header rows #76

Comments

jstockwin commented May 5, 2020

paulopaixaoamaral commented May 21, 2020

jstockwin commented May 21, 2020 • edited Loading

jstockwin commented May 21, 2020 •

edited

Loading