-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tables] Update extract simple table with option to allow gaps #57
Conversation
e0585aa
to
9749243
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions (see comments) and an oversight. Could you also test the behaviour of extract_simple_table
when passing a reference_element
.
if reference_element is None: | ||
reference_element = elements[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test that reference_element
is actually in elements
(if it is not None) or do you think it would be unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting point.
Since writing this I've realised that I can still check that the number of elements passed equals the number of elements in the returned table (in commit "[tables] Fix bugs in extract simple table"):
table_size = sum(
len([element for element in row if element is not None]) for row in table
)
if table_size != len(elements):
raise TableExtractionError(
f"Number of elements in table ({table_size}) does not match number of "
f"elements passed ({len(elements)}). Perhaps try extract_table instead of "
"extract_simple_table, or change you reference element."
)
I think you can still get away with passing a reference element which isn't in the list, but perhaps if you do this it's intended. I think it's fine not to check, and then if for some reason you don't want your reference element included in the final result then that's fine? I think it would be hard to do without noticing (i.e. if someone does it then it is probably intentional).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it is fine not to check.
8951171
to
ea59c1d
Compare
@jean-garret I've updated this PR |
This is because extract_simple_table is much faster than extract_table, and we often have cases where only a few columns or rows don't have data in every cell. Initially we wanted this to only be used on full table, since it can easily miss cells. We still achieve this when allow_gaps=False (default), but you can say that you're expecting gaps.
ea59c1d
to
65950fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice that you spotted the bug with extract_simple_table
when the table is across different pages!
if reference_element is None: | ||
reference_element = elements[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it is fine not to check.
This is because extract_simple_table is much faster than extract_table,
and we often have cases where only a few columns or rows don't have data
in every cell.
Initially we wanted this to only be used on full table, since it can
easily miss cells. We still achieve this when allow_gaps=False
(default), but you can say that you're expecting gaps.
Closes #58