Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tables] Update extract simple table with option to allow gaps #57

Merged
merged 5 commits into from
Apr 14, 2020

Conversation

jstockwin
Copy link
Owner

@jstockwin jstockwin commented Apr 7, 2020

This is because extract_simple_table is much faster than extract_table,
and we often have cases where only a few columns or rows don't have data
in every cell.

Initially we wanted this to only be used on full table, since it can
easily miss cells. We still achieve this when allow_gaps=False
(default), but you can say that you're expecting gaps.

Closes #58

Copy link
Collaborator

@jean-garret jean-garret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions (see comments) and an oversight. Could you also test the behaviour of extract_simple_table when passing a reference_element.

py_pdf_parser/tables.py Outdated Show resolved Hide resolved
py_pdf_parser/tables.py Outdated Show resolved Hide resolved
py_pdf_parser/tables.py Outdated Show resolved Hide resolved
Comment on lines +76 to +77
if reference_element is None:
reference_element = elements[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test that reference_element is actually in elements (if it is not None) or do you think it would be unnecessary?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting point.

Since writing this I've realised that I can still check that the number of elements passed equals the number of elements in the returned table (in commit "[tables] Fix bugs in extract simple table"):

    table_size = sum(
        len([element for element in row if element is not None]) for row in table
    )
    if table_size != len(elements):
        raise TableExtractionError(
            f"Number of elements in table ({table_size}) does not match number of "
            f"elements passed ({len(elements)}). Perhaps try extract_table instead of "
            "extract_simple_table, or change you reference element."
        )

I think you can still get away with passing a reference element which isn't in the list, but perhaps if you do this it's intended. I think it's fine not to check, and then if for some reason you don't want your reference element included in the final result then that's fine? I think it would be hard to do without noticing (i.e. if someone does it then it is probably intentional).

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it is fine not to check.

@jstockwin jstockwin force-pushed the simple-table-update branch from 8951171 to ea59c1d Compare April 8, 2020 10:21
@jstockwin jstockwin requested a review from jean-garret April 8, 2020 10:21
@jstockwin
Copy link
Owner Author

@jean-garret I've updated this PR

This is because extract_simple_table is much faster than extract_table,
and we often have cases where only a few columns or rows don't have data
in every cell.

Initially we wanted this to only be used on full table, since it can
easily miss cells. We still achieve this when allow_gaps=False
(default), but you can say that you're expecting gaps.
@jstockwin jstockwin force-pushed the simple-table-update branch from ea59c1d to 65950fd Compare April 8, 2020 13:20
Copy link
Collaborator

@jean-garret jean-garret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice that you spotted the bug with extract_simple_table when the table is across different pages!

Comment on lines +76 to +77
if reference_element is None:
reference_element = elements[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it is fine not to check.

@jstockwin jstockwin merged commit ce27e6b into master Apr 14, 2020
@jstockwin jstockwin deleted the simple-table-update branch April 14, 2020 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow some gaps in the table for extract_simple_table
2 participants