-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tables] Add flag to remove duplicate header rows #89
Conversation
6247868
to
d75001f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paulopaixaoamaral - looks generally good to me.
I've added two suggestions. Also:
- A test where the element text (but not the font) matches, and also one where the font (but not the text) matches might be good, just to ensure these elements don't get removed. (I think I could e.g. remove font comparison from your function and still have tests pass).
- However, maybe it's cleaner to pull your inner function outside, since then you can test that in isolation?
@@ -6,6 +6,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
|
|||
## [Unreleased] | |||
### Changed | |||
- Added flag to `extract_simple_table` and `extract_table` functions to remove duplicate header rows. ([#89](https://github.com/jstockwin/py-pdf-parser/pull/89)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be in a section called ### Added
py_pdf_parser/tables.py
Outdated
if (elem_1 is None or elem_2 is None) or ( | ||
elem_2 is None and elem_1 is not None | ||
): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the above check, we know they're not BOTH None
, so I think this can be simplified to if elem1 is None or elem2 is None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did that originally, but I thought it would be better to be explicit for readability purposes. But I guess you're right, it will be readable enough, I am going to change it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Description
This will add a flag to the table extraction functions to enable the removal of rows which are duplicats of the header row.
Linked issues
Closes #76
Testing
Tests have been added to the table extraction tests to make sure the duplicate header rows are removed.
Checklist
docker-compose run --rm lint
)CHANGELOG.md
, following the format fromKeep a Changelog.