Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add support for file_bytes argument with managed_file_context() #270

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cscanlin
Copy link

@cscanlin cscanlin commented Oct 8, 2021

Fixes #170 and #245

@@ -24,19 +27,33 @@ class PDFHandler(object):

Parameters
----------
filepath : str
Filepath or URL of the PDF file.
filepath : str | pathlib.Path, optional (default: None)
Copy link
Author

@cscanlin cscanlin Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I elected to keep the arguments separate instead of combining them like pandas.read_csv (or any of the others) do. Mostly to preserve the existing API kwargs

i.e. I did not want to rename this argument file_path_or_bytes

@@ -49,6 +66,28 @@ def __init__(self, filepath, pages="1", password=None):
self.password = self.password.encode("ascii")
self.pages = self._get_pages(pages)

@contextmanager
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the meat of it. Variably opens a file handle or passes the bytes through, depending on the case.

def download_url(url):
"""Download file from specified URL.
def get_url_bytes(url):
"""Get a stream of bytes for url

Parameters
----------
url : str or unicode

Returns
-------
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is only moderately involved with my feature, but this is an anti-pattern in my opinion.

with tempfile.NamedTemporaryFile("wb", delete=False) as f:
    ...
filepath = os.path.join(os.path.dirname(f.name), filename)
shutil.move(f.name, filepath)

Trying to maintain this file outside of the context manager provided by tempfile is at best not the intention of the module. Using BytesIO is going to incur somewhat more memory strain in these cases, but I could easily see the existing implementation causing bugs in some cases (either now or in the future).

@@ -107,7 +146,7 @@ def _save_page(self, filepath, page, temp):
Tmp directory.

"""
with open(filepath, "rb") as fileobj:
with self.managed_file_context() as fileobj:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug

@chris-decker
Copy link

I needed this functionality and since the original author hasn't yet merged it I cloned the repo and merged locally. Seems to be working as intended, thanks for contributing.

@clcarver1130
Copy link

Is this still stalled? Would love this functionality if still possible

@cscanlin
Copy link
Author

Unfortunately this repo looks to be somewhat abandoned, @vinayak-mehta has not merged any code since July 2021.

It's a really useful tool, so it would be a shame to let it decay. Does anybody have interest in being a maintainer?

@ramSeraph
Copy link

Consider starting a discussion on gitter?

Copy link
Collaborator

@bosd bosd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cocntribution!!

Quick code review, without functionally testing it.
Would love to see if the tests are green.

@bosd
Copy link
Collaborator

bosd commented Jul 15, 2023

@foarsitter Can you trigger the tests? :Pray:

@sisrfeng
Copy link

Any update?

@foarsitter
Copy link
Collaborator

@cscanlin, by any change, do you have time to rebase this and run black/isort?

@MartinThoma
Copy link
Collaborator

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

@Johnmaras
Copy link

@MartinThoma Hi. Has anyone merged, or plan to do so, this PR on your fork? I could use this feature. I guess I could clone it and open the PR in your project.

@bosd
Copy link
Collaborator

bosd commented Mar 22, 2024

@MartinThoma Hi. Has anyone merged, or plan to do so, this PR on your fork? I could use this feature. I guess I could clone it and open the PR in your project.

@Johnmaras Please go ahead and open a PR there. 🙂

@Johnmaras
Copy link

Hi @bosd.
I can open a PR but it looks like there are many conflicts between cscanlin:file-bytes-support and pypdf_table_extraction:main.
If I open the PR do you think a contributor of pypdf_table_extraction could work on resolving the conflicts? I can't currently work on it myself.

Let me know.
Thank you

@bosd
Copy link
Collaborator

bosd commented Mar 26, 2024

Hi @bosd.
I can open a PR but it looks like there are many conflicts between cscanlin:file-bytes-support and pypdf_table_extraction:main.
If I open the PR do you think a contributor of pypdf_table_extraction could work on resolving the conflicts? I can't currently work on it myself.

Let me know.
Thank you

I'm a contributor / maintainer there. I can have a look to resolve the conflicts. ( May take some time, kinda busy lately)

@Johnmaras
Copy link

Understood. I'll proceed with the PR as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Camelot functionality has read_pdf from file but no option read from bytes
10 participants