Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PLAYA and PAVÉS instead of pdfminer.six #108

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

dhdaines
Copy link

As the subject says! This helped me fix a very annoying bug, so I'm glad I did it.

It's very marginally faster, and you should also get extra robustness to broken PDFs.

Getting it to support the parallel processing that PLAYA can do is a bit more work, but I might give it a try another day...

@dhdaines
Copy link
Author

Also note there are a number of deprecation warnings, that's what the TODO comments are about. Making the required changes is probably pretty straightforward and should simplify the code, so let me know if you want me to do that (but I may also move the deprecated APIs into PAVÉS so that it can more properly emulate pdfminer.six)

@0xabu
Copy link
Owner

0xabu commented Mar 2, 2025

Thanks for the PR! This is certainly interesting, given that it's unclear whether anyone is maintaining pdfminer, but I won't rush it. I'll try this out for my own use in the next month to get some confidence with it.

Personally I'm not super interested in parallel analysis -- most of the PDFs I use this with are 10-15 pages long.

@dhdaines
Copy link
Author

dhdaines commented Mar 2, 2025

Thanks for the PR! This is certainly interesting, given that it's unclear whether anyone is maintaining pdfminer, but I won't rush it. I'll try this out for my own use in the next month to get some confidence with it.

Sure, take your time! I'm glad I did this anyway, as it helped me find and fix some bugs. If you find some PDFs that fail with pdfminer it would be interesting to try them here.

Personally I'm not super interested in parallel analysis -- most of the PDFs I use this with are 10-15 pages long.

Yeah, after looking at it I realized that the use case of pdfannots isn't at all the kind of huge PDFs that I'm used to dealing with! If somebody had manually annotated hundreds of pages I would be really surprised :-)

I notice there are some build failures which are probably just a configuration thing in the workflows as the dependencies are not being pulled in for mypy properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants