Use PLAYA and PAVÉS instead of pdfminer.six #108

dhdaines · 2025-02-28T13:39:02Z

As the subject says! This helped me fix a very annoying bug, so I'm glad I did it.

It's very marginally faster, and you should also get extra robustness to broken PDFs.

Getting it to support the parallel processing that PLAYA can do is a bit more work, but I might give it a try another day...

dhdaines · 2025-02-28T13:49:51Z

Also note there are a number of deprecation warnings, that's what the TODO comments are about. Making the required changes is probably pretty straightforward and should simplify the code, so let me know if you want me to do that (but I may also move the deprecated APIs into PAVÉS so that it can more properly emulate pdfminer.six)

0xabu · 2025-03-02T18:57:53Z

Thanks for the PR! This is certainly interesting, given that it's unclear whether anyone is maintaining pdfminer, but I won't rush it. I'll try this out for my own use in the next month to get some confidence with it.

Personally I'm not super interested in parallel analysis -- most of the PDFs I use this with are 10-15 pages long.

dhdaines · 2025-03-02T19:23:13Z

Thanks for the PR! This is certainly interesting, given that it's unclear whether anyone is maintaining pdfminer, but I won't rush it. I'll try this out for my own use in the next month to get some confidence with it.

Sure, take your time! I'm glad I did this anyway, as it helped me find and fix some bugs. If you find some PDFs that fail with pdfminer it would be interesting to try them here.

Personally I'm not super interested in parallel analysis -- most of the PDFs I use this with are 10-15 pages long.

Yeah, after looking at it I realized that the use case of pdfannots isn't at all the kind of huge PDFs that I'm used to dealing with! If somebody had manually annotated hundreds of pages I would be really surprised :-)

I notice there are some build failures which are probably just a configuration thing in the workflows as the dependencies are not being pulled in for mypy properly.

dhdaines added 4 commits February 28, 2025 07:32

feat: use PLAYA and PAVÉS instead of pdfminer.six

fc78bc8

fix: use only explicit page labels contrary to PLAYA

598f095

fix: use PLAYA 0.3.1 which fixes bad bugs

d9ee454

docs: blame PLAYA and not pdfminer.six for any PDF extraction errors

58fd4e4

fix(deps): just import stuff from paves as much as possible

fd042a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PLAYA and PAVÉS instead of pdfminer.six #108

Use PLAYA and PAVÉS instead of pdfminer.six #108

dhdaines commented Feb 28, 2025

dhdaines commented Feb 28, 2025

0xabu commented Mar 2, 2025

dhdaines commented Mar 2, 2025 •

edited

Loading

Use PLAYA and PAVÉS instead of pdfminer.six #108

Are you sure you want to change the base?

Use PLAYA and PAVÉS instead of pdfminer.six #108

Conversation

dhdaines commented Feb 28, 2025

dhdaines commented Feb 28, 2025

0xabu commented Mar 2, 2025

dhdaines commented Mar 2, 2025 • edited Loading

dhdaines commented Mar 2, 2025 •

edited

Loading