Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with big files (300 mb +) #128

Closed
ivan985 opened this issue Feb 28, 2018 · 1 comment
Closed

Problems with big files (300 mb +) #128

ivan985 opened this issue Feb 28, 2018 · 1 comment

Comments

@ivan985
Copy link

ivan985 commented Feb 28, 2018

Hello. I have a problem while working with big files (300 mb +). The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear (for example, on my resources - 167 seconds per 350 mb file and 565 seconds per 526 mb file). But, which is much worse, the text of them goes without spaces, each line goes like one solid block. The change in the margin- parameters did not affect the result. Where may be the problem? And how to fix it?

One of the files on which this I have this problem (340 mb) -- https://yadi.sk/d/uOc1dLko3Ss5HM
P.S. Tried other handlers, xpdf takes out the text correctly. I am sorry if I ask simple things, I have a little programming experience.

@pietermarsman
Copy link
Member

The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear

Hi @ivan985, the speed of pdfminer.six heavily depends on the content of the pdf. For example, the layout algorithm is much faster when it can early-on group characters together into words.

But, which is much worse, the text of them goes without spaces, each line goes like one solid block.

Try tweaking the char_margin, word_margin and line_margin to improve the output of the results. I suspect that this will also influence the speed of parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants