Problems with big files (300 mb +) #128

ivan985 · 2018-02-28T12:39:51Z

Hello. I have a problem while working with big files (300 mb +). The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear (for example, on my resources - 167 seconds per 350 mb file and 565 seconds per 526 mb file). But, which is much worse, the text of them goes without spaces, each line goes like one solid block. The change in the margin- parameters did not affect the result. Where may be the problem? And how to fix it?

One of the files on which this I have this problem (340 mb) -- https://yadi.sk/d/uOc1dLko3Ss5HM
P.S. Tried other handlers, xpdf takes out the text correctly. I am sorry if I ask simple things, I have a little programming experience.

pietermarsman · 2019-11-10T13:28:32Z

The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear

Hi @ivan985, the speed of pdfminer.six heavily depends on the content of the pdf. For example, the layout algorithm is much faster when it can early-on group characters together into words.

But, which is much worse, the text of them goes without spaces, each line goes like one solid block.

Try tweaking the char_margin, word_margin and line_margin to improve the output of the results. I suspect that this will also influence the speed of parsing.

timb07 mentioned this issue Mar 29, 2018

Speed up handling of PDFs with large images #133

Merged

timb07 mentioned this issue Apr 11, 2018

Speed up layout of text boxes #141

Merged

pietermarsman added the type: bug label Oct 13, 2019

pietermarsman added type: question and removed type: bug labels Nov 10, 2019

pietermarsman closed this as completed Nov 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with big files (300 mb +) #128

Problems with big files (300 mb +) #128

ivan985 commented Feb 28, 2018 •

edited

Loading

pietermarsman commented Nov 10, 2019

Problems with big files (300 mb +) #128

Problems with big files (300 mb +) #128

Comments

ivan985 commented Feb 28, 2018 • edited Loading

pietermarsman commented Nov 10, 2019

ivan985 commented Feb 28, 2018 •

edited

Loading