You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. I have a problem while working with big files (300 mb +). The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear (for example, on my resources - 167 seconds per 350 mb file and 565 seconds per 526 mb file). But, which is much worse, the text of them goes without spaces, each line goes like one solid block. The change in the margin- parameters did not affect the result. Where may be the problem? And how to fix it?
One of the files on which this I have this problem (340 mb) -- https://yadi.sk/d/uOc1dLko3Ss5HM
P.S. Tried other handlers, xpdf takes out the text correctly. I am sorry if I ask simple things, I have a little programming experience.
The text was updated successfully, but these errors were encountered:
The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear
Hi @ivan985, the speed of pdfminer.six heavily depends on the content of the pdf. For example, the layout algorithm is much faster when it can early-on group characters together into words.
But, which is much worse, the text of them goes without spaces, each line goes like one solid block.
Try tweaking the char_margin, word_margin and line_margin to improve the output of the results. I suspect that this will also influence the speed of parsing.
Hello. I have a problem while working with big files (300 mb +). The PDF-miner works slowly on them, and the speed, depending on the size, is not even linear (for example, on my resources - 167 seconds per 350 mb file and 565 seconds per 526 mb file). But, which is much worse, the text of them goes without spaces, each line goes like one solid block. The change in the margin- parameters did not affect the result. Where may be the problem? And how to fix it?
One of the files on which this I have this problem (340 mb) -- https://yadi.sk/d/uOc1dLko3Ss5HM
P.S. Tried other handlers, xpdf takes out the text correctly. I am sorry if I ask simple things, I have a little programming experience.
The text was updated successfully, but these errors were encountered: