[loaders] Loads accept LTTextLines as top level pdfminer elements, which breaks things #154

jstockwin · 2021-01-15T10:49:53Z

In the loaders, we do

elements = [element for element in page if isinstance(element, LTTextContainer)]

however, I have an example PDF (which I can't share) where one of these elements is an LTTextLine.

LTTextBox and LTTextLine are both LTTextContainers.

We're expecting to get LTTextBoxes, containing LTTextLines, which in turn contain LTChars. In this case, we're only getting an LTTextLine containing an LTChar. This breaks the code, for example when trying to find the font, since we try to iterate through the 2nd level, which should be an LTTextLine, but is now an LTChar which is not iterable.

I think the fix here is to search for instances of LTTextBox instead of LTTextLine.

The text was updated successfully, but these errors were encountered:

Closes #154

jstockwin added the bug label Jan 15, 2021

jstockwin self-assigned this Jan 15, 2021

jstockwin added a commit that referenced this issue Jan 15, 2021

[loaders] Only accept LTTexTBoxes from pdfminer.six

2283086

Closes #154

jstockwin mentioned this issue Jan 15, 2021

[loaders] Only accept LTTextBoxes from pdfminer.six #155

Merged

6 tasks

jstockwin added a commit that referenced this issue Jan 15, 2021

[loaders] Only accept LTTexTBoxes from pdfminer.six

b97b245

Closes #154

jstockwin closed this as completed in #155 Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[loaders] Loads accept LTTextLines as top level pdfminer elements, which breaks things #154

[loaders] Loads accept LTTextLines as top level pdfminer elements, which breaks things #154

jstockwin commented Jan 15, 2021

[loaders] Loads accept LTTextLines as top level pdfminer elements, which breaks things #154

[loaders] Loads accept LTTextLines as top level pdfminer elements, which breaks things #154

Comments

jstockwin commented Jan 15, 2021