KeyError: 'ID' when running pdf2txt.py #470

alexkillen · 2020-08-14T10:27:12Z

Unfortunately I cannot include the PDF as it is a bank statement, but hopefully the details below are enough.

The error is as follows:

DEBUG:pdfminer.pdfdocument:trailer={'Size': 70, 'Root': <PDFObjRef:69>, 'Info': <PDFObjRef:3>, 'Encrypt': <PDFObjRef:2>}
INFO:pdfminer.pdfdocument:trailer: {'Size': 70, 'Root': <PDFObjRef:69>, 'Info': <PDFObjRef:3>, 'Encrypt': <PDFObjRef:2>}
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 195, in <module>
    sys.exit(main())
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 189, in main
    outfp = extract_text(**vars(A))
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 57, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/high_level.py", line 79, in extract_text_to_fp
    for page in PDFPage.get_pages(inf,
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/pdfdocument.py", line 589, in __init__
    self.encryption = (list_value(trailer['ID']),
KeyError: 'ID'

The error occurs when attempting to access the 'ID' property of the File Trailer, but as can be seen in the DEBUG line in the above output, 'ID' is not in the trailer. Note that 'ID' is listed as optional in the PDF spec: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf#page=88.

I managed to workaround the issue by making the following change in pdfminer/pdfdocument.py (line 588-590):

        if 'Encrypt' in trailer:
                # self.encryption = (list_value(trailer['ID']),
                self.encryption = (list_value(trailer['ID']) if 'ID' in trailer else [''.encode('utf-8'), ''.encode('utf-8')],
                                   dict_value(trailer['Encrypt']))

This simply provides empty utf-8 encoded strings as the ID. I'm not sure if this would be the right "fix" but it appeared to work in my case.

pietermarsman · 2020-09-13T10:40:03Z

Since it is

Optional, but strongly recommended; PDF 1.1)

we should indeed make this more robust by assuming the value can be missing.

It looks like there is no sensible default. So using a tuple of two empty bytes is ok.

I suggest using trailer.get('ID', [b'', b'']).

pietermarsman added the type:anomaly Errors caused by deviations from the PDF Reference label Sep 13, 2020

pietermarsman added component:document Related to PDFDocument type: bug and removed type:anomaly Errors caused by deviations from the PDF Reference labels Sep 13, 2020

datatalking mentioned this issue Jul 20, 2022

Type Error during extracting pages in some pdfs #720

Closed

pietermarsman added the status: accepted label Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'ID' when running pdf2txt.py #470

KeyError: 'ID' when running pdf2txt.py #470

alexkillen commented Aug 14, 2020

pietermarsman commented Sep 13, 2020

KeyError: 'ID' when running pdf2txt.py #470

KeyError: 'ID' when running pdf2txt.py #470

Comments

alexkillen commented Aug 14, 2020

pietermarsman commented Sep 13, 2020