Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'ID' when running pdf2txt.py #470

Open
alexkillen opened this issue Aug 14, 2020 · 1 comment
Open

KeyError: 'ID' when running pdf2txt.py #470

alexkillen opened this issue Aug 14, 2020 · 1 comment

Comments

@alexkillen
Copy link

Unfortunately I cannot include the PDF as it is a bank statement, but hopefully the details below are enough.

The error is as follows:

DEBUG:pdfminer.pdfdocument:trailer={'Size': 70, 'Root': <PDFObjRef:69>, 'Info': <PDFObjRef:3>, 'Encrypt': <PDFObjRef:2>}
INFO:pdfminer.pdfdocument:trailer: {'Size': 70, 'Root': <PDFObjRef:69>, 'Info': <PDFObjRef:3>, 'Encrypt': <PDFObjRef:2>}
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 195, in <module>
    sys.exit(main())
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 189, in main
    outfp = extract_text(**vars(A))
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 57, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/high_level.py", line 79, in extract_text_to_fp
    for page in PDFPage.get_pages(inf,
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/pdfdocument.py", line 589, in __init__
    self.encryption = (list_value(trailer['ID']),
KeyError: 'ID'

The error occurs when attempting to access the 'ID' property of the File Trailer, but as can be seen in the DEBUG line in the above output, 'ID' is not in the trailer. Note that 'ID' is listed as optional in the PDF spec: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf#page=88.

I managed to workaround the issue by making the following change in pdfminer/pdfdocument.py (line 588-590):

        if 'Encrypt' in trailer:
                # self.encryption = (list_value(trailer['ID']),
                self.encryption = (list_value(trailer['ID']) if 'ID' in trailer else [''.encode('utf-8'), ''.encode('utf-8')],
                                   dict_value(trailer['Encrypt']))

This simply provides empty utf-8 encoded strings as the ID. I'm not sure if this would be the right "fix" but it appeared to work in my case.

@pietermarsman pietermarsman added the type:anomaly Errors caused by deviations from the PDF Reference label Sep 13, 2020
@pietermarsman
Copy link
Member

Since it is

Optional, but strongly recommended; PDF 1.1)

we should indeed make this more robust by assuming the value can be missing.

It looks like there is no sensible default. So using a tuple of two empty bytes is ok.

I suggest using trailer.get('ID', [b'', b'']).

@pietermarsman pietermarsman added component:document Related to PDFDocument type: bug and removed type:anomaly Errors caused by deviations from the PDF Reference labels Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants