Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FAQ about special characters #829

Merged
merged 3 commits into from
Nov 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

- Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
- Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
- Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829))

### Fixed

Expand Down
27 changes: 27 additions & 0 deletions docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,30 @@ improves pdfminer.
Since 2020, the original pdfminer is `dormant
<https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
which Euske recommends if you need an actively maintained version of pdfminer.

Why are there `(cid:x)` values in the textual output?
=====================================================

One of the most common issues with pdfminer.six is that the textual output
contains raw character id's `(cid:x)`. This is often experienced as confusing
because the text is shown fine in a PDF viewer and other text from the same
PDF is extracted properly.

The underlying problem is that a PDF has two different representations
of each character. Each character is mapped to a glyph that determines
how the character is shown in a PDF viewer. And each character is also
mapped to its unicode value that is used when copy-pasting the character.
Some PDF's have incomplete unicode mappings and therefore it is impossible
to convert the character to unicode. In these cases pdfminer.six defaults
to showing the raw character id `(cid:x)`

A quick test to see if pdfminer.six should be able to do better is to
copy-paste the text from a PDF viewer to a text editor. If the result
is proper text, pdfminer.six should also be able to extract proper text.
If the result is gibberish, pdfminer.six will also not be able to convert
the characters to unicode.

References:

#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_