pdfminer · KunalGehlot · Aug 23, 2022 · Aug 23, 2022 · Aug 23, 2022 · Aug 24, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
 - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
+- Documentation updates: FAQ (Unable to extract special characters), and other small changes
+- Documentation: How-to extract coordinates and font information
 
 ### Fixed
 
@@ -19,6 +21,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 - Installing typing-extensions on Python 3.6 and 3.7 ([#775](https://github.com/pdfminer/pdfminer.six/pull/775))
 - `TypeError` in cmapdb.py when parsing null characters ([#768](https://github.com/pdfminer/pdfminer.six/pull/768))
 - Color "convenience operators" now (per spec) also set color space ([#779](https://github.com/pdfminer/pdfminer.six/issues/779))
+- `ValueError` when extracting images, due to breaking changes in Pillow ([#795](https://github.com/pdfminer/pdfminer.six/issues/795))
 
 ### Deprecated
 

diff --git a/docs/source/faq.rst b/docs/source/faq.rst
@@ -7,11 +7,11 @@ Why is it called pdfminer.six?
 ==============================
 
 Pdfminer.six is a fork of the `original pdfminer created by Euske
-<https://github.com/euske>`_. Almost all of the code and architecture is in
-fact created by Euske. But, for a long time this original pdfminer did not
+<https://github.com/euske>`_. Almost all of the code and architecture are in
+-fact created by Euske. But, for a long time, this original pdfminer did not
 support Python 3. Until 2020 the original pdfminer only supported Python 2.
 The original goal of pdfminer.six was to add support for Python 3. This was
-done with the six package. The six package helps to write code that is
+done with the `six` package. The `six` package helps to write code that is
 compatible with both Python 2 and Python 3. Hence, pdfminer.six.
 
 As of 2020, pdfminer.six dropped the support for Python 2 because it was
@@ -27,15 +27,60 @@ also equal to six feet.
 How does pdfminer.six compare to other forks of pdfminer?
 ==========================================================
 
-Pdfminer.six is now an independent and community maintained package for
-extracting text from PDF's with Python. We actively fix bugs (also for PDF's
+Pdfminer.six is now an independent and community-maintained package for
+extracting text from PDFs with Python. We actively fix bugs (also for PDFs
 that don't strictly follow the PDF Reference), add new features and improve
 the usability of pdfminer.six. This community separates pdfminer.six from the
 other forks of the original pdfminer. PDF as a format is very diverse and
 there are countless deviations from the official format. The only way to
-support all the PDF's out there is to have a community that actively uses and
+support all the PDFs out there is to have a community that actively uses and
 improves pdfminer.
 
 Since 2020, the original pdfminer is `dormant
 <https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
 which Euske recommends if you need an actively maintained version of pdfminer.
+
+Unable to extract special characters from PDF
+=============================================
+
+a.k.a 
+
+"Why does pdfminer.six not extract special characters from PDFs?"
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+or
+
+"Getting weird characters (cid\:20) instead of special characters"
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Pdfminer.six is a tool for extracting text from PDFs. It is not a tool for
+extracting the original PDF content. The PDF format is very complex, and it 
+is impossible to extract the original PDF content with 100% accuracy.
+
+One of the most commonly encountered issues is that **the PDF does not contain
+a mapping for its glyphs** (i.e. the characters displayed on the screen). 
+It is a common issue with PDFs that are generated by scanning
+documents. In this case, pdfminer.six will extract the **CID** (character
+identifier) instead of the actual character. The CID is a number that
+represents a character. The CID is not a character itself. The CID is a 
+reference to a character that is defined in the PDF's font. The PDF's font
+is a mapping between the CID and the actual character.
+
+One of the easiest ways to understand it is to think of the CID as a
+reference to a character in a dictionary. The dictionary is the PDF's font.
+The CID is the key in the dictionary. The actual character is the value in
+the dictionary.
+
+A quick way to check if the PDF contains a mapping for its glyphs is to
+open the PDF in a text editor. If the text is displayed correctly, then the
+PDF contains a mapping for its glyphs. If the text is displayed as CID
+numbers, then the PDF does not contain a mapping for its glyphs.
+
+If you copy-paste from the PDF directly, you will get gibberish. 
+Pdfminer.six is doing the same thing, but automated, and gets the same result
+(=gibberish).
+
+References: 
+
+#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
+#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_
diff --git a/docs/source/howto/acro_forms.rst b/docs/source/howto/acro_forms.rst
@@ -65,7 +65,7 @@ Only AcroForm interactive forms are supported, XFA forms are not supported.
 
             print(name, values)
 
-This code snippet will print all the fields name and value and save them in the "data" dictionary.
+This code snippet will print all the fields' names and values and save them in the "data" dictionary.
 
 
 How it works:
@@ -77,9 +77,9 @@ How it works:
     parser = PDFParser(fp)
     doc = PDFDocument(parser)
 
-- Get the catalog
+- Get the Catalog
 
-  (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)
+  (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference)
 
 .. code-block:: python
 
@@ -122,7 +122,7 @@ How it works:
 
 - Call the value(s) decoding method as needed
 
-  (a single field can hold multiple values, for example a combo box can hold more than one value at time)
+  (a single field can hold multiple values, for example, a combo box can hold more than one value at a time)
 
 .. code-block:: python
 
@@ -131,7 +131,7 @@ How it works:
     else:
         values = decode_value(values)
 
-(the decode_value method takes care of decoding the fields value returning a string)
+(the decode_value method takes care of decoding the field's value, returning a string)
 
 - Decode PSLiteral and PSKeyword field values
 

diff --git a/docs/source/howto/coordinates.rst b/docs/source/howto/coordinates.rst
@@ -0,0 +1,94 @@
+.. _tutorial_coordinates:
+
+How to extract text, text coordinates and font information from a PDF
+************************************************************************
+
+The high-level API can be used to extract text, text coordinates and font information from a PDF.
+
+pdfminer.six uses a Layout analysis algorithm which returns a hierarchical structure while 
+extracting information from the PDF, the following example shows how you can traverse 
+through the tree to extract information.
+
+For more information on the Layout analysis algorithm, please refer to the
+:ref:`topic_pdf_to_text_layout` section.
+
+.. code-block:: python
+
+    from pathlib import Path
+    from typing import Iterable, Any
+
+    from pdfminer.high_level import extract_pages
+
+    def show_ltitem_hierarchy(o: Any, depth=0):
+        """Show location and text of LTItem and all its descendants"""
+        if depth == 0:
+            print('element                         x1  y1  x2'
+                '  y2  fontinfo             text')
+            print('------------------------------ --- --- --- '
+                '--- -------------------- -----')
+
+        print(
+            f'{get_indented_name(o, depth):<30.30s} '
+            f'{get_optional_bbox(o)} '
+            f'{get_optional_fontinfo(o):<20.20s} '
+            f'{get_optional_text(o)}'
+        )
+
+        if isinstance(o, Iterable):
+            for i in o:
+                show_ltitem_hierarchy(i, depth=depth + 1)
+
+
+    def get_indented_name(o: Any, depth: int) -> str:
+        """Indented name of class"""
+        return '  ' * depth + o.__class__.__name__
+
+
+    def get_optional_fontinfo(o: Any) -> str:
+        """Font info of LTChar if available, otherwise empty string"""
+        if hasattr(o, 'fontname') and hasattr(o, 'size'):
+            return f'{o.fontname} {round(o.size)}pt'
+        return ''
+
+    def get_optional_bbox(o: Any) -> str:
+        """Bounding box of LTItem if available, otherwise empty string"""
+        if hasattr(o, 'bbox'):
+            return ''.join(f'{i:<4.0f}' for i in o.bbox)
+        return ''
+
+    def get_optional_text(o: Any) -> str:
+        """Text of LTItem if available, otherwise empty string"""
+        if hasattr(o, 'get_text'):
+            return o.get_text().strip()
+        return ''
+
+    path = Path('~/Downloads/simple1.pdf').expanduser()
+    pages = extract_pages(path)
+    show_ltitem_hierarchy(pages)
+
+You will get the following output:
+
+.. doctest::
+
+    element                         x1  y1  x2  y2  fontinfo             text
+    ------------------------------ --- --- --- ---- -------------------- -----
+    generator                                            
+      LTPage                       0   0   612 792                       
+        LTTextBoxHorizontal        100 695 161 719                       Hello
+          LTTextLineHorizontal     100 695 161 719                       Hello
+            LTChar                 100 695 117 719  Helvetica 24pt       H
+            LTChar                 117 695 131 719  Helvetica 24pt       e
+            LTChar                 131 695 136 719  Helvetica 24pt       l
+            LTChar                 136 695 141 719  Helvetica 24pt       l
+            LTChar                 141 695 155 719  Helvetica 24pt       o
+            LTChar                 155 695 161 719  Helvetica 24pt       
+            LTAnno                                       
+        LTTextBoxHorizontal        261 695 324 719                       World
+          LTTextLineHorizontal     261 695 324 719                       World
+            LTChar                 261 695 284 719  Helvetica 24pt       W
+            LTChar                 284 695 297 719  Helvetica 24pt       o
+            LTChar                 297 695 305 719  Helvetica 24pt       r
+            LTChar                 305 695 311 719  Helvetica 24pt       l
+            LTChar                 311 695 324 719  Helvetica 24pt       d
+            LTAnno  
+    ...
diff --git a/docs/source/howto/index.rst b/docs/source/howto/index.rst
@@ -10,3 +10,4 @@ How-to guides help you to solve specific problems with pdfminer.six.
 
     images
     acro_forms
+    coordinates
diff --git a/docs/source/topic/converting_pdf_to_text.rst b/docs/source/topic/converting_pdf_to_text.rst
@@ -3,7 +3,7 @@
 Converting a PDF file to text
 *****************************
 
-Most PDF files look like they contain well structured text. But the reality  is
+Most PDF files look like they contain well-structured text. But the reality is
 that a PDF file does not contain anything that resembles paragraphs,
 sentences or even words. When it comes to text, a PDF file is only aware of
 the characters and their placement.
@@ -14,7 +14,7 @@ compose the table, the page footer or the description of a figure. Unlike
 other document formats, like a `.txt` file or a word document, the PDF format
 does not contain a stream of text.
 
-A PDF document does consists of a collection of objects that together describe
+A PDF document consists of a collection of objects that together describe
 the appearance of one or more pages, possibly accompanied by additional
 interactive elements and higher-level application data. A PDF file contains
 the objects making up a PDF document along with associated structural
@@ -53,7 +53,7 @@ uses these bounding boxes to decide which characters belong together.
 
 Characters that are both horizontally and vertically close are grouped onto
 one line. How close they should be is determined by the `char_margin`
-(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
+(M in the figure) and the `line_overlap` (not in figure) parameter. The horizontal
 *distance* between the bounding boxes of two characters should be smaller than
 the `char_margin` and the vertical *overlap* between the bounding boxes should
 be smaller than the `line_overlap`.
@@ -76,7 +76,7 @@ be separated by a space.
 
 The result of this stage is a list of lines. Each line consists of a list of
 characters. These characters are either original `LTChar` characters that
-originate from the PDF file, or inserted `LTAnno` characters that
+originate from the PDF file or inserted `LTAnno` characters that
 represent spaces between words or newlines at the end of each line.
 
 Grouping lines into boxes
@@ -91,7 +91,7 @@ Lines that are both horizontally overlapping and vertically close are grouped.
 How vertically close the lines should be is determined by the `line_margin`.
 This margin is specified relative to the height of the bounding box. Lines
 are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
-(see L :sub:`2`) in the figure) of the bounding boxes is closer together
+(see L :sub:`2`) in the figure) of the bounding boxes are closer together
 than the absolute line margin, i.e. the `line_margin` multiplied by the
 height of the bounding box.
 
@@ -120,7 +120,7 @@ Working with rotated characters
 
 The algorithm described above assumes that all characters have the same
 orientation. However, any writing direction is possible in a PDF. To
-accommodate for this, pdfminer.six allows to detect vertical writing with the
+accommodate for this, pdfminer.six allows detecting vertical writing with the
 `detect_vertical` parameter. This will apply all the grouping steps as if the
 pdf was rotated 90 (or 270) degrees
 

diff --git a/pdfminer/high_level.py b/pdfminer/high_level.py
@@ -185,7 +185,9 @@ def extract_pages(
     caching: bool = True,
     laparams: Optional[LAParams] = None,
 ) -> Iterator[LTPage]:
-    """Extract and yield LTPage objects
+    """Extract and yield LTPage objects which can be further iterated to get
+    sub-elements. This is the most powerful method of extracting data from a
+    PDF.
 
     :param pdf_file: Either a file path or a file-like object for the PDF file
         to be worked on.
@@ -195,7 +197,7 @@ def extract_pages(
     :param caching: If resources should be cached
     :param laparams: An LAParams object from pdfminer.layout. If None, uses
         some default settings that often work well.
-    :return:
+    :return: LTPage objects
     """
     if laparams is None:
         laparams = LAParams()

diff --git a/pdfminer/image.py b/pdfminer/image.py
@@ -225,20 +225,24 @@ def _save_bytes(self, image: LTImage) -> str:
         with open(path, "wb") as fp:
             try:
                 from PIL import Image  # type: ignore[import]
+                from PIL import ImageOps
             except ImportError:
                 raise ImportError(PIL_ERROR_MESSAGE)
 
-            mode: Literal["1", "8", "RGB", "CMYK"]
+            mode: Literal["1", "L", "RGB", "CMYK"]
             if image.bits == 1:
                 mode = "1"
             elif image.bits == 8 and channels == 1:
-                mode = "8"
+                mode = "L"
             elif image.bits == 8 and channels == 3:
                 mode = "RGB"
             elif image.bits == 8 and channels == 4:
                 mode = "CMYK"
 
             img = Image.frombytes(mode, image.srcsize, image.stream.get_data(), "raw")
+            if mode == "L":
+                img = ImageOps.invert(img)
+
             img.save(fp)
 
         return name
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,3 +10,4 @@ How-to guides help you to solve specific problems with pdfminer.six.

		images
		acro_forms
		coordinates