Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[components] Add element_ordering argument to PDFDocument #95

Merged
merged 2 commits into from
Jun 22, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- Added `__len__` and `__repr__` functions to the Section class. ([#90](https://github.com/jstockwin/py-pdf-parser/pull/90))
- Added flag to `extract_simple_table` and `extract_table` functions to remove duplicate header rows. ([#89](https://github.com/jstockwin/py-pdf-parser/pull/89))
- You can now specify `element_ordering` when instantiating a PDFDocument. This defaults to the old behaviour or left to right, top to bottom. ([#95](https://github.com/jstockwin/py-pdf-parser/pull/95))

### Changed
- Advanced layout analysis is now disabled by default. ([#88](https://github.com/jstockwin/py-pdf-parser/pull/88))

Expand Down
Binary file added docs/source/example_files/columns.pdf
Binary file not shown.
Binary file added docs/source/example_files/grid.pdf
Binary file not shown.
163 changes: 163 additions & 0 deletions docs/source/examples/element_ordering.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
.. _element-ordering:

Element Ordering
----------------

In this example, we see how to specify a custom ordering for the elements.

For this we will use a simple pdf, which has a single element in each corner of the
page. You can :download:`download the example here </example_files/grid.pdf>`.


Default
.......

The default element ordering is left to right, top to bottom.

.. code-block:: python

from py_pdf_parser.loaders import load_file

file_path = "grid.pdf"

# Default - left to right, top to bottom
document = load_file(file_path)
print([element.text() for element in document.elements])

This results in
::

['Top Left', 'Top Right', 'Bottom Left', 'Bottom Right']

Presets
.......

There are also preset orderings for ``right to left, top to bottom``,
``top to bottom, left to right``, and ``top to bottom, right to left``. You can use
these by importing the :class:`~py_pdf_parser.components.ElementOrdering` class from
:py:mod:`py_pdf_parser.components` and passing these as the ``element_ordering``
argument to :class:`~py_pdf_parser.components.PDFDocument`. Note that keyword arguments
to :meth:`~py_pdf_parser.loaders.load` and :meth:`~py_pdf_parser.loaders.load_file` get
passed through to the :class:`~py_pdf_parser.components.PDFDocument`.

.. code-block:: python

from py_pdf_parser.loaders import load_file
from py_pdf_parser.components import ElementOrdering

# Preset - right to left, top to bottom
document = load_file(
file_path, element_ordering=ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM
)
print([element.text() for element in document.elements])

# Preset - top to bottom, left to right
document = load_file(
file_path, element_ordering=ElementOrdering.TOP_TO_BOTTOM_LEFT_TO_RIGHT
)
print([element.text() for element in document.elements])

# Preset - top to bottom, right to left
document = load_file(
file_path, element_ordering=ElementOrdering.TOP_TO_BOTTOM_RIGHT_TO_LEFT
)
print([element.text() for element in document.elements])

which results in

::

['Top Right', 'Top Left', 'Bottom Right', 'Bottom Left']
['Bottom Left', 'Top Left', 'Bottom Right', 'Top Right']
['Top Right', 'Bottom Right', 'Top Left', 'Bottom Left']

Custom Ordering
...............

If none of the presets give an ordering you are looking for, you can also pass a
callable as the ``element_ordering`` argument of
:class:`~py_pdf_parser.components.PDFDocument`. This callable will be given a list of
elements for each page, and should return a list of the same elements, in the desired
order.

.. important::

The elements which get passed to your function will be PDFMiner.six elements, and NOT
class :class:`~py_pdf_parser.componenets.PDFElement`. You can access the ``x0``,
``x1``, ``y0``, ``y1`` directly, and extract the text using `get_text()`. Other
options are available: please familiarise yourself with the PDFMiner.six
documentation.

.. note::

Your function will be called multiple times, once for each page of the document.
Elements will always be considered in order of increasing page number, your function
only controls the ordering within each page.

For example, if we wanted to implement an ordering which is bottom to top, left to right
then we can do this as follows:

.. code-block:: python

from py_pdf_parser.loaders import load_file

# Custom - bottom to top, left to right
def ordering_function(elements):
"""
Note: Elements will be PDFMiner.six elements. The x axis is positive as you go left
to right, and the y axis is positive as you go bottom to top, and hence we can
simply sort according to this.
"""
return sorted(elements, key=lambda elem: (elem.x0, elem.y0))


document = load_file(file_path, element_ordering=ordering_function)
print([element.text() for element in document.elements])

which results in

::

['Bottom Left', 'Top Left', 'Bottom Right', 'Top Right']

Multiple Columns
................

Finally, suppose our PDF has multiple columns, like
:download:`this example </example_files/columns.pdf>`.

If we don't specify an ``element_ordering``, the elements will be extracted in the
following order:

::

['Column 1 Title', 'Column 2 Title', 'Here is some column 1 text.', 'Here is some column 2 text.', 'Col 1 left', 'Col 1 right', 'Col 2 left', 'Col 2 right']

If we visualise this document
(see the :ref:`simple-memo` example if you don't know how to do this), then we can see
that the column divider is at an ``x`` value of about 300. Using this information, we
can specify a custom ordering function which will order the elements left to right,
top to bottom, but in each column individually.

.. code-block:: python

from py_pdf_parser.loaders import load_file

document = load_file("columns.pdf")

def column_ordering_function(elements):
"""
The first entry in the key is False for colum 1, and Tru for column 2. The second
and third keys just give left to right, top to bottom.
"""
return sorted(elements, key=lambda elem: (elem.x0 > 300, -elem.y0, elem.x0))


document = load_file(file_path, element_ordering=column_ordering_function)
print([element.text() for element in document.elements])

which returns the elements in the correct order:

::

['Column 1 Title', 'Here is some column 1 text.', 'Col 1 left', 'Col 1 right', 'Column 2 Title', 'Here is some column 2 text.', 'Col 2 left', 'Col 2 right']
2 changes: 2 additions & 0 deletions docs/source/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@ Below you can find links to the following examples:
- The :ref:`simple-memo` example shows the very basics of using py-pdf-parser. You will see how to load a pdf document, start filtering the elements, and extract text from certain elements in the document.
- The :ref:`order-summary` example explains how to use font mappings, sections, and how to extract simple tables.
- The :ref:`more-tables` example explains tables in more detail, showing how to extract more complex tables.
- The :ref:`element-ordering` example shows how to specify different orderings for the elements on a page.

.. toctree::

simple_memo
order_summary
more_tables
element_ordering

47 changes: 44 additions & 3 deletions py_pdf_parser/components.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from typing import Dict, List, Set, Optional, Union, TYPE_CHECKING
from typing import Callable, Dict, List, Set, Optional, Union, TYPE_CHECKING

import re
from collections import Counter, defaultdict
from enum import Enum, auto
from itertools import chain

from .common import BoundingBox
Expand All @@ -14,6 +15,33 @@
from pdfminer.layout import LTComponent


class ElementOrdering(Enum):
"""
A class enumerating the available presets for element_ordering.
"""

LEFT_TO_RIGHT_TOP_TO_BOTTOM = auto()
RIGHT_TO_LEFT_TOP_TO_BOTTOM = auto()
TOP_TO_BOTTOM_LEFT_TO_RIGHT = auto()
TOP_TO_BOTTOM_RIGHT_TO_LEFT = auto()


_ELEMENT_ORDERING_FUNCTIONS: Dict[ElementOrdering, Callable[[List], List]] = {
ElementOrdering.LEFT_TO_RIGHT_TOP_TO_BOTTOM: lambda elements: sorted(
elements, key=lambda elem: (-elem.y0, elem.x0)
),
ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM: lambda elements: sorted(
elements, key=lambda elem: (-elem.y0, -elem.x0)
),
ElementOrdering.TOP_TO_BOTTOM_LEFT_TO_RIGHT: lambda elements: sorted(
elements, key=lambda elem: (elem.x0, -elem.y0)
),
ElementOrdering.TOP_TO_BOTTOM_RIGHT_TO_LEFT: lambda elements: sorted(
elements, key=lambda elem: (-elem.x0, -elem.y0)
),
}


class PDFPage:
"""
A representation of a page within the `PDFDocument`.
Expand Down Expand Up @@ -325,6 +353,11 @@ class PDFDocument:
Default: 0.
font_size_precision (int): How much rounding to apply to the font size. The font
size will be rounded to this many decimal places.
element_ordering (ElementOrdering or callable, optional): An ordering function
for the elements. Either a member of the ElementOrdering Enum, or a callable
which takes a list of elements and returns an ordered list of elements. This
will be called separately for each page. Note that the elements in this case
will be PDFMiner elements, and not PDFElements from this package.

Attributes:
pages (list): A list of all `PDFPages` in the document.
Expand All @@ -337,7 +370,8 @@ class PDFDocument:
number_of_pages: int
page_numbers: List[int]
sectioning: "Sectioning"
# _element_list will contain all elements, sorted from top to bottom, left to right.
# _element_list will contain all elements, sorted according to element_ordering
# (default left to right, top to bottom).
_element_list: List[PDFElement]
# _element_indexes_by_font will be a caching of fonts to elements indexes but it
# will be built as needed (while filtering by fonts), not on document load.
Expand All @@ -357,6 +391,9 @@ def __init__(
font_mapping_is_regex: bool = False,
regex_flags: Union[int, re.RegexFlag] = 0,
font_size_precision: int = 1,
element_ordering: Union[
ElementOrdering, Callable[[List], List]
] = ElementOrdering.LEFT_TO_RIGHT_TOP_TO_BOTTOM,
):
self.sectioning = Sectioning(self)
self._element_list = []
Expand All @@ -369,7 +406,11 @@ def __init__(
idx = 0
for page_number, page in sorted(pages.items()):
first_element = None
for element in sorted(page.elements, key=lambda elem: (-elem.y0, elem.x0)):
if isinstance(element_ordering, ElementOrdering):
sort_func = _ELEMENT_ORDERING_FUNCTIONS[element_ordering]
else:
sort_func = element_ordering
for element in sort_func(page.elements):
pdf_element = PDFElement(
document=self,
element=element,
Expand Down
9 changes: 6 additions & 3 deletions py_pdf_parser/filtering.py
Original file line number Diff line number Diff line change
Expand Up @@ -684,7 +684,8 @@ def before(self, element: "PDFElement", inclusive: bool = False) -> "ElementList
Returns all elements before the specified element.

By before, we mean preceding elements according to their index. The PDFDocument
will order elements left to right, top to bottom (as you would normally read).
will order elements according to the specified element_ordering (which defaults
to left to right, top to bottom).

Args:
element (PDFElement): The element in question.
Expand All @@ -704,7 +705,8 @@ def after(self, element: "PDFElement", inclusive: bool = False) -> "ElementList"
Returns all elements after the specified element.

By after, we mean succeeding elements according to their index. The PDFDocument
will order elements left to right, top to bottom (as you would normally read).
will order elements according to the specified element_ordering (which defaults
to left to right, top to bottom).

Args:
element (PDFElement): The element in question.
Expand All @@ -729,7 +731,8 @@ def between(
Returns all elements between the start and end elements.

This is done according to the element indexes. The PDFDocument will order
elements left to right, top to bottom (as you would normally read).
elements according to the specified element_ordering (which defaults
to left to right, top to bottom).

This is the same as applying `before` with `start_element` and `after` with
`end_element`.
Expand Down
2 changes: 1 addition & 1 deletion tests/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def assert_original_element_list_list_equal(
def assert_original_element_list_equal(
self,
original_element_list: List[Optional["LTComponent"]],
element_list: List[Optional["PDFElement"]],
element_list: Union[List[Optional["PDFElement"]], "ElementList"],
):
self.assertEqual(len(original_element_list), len(element_list))
for original_element, element in zip(original_element_list, element_list):
Expand Down
57 changes: 55 additions & 2 deletions tests/test_components.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
from ddt import ddt, data

from py_pdf_parser.common import BoundingBox
from py_pdf_parser.components import PDFDocument
from py_pdf_parser.components import PDFDocument, ElementOrdering
from py_pdf_parser.filtering import ElementList
from py_pdf_parser.loaders import Page
from py_pdf_parser.exceptions import NoElementsOnPageError, PageNotFoundError

from .base import BaseTestCase
from .utils import create_pdf_element, FakePDFMinerTextElement
from .utils import create_pdf_element, create_pdf_document, FakePDFMinerTextElement


@ddt
Expand Down Expand Up @@ -286,3 +286,56 @@ def test_document(self):
def test_document_with_blank_page(self):
with self.assertRaises(NoElementsOnPageError):
PDFDocument(pages={1: Page(elements=[], width=100, height=100)})

def test_element_ordering(self):
# elem_1 elem_2
# elem_3 elem_4
elem_1 = FakePDFMinerTextElement(bounding_box=BoundingBox(0, 5, 6, 10))
elem_2 = FakePDFMinerTextElement(bounding_box=BoundingBox(6, 10, 6, 10))
elem_3 = FakePDFMinerTextElement(bounding_box=BoundingBox(0, 5, 0, 5))
elem_4 = FakePDFMinerTextElement(bounding_box=BoundingBox(6, 10, 0, 5))

# Check default: left to right, top to bottom
document = create_pdf_document(elements=[elem_1, elem_2, elem_3, elem_4])
self.assert_original_element_list_equal(
[elem_1, elem_2, elem_3, elem_4], document.elements
)

# Check other presets
document = create_pdf_document(
elements=[elem_1, elem_2, elem_3, elem_4],
element_ordering=ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM,
)
self.assert_original_element_list_equal(
[elem_2, elem_1, elem_4, elem_3], document.elements
)

document = create_pdf_document(
elements=[elem_1, elem_2, elem_3, elem_4],
element_ordering=ElementOrdering.TOP_TO_BOTTOM_LEFT_TO_RIGHT,
)
self.assert_original_element_list_equal(
[elem_1, elem_3, elem_2, elem_4], document.elements
)

document = create_pdf_document(
elements=[elem_1, elem_2, elem_3, elem_4],
element_ordering=ElementOrdering.TOP_TO_BOTTOM_RIGHT_TO_LEFT,
)
self.assert_original_element_list_equal(
[elem_2, elem_4, elem_1, elem_3], document.elements
)

# Check custom function
document = create_pdf_document(
elements=[elem_1, elem_2, elem_3, elem_4],
element_ordering=lambda elements: [
elements[0],
elements[3],
elements[1],
elements[2],
],
)
self.assert_original_element_list_equal(
[elem_1, elem_4, elem_2, elem_3], document.elements
)
Loading