Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[loaders] Include text from figures when all_texts=True #99

Merged
merged 1 commit into from
Jun 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
### Changed
- When the layout parameter all_texts is True, the text inside figures is now also returned as elements in the document. ([#99](https://github.com/jstockwin/py-pdf-parser/pull/99))

## [0.4.0]
### Added
Expand Down
Binary file added docs/source/example_files/figure.pdf
Binary file not shown.
45 changes: 45 additions & 0 deletions docs/source/examples/extracting_text_from_figures.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
.. _extracting-text-from-figures:

Extracting Text From Figures
----------------------------
PDFs are structured documents, and can contain Figures. By default, PDFMiner.six and
hence py-pdf-parser does not extract text from figures.

You can :download:`download an example here </example_files/figure.pdf>`. In the
example, there is figure which contains a red square, and some text. Below the figure
there is some more text.

By default, the text in the figure will not be included:

.. code-block:: python
from py_pdf_parser import load_file
document = load_file("figure.pdf")
print([element.text() for element in document.elements])
which results in:

::

["Here is some text outside of an image"]

To include the text inside the figure, we must pass the ``all_texts`` layout parameter.
This is documented in the PDFMiner.six documentation, `here
<https://pdfminersix.readthedocs.io/en/latest/reference/composable.html#laparams>`_.

The layout parameters can be passed to both :meth:`~py_pdf_parser.loaders.load` and
:meth:`~py-pdf-parser.loaders.load_file` as a dictionary to the ``la_params`` argument.

In our case:

.. code-block:: python
from py_pdf_parser import load_file
document = load_file("figure.pdf", la_params={"all_texts": True})
print([element.text() for element in document.elements])
which results in:

::

["This is some text in an image", "Here is some text outside of an image"]
2 changes: 2 additions & 0 deletions docs/source/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ Below you can find links to the following examples:
- The :ref:`order-summary` example explains how to use font mappings, sections, and how to extract simple tables.
- The :ref:`more-tables` example explains tables in more detail, showing how to extract more complex tables.
- The :ref:`element-ordering` example shows how to specify different orderings for the elements on a page.
- The :ref:`extracting-text-from-figures` example shows how to extract text from figures.

.. toctree::

simple_memo
order_summary
more_tables
element_ordering
extracting_text_from_figures

13 changes: 12 additions & 1 deletion py_pdf_parser/loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import logging

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LAParams
from pdfminer.layout import LTTextContainer, LAParams, LTFigure

from .components import PDFDocument

Expand Down Expand Up @@ -74,6 +74,17 @@ def load(
pages: Dict[int, Page] = {}
for page in extract_pages(pdf_file, laparams=LAParams(**la_params)):
elements = [element for element in page if isinstance(element, LTTextContainer)]

# If all_texts=True then we may get some text from inside figures
if la_params.get("all_texts"):
figures = (element for element in page if isinstance(element, LTFigure))
for figure in figures:
elements += [
element
for element in figure
if isinstance(element, LTTextContainer)
]

if not elements:
logger.warning(
f"No elements detected on page {page.pageid}, skipping this page."
Expand Down
Binary file added tests/data/image.pdf
Binary file not shown.
24 changes: 24 additions & 0 deletions tests/test_doc_examples/test_extracting_text_from_figures.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os

from py_pdf_parser.loaders import load_file
from tests.base import BaseTestCase


class TestExtractingTextFromFigures(BaseTestCase):
def test_output_is_correct(self):
file_path = os.path.join(
os.path.dirname(__file__), "../../docs/source/example_files/figure.pdf"
)

# Without all_texts
document = load_file(file_path)
self.assertListEqual(
[element.text() for element in document.elements],
["Here is some text outside of an image"],
)

document = load_file(file_path, la_params={"all_texts": True})
self.assertListEqual(
[element.text() for element in document.elements],
["This is some text in an image", "Here is some text outside of an image"],
)
18 changes: 18 additions & 0 deletions tests/test_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,21 @@ def test_load(self):
with open(file_path, "rb") as in_file:
document = load(in_file)
self.assertIsInstance(document, PDFDocument)

def test_load_with_text_in_image(self):
file_path = os.path.join(os.path.dirname(__file__), "data", "image.pdf")
with open(file_path, "rb") as in_file:
document = load(in_file)
self.assertIsInstance(document, PDFDocument)
self.assertEqual(len(document.elements), 1)

with open(file_path, "rb") as in_file:
document = load(in_file, la_params={"all_texts": True})
self.assertIsInstance(document, PDFDocument)
self.assertEqual(len(document.elements), 2)

def test_load_file_with_text_in_image(self):
file_path = os.path.join(os.path.dirname(__file__), "data", "image.pdf")
document = load_file(file_path, la_params={"all_texts": True})
self.assertIsInstance(document, PDFDocument)
self.assertEqual(len(document.elements), 2)