-
Notifications
You must be signed in to change notification settings - Fork 42
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[loaders] Include text from figures when all_texts=True
Closes #98
- Loading branch information
Showing
7 changed files
with
101 additions
and
1 deletion.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
.. _extracting-text-from-figures: | ||
|
||
Extracting Text From Figures | ||
---------------------------- | ||
PDFs are structured documents, and can contain Figures. By default, PDFMiner.six and | ||
hence py-pdf-parser does not extract text from figures. | ||
|
||
You can :download:`download an example here </example_files/figure.pdf>`. In the | ||
example, there is figure which contains a red square, and some text. Below the figure | ||
there is some more text. | ||
|
||
By default, the text in the figure will not be included: | ||
|
||
.. code-block:: python | ||
from py_pdf_parser import load_file | ||
document = load_file("figure.pdf") | ||
print([element.text() for element in document.elements]) | ||
which results in: | ||
|
||
:: | ||
|
||
["Here is some text outside of an image"] | ||
|
||
To include the text inside the figure, we must pass the ``all_texts`` layout parameter. | ||
This is documented in the PDFMiner.six documentation, `here | ||
<https://pdfminersix.readthedocs.io/en/latest/reference/composable.html#laparams>`_. | ||
|
||
The layout parameters can be passed to both :meth:`~py_pdf_parser.loaders.load` and | ||
:meth:`~py-pdf-parser.loaders.load_file` as a dictionary to the ``la_params`` argument. | ||
|
||
In our case: | ||
|
||
.. code-block:: python | ||
from py_pdf_parser import load_file | ||
document = load_file("figure.pdf", la_params={"all_texts": True}) | ||
print([element.text() for element in document.elements]) | ||
which results in: | ||
|
||
:: | ||
|
||
["This is some text in an image", "Here is some text outside of an image"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
24 changes: 24 additions & 0 deletions
24
tests/test_doc_examples/test_extracting_text_from_figures.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
import os | ||
|
||
from py_pdf_parser.loaders import load_file | ||
from tests.base import BaseTestCase | ||
|
||
|
||
class TestExtractingTextFromFigures(BaseTestCase): | ||
def test_output_is_correct(self): | ||
file_path = os.path.join( | ||
os.path.dirname(__file__), "../../docs/source/example_files/figure.pdf" | ||
) | ||
|
||
# Without all_texts | ||
document = load_file(file_path) | ||
self.assertListEqual( | ||
[element.text() for element in document.elements], | ||
["Here is some text outside of an image"] | ||
) | ||
|
||
document = load_file(file_path, la_params={"all_texts": True}) | ||
self.assertListEqual( | ||
[element.text() for element in document.elements], | ||
["This is some text in an image", "Here is some text outside of an image"] | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters