Skip to content

Commit

Permalink
Fix regression in page layout that sometimes returned text lines out …
Browse files Browse the repository at this point in the history
…of order (#659)

* add a test

* fix the bug

* rewrap long lines

* update CHANGELOG

* re-merge CHANGELOG

Co-authored-by: Pieter Marsman <[email protected]>
  • Loading branch information
0xabu and pietermarsman authored Jan 26, 2022
1 parent 9a644aa commit 95dee8d
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

### Fixed
- Hande decompression error due to CRC checksum error ([#637](https://github.com/pdfminer/pdfminer.six/pull/637))
- Regression (since 20191107) in `LTLayoutContainer.group_textboxes` that returned some text lines out of order ([#659](https://github.com/pdfminer/pdfminer.six/pull/659))
- Add handling of JPXDecode filter to enable extraction of images for some pdfs ([#645](https://github.com/pdfminer/pdfminer.six/pull/645))
- Fix extraction of jbig2 files, which was producing invalid files ([#652](https://github.com/pdfminer/pdfminer.six/pull/653))
- Crash in `pdf2txt.py --boxes-flow=disabled` ([#682](https://github.com/pdfminer/pdfminer.six/pull/682))
Expand Down
2 changes: 1 addition & 1 deletion pdfminer/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -889,7 +889,7 @@ def isany(obj1: ElementT, obj2: ElementT) -> Set[ElementT]:
(skip_isany, d, id1, id2, obj1, obj2) = heapq.heappop(dists)
# Skip objects that are already merged
if (id1 not in done) and (id2 not in done):
if skip_isany and isany(obj1, obj2):
if not skip_isany and isany(obj1, obj2):
heapq.heappush(dists, (True, d, id1, id2, obj1, obj2))
continue
if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or \
Expand Down
Binary file added samples/simple5.pdf
Binary file not shown.
14 changes: 14 additions & 0 deletions tests/test_highlevel_extracttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ def run_with_file(sample_path):
"simple3.pdf": "Hello\n\nHello\n\n\n\n\n\n\n\n\n\n\n"
"World\n\nWorld\n\n\f",
"simple4.pdf": "Text1\nText2\nText3\n\n\f",
"simple5.pdf": "Heading\n\n"
"Link to heading that is working with vim-pandoc.\n\n"
"Link to heading “that is” not working with vim-pandoc.\n\n"
"Subheading\n\nSome “more text”\n\n1\n\n\f",
"zen_of_python_corrupted.pdf": "Mai 30, 18 13:27\n\nzen_of_python.txt",
"contrib/issue_566_test_1.pdf": "ISSUE Date:2019-4-25 Buyer:黎荣",
"contrib/issue_566_test_2.pdf": "甲方:中国饮料有限公司(盖章)",
Expand Down Expand Up @@ -64,6 +68,11 @@ def test_simple4_with_string(self):
s = run_with_string(test_file)
self.assertEqual(s, test_strings[test_file])

def test_simple5_with_string(self):
test_file = "simple5.pdf"
s = run_with_string(test_file)
self.assertEqual(s, test_strings[test_file])

def test_simple1_with_file(self):
test_file = "simple1.pdf"
s = run_with_file(test_file)
Expand All @@ -84,6 +93,11 @@ def test_simple4_with_file(self):
s = run_with_file(test_file)
self.assertEqual(s, test_strings[test_file])

def test_simple5_with_file(self):
test_file = "simple5.pdf"
s = run_with_file(test_file)
self.assertEqual(s, test_strings[test_file])

def test_zlib_corrupted(self):
test_file = "zen_of_python_corrupted.pdf"
s = run_with_file(test_file)
Expand Down

0 comments on commit 95dee8d

Please sign in to comment.