jbig2 extractor does not seem to be working correctly #652

fgregg · 2021-07-31T18:39:29Z

The jb2 file that is extracted from a pdf with jbig2 encoded images does not seem correct. This includes the sample file

> pd2txt.py --output-dir=. samples/contrib/pdf-with-jbig2.pdf
> ls *.jb2
XIPLAYER0.jb2
> jbig2dec -t pbm XIPLAYER0.jb2
jbig2dec WARNING OOB obtained when decoding symbol instance T coordinate (segment 0x03)
jbig2dec WARNING failed to decode text region image data (segment 0x03)
jbig2dec WARNING failed to decode; treating as end of file (segment 0x03)
> pnmtopng XIPLAYER0.pbm > XIPLAYER0.png

This produces a blank PNG file: XIPLAYER0.png

The original pdf file is okay, and we can extract the image using pdfimages.

> pdfimages -jbig2 samples/contrib/pdf-with-jbig2.pdf XIPLAYER0
> ls XIPLAYER0-000.*
XIPLAYER0-000.jb2e  XIPLAYER0-000.jb2g
> jbig2dec XIPLAYER0-000.jb2g XIPLAYER0-000.jb2e -t pnm
> pnmtopng XIPLAYER0-000.pbm > XIPLAYER0-000.png

This problem is not due to a mistranslation of @side2k's original PR.

I checked out their original Python 2.7 branch, and the jb2 file that they produce is exactly the same as what's on the HEAD of pdfminer.six's develop branch.

The text was updated successfully, but these errors were encountered:

fgregg · 2021-07-31T20:45:42Z

unfortunately, pdfimages (as part of poppler) is licensed under GPL 3, so we cannot read and translate their code and maintain this projects's MIT license

fgregg · 2021-08-01T17:49:19Z

comparing the output of pdfimages, it looks like XIPLAYER0-000.jb2e and XIPLAYER0-000.jb2g are direct extractions from the pdf. This would be simple to reproduce.

For jbig2 images, we would produce two files, a jb2g and jb2e. Then a user could use jbig2dec to convert these files, together, into pbm or png file.

Obviously, it would be better if pdfminer could output a single jb2 file, but I don't know how to construct a valid file.

It seems better to output two files that can be used, versus one file that is invalid, as is the status quo.

Thoughts @pietermarsman?

- closes pdfminer#652

fgregg · 2021-08-02T03:03:26Z

i ended up figuring out how to put everything together in a single jb2 file, and have submitted a pull request. #653

pietermarsman · 2022-01-23T20:26:41Z

@fgregg Thanks for figuring this out! Will review / merge the PR.

fgregg added a commit to datamade/pdfminer.six that referenced this issue Aug 2, 2021

Fixes jbig2 writer to write valid jb2 files

6c7ee43

- closes pdfminer#652

fgregg mentioned this issue Aug 2, 2021

Fixes jbig2 writer to write valid jb2 files #653

Closed

6 tasks

pietermarsman closed this as completed in aa5dec2 Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jbig2 extractor does not seem to be working correctly #652

jbig2 extractor does not seem to be working correctly #652

fgregg commented Jul 31, 2021 •

edited

Loading

fgregg commented Jul 31, 2021

fgregg commented Aug 1, 2021

fgregg commented Aug 2, 2021

pietermarsman commented Jan 23, 2022

jbig2 extractor does not seem to be working correctly #652

jbig2 extractor does not seem to be working correctly #652

Comments

fgregg commented Jul 31, 2021 • edited Loading

fgregg commented Jul 31, 2021

fgregg commented Aug 1, 2021

fgregg commented Aug 2, 2021

pietermarsman commented Jan 23, 2022

fgregg commented Jul 31, 2021 •

edited

Loading