Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jbig2 extractor does not seem to be working correctly #652

Closed
fgregg opened this issue Jul 31, 2021 · 4 comments
Closed

jbig2 extractor does not seem to be working correctly #652

fgregg opened this issue Jul 31, 2021 · 4 comments

Comments

@fgregg
Copy link

fgregg commented Jul 31, 2021

The jb2 file that is extracted from a pdf with jbig2 encoded images does not seem correct. This includes the sample file

> pd2txt.py --output-dir=. samples/contrib/pdf-with-jbig2.pdf
> ls *.jb2
XIPLAYER0.jb2
> jbig2dec -t pbm XIPLAYER0.jb2
jbig2dec WARNING OOB obtained when decoding symbol instance T coordinate (segment 0x03)
jbig2dec WARNING failed to decode text region image data (segment 0x03)
jbig2dec WARNING failed to decode; treating as end of file (segment 0x03)
> pnmtopng XIPLAYER0.pbm > XIPLAYER0.png

This produces a blank PNG file: XIPLAYER0.png

The original pdf file is okay, and we can extract the image using pdfimages.

> pdfimages -jbig2 samples/contrib/pdf-with-jbig2.pdf XIPLAYER0
> ls XIPLAYER0-000.*
XIPLAYER0-000.jb2e  XIPLAYER0-000.jb2g
> jbig2dec XIPLAYER0-000.jb2g XIPLAYER0-000.jb2e -t pnm
> pnmtopng XIPLAYER0-000.pbm > XIPLAYER0-000.png

XIPLAYER0-000

This problem is not due to a mistranslation of @side2k's original PR.

I checked out their original Python 2.7 branch, and the jb2 file that they produce is exactly the same as what's on the HEAD of pdfminer.six's develop branch.

@fgregg
Copy link
Author

fgregg commented Jul 31, 2021

unfortunately, pdfimages (as part of poppler) is licensed under GPL 3, so we cannot read and translate their code and maintain this projects's MIT license

@fgregg
Copy link
Author

fgregg commented Aug 1, 2021

comparing the output of pdfimages, it looks like XIPLAYER0-000.jb2e and XIPLAYER0-000.jb2g are direct extractions from the pdf. This would be simple to reproduce.

For jbig2 images, we would produce two files, a jb2g and jb2e. Then a user could use jbig2dec to convert these files, together, into pbm or png file.

Obviously, it would be better if pdfminer could output a single jb2 file, but I don't know how to construct a valid file.

It seems better to output two files that can be used, versus one file that is invalid, as is the status quo.

Thoughts @pietermarsman?

fgregg added a commit to datamade/pdfminer.six that referenced this issue Aug 2, 2021
@fgregg
Copy link
Author

fgregg commented Aug 2, 2021

i ended up figuring out how to put everything together in a single jb2 file, and have submitted a pull request. #653

@pietermarsman
Copy link
Member

@fgregg Thanks for figuring this out! Will review / merge the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants