Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PyMuPDF always log to stderr #877

Closed
apyrgio opened this issue Jul 25, 2024 · 0 comments · Fixed by #879
Closed

Make PyMuPDF always log to stderr #877

apyrgio opened this issue Jul 25, 2024 · 0 comments · Fixed by #879
Labels
bug Something isn't working container
Milestone

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Jul 25, 2024

Problem

PyMuPDF has made the design decision to log all messages by default to stdout (see pymupdf/PyMuPDF#3135). This is a problem for us for two reasons:

First, in the document to pixels phase, we assume that the stdout of the conversion process contains only the pixels of the rasterized document (plus some page/size info). If a library that we use also logs to stdout, then this will corrupt our pixel stream, and our conversion will fail.

This is actually biting us in practice. We have a malformed file (no-pages-types.pdf) in our large test set that causes PyMuPDF to log this error to stdout:

MuPDF error: format error: non-page object in page tree

Tip

You can reproduce it with:

$ python3 -c 'import fitz; list(fitz.Document("no-pages-types.pdf").pages())' 2>/dev/null   

The error above is mixed with the pixel stream, and Dangerzone fails.

Second, in the pixels to PDF phase, we had a case (#700) where PyMuPDF was logging to stdout, from which we expect to read conversion progress reports in JSON format. We circumvented this using Python's contextlib.redirect_stdout(), but we lose error messages this way.

Solution

A solution to the above problem that works for all configurations is to take advantage of the environment variables that PyMuPDF devs have added to control logging. An important detail here is that we have to use these environment variables in our dangerzone.conversion.* modules, before we load the fitz module, which is ugly, but doable.

@apyrgio apyrgio added bug Something isn't working container labels Jul 25, 2024
apyrgio added a commit that referenced this issue Jul 25, 2024
PyMUPDF logs to stdout by default, which is problematic because we use
the stdout of the conversion process to read the pixel stream of a
document.

Make PyMuPDF always log to stderr, by setting the following environment
variables: PYMUPDF_MESSAGE and PYMUPDF_LOG.

Fixes #877
@eloquence eloquence added this to the 0.8.0 milestone Jul 25, 2024
@apyrgio apyrgio closed this as completed in 3f86e7b Aug 9, 2024
apyrgio added a commit that referenced this issue Oct 22, 2024
Add a doc that contains an MP4 video in it, which has an audio and video
stream. This type of document could not be converted with the latest
Dangerzone releases, because PyMuPDF threw this error in the container's
stdout:

    MuPDF error: unsupported error: cannot create appearance stream for
    Screen annotations

This error message was treated literally by our client code, which
parsed the first few bytes in order to find out the page height/width.
This resulted to a misleading Dangerzone error, e.g.:

    A page exceeded the maximum height

This issue started occurring since 0.6.0, which added streaming support,
and was fixed by commit 3f86e7b. That
fix was not accompanied by a test document that would ensure we would
not have this regression from now on, so we add it in this
commit.

Refs #877
Closes #917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working container
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants