Make PyMuPDF always log to stderr #877

apyrgio · 2024-07-25T08:08:59Z

Problem

PyMuPDF has made the design decision to log all messages by default to stdout (see pymupdf/PyMuPDF#3135). This is a problem for us for two reasons:

First, in the document to pixels phase, we assume that the stdout of the conversion process contains only the pixels of the rasterized document (plus some page/size info). If a library that we use also logs to stdout, then this will corrupt our pixel stream, and our conversion will fail.

This is actually biting us in practice. We have a malformed file (no-pages-types.pdf) in our large test set that causes PyMuPDF to log this error to stdout:

MuPDF error: format error: non-page object in page tree

Tip

You can reproduce it with:

$ python3 -c 'import fitz; list(fitz.Document("no-pages-types.pdf").pages())' 2>/dev/null

The error above is mixed with the pixel stream, and Dangerzone fails.

Second, in the pixels to PDF phase, we had a case (#700) where PyMuPDF was logging to stdout, from which we expect to read conversion progress reports in JSON format. We circumvented this using Python's contextlib.redirect_stdout(), but we lose error messages this way.

Solution

A solution to the above problem that works for all configurations is to take advantage of the environment variables that PyMuPDF devs have added to control logging. An important detail here is that we have to use these environment variables in our dangerzone.conversion.* modules, before we load the fitz module, which is ugly, but doable.

The text was updated successfully, but these errors were encountered:

PyMUPDF logs to stdout by default, which is problematic because we use the stdout of the conversion process to read the pixel stream of a document. Make PyMuPDF always log to stderr, by setting the following environment variables: PYMUPDF_MESSAGE and PYMUPDF_LOG. Fixes #877

Add a doc that contains an MP4 video in it, which has an audio and video stream. This type of document could not be converted with the latest Dangerzone releases, because PyMuPDF threw this error in the container's stdout: MuPDF error: unsupported error: cannot create appearance stream for Screen annotations This error message was treated literally by our client code, which parsed the first few bytes in order to find out the page height/width. This resulted to a misleading Dangerzone error, e.g.: A page exceeded the maximum height This issue started occurring since 0.6.0, which added streaming support, and was fixed by commit 3f86e7b. That fix was not accompanied by a test document that would ensure we would not have this regression from now on, so we add it in this commit. Refs #877 Closes #917

apyrgio added bug Something isn't working container labels Jul 25, 2024

apyrgio mentioned this issue Jul 25, 2024

Make PyMuPDF always log to stderr #879

Merged

eloquence added this to the 0.8.0 milestone Jul 25, 2024

apyrgio closed this as completed in 3f86e7b Aug 9, 2024

apyrgio mentioned this issue Oct 22, 2024

tests: Add a doc with multimedia elements #966

Merged

apyrgio mentioned this issue Oct 30, 2024

Prepare the CHANGELOG for 0.8.0 #976

Merged

harrislapiroff added this to Dangerzone ✨ Nov 20, 2024

harrislapiroff moved this to Done in Dangerzone ✨ Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PyMuPDF always log to stderr #877

Make PyMuPDF always log to stderr #877

apyrgio commented Jul 25, 2024

Make PyMuPDF always log to stderr #877

Make PyMuPDF always log to stderr #877

Comments

apyrgio commented Jul 25, 2024

Problem

Solution