Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic via-CLI mutli-document support #216

Merged
merged 10 commits into from
Nov 14, 2022
Merged

Basic via-CLI mutli-document support #216

merged 10 commits into from
Nov 14, 2022

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Sep 27, 2022

Depends on PR #208

Adds basic bulk document support to the CLI and makes it so the GUI can be started with multiple files from the terminal.

Usage: dangerzone-cli [OPTIONS] doc1.pdf doc2.pdf

  • bulk conversion in CLI implemented via thread (similar to the GUI implementation) and number of threads is bound by the number of CPUs (2*CPU +1)
  • Security: disallow interspersed args (to prevent maliciously named files from being interpreted as options)
  • Add test cases for bulk document conversion
  • Deduplicates stdout_callback code (the one called when a container outputs a json line)
  • Log document id (based on arg order)
    So we can distinguish the output of each document:
     $ dangerzone-cli document-a.pdf document-b.pdf
 
     [doc 1] 3% Separating document into pages
     [doc 2] 0% Converting to PDF using LibreOffice
     [doc 1] 5% Converting page 1/4 to pixels

@deeplow deeplow marked this pull request as draft September 27, 2022 11:00
@deeplow deeplow self-assigned this Sep 27, 2022
@deeplow deeplow force-pushed the 209-bulk-doc-cli-support branch 3 times, most recently from 296662e to bdc8aee Compare October 3, 2022 18:06
@deeplow deeplow force-pushed the 209-bulk-doc-cli-support branch 5 times, most recently from a899d13 to df6963e Compare October 13, 2022 14:01
dangerzone/cli.py Outdated Show resolved Hide resolved
@deeplow
Copy link
Contributor Author

deeplow commented Oct 18, 2022

The one missing thing is to increase the document conversion timeout proportional to the number of documents. But doing that will be much after all the timeout variables are centralized in one. This is already done with #167. We just have to merge it.

@deeplow deeplow force-pushed the 209-bulk-doc-cli-support branch 6 times, most recently from 67e0d00 to 385ddf4 Compare October 27, 2022 13:25
@deeplow deeplow marked this pull request as ready for review October 27, 2022 13:26
@deeplow deeplow requested a review from apyrgio October 27, 2022 13:26
@deeplow
Copy link
Contributor Author

deeplow commented Oct 27, 2022

Now that #208 is merged, this is ready for review @apyrgio.

dangerzone/logic.py Outdated Show resolved Hide resolved
dangerzone/cli.py Outdated Show resolved Hide resolved
dangerzone/document.py Outdated Show resolved Hide resolved
dangerzone/document.py Outdated Show resolved Hide resolved
@apyrgio
Copy link
Contributor

apyrgio commented Oct 31, 2022

One more suggestion. I think we should update the Changelog, to reflect that we now have support for multiple doc conversion via the CLI.

@apyrgio
Copy link
Contributor

apyrgio commented Oct 31, 2022

@deeplow I just finished my first round of comments. The code looks fine, tests pass, and I gave it a whirl as well. Hit me up with your feedback on the comments, once you have time.

dangerzone/args.py Outdated Show resolved Hide resolved
dangerzone/args.py Outdated Show resolved Hide resolved
dangerzone/document.py Outdated Show resolved Hide resolved
Comment on lines 82 to 85
# set the default output filename as soon as we know the input filename
self.output_filename = (
f"{os.path.splitext(self.input_filename)[0]}{SAFE_EXTENSION}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scares me a bit because if we do self.input_filename = ... after self.output_filename = ..., we will overwrite the output filename. It's best to have this as part of def output_filename() instead:

      @property
      def output_filename(self) -> str:
          if self._output_filename is None:
              if self._input_filename is not None:
                  # Basically repurpose set_default_output_filename()
                  return self._default_output_filename()
              else:
                  raise errors.NotSetOutputFilenameException()
          else:
              return self._output_filename

dangerzone/document.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
@apyrgio
Copy link
Contributor

apyrgio commented Nov 7, 2022

@deeplow: Done with the second round of review. I have some minor to moderate review comments, which I can try implementing myself, if you don't have the time. Else, I'll wait for you to respond.

@apyrgio
Copy link
Contributor

apyrgio commented Nov 8, 2022

Actually, I'll just go ahead and implement some of the comments in a separate branch, and you can cherry-pick what you prefer.

@apyrgio
Copy link
Contributor

apyrgio commented Nov 8, 2022

Actually, I'll just go ahead and implement some of the comments in a separate branch, and you can cherry-pick what you prefer.

I did some crude fixes in 209-bulk-doc-cli-support-2. Cherry-pick to your branch whatever your prefer.

@deeplow deeplow force-pushed the 209-bulk-doc-cli-support branch from 0f4d498 to 90896d3 Compare November 9, 2022 18:59
@deeplow
Copy link
Contributor Author

deeplow commented Nov 9, 2022

Pushed your changes here from the 209-bulk-doc-cli-support-2 branch. Everything is mostly fine and I just wanted to add a small comment or two.

@deeplow deeplow force-pushed the 209-bulk-doc-cli-support branch from 90896d3 to e2aa7e6 Compare November 9, 2022 19:50
msg = "Security: Detected CLI options that are also"

@contextlib.contextmanager
def temp_file(file):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a need for this complex temp_file creation logic and not something as simple as:

file_path = tmp_path / "--help"
file_path.touch()

Since tmp_path is already a contextlib-managed directory, I assume everything under it will go away after the test. Or am I overlooking something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we create some files in a directory, we run the CLI command, and we expect it to fail with a "Security: ..." message. There are various cases that we want to cover, so we do this multiple times.

If we don't remove the temp files before the next run_cli() invocation, the next command will fail, but we can't be sure if the files from the previous invocation triggered the failure.

I agree that the context manager is a bit heavy on the eyes, so we could just create a subdir for each run_cli() invocation instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push a fixup commit for that, so you can evaluate it.

Copy link
Contributor Author

@deeplow deeplow Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about splitting these these into separate tests? That would remove the need for the context manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe a pytest fixture that returns a pair: a tempdir and a file within it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would add a bit more boilerplate stuff, but I don't mind much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent a fixup for that as well.

dangerzone/cli.py Outdated Show resolved Hide resolved
@deeplow
Copy link
Contributor Author

deeplow commented Nov 10, 2022

All seems ready. Shall I merge it?

@apyrgio
Copy link
Contributor

apyrgio commented Nov 10, 2022

Sure, go ahead!

Basic implementation of bulk document support in dangerzone-cli.

Usage: dangerzone-cli [OPTIONS] doc1.pdf doc2.pdf
Wildcard arguments like `*` can lead to security vulnerabilities
if files are maliciously named as would-be parameters. In the following
scenario if a file in the current directory was named '--help', running
the following command would show the help.

  $ dangerzone-cli *

By checking if parameters also happen to be files, we mitigate this
risk and have a chance to warn the user.
Initial parallel document conversion: creates a pool of N threads
defined by the setting 'parallel_conversions'. Each thread calls
convert() on a document.
The container output logging logic was in both the CLI and the GUI.
This change moves the core parsing logic to container.py.

Since the code was largely the same, now cli does need to specify
a stdout_callback since all the necessary logging already happens.

The GUI now only adds an stdout_callback to detect if there was an
error during the conversion process.
With multiple input documents it is possible only one of them has
issues. Mentioning the document id can help debug.
The document's state update is better update in the convert() function.
This is because this function is always called for the conversion
progress regardless of the frontend.
All filename-related exceptions were of class DocumentFilenameException.
This made it difficult to disambiguate them. Specializing them makes it
it easier for tests to detect which exception in particular we want to
verify.
Checking if files were writeable created files in the process. In the
case where someone adds a list of N files to dangerzone but exits before
converting, they would be left with N 0-byte files for the -safe
version. Now they don't.

Fixes #214
@deeplow deeplow force-pushed the 209-bulk-doc-cli-support branch from 407e163 to 0b738ba Compare November 14, 2022 11:01
@deeplow
Copy link
Contributor Author

deeplow commented Nov 14, 2022

Rebased and squashed FIXUP commits. Merging now with main.

@deeplow deeplow merged commit 0b738ba into main Nov 14, 2022
@deeplow deeplow deleted the 209-bulk-doc-cli-support branch May 11, 2023 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants