Basic via-CLI mutli-document support #216

deeplow · 2022-09-27T11:00:18Z

Depends on PR #208

Adds basic bulk document support to the CLI and makes it so the GUI can be started with multiple files from the terminal.

Usage: dangerzone-cli [OPTIONS] doc1.pdf doc2.pdf

bulk conversion in CLI implemented via thread (similar to the GUI implementation) and number of threads is bound by the number of CPUs (2*CPU +1)
Security: disallow interspersed args (to prevent maliciously named files from being interpreted as options)
Add test cases for bulk document conversion
Deduplicates stdout_callback code (the one called when a container outputs a json line)
Log document id (based on arg order)
So we can distinguish the output of each document:

     $ dangerzone-cli document-a.pdf document-b.pdf
 
     [doc 1] 3% Separating document into pages
     [doc 2] 0% Converting to PDF using LibreOffice
     [doc 1] 5% Converting page 1/4 to pixels

dangerzone/cli.py

deeplow · 2022-10-18T10:24:00Z

The one missing thing is to increase the document conversion timeout proportional to the number of documents. But doing that will be much after all the timeout variables are centralized in one. This is already done with #167. We just have to merge it.

deeplow · 2022-10-27T13:26:58Z

Now that #208 is merged, this is ready for review @apyrgio.

dangerzone/logic.py

dangerzone/cli.py

dangerzone/document.py

dangerzone/logic.py

dangerzone/document.py

apyrgio · 2022-10-31T14:02:06Z

One more suggestion. I think we should update the Changelog, to reflect that we now have support for multiple doc conversion via the CLI.

apyrgio · 2022-10-31T14:04:39Z

@deeplow I just finished my first round of comments. The code looks fine, tests pass, and I gave it a whirl as well. Hit me up with your feedback on the comments, once you have time.

dangerzone/args.py

dangerzone/document.py

apyrgio · 2022-11-07T14:00:59Z

dangerzone/document.py

+        # set the default output filename as soon as we know the input filename
+        self.output_filename = (
+            f"{os.path.splitext(self.input_filename)[0]}{SAFE_EXTENSION}"
+        )


This scares me a bit because if we do self.input_filename = ... after self.output_filename = ..., we will overwrite the output filename. It's best to have this as part of def output_filename() instead:

@property def output_filename(self) -> str: if self._output_filename is None: if self._input_filename is not None: # Basically repurpose set_default_output_filename() return self._default_output_filename() else: raise errors.NotSetOutputFilenameException() else: return self._output_filename

dangerzone/document.py

tests/test_cli.py

apyrgio · 2022-11-07T14:52:25Z

@deeplow: Done with the second round of review. I have some minor to moderate review comments, which I can try implementing myself, if you don't have the time. Else, I'll wait for you to respond.

apyrgio · 2022-11-08T14:37:13Z

Actually, I'll just go ahead and implement some of the comments in a separate branch, and you can cherry-pick what you prefer.

apyrgio · 2022-11-08T17:47:14Z

Actually, I'll just go ahead and implement some of the comments in a separate branch, and you can cherry-pick what you prefer.

I did some crude fixes in 209-bulk-doc-cli-support-2. Cherry-pick to your branch whatever your prefer.

deeplow · 2022-11-09T19:02:55Z

Pushed your changes here from the 209-bulk-doc-cli-support-2 branch. Everything is mostly fine and I just wanted to add a small comment or two.

deeplow · 2022-11-09T19:05:49Z

tests/test_cli.py

+        msg = "Security: Detected CLI options that are also"
+
+        @contextlib.contextmanager
+        def temp_file(file):


Why is there a need for this complex temp_file creation logic and not something as simple as:

file_path = tmp_path / "--help" file_path.touch()

Since tmp_path is already a contextlib-managed directory, I assume everything under it will go away after the test. Or am I overlooking something?

The idea here is that we create some files in a directory, we run the CLI command, and we expect it to fail with a "Security: ..." message. There are various cases that we want to cover, so we do this multiple times.

If we don't remove the temp files before the next run_cli() invocation, the next command will fail, but we can't be sure if the files from the previous invocation triggered the failure.

I agree that the context manager is a bit heavy on the eyes, so we could just create a subdir for each run_cli() invocation instead.

I'll push a fixup commit for that, so you can evaluate it.

How about splitting these these into separate tests? That would remove the need for the context manager

Or maybe a pytest fixture that returns a pair: a tempdir and a file within it

It would add a bit more boilerplate stuff, but I don't mind much.

I sent a fixup for that as well.

dangerzone/cli.py

deeplow · 2022-11-10T14:59:15Z

All seems ready. Shall I merge it?

apyrgio · 2022-11-10T15:03:43Z

Sure, go ahead!

Basic implementation of bulk document support in dangerzone-cli. Usage: dangerzone-cli [OPTIONS] doc1.pdf doc2.pdf

Wildcard arguments like `*` can lead to security vulnerabilities if files are maliciously named as would-be parameters. In the following scenario if a file in the current directory was named '--help', running the following command would show the help. $ dangerzone-cli * By checking if parameters also happen to be files, we mitigate this risk and have a chance to warn the user.

Initial parallel document conversion: creates a pool of N threads defined by the setting 'parallel_conversions'. Each thread calls convert() on a document.

The container output logging logic was in both the CLI and the GUI. This change moves the core parsing logic to container.py. Since the code was largely the same, now cli does need to specify a stdout_callback since all the necessary logging already happens. The GUI now only adds an stdout_callback to detect if there was an error during the conversion process.

With multiple input documents it is possible only one of them has issues. Mentioning the document id can help debug.

The document's state update is better update in the convert() function. This is because this function is always called for the conversion progress regardless of the frontend.

All filename-related exceptions were of class DocumentFilenameException. This made it difficult to disambiguate them. Specializing them makes it it easier for tests to detect which exception in particular we want to verify.

Checking if files were writeable created files in the process. In the case where someone adds a list of N files to dangerzone but exits before converting, they would be left with N 0-byte files for the -safe version. Now they don't. Fixes #214

deeplow · 2022-11-14T11:19:51Z

Rebased and squashed FIXUP commits. Merging now with main.

deeplow marked this pull request as draft September 27, 2022 11:00

deeplow self-assigned this Sep 27, 2022

deeplow force-pushed the 209-bulk-doc-cli-support branch 3 times, most recently from 296662e to bdc8aee Compare October 3, 2022 18:06

deeplow mentioned this pull request Oct 6, 2022

rename "common" to "document" & simplify "new window" window logic #208

Merged

deeplow force-pushed the 209-bulk-doc-cli-support branch 5 times, most recently from a899d13 to df6963e Compare October 13, 2022 14:01

deeplow commented Oct 17, 2022

View reviewed changes

dangerzone/cli.py Outdated Show resolved Hide resolved

deeplow force-pushed the 209-bulk-doc-cli-support branch 6 times, most recently from 67e0d00 to 385ddf4 Compare October 27, 2022 13:25

deeplow marked this pull request as ready for review October 27, 2022 13:26

deeplow requested a review from apyrgio October 27, 2022 13:26

apyrgio reviewed Oct 27, 2022

View reviewed changes

dangerzone/logic.py Outdated Show resolved Hide resolved

apyrgio reviewed Oct 27, 2022

View reviewed changes

dangerzone/cli.py Outdated Show resolved Hide resolved

apyrgio reviewed Oct 27, 2022

View reviewed changes

dangerzone/document.py Outdated Show resolved Hide resolved

apyrgio reviewed Oct 31, 2022

View reviewed changes

dangerzone/logic.py Show resolved Hide resolved

apyrgio reviewed Oct 31, 2022

View reviewed changes

dangerzone/logic.py Show resolved Hide resolved

apyrgio reviewed Oct 31, 2022

View reviewed changes

dangerzone/document.py Outdated Show resolved Hide resolved