-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic via-CLI mutli-document support #216
Conversation
296662e
to
bdc8aee
Compare
a899d13
to
df6963e
Compare
The one missing thing is to increase the document conversion timeout proportional to the number of documents. But doing that will be much after all the timeout variables are centralized in one. This is already done with #167. We just have to merge it. |
67e0d00
to
385ddf4
Compare
One more suggestion. I think we should update the Changelog, to reflect that we now have support for multiple doc conversion via the CLI. |
@deeplow I just finished my first round of comments. The code looks fine, tests pass, and I gave it a whirl as well. Hit me up with your feedback on the comments, once you have time. |
dangerzone/document.py
Outdated
# set the default output filename as soon as we know the input filename | ||
self.output_filename = ( | ||
f"{os.path.splitext(self.input_filename)[0]}{SAFE_EXTENSION}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scares me a bit because if we do self.input_filename = ...
after self.output_filename = ...
, we will overwrite the output filename. It's best to have this as part of def output_filename()
instead:
@property
def output_filename(self) -> str:
if self._output_filename is None:
if self._input_filename is not None:
# Basically repurpose set_default_output_filename()
return self._default_output_filename()
else:
raise errors.NotSetOutputFilenameException()
else:
return self._output_filename
@deeplow: Done with the second round of review. I have some minor to moderate review comments, which I can try implementing myself, if you don't have the time. Else, I'll wait for you to respond. |
Actually, I'll just go ahead and implement some of the comments in a separate branch, and you can cherry-pick what you prefer. |
I did some crude fixes in |
0f4d498
to
90896d3
Compare
Pushed your changes here from the |
90896d3
to
e2aa7e6
Compare
tests/test_cli.py
Outdated
msg = "Security: Detected CLI options that are also" | ||
|
||
@contextlib.contextmanager | ||
def temp_file(file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a need for this complex temp_file creation logic and not something as simple as:
file_path = tmp_path / "--help"
file_path.touch()
Since tmp_path
is already a contextlib-managed directory, I assume everything under it will go away after the test. Or am I overlooking something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that we create some files in a directory, we run the CLI command, and we expect it to fail with a "Security: ..." message. There are various cases that we want to cover, so we do this multiple times.
If we don't remove the temp files before the next run_cli()
invocation, the next command will fail, but we can't be sure if the files from the previous invocation triggered the failure.
I agree that the context manager is a bit heavy on the eyes, so we could just create a subdir for each run_cli()
invocation instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push a fixup commit for that, so you can evaluate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about splitting these these into separate tests? That would remove the need for the context manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe a pytest fixture that returns a pair: a tempdir and a file within it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would add a bit more boilerplate stuff, but I don't mind much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent a fixup for that as well.
All seems ready. Shall I merge it? |
Sure, go ahead! |
Basic implementation of bulk document support in dangerzone-cli. Usage: dangerzone-cli [OPTIONS] doc1.pdf doc2.pdf
Wildcard arguments like `*` can lead to security vulnerabilities if files are maliciously named as would-be parameters. In the following scenario if a file in the current directory was named '--help', running the following command would show the help. $ dangerzone-cli * By checking if parameters also happen to be files, we mitigate this risk and have a chance to warn the user.
Initial parallel document conversion: creates a pool of N threads defined by the setting 'parallel_conversions'. Each thread calls convert() on a document.
The container output logging logic was in both the CLI and the GUI. This change moves the core parsing logic to container.py. Since the code was largely the same, now cli does need to specify a stdout_callback since all the necessary logging already happens. The GUI now only adds an stdout_callback to detect if there was an error during the conversion process.
With multiple input documents it is possible only one of them has issues. Mentioning the document id can help debug.
The document's state update is better update in the convert() function. This is because this function is always called for the conversion progress regardless of the frontend.
All filename-related exceptions were of class DocumentFilenameException. This made it difficult to disambiguate them. Specializing them makes it it easier for tests to detect which exception in particular we want to verify.
Checking if files were writeable created files in the process. In the case where someone adds a list of N files to dangerzone but exits before converting, they would be left with N 0-byte files for the -safe version. Now they don't. Fixes #214
407e163
to
0b738ba
Compare
Rebased and squashed FIXUP commits. Merging now with main. |
Depends on PR #208
Adds basic bulk document support to the CLI and makes it so the GUI can be started with multiple files from the terminal.
Usage:
dangerzone-cli [OPTIONS] doc1.pdf doc2.pdf
stdout_callback
code (the one called when a container outputs ajson
line)So we can distinguish the output of each document: