-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HWP/HWPX document formats #460
Conversation
I think commit Add h2orestart on Dockerfile and Add hwp hwpx support can be just squashed. |
Closes #243 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this contribution @OctopusET! You've done lots of research which makes the problem area much clearer.
I have commented on the code, but I also have some general comments:
- From the links you sent in the PR, it seem that the
.hwp{x}
files may not be initially present in the system. The real problem is.lnk
files, which execute arbitrary code and may open benign.hwp{x}
files as a decoy. In this PR, we will protect users against directly opening a malicious.hwp{x}
document, correct? Did you perhaps have a suggestion for tackling.lnk
files (aside from "please, don't open them :-)"? - Is there any LibreOffice hardening that people employ when opening an
.hwp{x}
file? Some sort of extension sandboxing, for instance?
Also, thanks a lot for providing sample files. I'll test your implementation soon and try to trace the source of errors.
Thanks as well for the contribution, this seems like it could be impactful indeed. However, I was unable to use this extension to convert the demo file you provided. It failed in the
@OctopusET if you run the same command, does it work for you? |
This command is not working for me too, even the .pdf file doesn't work. @deeplow |
Strange. At least our results are consistent. But if I'm not mistaken that's the command that Dangerzone is running under the hood to convert Dangerzone is running under the hood. In order to make this would I'd say we first must be able to call that command successfully. I will try later manually installing libreoffice and that extension in a disposable VM and see if I can export it as PDF though the graphical version of libreoffice. |
My bad. There were several issue with that command. One was a typo in the command it was creating a directory |
Oh I missed too. Thank you! @deeplow This is the command I tried. And the conversion works great.
|
I have figured it out 🥳. Basically the extension appears to be relying on the file extension and not the mime type. So in our case, the input file is in
To fix this, all we have to do is to rename the file as
Applying this patch to the code will make the full conversion work. But this is a non-permanent hack. We have to think of a more permanent solution. Potentially opening an issue upstream to use mime-types instead. --- a/dangerzone/conversion/doc_to_pixels.py
+++ b/dangerzone/conversion/doc_to_pixels.py
@@ -142,6 +142,7 @@ class DocumentToPixels(DangerzoneConverter):
pdf_filename = "/tmp/input_file"
elif conversion["type"] == "libreoffice":
self.update_progress("Converting to PDF using LibreOffice")
+ shutil.copy('/tmp/input_file', '/tmp/input_file.hwp')
args = [
"libreoffice",
"--headless",
@@ -150,8 +151,9 @@ class DocumentToPixels(DangerzoneConverter):
"pdf",
"--outdir",
"/tmp",
- "/tmp/input_file",
+ "/tmp/input_file.hwp",
] |
As a security-focused project, this requires some adversarial skepticism. One assumption I have is the possibility of a malicious I have now implemented a proof of concept of dynamically loading libreoffice extensions. This reduces the damage that a potentially malicious |
Great work! Thank you so much. I think your work should be also included in this PR. I'm writing the answer of @apyrgio first comment. I am finding some cases. I will comment soon. Before that, @deeplow what do you think about the binary header? I'm not sure about the hwpx, but hwp has a unique format header. Should it be also used for the determining hwp files? I will start writing a PR for the h2orestart soon. |
I just find out clamav also supports the hwp/hwpx (including very old version). https://blog.clamav.net/2016/03/clamav-0991-hangul-word-processor-hwp.html This doesn't seem relay on the hancom's opened(not sure about the license) format document. Seems like it's own reversing work from the cisco. I guess this also could be used for better file detection. |
Nice! I have opened an issue upstream ebandal/H2Orestart#7
That's what mime is for 🙂. The mime type should be detected by the mime library (at least in my case it was). If for some reason that doesn't happen, then that's an upstream bug with either https://github.com/file/file or https://github.com/python/cpython/blob/3.11/Lib/mimetypes.py. These are the two mimetype libraries that we use: From your comment here you seem to imply that there are issues with detecting the file types, but I'm skeptical about adding any extra dependencies or doing custom code just for this specific bit. I think solutions for better file detection have to be made upstream in mimetype dectection libs as I stated above. |
OK. If you don't mind, I'll push it here. |
Great go a head |
The other concern, (which is a big concern) is the file size of this dependency. Since we ship the container image, including 80MB of another package is this much size for everyone who uses Dangerzone. @apyrgio may have an idea how to solve this. |
Yeah I checked that too, thank you.
Cool I agree. Still I want to mention that hwp/hwpx formats are so messed up and they use some 'offical' mime types, even they are not standard MIME type. I'm think this won't help that much but here's some python library for the hwp. It also supports the pdf conversion but it's unstable and not maintained for long time. So I think sticking with h2orestart would be better. |
@deeplow Hmm, I think the font support is necessary. With the latest main branch of dangerzone, when I test it, all I get is the tofu. Is there any issue from non-latin character users before? I think the best solution would be using only the Regular size font. Or the adding a download option, if it's possible I think ocr data also be downloaded too. |
In the case of 'hwp' files (Hancom Office), read in LibreOffice though the H2Orestart extension. However, it doesn't guess the file type based on the file's contents because of that it has to infer it from the file extension [1]. An upstream bug has been reported. [1]: freedomofpress#460 (comment)
Apparently I can't push to this branch. @OctopusET is there an option that you can check for allowing contributions from maintainers on the branch? It should have shown up when you created the PR, I think. Either that, or merge this branch onto yours: https://github.com/deeplow/dangerzone/tree/hwp-support-dynamic-extension |
We haven't heard of any issues, but it could be because we may not have many user in the region due to the lack of support. So I'm interested in tackling this issue. But I want to go about it in a way that scales with adding support for more languages without significantly increasing the container size. I'll have to read up some more on font support. But I'd say its it's own issue. Note I'm out of time this week, so I'll be back on Monday to continue this conversation |
Hmm I did it. I think github doesn't think you don't have a write permission of this repo.
Okay, I will start more investigation and commenting the first comment in this weekend. |
I pushed now just a lint, but I'd say overall the only think left changing it to also work on Qubes. I can work on this over the next days. Anything else that that you think is missing @OctopusET? UPDATE: and also some tests |
And I need a cleanup the MIME. I'm working on right now. |
If you're fine with doing it. If not, we can polish that prior to merging.
Ideally yes, but I don't understand very much how those libraries are chained. But if it does, then it's good :) if not, at least we have something that can detect that edge-case. As a matter of fact, this is not the first time that an office file is simply detected as |
@OctopusET for the mimetypes that you just commented, can you please add there as a comment a link to this discussion? That way in the future it can be immediately apparent why that is commented out. |
That's absolutely right. I'll add it. |
No worries, feel free to edit as is the most comfortable to you. We can always polish it on our end after you're done with all the changes. |
@OctopusET just decided to push the candidate release one week, so we'll be doing feature freeze on monday instead. So you still have some time to work on this if you want it included on the next release. Sorry for the change of plans. |
@deeplow No problem, I'm glad to have more time to work on some improvements for commit messages. I don't think any more feature updates are needed. But I'll find more test cases if there are any. I will soon post the the answer for the first comment I been writing for. And I'll tell you when I think it's done. Thank you for all your great work! |
In the upstream, I will update that too. Still, I will left the There's might be a chance that it will be detected as I'm glad that it changed before the new version of Related issue: https://bugs.astron.com/view.php?id=467 |
Known MIME types of HWP/HWPXHWP
HWPX
fileMIME types that `file' command actually uses in upstream file/file@1fc9175 HWP
HWPX
|
Nice to see this. Thanks for keeping an eye on it. @OctopusET what's the current status of this PR. Is it ready for final review and merge? |
Yes! I think it's ready to be reviewed. |
H2ORestart is a LibreOffice extension which adds Hancom HWP/HWPX (Hangul Word Processor) supports for LibreOffice. This format is widely used in South Korea. Version: v0.5.7 Extension Repository: https://github.com/ebandal/H2Orestart/releases
hwp/hwpx has several custom MIME types .hwp: - application/x-hwp - application/haansofthwp - application/vnd.hancom.hwp .hwpx: - application/haansofthwpx - application/vnd.hancom.hwpx, - application/hwp+zip Fixes #243
Only load the LibreOffice extension for opening hwp/hwpx when it is actually needed. Adding an extension to libreoffice may allow for it to run arbitrary code. This makes it trust more scalable by trusting LibreOffice extensions only for the filetypes which they target. Reasoning --------- Assuming a malicious `.oxt` extension this means that the extension has arbitrary code execution in the container. While this is not an existential threat in itself, we should not expose every Dangerzone user to it. This is achieved by dynamically loading the extension at runtime only when needed. This ensures that a compromised extension will in its least malicious form be able to modify the visual content of any hancom office files but not *every file*. In the more malicious version, if the code execution manages to do a container escape, this will only affect users that have converted a Hancom office file.
HWPX MIME type is recognized as 'application/zip' with current version of file command (file-5.44). It will be recognized as 'application/hwp+zip' when new version of file is released. For a temporary fix, when MIME type of file is 'application/zip', check the file type again (without the MIME option). And then check if it's 'Zip data (MIME type "application/hwp+zip"?)' or not.
Use the MIME types actually used by the `file` command, which was recently changed for the detection of the HWPX format [1]. application/hwp+zip -> application/x-hwp+zip But the HWPX format includes a 'mimetype' file, which contains the MIME type string "application/hwp+zip", so that was left so because it may be possible to detect it as "application/hwp+zip". [1]: file/file@ceef7ea
Add extra files and base64 encode externally contributed docs. This prevents the accidental opening of such documents, since they couldn't be rebuit by the Dangerzone developers to ensure their safety.
Just rebased, polished some commit messages and reordered some commits (squashing some other unneeded ones). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as well. Thanks a lot @OctopusET for this contribution :-).
I wrote this around the end of July 2023, I only changed some, there will be some errors. But I left this in case someone might be helpful and I said I will answer the first comment. This would be the answer of the first comment from @apyrgio #460 (review)Hello, @apyrgio thank you again for the comment.
Honestly, I'm not sure. But since there's some security improve on extension. I think it's better now, but still it should be verified. FYI, I didn't mention this article because it's in Korean but recently there was an attack using hwp on the MacOS targeting the North Korea human rights activist in South Korea (https://www.genians.co.kr/blog/threat_intelligence_report_macos). By using the similar method you mentioned. You might able to translate the PDF file to English. If you can't or you don't want to use the proprietary pdf translator. I can translate for you.
Unfortunately, AFAIK, there are no certain hardening solution. This extension is very new one and it is not popular yet. It's also because LibreOffice is not that popular in South Korea, and people just use Hancom office (or piracy version) or other web solutions. And the almost no discussion on the hardening HWP documents, Since the (almost) all hwp viewer, editors are propitiatory software. And the lack of concern of the security, it's exclusive format only in South Korea. But still, open hwp documents in browser could get the benefits of the browser sandbox.
It's not web version, still worth to mention:
|
Thanks a lot for the detailed answer @OctopusET. It reinforces the point that Dangerzone will be helpful to South Korean users 👍 . |
HWP
HWP is the most popular document format in South Korea, aside from the controversy over its closed nature. Almost all South Korean government documents are written in hwp/hwpx format.
Test with this files
Sample hwp attachments are available here (On the attachments table)
https://bugs.documentfoundation.org/show_bug.cgi?id=144747
H2Orestart
This extension helps to convert the hwp/hwpx document files to other formats supported by LibreOffice.
Good news is it's open sourced recently. (That's why it's called 'restart')
It's a Java based LibreOffice plugin, so it needs a JRE. Fortunately, the dangerzone container image contains the java8 JRE, so there is no need to include another package in the image.
Problems with this PR
ErrorsUpdate: Fixed! #460 (comment)MIME types
HWP and HWPX use custom MIME types that are not recognized by IANA. And one format has multiple MIME types, so they all need to be added. Some recommend
application/vnd.hancom.*
. But wildcard may not be supported on this code base and it may lead to security problems.application/x-hwp
,application/haansofthwp
,application/vnd.hancom.hwp
application/haansofthwpx
,application/vnd.hancom.hwpx
Reference (in Korean)
CJK Fonts
I tried to convert some documents and then all the Korean characters were rendered as 'tofu'. I installed
font-noto-cjk
and it's all gone. I think it would be better to installfont-noto-cjk-extra
as well, just in case. https://pkgs.alpinelinux.org/package/edge/community/x86/font-noto-cjk-extraTest needed on other system
I haven't tested on the MacOS, and Windows.
I think this support would be very helpful not only for journalists but also for many people who use hwp formats.
Additional information
HWPX
It's one of South Korean standard, KS X 6101, Archive.
Actual name is OWPML(Open Word-Processor Markup Language), and HWPX is branding of the Hancom.
HWPX is quite a new format, and it's been adopted these days. Hancom changed default document format to .hwpx.
References (in Korean):
Security attack increase
Security attacks using HWP/HWPX formats have been around for a long time. They continue to grow and become more complex.
News:
Actually...
LibreOffice does support the hwp format, but it's very old version HWP3.0 (released in 1997).
Other References
Fixes #468