Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all images exported when using "--extract-media" with some .docx files #7786

Closed
maxklebron opened this issue Dec 30, 2021 · 2 comments
Closed
Labels

Comments

@maxklebron
Copy link

Issue Description
When testing v2.16.2 of pandoc on Windows I observed some .docx files do not have all of their images exported when using the --extract-media command line argument (when converting to the .md format). Instead a subset of images are exported. For example, in a simple test .docx file (attached to this issue) only 1 image is exported, but there are several in the document.

Details to recreate
This can be recreated by running the following command using v2.16.2 of pandoc on Windows Server 2012 R2:

pandoc.exe -s --extract-media . -o out.md test.docx

The sample .docx file to recreate this is available here: test.docx.zip

Further details

  • Original .docx (saved from MS Word v16.56 on OS X)

    Screenshot 2021-12-30 at 23 27 28
  • Generated .md file (from Pandoc)

    Screenshot 2021-12-30 at 23 33 23
  • Unzipped .docx file directory structure (with 10x media files visible)

    Screenshot 2021-12-30 at 23 32 10
@maxklebron maxklebron added the bug label Dec 30, 2021
@jgm
Copy link
Owner

jgm commented Dec 31, 2021

I think the issue here is that there are several pic:pic elements inside one w:drawing element; our parser is just taking the first...

@maxklebron
Copy link
Author

@jgm Thanks for the pointer on this.

I'm not so familiar with the underlying .docx markup, but from the WYSIWYG frontend of MS Word this seems like a fairly normal thing to be able to do, so could potentially be a common scenario for .docx files?

How feasible do you think it might be to extend the parser to cope with multiple pic:pic elements?

jgm added a commit that referenced this issue Dec 31, 2021
...instead of ParPart.

Also remove NullParPart constructor, as it is no longer
needed.

This will allow us to handle elements that contain multiple
ParParts, e.g. w:drawing elements with multiple pic:pic.

See #7786.
@jgm jgm closed this as completed in 7ff1b79 Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants