Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCX reader discards figure caption (regression) #9610

Closed
frederik opened this issue Mar 27, 2024 · 4 comments
Closed

DOCX reader discards figure caption (regression) #9610

frederik opened this issue Mar 27, 2024 · 4 comments
Labels

Comments

@frederik
Copy link

Problem description

The latest pandoc version(s) 3.1.12.3 (possibly .2) seems to drop figure captions from docx during import.

pandoc captions.docx -o test.json

Latest known version where it works: 3.1.12.1

Reproduction

I am attaching a docx and two json outputs from: pandoc 3.1.12.1 and 3.1.12.3

here's the diff (only the caption is missing)

115,151d114
<         },
<         {
<             "t": "Para",
<             "c": [
<                 {
<                     "t": "Str",
<                     "c": "Figure"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "1"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "A"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "figure"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "caption"
<                 }
<             ]

caption.docx
test-latest.json
test-3-1-11.json

@frederik frederik added the bug label Mar 27, 2024
@jgm
Copy link
Owner

jgm commented Mar 27, 2024

Probably due to one of these items from the .2 changelog:

  * Docx reader:

    + Ensure that table captions are counted (#9518).
    + Detect caption by style name not id (#9518).
      The styleId can change depending on the localization.
    + Avoid emitting empty paragraph where caption was.

@jgm
Copy link
Owner

jgm commented Mar 27, 2024

Here's what I'm seeing:

% pandoc ~/Downloads/caption.docx -t native
[ Para
    [ Image
        ( ""
        , []
        , [ ( "width" , "6.268055555555556in" )
          , ( "height" , "2.5944444444444446in" )
          ]
        )
        [ Str "A"
        , Space
        , Str "screenshot"
        , Space
        , Str "of"
        , Space
        , Str "a"
        , Space
        , Str "web"
        , Space
        , Str "page"
        , SoftBreak
        , Str "Description"
        , Space
        , Str "automatically"
        , Space
        , Str "generated"
        ]
        ( "media/image1.png" , "" )
    ]
]

This is not being parsed as a Figure at all, which is a separate issue perhaps.

@jgm
Copy link
Owner

jgm commented Mar 27, 2024

caption.docx has

<w:pStyle w:val="Caption"/>

then in doc.styles:

<w:style w:type="paragraph" w:styleId="Caption"><w:name w:val="caption"/>

so the caption name is caption.

Note that changing the styleId to "ImageCaption" allows the caption to be parsed as a regular paragraph.

So, I think what is going on is this: Pandoc identifies paragraphs with style 'caption' as table captions. They are not emitted as regular paragraphs, but because we do not at this point have special handling for figures with captions, the result is that it gets dropped altogether.

Obviously not a great situation, but the fix would involve proper support for captioned images as Figure elements, which we've never had.

@jgm
Copy link
Owner

jgm commented Jun 12, 2024

Notes:

see also #9391

Word represents captions with a p element either before or after the image or table. The caption paragraph has pPr <w:pStyle w:val="Caption"/>, and if it's before the item it captions, also <w:keepNext/>. Currently the reader seems to assume that all captions are table captions; that needs to be changed. In addition we need to fix the way the reader associates captions with Tables and Figures (see #9358).

Pandoc's own docx writer uses ImageCaption and TableCaption classes. We should probably just use Caption to be more like Word.

@jgm jgm closed this as completed in 94975a4 Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants