Feature Request: Support for East Asian Language Tags in DOCX Output #9817

TomBener · 2024-05-29T09:39:13Z

Description

When converting Markdown to DOCX using Pandoc, East Asian texts (Including Chinese, Japanese, and Korean) do not receive the appropriate XML tags that indicate their language. This leads to typographical issues, especially with punctuation marks like quotes, which do not appear in full-width as expected in East Asian texts. For instance, Simplified Chinese quotes (“ ” ‘ ’) share the same Unicode values as their Western counterparts but need to be displayed as full-width characters to align properly with Chinese text, as the screenshot shows below.

For this issue, MS Word uses specific XML tags to denote East Asian texts, as shown below:

<w:r>
  <w:rPr>
    <w:rFonts w:hint="eastAsia"/>
    <w:lang w:eastAsia="zh-CN"/>
  </w:rPr>
  <w:t>这是“中文”</w:t>
</w:r>

While for the English texts, it is generally as follows:

<w:r>
  <w:t>This is an English sentence</w:t>
</w:r>

Current Workaround

To address the specific issue regarding quotation marks, I have written a Lua filter that converts straight Chinese quotes to Pandoc's Quoted elements (DoubleQuote and SingleQuote), and then applies the necessary XML tags to these Quoted elements in the DOCX output.

Here's the Lua filter:

-- Lua Filter to Apply XML Tags to Chinese Quotes in DOCX Output

-- Check if the text contains Chinese characters
function is_chinese(text)
    return text:find("[\228-\233][\128-\191][\128-\191]")
end

-- Parse quotes in the text, handling nested quotes
function parse_quotes(text)
    local elements = {}
    local pos = 1

    while pos <= #text do
        local double_start, double_end, double_quoted = text:find("「(.-)」", pos)
        local single_start, single_end, single_quoted = text:find("『(.-)』", pos)

        if double_start and (not single_start or double_start < single_start) then
            if double_start > pos then
                table.insert(elements, pandoc.Str(text:sub(pos, double_start - 1)))
            end
            table.insert(elements, pandoc.Quoted(pandoc.DoubleQuote, parse_quotes(double_quoted)))
            pos = double_end + 1
        elseif single_start then
            if single_start > pos then
                table.insert(elements, pandoc.Str(text:sub(pos, single_start - 1)))
            end
            table.insert(elements, pandoc.Quoted(pandoc.SingleQuote, parse_quotes(single_quoted)))
            pos = single_end + 1
        else
            table.insert(elements, pandoc.Str(text:sub(pos)))
            break
        end
    end

    return elements
end

-- Apply custom XML tags to quotes, including nested quotes
function apply_custom_tags(element)
    if element.t == "Quoted" then
        local has_chinese = false
        for _, inner_element in ipairs(element.content) do
            if inner_element.t == "Str" and is_chinese(inner_element.text) then
                has_chinese = true
                break
            end
        end

        if has_chinese then
            local quote_type = element.quotetype == pandoc.DoubleQuote and "“" or "‘"
            local closing_quote = element.quotetype == pandoc.DoubleQuote and "”" or "’"
            local result = pandoc.List({
                pandoc.RawInline("openxml",
                    string.format(
                        '<w:r><w:rPr><w:rFonts w:hint="eastAsia"/><w:lang w:eastAsia="zh-CN"/></w:rPr><w:t>%s</w:t></w:r>',
                        quote_type))
            })
            for _, inner_element in ipairs(element.content) do
                local nested_elements = apply_custom_tags(inner_element)
                for _, nested_element in ipairs(nested_elements) do
                    result:insert(nested_element)
                end
            end
            result:insert(pandoc.RawInline("openxml",
                string.format(
                    '<w:r><w:rPr><w:rFonts w:hint="eastAsia"/><w:lang w:eastAsia="zh-CN"/></w:rPr><w:t>%s</w:t></w:r>',
                    closing_quote)))
            return result
        end
    end
    return pandoc.List({ element })
end

-- Process each string to convert quotes and apply custom XML tags
function Str(str)
    local parsed_elements = parse_quotes(str.text)
    local new_elements = pandoc.List({})
    for _, parsed_element in ipairs(parsed_elements) do
        new_elements:insert(parsed_element)
    end

    local result = pandoc.List({})
    for _, element in ipairs(new_elements) do
        local processed_elements = apply_custom_tags(element)
        for _, processed_element in ipairs(processed_elements) do
            result:insert(processed_element)
        end
    end

    return result
end

Proposed Solution

I propose that Pandoc automatically add the appropriate XML tags for East Asian languages when converting documents to DOCX format, regardless of the lang option. This could be based on detecting the presence of East Asian characters in the text. Additionally, support for bidirectional (Bidi) languages could be included to ensure proper formatting.

Benefits

Ensures correct typographical display of East Asian texts in DOCX documents.
Improves the user experience for documents containing a mix of Western and East Asian texts.
Removes the need for custom Lua filters for basic functionality.

Thank you for considering this feature request. I believe it will significantly enhance Pandoc's functionality and usability for users dealing with multilingual documents.

Related: #7022

The text was updated successfully, but these errors were encountered:

TomBener · 2024-05-30T13:47:47Z

@jgm Many thanks! I have tested the nightly build and it works like a magic! It would be better if we can specify the specific CJK language (zh-CN, zh-HK, zh-TW, jp-JP, ko-KR) to pass it to w:lang field even though the current implementation is really great!

jgm · 2024-05-30T15:09:37Z

Have you tried

[这是“中文”]{lang=zh-CN]

TomBener · 2024-05-31T02:16:31Z

I've tried this:

[这是“中文”]{lang=zh-CN}

And it works perfectly! Thank you!

TomBener · 2024-06-22T06:47:27Z

Hi, @jgm I found issues with the eastAsia font hints writer. Considering the following Markdown example:

---
title: Test eastAsia document
reference-section-title: References
csl: test.csl
references:
- author:
  - family: Ma
    given: Jinguan
    suffix: 马經觀
  container-title: 國聞周報
  id: majingguan1949
  issued: 1949
  original-title: China's tomorrow and the day after tomorrow
  page: 1
  publisher-place: Shanghai
  title: 中國的明天和後天
  type: article-newspaper
---

This is a test document for CJK in docx with Pandoc 3.2 later
[@majingguan1949].

This is the test.csl.

By running pandoc -C test.md -o test.docx, I got the following docx:

<w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Title" />
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">Test eastAsia document</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="FirstParagraph" />
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">This is a test document for CJK in docx with Pandoc 3.2 later</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"></w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve">(Ma, 1949)</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve">.</w:t>
      </w:r>
    </w:p>
    <w:bookmarkStart w:id="22" w:name="bibliography" />
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading1" />
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">References</w:t>
      </w:r>
    </w:p>
    <w:bookmarkStart w:id="21" w:name="refs" />
    <w:bookmarkStart w:id="20" w:name="ref-majingguan1949" />
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Bibliography" />
      </w:pPr>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="eastAsia" />
        </w:rPr>
        <w:t xml:space="preserve">MA JINGUAN 马經觀 (1949)</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"></w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="eastAsia" />
        </w:rPr>
        <w:t xml:space="preserve">“中國的明天和後天”</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"></w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="eastAsia" />
        </w:rPr>
        <w:t xml:space="preserve">(China’s tomorrow and the day after tomorrow). 國聞周報.</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="20" />
    <w:bookmarkEnd w:id="21" />
    <w:bookmarkEnd w:id="22" />
    <w:sectPr />
  </w:body>

Where the last sentence was added the eastAsia font hint, leading to the wide display of the English apostrophe mark, as showed in the screenshot below.

This unexpected behavior is largely caused by the mix of Chinese and English in one single bibliography entry, which is common in some English academic journals. It seems Pandoc needs to detect CJK and English more precisely and separate them into smaller parts. If this is hard to implement, is it possible to add an option to toggle on or toggle off the eastAsia font option for docx output? Thanks!

jgm · 2024-06-23T01:55:29Z

Not sure what to do here. The attribute has to go onto the w:r element, so we'd need to inspect text runs and break them up into separate w:r elements in cases like this.
Maybe something like:

If the run contains CJK characters, then:
break it up into chunks on spaces, and put each chunk in a w:r, adding the font hint if the chunk contains CJK characters

jgm · 2024-06-23T04:38:23Z

OK, try this.

TomBener · 2024-06-23T09:13:25Z

Thanks. I have tested the new nightly building. It solved the issue above. But I found another issue when CJK characters are hyperlinked. For example:

---
title: Test eastAsia document
reference-section-title: References
csl: test.csl
references:
- author:
  - family: Ma
    given: Jinguan
    suffix: 马經觀
  container-title: 國聞周報
  id: majingguan1949
  issued: 1949
  original-title: China's tomorrow and the day after tomorrow
  page: 1
  publisher-place: Shanghai
  title: 中國的明天和後天
  type: article-newspaper
  doi: 10.1234/5678
- author:
  - family: Ma
    given: Jinguan
    suffix: 马經觀
  container-title: 國聞周報
  id: majingguan1949a
  issued: 1949
  original-title: China's tomorrow and the day after tomorrow
  page: 1
  publisher-place: Shanghai
  title: 中國的明天和後天
  type: article-newspaper
---

This is a test document for CJK in docx with Pandoc 3.2 later
[@majingguan1949; @majingguan1949a].

The first citation contained a DOI and was rendered as a hyperlink with test.csl, but the quotation mark was not double-width, as show in the screenshot below.

The related XML is as follows:

<w:r<w:r>
  <w:t xml:space="preserve">“</w:t>
</w:r>
<w:hyperlink r:id="rId20">
  <w:r>
    <w:rPr>
      <w:rStyle w:val="Hyperlink" />
      <w:rFonts w:hint="eastAsia" />
    </w:rPr>
    <w:t xml:space="preserve">中國的明天和後天</w:t>
  </w:r>
</w:hyperlink>
<w:r>
  <w:t xml:space="preserve">”</w:t>
</w:r>
<w:r>
  <w:t xml:space="preserve"></w:t>
</w:r>
<w:r>
  <w:t xml:space="preserve">(China’s tomorrow and the day after tomorrow). </w:t>
</w:r>

In my filed, we tend to cite the Chinese sources in articles but they are relatively small in the entire document. So the English journals expect the typesetting to be in line with English instead of Chinese, particularly the quotation mark. In this context, could you please provide an option to disable eastAsia font attribute?

jgm · 2024-06-23T17:00:57Z

Please create a new issue for the hyperlink issue, and another one requesting a way to turn off the feature.

TomBener · 2024-06-24T01:03:36Z

@jgm I have submitted two issues. Could you please consider them? Thanks very much!

TomBener added the enhancement label May 29, 2024

jgm closed this as completed in a6c3945 May 29, 2024

jgm reopened this Jun 23, 2024

jgm closed this as completed in b07c05a Jun 23, 2024

This was referenced Jun 24, 2024

Issues with East Asian Language Tags with Hyperlinks and Quotes in DOCX Output #9909

Closed

Support to disable East Asian font hints in docx output #9910

Open

jgm mentioned this issue Sep 20, 2024

Trivial custom docx writer doesn't work for reference doc with header #10201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support for East Asian Language Tags in DOCX Output #9817

Feature Request: Support for East Asian Language Tags in DOCX Output #9817

TomBener commented May 29, 2024

TomBener commented May 30, 2024 •

edited

Loading

jgm commented May 30, 2024

TomBener commented May 31, 2024

TomBener commented Jun 22, 2024 •

edited

Loading

jgm commented Jun 23, 2024

jgm commented Jun 23, 2024

TomBener commented Jun 23, 2024

jgm commented Jun 23, 2024

TomBener commented Jun 24, 2024

Feature Request: Support for East Asian Language Tags in DOCX Output #9817

Feature Request: Support for East Asian Language Tags in DOCX Output #9817

Comments

TomBener commented May 29, 2024

Description

Current Workaround

Proposed Solution

Benefits

TomBener commented May 30, 2024 • edited Loading

jgm commented May 30, 2024

TomBener commented May 31, 2024

TomBener commented Jun 22, 2024 • edited Loading

jgm commented Jun 23, 2024

jgm commented Jun 23, 2024

TomBener commented Jun 23, 2024

jgm commented Jun 23, 2024

TomBener commented Jun 24, 2024

TomBener commented May 30, 2024 •

edited

Loading

TomBener commented Jun 22, 2024 •

edited

Loading