Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for East Asian Language Tags in DOCX Output #9817

Closed
TomBener opened this issue May 29, 2024 · 9 comments
Closed

Feature Request: Support for East Asian Language Tags in DOCX Output #9817

TomBener opened this issue May 29, 2024 · 9 comments

Comments

@TomBener
Copy link
Contributor

Description

When converting Markdown to DOCX using Pandoc, East Asian texts (Including Chinese, Japanese, and Korean) do not receive the appropriate XML tags that indicate their language. This leads to typographical issues, especially with punctuation marks like quotes, which do not appear in full-width as expected in East Asian texts. For instance, Simplified Chinese quotes (“ ” ‘ ’) share the same Unicode values as their Western counterparts but need to be displayed as full-width characters to align properly with Chinese text, as the screenshot shows below.

fcddf4fdf422f35b7223538e3f78fba4

For this issue, MS Word uses specific XML tags to denote East Asian texts, as shown below:

<w:r>
  <w:rPr>
    <w:rFonts w:hint="eastAsia"/>
    <w:lang w:eastAsia="zh-CN"/>
  </w:rPr>
  <w:t>这是“中文”</w:t>
</w:r>

While for the English texts, it is generally as follows:

<w:r>
  <w:t>This is an English sentence</w:t>
</w:r>

Current Workaround

To address the specific issue regarding quotation marks, I have written a Lua filter that converts straight Chinese quotes to Pandoc's Quoted elements (DoubleQuote and SingleQuote), and then applies the necessary XML tags to these Quoted elements in the DOCX output.

Here's the Lua filter:

-- Lua Filter to Apply XML Tags to Chinese Quotes in DOCX Output

-- Check if the text contains Chinese characters
function is_chinese(text)
    return text:find("[\228-\233][\128-\191][\128-\191]")
end

-- Parse quotes in the text, handling nested quotes
function parse_quotes(text)
    local elements = {}
    local pos = 1

    while pos <= #text do
        local double_start, double_end, double_quoted = text:find("「(.-)」", pos)
        local single_start, single_end, single_quoted = text:find("『(.-)』", pos)

        if double_start and (not single_start or double_start < single_start) then
            if double_start > pos then
                table.insert(elements, pandoc.Str(text:sub(pos, double_start - 1)))
            end
            table.insert(elements, pandoc.Quoted(pandoc.DoubleQuote, parse_quotes(double_quoted)))
            pos = double_end + 1
        elseif single_start then
            if single_start > pos then
                table.insert(elements, pandoc.Str(text:sub(pos, single_start - 1)))
            end
            table.insert(elements, pandoc.Quoted(pandoc.SingleQuote, parse_quotes(single_quoted)))
            pos = single_end + 1
        else
            table.insert(elements, pandoc.Str(text:sub(pos)))
            break
        end
    end

    return elements
end

-- Apply custom XML tags to quotes, including nested quotes
function apply_custom_tags(element)
    if element.t == "Quoted" then
        local has_chinese = false
        for _, inner_element in ipairs(element.content) do
            if inner_element.t == "Str" and is_chinese(inner_element.text) then
                has_chinese = true
                break
            end
        end

        if has_chinese then
            local quote_type = element.quotetype == pandoc.DoubleQuote and "" or ""
            local closing_quote = element.quotetype == pandoc.DoubleQuote and "" or ""
            local result = pandoc.List({
                pandoc.RawInline("openxml",
                    string.format(
                        '<w:r><w:rPr><w:rFonts w:hint="eastAsia"/><w:lang w:eastAsia="zh-CN"/></w:rPr><w:t>%s</w:t></w:r>',
                        quote_type))
            })
            for _, inner_element in ipairs(element.content) do
                local nested_elements = apply_custom_tags(inner_element)
                for _, nested_element in ipairs(nested_elements) do
                    result:insert(nested_element)
                end
            end
            result:insert(pandoc.RawInline("openxml",
                string.format(
                    '<w:r><w:rPr><w:rFonts w:hint="eastAsia"/><w:lang w:eastAsia="zh-CN"/></w:rPr><w:t>%s</w:t></w:r>',
                    closing_quote)))
            return result
        end
    end
    return pandoc.List({ element })
end

-- Process each string to convert quotes and apply custom XML tags
function Str(str)
    local parsed_elements = parse_quotes(str.text)
    local new_elements = pandoc.List({})
    for _, parsed_element in ipairs(parsed_elements) do
        new_elements:insert(parsed_element)
    end

    local result = pandoc.List({})
    for _, element in ipairs(new_elements) do
        local processed_elements = apply_custom_tags(element)
        for _, processed_element in ipairs(processed_elements) do
            result:insert(processed_element)
        end
    end

    return result
end

Proposed Solution

I propose that Pandoc automatically add the appropriate XML tags for East Asian languages when converting documents to DOCX format, regardless of the lang option. This could be based on detecting the presence of East Asian characters in the text. Additionally, support for bidirectional (Bidi) languages could be included to ensure proper formatting.

Benefits

  • Ensures correct typographical display of East Asian texts in DOCX documents.
  • Improves the user experience for documents containing a mix of Western and East Asian texts.
  • Removes the need for custom Lua filters for basic functionality.

Thank you for considering this feature request. I believe it will significantly enhance Pandoc's functionality and usability for users dealing with multilingual documents.

Related: #7022

@TomBener
Copy link
Contributor Author

TomBener commented May 30, 2024

@jgm Many thanks! I have tested the nightly build and it works like a magic! It would be better if we can specify the specific CJK language (zh-CN, zh-HK, zh-TW, jp-JP, ko-KR) to pass it to w:lang field even though the current implementation is really great!

@jgm
Copy link
Owner

jgm commented May 30, 2024

Have you tried

[这是“中文”]{lang=zh-CN]

@TomBener
Copy link
Contributor Author

I've tried this:

[这是“中文”]{lang=zh-CN}

And it works perfectly! Thank you!

@TomBener
Copy link
Contributor Author

TomBener commented Jun 22, 2024

Hi, @jgm I found issues with the eastAsia font hints writer. Considering the following Markdown example:

---
title: Test eastAsia document
reference-section-title: References
csl: test.csl
references:
- author:
  - family: Ma
    given: Jinguan
    suffix: 马經觀
  container-title: 國聞周報
  id: majingguan1949
  issued: 1949
  original-title: China's tomorrow and the day after tomorrow
  page: 1
  publisher-place: Shanghai
  title: 中國的明天和後天
  type: article-newspaper
---

This is a test document for CJK in docx with Pandoc 3.2 later
[@majingguan1949].

This is the test.csl.

By running pandoc -C test.md -o test.docx, I got the following docx:

<w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Title" />
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">Test eastAsia document</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="FirstParagraph" />
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">This is a test document for CJK in docx with Pandoc 3.2 later</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"></w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve">(Ma, 1949)</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve">.</w:t>
      </w:r>
    </w:p>
    <w:bookmarkStart w:id="22" w:name="bibliography" />
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading1" />
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">References</w:t>
      </w:r>
    </w:p>
    <w:bookmarkStart w:id="21" w:name="refs" />
    <w:bookmarkStart w:id="20" w:name="ref-majingguan1949" />
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Bibliography" />
      </w:pPr>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="eastAsia" />
        </w:rPr>
        <w:t xml:space="preserve">MA JINGUAN 马經觀 (1949)</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"></w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="eastAsia" />
        </w:rPr>
        <w:t xml:space="preserve">“中國的明天和後天”</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"></w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="eastAsia" />
        </w:rPr>
        <w:t xml:space="preserve">(China’s tomorrow and the day after tomorrow). 國聞周報.</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="20" />
    <w:bookmarkEnd w:id="21" />
    <w:bookmarkEnd w:id="22" />
    <w:sectPr />
  </w:body>

Where the last sentence was added the eastAsia font hint, leading to the wide display of the English apostrophe mark, as showed in the screenshot below.

CleanShot 2024-06-22 at 14 39 31@2x

This unexpected behavior is largely caused by the mix of Chinese and English in one single bibliography entry, which is common in some English academic journals. It seems Pandoc needs to detect CJK and English more precisely and separate them into smaller parts. If this is hard to implement, is it possible to add an option to toggle on or toggle off the eastAsia font option for docx output? Thanks!

@jgm jgm reopened this Jun 23, 2024
@jgm
Copy link
Owner

jgm commented Jun 23, 2024

Not sure what to do here. The attribute has to go onto the w:r element, so we'd need to inspect text runs and break them up into separate w:r elements in cases like this.
Maybe something like:

If the run contains CJK characters, then:
break it up into chunks on spaces, and put each chunk in a w:r, adding the font hint if the chunk contains CJK characters

@jgm jgm closed this as completed in b07c05a Jun 23, 2024
@jgm
Copy link
Owner

jgm commented Jun 23, 2024

OK, try this.

@TomBener
Copy link
Contributor Author

Thanks. I have tested the new nightly building. It solved the issue above. But I found another issue when CJK characters are hyperlinked. For example:

---
title: Test eastAsia document
reference-section-title: References
csl: test.csl
references:
- author:
  - family: Ma
    given: Jinguan
    suffix: 马經觀
  container-title: 國聞周報
  id: majingguan1949
  issued: 1949
  original-title: China's tomorrow and the day after tomorrow
  page: 1
  publisher-place: Shanghai
  title: 中國的明天和後天
  type: article-newspaper
  doi: 10.1234/5678
- author:
  - family: Ma
    given: Jinguan
    suffix: 马經觀
  container-title: 國聞周報
  id: majingguan1949a
  issued: 1949
  original-title: China's tomorrow and the day after tomorrow
  page: 1
  publisher-place: Shanghai
  title: 中國的明天和後天
  type: article-newspaper
---

This is a test document for CJK in docx with Pandoc 3.2 later
[@majingguan1949; @majingguan1949a].

The first citation contained a DOI and was rendered as a hyperlink with test.csl, but the quotation mark was not double-width, as show in the screenshot below.

CleanShot 2024-06-23 at 17 04 03@2x

The related XML is as follows:

<w:r<w:r>
  <w:t xml:space="preserve">“</w:t>
</w:r>
<w:hyperlink r:id="rId20">
  <w:r>
    <w:rPr>
      <w:rStyle w:val="Hyperlink" />
      <w:rFonts w:hint="eastAsia" />
    </w:rPr>
    <w:t xml:space="preserve">中國的明天和後天</w:t>
  </w:r>
</w:hyperlink>
<w:r>
  <w:t xml:space="preserve">”</w:t>
</w:r>
<w:r>
  <w:t xml:space="preserve"></w:t>
</w:r>
<w:r>
  <w:t xml:space="preserve">(China’s tomorrow and the day after tomorrow). </w:t>
</w:r>

In my filed, we tend to cite the Chinese sources in articles but they are relatively small in the entire document. So the English journals expect the typesetting to be in line with English instead of Chinese, particularly the quotation mark. In this context, could you please provide an option to disable eastAsia font attribute?

@jgm
Copy link
Owner

jgm commented Jun 23, 2024

Please create a new issue for the hyperlink issue, and another one requesting a way to turn off the feature.

@TomBener
Copy link
Contributor Author

@jgm I have submitted two issues. Could you please consider them? Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants