-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support for East Asian Language Tags in DOCX Output #9817
Comments
@jgm Many thanks! I have tested the nightly build and it works like a magic! It would be better if we can specify the specific CJK language ( |
Have you tried
|
I've tried this:
And it works perfectly! Thank you! |
Hi, @jgm I found issues with the ---
title: Test eastAsia document
reference-section-title: References
csl: test.csl
references:
- author:
- family: Ma
given: Jinguan
suffix: 马經觀
container-title: 國聞周報
id: majingguan1949
issued: 1949
original-title: China's tomorrow and the day after tomorrow
page: 1
publisher-place: Shanghai
title: 中國的明天和後天
type: article-newspaper
---
This is a test document for CJK in docx with Pandoc 3.2 later
[@majingguan1949]. This is the test.csl. By running <w:body>
<w:p>
<w:pPr>
<w:pStyle w:val="Title" />
</w:pPr>
<w:r>
<w:t xml:space="preserve">Test eastAsia document</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="FirstParagraph" />
</w:pPr>
<w:r>
<w:t xml:space="preserve">This is a test document for CJK in docx with Pandoc 3.2 later</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">(Ma, 1949)</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">.</w:t>
</w:r>
</w:p>
<w:bookmarkStart w:id="22" w:name="bibliography" />
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1" />
</w:pPr>
<w:r>
<w:t xml:space="preserve">References</w:t>
</w:r>
</w:p>
<w:bookmarkStart w:id="21" w:name="refs" />
<w:bookmarkStart w:id="20" w:name="ref-majingguan1949" />
<w:p>
<w:pPr>
<w:pStyle w:val="Bibliography" />
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia" />
</w:rPr>
<w:t xml:space="preserve">MA JINGUAN 马經觀 (1949)</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia" />
</w:rPr>
<w:t xml:space="preserve">“中國的明天和後天”</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia" />
</w:rPr>
<w:t xml:space="preserve">(China’s tomorrow and the day after tomorrow). 國聞周報.</w:t>
</w:r>
</w:p>
<w:bookmarkEnd w:id="20" />
<w:bookmarkEnd w:id="21" />
<w:bookmarkEnd w:id="22" />
<w:sectPr />
</w:body> Where the last sentence was added the This unexpected behavior is largely caused by the mix of Chinese and English in one single bibliography entry, which is common in some English academic journals. It seems Pandoc needs to detect CJK and English more precisely and separate them into smaller parts. If this is hard to implement, is it possible to add an option to toggle on or toggle off the |
Not sure what to do here. The attribute has to go onto the w:r element, so we'd need to inspect text runs and break them up into separate w:r elements in cases like this. If the run contains CJK characters, then: |
OK, try this. |
Thanks. I have tested the new nightly building. It solved the issue above. But I found another issue when CJK characters are hyperlinked. For example: ---
title: Test eastAsia document
reference-section-title: References
csl: test.csl
references:
- author:
- family: Ma
given: Jinguan
suffix: 马經觀
container-title: 國聞周報
id: majingguan1949
issued: 1949
original-title: China's tomorrow and the day after tomorrow
page: 1
publisher-place: Shanghai
title: 中國的明天和後天
type: article-newspaper
doi: 10.1234/5678
- author:
- family: Ma
given: Jinguan
suffix: 马經觀
container-title: 國聞周報
id: majingguan1949a
issued: 1949
original-title: China's tomorrow and the day after tomorrow
page: 1
publisher-place: Shanghai
title: 中國的明天和後天
type: article-newspaper
---
This is a test document for CJK in docx with Pandoc 3.2 later
[@majingguan1949; @majingguan1949a]. The first citation contained a DOI and was rendered as a hyperlink with The related XML is as follows: <w:r<w:r>
<w:t xml:space="preserve">“</w:t>
</w:r>
<w:hyperlink r:id="rId20">
<w:r>
<w:rPr>
<w:rStyle w:val="Hyperlink" />
<w:rFonts w:hint="eastAsia" />
</w:rPr>
<w:t xml:space="preserve">中國的明天和後天</w:t>
</w:r>
</w:hyperlink>
<w:r>
<w:t xml:space="preserve">”</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">(China’s tomorrow and the day after tomorrow). </w:t>
</w:r> In my filed, we tend to cite the Chinese sources in articles but they are relatively small in the entire document. So the English journals expect the typesetting to be in line with English instead of Chinese, particularly the quotation mark. In this context, could you please provide an option to disable |
Please create a new issue for the hyperlink issue, and another one requesting a way to turn off the feature. |
@jgm I have submitted two issues. Could you please consider them? Thanks very much! |
Description
When converting Markdown to DOCX using Pandoc, East Asian texts (Including Chinese, Japanese, and Korean) do not receive the appropriate XML tags that indicate their language. This leads to typographical issues, especially with punctuation marks like quotes, which do not appear in full-width as expected in East Asian texts. For instance, Simplified Chinese quotes (“ ” ‘ ’) share the same Unicode values as their Western counterparts but need to be displayed as full-width characters to align properly with Chinese text, as the screenshot shows below.
For this issue, MS Word uses specific XML tags to denote East Asian texts, as shown below:
While for the English texts, it is generally as follows:
Current Workaround
To address the specific issue regarding quotation marks, I have written a Lua filter that converts straight Chinese quotes to Pandoc's
Quoted
elements (DoubleQuote and SingleQuote), and then applies the necessary XML tags to theseQuoted
elements in the DOCX output.Here's the Lua filter:
Proposed Solution
I propose that Pandoc automatically add the appropriate XML tags for East Asian languages when converting documents to DOCX format, regardless of the
lang
option. This could be based on detecting the presence of East Asian characters in the text. Additionally, support for bidirectional (Bidi) languages could be included to ensure proper formatting.Benefits
Thank you for considering this feature request. I believe it will significantly enhance Pandoc's functionality and usability for users dealing with multilingual documents.
Related: #7022
The text was updated successfully, but these errors were encountered: