-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract text more carefully in mdbook-xgettext
#318
Comments
When doing this, it's critically important that we run the same transformations on the existing |
This is a very simple work-around for us extracting lots of code to the `messages.pot` file. The comments in the code can be translated, but Cloud Translate doesn’t know about it: instead it translates everything, keywords and all. The course text uses `;` only very sparingly: I found a single page which uses the character. So we are not losing much by skipping these messages. Long-term, we should extract code blocks at a unit and we should mark them in the `messages.pot` file (google#318). That will allow us to be more selective about what we translate.
why not use the markdown parser used by mdbook |
@moutikabdessabour we should definitely use |
This is a very simple work-around for us extracting lots of code to the `messages.pot` file. The comments in the code can be translated, but Cloud Translate doesn’t know about it: instead it translates everything, keywords and all. The course text uses `;` only very sparingly: I found a single page which uses the character. So we are not losing much by skipping these messages. Long-term, we should extract code blocks at a unit and we should mark them in the `messages.pot` file (google#318). That will allow us to be more selective about what we translate.
I'd like to work on this, if you don't mind assigning it to me. I can see how to replace the existing
Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the |
While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary. It's probably the same reason why mgeisler@ thought trimming links would be a poor idea. |
That makes a lot of sense. I think we could adjust the chunk-extraction to collapse adjacent list-item chunks into a single chunk. |
Yes, that was also roughly my idea. Basically that the new extraction functionality can be accessed from some temporary tool which will iterate over pairs of I'm thinking this should be done in smaller steps and that each step should be carried out on the Perhaps we can start by teaching Next, I imagine it would be easy to extract fenced code blocks as a unit, and probably also easy to strip away I've been dabbling a bit with this myself and I think the biggest trouble will be to parse all of the different
I knew that the current system gives us that freedom, but I didn't know the freedom was used 😄 Can you tell us more about where you had to do this? My gut feeling is that we should try to improve the original English text in those cases. |
I got a start on this today in #449. I think this can get pretty close to producing the existing set of messages. This is probably a good place to start, and then update the .po files where they differ (number of newlines, maybe some funny business around |
On experimenting a bit, I think we should leave lists as a unit for translation. The reason is, otherwise indentation is very hard to get right. For example, given
we get
If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be
in order to keep the indentation correct. So, I will include lists in their entirety. |
Also, I don't think there's any automated way to re-break these messages. Some lists were broken into multiple messages by having |
Long-term, I would like to unwrap such paragraphs. So * This is
a single
list item.
Second paragraph
in first item. Becomes two messages in the
Indentation and wrapping has been taken away. When translating the original text, we end up with
This should work as long as there are no new The goal (for me) is to remove the possibility of errors in the translations, and also to make the translations robust against changes in the formatting. I would like to hear from @jooyunghan, @jiyongp, @rastringer, @hugojacob, and @ronaldfw if this is a good goal? |
I think it's okay to not unwrap softly wrapped text. It is sometimes even useful especially when translating a code fragment having translatable comments. What is annoying with po is that it doesn't support multi-line strings. Ideally, I wish the following. Not sure po file format supports it (but we could preprocess if not). Markdown:
po file:
|
The PO format uses C-style string and C-style string concatenation. So msgid ""
"f"
"o"
"o" is a When using I don't understand how having support for newlines in the strings in the PO file helps you here? |
I know that. But there are a few problems here:
A translated text entered in poedit
becomes
|
IMO, the problem of working with PO file directly is that we need to handle the stack of two encodings: PO file's C-style string literals (with escaping) over MarkDown text. I think that that's why @jiyong's wished PO file supporting raw text. My workflow now is that
|
Thanks, I see what you mean now! For that use case, I would suggest writing a tiny tool which transforms a - msgid: '# Running the Course'
msgstr: '# 강의 진행 방식'
- msgid: '> This page is for the course instructor.'
msgstr: '> 강사를 위한 안내 페이지입니다.'
- msgid: |-
Here is a bit of background information about how we've been running the course
internally at Google.
msgstr: 다음은 구글 내부에서 이 과정을 어떤식으로 운영해왔는지에 대한 배경 정보입니다.
- msgid: 'To run the course, you need to:'
msgstr: '강의를 실행하기 위한 준비:'
- msgid: |-
1. Make yourself familiar with the course material. We've included speaker notes
on some of the pages to help highlight the key points (please help us by
contributing more speaker notes!). You should make sure to open the speaker
notes in a popup (click the link with a little arrow next to "Speaker
Notes"). This way you have a clean screen to present to the class.
msgstr: 1. 강의 자료를 숙지합니다. 주요 요점을 강조하기 위해 일부 페이지에 강의 참조노트를 포함하였습니다. (추가적인 노트를 작성하여 제공해 주시면 감사하겠습니다.) 강의 참조 노트의 링크를 누르면 강의노트가 별도의 팝업으로 분리가 되며, 메인 화면에서는 사
라집니다. As you can see, multi-line inputs end up as multi-line literal blocks in the YAML file — ready to be edited using your favorite tool 😄 If you think this is useful, then we can probably put it somewhere. |
@mgeisler Yes, that looks great. I'd use it. One question though: which file will be the source of truth? yaml, or po? |
I was thinking that you would generate the YAML file whenever you want locally and then export back to We would need the YAML-to-PO conversion as well, but that should be trivial — we need the |
ack! |
Right now, we simply split the text on
\n\n+
, but this leads to a number of problems:In general, it would be awesome if we could
#
from headers and*
from bullet points.So Markdown like
should result in these messages
This is a heading
(heading type is stripped)A _little_ paragraph.
(softwrapped lines are unfolded)fn main() {\n println!("Hello world!");\n}
(info string is stripped)First
(bullet point extracted individually)Second
You could imagine done something nice with links too:
foo [bar](https://example.net) baz
could be stored asfoo [bar] baz
. This might be a poor idea, though: it means that the translator cannot change the destination URL.The text was updated successfully, but these errors were encountered: