-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting kyujitai #604
Comments
Oh nice idea. I'm not very familiar with kyujitai. Would #66 help at all with this? |
Yes, I believe so! JMdict often doesn't include kyujitai forms, so if Rikai could do that, it'd be swell. |
There is a limited set of old character forms, it should be possible to make a hard-coded look-up. https://en.wikipedia.org/wiki/Ky%C5%ABjitai#Ky%C5%ABjitai_vs._Shinjitai |
That looks very tractable. I'll try to get to it next week. Thank you! |
Especially the statement "In particular, all Unicode normalization methods merge the old characters with the new ones." sounds interesting. I've not managed to trigger this automatic conversion in JS, but I only gave it a naive attempt and probably did something wrong because I don't know enough about how Unicode normalization works. |
I dug into this and after extracting the various kyuujitai from the Wikipedia article and removing duplicates there are 418 pairs remaining. Of those, a few of those are represented as Unicode variation sequences. For example 逸︁ is simply Unicode suggests it might be appropriate to drop variation selectors when searching and from checking a dump of the JMdict words dictionary, I don't see any occurences of Kyuujitai using variant selectors
Ignoring variation sequences, there are 351 pairs of kyuujitai / shinjitai remaining. Kyuujitai without variant selectors
Of these there are quite a few occurrences in JMdict where it typically includes both the kyuujitai and the shinjitai as headwords. Therefore it seems better to first try looking up using the original input string and then, if there are kyuujitai found, trying again with all kyuujitai replaced with shinjitai and, if we get a longer result, using that instead. I've done that in 4acdd7d. Do you have any texts I can test it out on? |
Could there be any cases where this swallows matches/loses detail? |
I'm not sure. I guess the case we'd be concerned about is if JMdict has entries such as: Entry A: ○○薗○ (but NO ○○園○) and the input text was ○○薗○○. In that case we'd say that the 新字体 version produced a longer maximum match so we'd display the entries that match on it. As a result we'd show entry B and not entry A so that would possibly be a regression. However, I don't think JMdict ever has entries with 旧字体 headwords where the 新字体 headword is not also present (although I believe the opposite is common) so maybe it's ok? As for losing detail, if the matches we find using the 旧字体 have the same maximum length as the converted 新字体 version, we'll stick with the original 旧字体 match so that in the pop-up the 旧字体 headword will be highlighted. |
Yeah that's what I was thinking about. You're probably right, it should be rather rare. |
Great, thanks for checking. I'll close this out for now then. |
FYI, I just noticed 戶 doesn't get parsed as 戸. |
As reported here birchill/10ten-ja-reader#604 (comment)
As reported here birchill/10ten-ja-reader#604 (comment)
Thanks! Looks like that one's not in the list at https://en.wikipedia.org/wiki/Kyūjitai I've updated the library we use for this upstream so this should be fixed by the next release. |
You might also want to add 內, which became 内. See https://en.wiktionary.org/wiki/%E5%BA%84%E5%86%85 |
As reported here: birchill/10ten-ja-reader#604 (comment)
Thanks! I've added that one too now. |
It'd be cool if Rikai could automatically convert kyujitai into shinjitai and parse words from old documents.
The text was updated successfully, but these errors were encountered: