Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting kyujitai #604

Closed
nicolasmaia opened this issue May 19, 2021 · 14 comments
Closed

Converting kyujitai #604

nicolasmaia opened this issue May 19, 2021 · 14 comments

Comments

@nicolasmaia
Copy link

It'd be cool if Rikai could automatically convert kyujitai into shinjitai and parse words from old documents.

@birtles
Copy link
Member

birtles commented May 20, 2021

Oh nice idea. I'm not very familiar with kyujitai. Would #66 help at all with this?

@nicolasmaia
Copy link
Author

Yes, I believe so!

JMdict often doesn't include kyujitai forms, so if Rikai could do that, it'd be swell.

@Tomalak
Copy link

Tomalak commented May 21, 2021

There is a limited set of old character forms, it should be possible to make a hard-coded look-up.

https://en.wikipedia.org/wiki/Ky%C5%ABjitai#Ky%C5%ABjitai_vs._Shinjitai

@birtles
Copy link
Member

birtles commented May 22, 2021

There is a limited set of old character forms, it should be possible to make a hard-coded look-up.

https://en.wikipedia.org/wiki/Ky%C5%ABjitai#Ky%C5%ABjitai_vs._Shinjitai

That looks very tractable. I'll try to get to it next week. Thank you!

@Tomalak
Copy link

Tomalak commented May 22, 2021

Especially the statement "In particular, all Unicode normalization methods merge the old characters with the new ones." sounds interesting. I've not managed to trigger this automatic conversion in JS, but I only gave it a naive attempt and probably did something wrong because I don't know enough about how Unicode normalization works.

birtles added a commit that referenced this issue May 25, 2021
@birtles
Copy link
Member

birtles commented May 25, 2021

I dug into this and after extracting the various kyuujitai from the Wikipedia article and removing duplicates there are 418 pairs remaining.

Of those, a few of those are represented as Unicode variation sequences.

For example 逸︁ is simply 0x9038 (逸) followed by 0xfe01. Javascript APIs like '逸︁'.length or even the more Unicode-aware [...'逸︁'].length will return a length of 2 and '逸︁'.codePointAt(0) and '逸︁'.codePointAt(1) will return 36920 (0x9038) and 65025 (0xfe01) respectively.

Unicode suggests it might be appropriate to drop variation selectors when searching and from checking a dump of the JMdict words dictionary, I don't see any occurences of 0xfe00 or 0xfe01 there so I've updated the normalization routine to simply drop these characters.

Kyuujitai using variant selectors
逸︁ = 9038+fe01
謁︀ = 8b01+fe00
禍︀ = 798d+fe00
悔︀ = 6094+fe00
海︀ = 6d77+fe00
慨︀ = 6168+fe00
喝︀ = 559d+fe00
褐︀ = 8910+fe00
漢︀ = 6f22+fe00
器︀ = 5668+fe00
既︀ = 65e2+fe00
祈︀ = 7948+fe00
響︀ = 97ff+fe00
勤︀ = 52e4+fe00
謹︀ = 8b39+fe00
穀︀ = 7a40+fe00
殺︀ = 6bba+fe00
祉︀ = 7949+fe00
視︀ = 8996+fe00
煮︀ = 716e+fe00
社︀ = 793e+fe00
者︀ = 8005+fe00
臭︀ = 81ed+fe00
祝︀ = 795d+fe00
暑︀ = 6691+fe00
署︀ = 7f72+fe00
諸︀ = 8af8+fe00
祥︀ = 7965+fe00
神︀ = 795e+fe00
節︀ = 7bc0+fe00
祖︀ = 7956+fe00
僧︀ = 50e7+fe00
層︀ = 5c64+fe00
憎︀ = 618e+fe00
贈︀ = 8d08+fe00
嘆︀ = 5606+fe00
著︀ = 8457+fe00
懲︀ = 61f2+fe00
塚︀ = 585a+fe00
都︀ = 90fd+fe00
突︀ = 7a81+fe00
難︀ = 96e3+fe00
梅︀ = 6885+fe00
繁︀ = 7e41+fe00
卑︀ = 5351+fe00
碑︀ = 7891+fe00
賓︀ = 8cd3+fe00
頻︀ = 983b+fe00
敏︀ = 654f+fe00
侮︀ = 4fae+fe00
福︀ = 798f+fe00
塀︀ = 5840+fe00
勉︀ = 52c9+fe00
墨︀ = 58a8+fe00
免︀ = 514d+fe00
欄︀ = 6b04+fe00
隆︀ = 9686+fe00
虜︀ = 865c+fe00
類︀ = 985e+fe00
練︁ = 7df4+fe01
廊︀ = 5eca+fe00
朗︀ = 6717+fe00
渚︀ = 6e1a+fe00
猪︀ = 732a+fe00
琢︀ = 7422+fe00
祐︀ = 7950+fe00
禎︀ = 798e+fe00

Ignoring variation sequences, there are 351 pairs of kyuujitai / shinjitai remaining.

Kyuujitai without variant selectors
薗 → 園
駈 → 駆
曾 → 曽
瀧 → 滝
嶋 → 島
燈 → 灯
埜 → 野
盃 → 杯
冨 → 富
峯 → 峰
龍 → 竜
乘 → 乗
亂 → 乱
豫 → 予
亞 → 亜
佛 → 仏
來 → 来
假 → 仮
會 → 会
傳 → 伝
僞 → 偽
價 → 価
儉 → 倹
兒 → 児
兩 → 両
凉 → 涼
處 → 処
剩 → 剰
劍 → 剣
劑 → 剤
辨 → 弁
瓣 → 弁
辯 → 弁
勞 → 労
勳 → 勲
勵 → 励
勸 → 勧
區 → 区
卷 → 巻
參 → 参
雙 → 双
單 → 単
營 → 営
嚴 → 厳
囑 → 嘱
圈 → 圏
國 → 国
圍 → 囲
圓 → 円
團 → 団
圖 → 図
壞 → 壊
墮 → 堕
壓 → 圧
壘 → 塁
壤 → 壌
壯 → 壮
壹 → 壱
壽 → 寿
奧 → 奥
奬 → 奨
孃 → 嬢
學 → 学
實 → 実
寢 → 寝
寫 → 写
寶 → 宝
將 → 将
專 → 専
對 → 対
屆 → 届
屬 → 属
峽 → 峡
嶽 → 岳
帶 → 帯
廣 → 広
廢 → 廃
廳 → 庁
彈 → 弾
彌 → 弥
徑 → 径
從 → 従
恆 → 恒
惡 → 悪
惠 → 恵
惱 → 悩
愼 → 慎
慘 → 惨
應 → 応
懷 → 懐
戀 → 恋
戰 → 戦
戲 → 戯
拔 → 抜
擔 → 担
拜 → 拝
拂 → 払
挾 → 挟
搜 → 捜
插 → 挿
搖 → 揺
攝 → 摂
據 → 拠
擇 → 択
擧 → 挙
擴 → 拡
收 → 収
效 → 効
敕 → 勅
敍 → 叙
數 → 数
變 → 変
斷 → 断
晝 → 昼
曉 → 暁
霸 → 覇
條 → 条
棧 → 桟
榮 → 栄
樂 → 楽
權 → 権
樞 → 枢
樣 → 様
樓 → 楼
檢 → 検
櫻 → 桜
盜 → 盗
歐 → 欧
歡 → 歓
歸 → 帰
殘 → 残
殼 → 殻
毆 → 殴
氣 → 気
淨 → 浄
淺 → 浅
滿 → 満
溪 → 渓
滯 → 滞
澁 → 渋
潛 → 潜
澤 → 沢
濟 → 済
濕 → 湿
濱 → 浜
灣 → 湾
燒 → 焼
爐 → 炉
爭 → 争
爲 → 為
犧 → 犠
狹 → 狭
獎 → 奨
默 → 黙
獨 → 独
獸 → 獣
獵 → 猟
獻 → 献
畫 → 画
當 → 当
疊 → 畳
癡 → 痴
發 → 発
盡 → 尽
眞 → 真
碎 → 砕
祕 → 秘
齋 → 斎
禪 → 禅
禮 → 礼
稱 → 称
稻 → 稲
穗 → 穂
穩 → 穏
竊 → 窃
竝 → 並
粹 → 粋
絲 → 糸
經 → 経
總 → 総
縣 → 県
縱 → 縦
繪 → 絵
繩 → 縄
繼 → 継
續 → 続
纖 → 繊
缺 → 欠
罐 → 缶
飜 → 翻
聲 → 声
聽 → 聴
肅 → 粛
腦 → 脳
膽 → 胆
臟 → 臓
臺 → 台
與 → 与
舊 → 旧
艷 → 艶
莖 → 茎
莊 → 荘
萬 → 万
藏 → 蔵
藝 → 芸
藥 → 薬
號 → 号
螢 → 蛍
蟲 → 虫
蠶 → 蚕
蠻 → 蛮
衞 → 衛
裝 → 装
襃 → 褒
覺 → 覚
覽 → 覧
觀 → 観
觸 → 触
謠 → 謡
證 → 証
譯 → 訳
譽 → 誉
讀 → 読
讓 → 譲
豐 → 豊
貳 → 弐
賣 → 売
贊 → 賛
踐 → 践
輕 → 軽
轉 → 転
辭 → 辞
遞 → 逓
隨 → 随
遲 → 遅
邊 → 辺
醉 → 酔
醫 → 医
釀 → 醸
釋 → 釈
錢 → 銭
鎭 → 鎮
鐵 → 鉄
鑄 → 鋳
鑛 → 鉱
關 → 関
陷 → 陥
險 → 険
隱 → 隠
雜 → 雑
靈 → 霊
靜 → 静
顯 → 顕
餘 → 余
餠 → 餅
騷 → 騒
驅 → 駆
驛 → 駅
驗 → 験
髓 → 髄
體 → 体
髮 → 髪
鷄 → 鶏
鹽 → 塩
麥 → 麦
點 → 点
黨 → 党
齊 → 斉
齒 → 歯
齡 → 齢
龜 → 亀
增 → 増
寬 → 寛
德 → 徳
橫 → 横
瀨 → 瀬
甁 → 瓶
綠 → 緑
緖 → 緒
薰 → 薫
賴 → 頼
郞 → 郎
鄕 → 郷
黑 → 黒
倂 → 併
卽 → 即
巢 → 巣
徵 → 徴
戾 → 戻
揭 → 掲
擊 → 撃
晚 → 晩
曆 → 暦
槪 → 概
步 → 歩
歷 → 歴
每 → 毎
涉 → 渉
淚 → 涙
渴 → 渇
溫 → 温
狀 → 状
瘦 → 痩
硏 → 研
緣 → 縁
虛 → 虚
錄 → 録
鍊 → 錬
鬭 → 闘
麵 → 麺
黃 → 黄
亙 → 亘
凛 → 凜
堯 → 尭
巖 → 巌
晄 → 晃
檜 → 桧
槇 → 槙
禰 → 祢
禱 → 祷
祿 → 禄
穰 → 穣
萠 → 萌
遙 → 遥
豔 → 艶
啞 → 唖
穎 → 頴
鷗 → 鴎
軀 → 躯
攪 → 撹
麴 → 麹
鹼 → 鹸
嚙 → 噛
繡 → 繍
蔣 → 蒋
醬 → 醤
搔 → 掻
屛 → 屏
幷 → 并
濾 → 沪
蘆 → 芦
蠟 → 蝋
彎 → 弯
焰 → 焔
礦 → 砿
讚 → 讃
顚 → 顛
醱 → 醗
潑 → 溌
輛 → 輌
繫 → 繋

Of these there are quite a few occurrences in JMdict where it typically includes both the kyuujitai and the shinjitai as headwords.

Therefore it seems better to first try looking up using the original input string and then, if there are kyuujitai found, trying again with all kyuujitai replaced with shinjitai and, if we get a longer result, using that instead.

I've done that in 4acdd7d.

Do you have any texts I can test it out on?

@Tomalak
Copy link

Tomalak commented May 25, 2021

if there are kyuujitai found, trying again with all kyuujitai replaced with shinjitai and, if we get a longer result, using that instead.

Could there be any cases where this swallows matches/loses detail?

@birtles
Copy link
Member

birtles commented May 25, 2021

Could there be any cases where this swallows matches/loses detail?

I'm not sure. I guess the case we'd be concerned about is if JMdict has entries such as:

Entry A: ○○薗○ (but NO ○○園○)
Entry B: ○○園○○

and the input text was ○○薗○○.

In that case we'd say that the 新字体 version produced a longer maximum match so we'd display the entries that match on it. As a result we'd show entry B and not entry A so that would possibly be a regression. However, I don't think JMdict ever has entries with 旧字体 headwords where the 新字体 headword is not also present (although I believe the opposite is common) so maybe it's ok?

As for losing detail, if the matches we find using the 旧字体 have the same maximum length as the converted 新字体 version, we'll stick with the original 旧字体 match so that in the pop-up the 旧字体 headword will be highlighted.

@Tomalak
Copy link

Tomalak commented May 26, 2021

Yeah that's what I was thinking about. You're probably right, it should be rather rare.

@birtles
Copy link
Member

birtles commented May 26, 2021

Great, thanks for checking. I'll close this out for now then.

@birtles birtles closed this as completed May 26, 2021
@nicolasmaia
Copy link
Author

FYI, I just noticed 戶 doesn't get parsed as 戸.

Cf. https://en.wiktionary.org/wiki/%E6%88%B6#Japanese

birtles added a commit to birchill/normal-jp that referenced this issue Oct 3, 2024
birtles added a commit to birchill/normal-jp that referenced this issue Oct 3, 2024
@birtles
Copy link
Member

birtles commented Oct 3, 2024

FYI, I just noticed 戶 doesn't get parsed as 戸.

Cf. https://en.wiktionary.org/wiki/%E6%88%B6#Japanese

Thanks! Looks like that one's not in the list at https://en.wikipedia.org/wiki/Kyūjitai

I've updated the library we use for this upstream so this should be fixed by the next release.

@nicolasmaia
Copy link
Author

You might also want to add 內, which became 内. See https://en.wiktionary.org/wiki/%E5%BA%84%E5%86%85

birtles added a commit to birchill/normal-jp that referenced this issue Oct 4, 2024
@birtles
Copy link
Member

birtles commented Oct 4, 2024

You might also want to add 內, which became 内. See https://en.wiktionary.org/wiki/%E5%BA%84%E5%86%85

Thanks! I've added that one too now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants