Converting kyujitai #604

nicolasmaia · 2021-05-19T03:54:41Z

It'd be cool if Rikai could automatically convert kyujitai into shinjitai and parse words from old documents.

birtles · 2021-05-20T01:19:46Z

Oh nice idea. I'm not very familiar with kyujitai. Would #66 help at all with this?

nicolasmaia · 2021-05-20T03:04:49Z

Yes, I believe so!

JMdict often doesn't include kyujitai forms, so if Rikai could do that, it'd be swell.

Tomalak · 2021-05-21T08:39:11Z

There is a limited set of old character forms, it should be possible to make a hard-coded look-up.

https://en.wikipedia.org/wiki/Ky%C5%ABjitai#Ky%C5%ABjitai_vs._Shinjitai

birtles · 2021-05-22T05:21:23Z

There is a limited set of old character forms, it should be possible to make a hard-coded look-up.

https://en.wikipedia.org/wiki/Ky%C5%ABjitai#Ky%C5%ABjitai_vs._Shinjitai

That looks very tractable. I'll try to get to it next week. Thank you!

Tomalak · 2021-05-22T08:41:42Z

Especially the statement "In particular, all Unicode normalization methods merge the old characters with the new ones." sounds interesting. I've not managed to trigger this automatic conversion in JS, but I only gave it a naive attempt and probably did something wrong because I don't know enough about how Unicode normalization works.

See #604.

birtles · 2021-05-25T03:34:12Z

I dug into this and after extracting the various kyuujitai from the Wikipedia article and removing duplicates there are 418 pairs remaining.

Of those, a few of those are represented as Unicode variation sequences.

For example 逸︁ is simply 0x9038 (逸) followed by 0xfe01. Javascript APIs like '逸︁'.length or even the more Unicode-aware [...'逸︁'].length will return a length of 2 and '逸︁'.codePointAt(0) and '逸︁'.codePointAt(1) will return 36920 (0x9038) and 65025 (0xfe01) respectively.

Unicode suggests it might be appropriate to drop variation selectors when searching and from checking a dump of the JMdict words dictionary, I don't see any occurences of 0xfe00 or 0xfe01 there so I've updated the normalization routine to simply drop these characters.

Kyuujitai using variant selectors

逸︁ = 9038+fe01
謁︀ = 8b01+fe00
禍︀ = 798d+fe00
悔︀ = 6094+fe00
海︀ = 6d77+fe00
慨︀ = 6168+fe00
喝︀ = 559d+fe00
褐︀ = 8910+fe00
漢︀ = 6f22+fe00
器︀ = 5668+fe00
既︀ = 65e2+fe00
祈︀ = 7948+fe00
響︀ = 97ff+fe00
勤︀ = 52e4+fe00
謹︀ = 8b39+fe00
穀︀ = 7a40+fe00
殺︀ = 6bba+fe00
祉︀ = 7949+fe00
視︀ = 8996+fe00
煮︀ = 716e+fe00
社︀ = 793e+fe00
者︀ = 8005+fe00
臭︀ = 81ed+fe00
祝︀ = 795d+fe00
暑︀ = 6691+fe00
署︀ = 7f72+fe00
諸︀ = 8af8+fe00
祥︀ = 7965+fe00
神︀ = 795e+fe00
節︀ = 7bc0+fe00
祖︀ = 7956+fe00
僧︀ = 50e7+fe00
層︀ = 5c64+fe00
憎︀ = 618e+fe00
贈︀ = 8d08+fe00
嘆︀ = 5606+fe00
著︀ = 8457+fe00
懲︀ = 61f2+fe00
塚︀ = 585a+fe00
都︀ = 90fd+fe00
突︀ = 7a81+fe00
難︀ = 96e3+fe00
梅︀ = 6885+fe00
繁︀ = 7e41+fe00
卑︀ = 5351+fe00
碑︀ = 7891+fe00
賓︀ = 8cd3+fe00
頻︀ = 983b+fe00
敏︀ = 654f+fe00
侮︀ = 4fae+fe00
福︀ = 798f+fe00
塀︀ = 5840+fe00
勉︀ = 52c9+fe00
墨︀ = 58a8+fe00
免︀ = 514d+fe00
欄︀ = 6b04+fe00
隆︀ = 9686+fe00
虜︀ = 865c+fe00
類︀ = 985e+fe00
練︁ = 7df4+fe01
廊︀ = 5eca+fe00
朗︀ = 6717+fe00
渚︀ = 6e1a+fe00
猪︀ = 732a+fe00
琢︀ = 7422+fe00
祐︀ = 7950+fe00
禎︀ = 798e+fe00

Ignoring variation sequences, there are 351 pairs of kyuujitai / shinjitai remaining.

Kyuujitai without variant selectors

薗 → 園
駈 → 駆
曾 → 曽
瀧 → 滝
嶋 → 島
燈 → 灯
埜 → 野
盃 → 杯
冨 → 富
峯 → 峰
龍 → 竜
乘 → 乗
亂 → 乱
豫 → 予
亞 → 亜
佛 → 仏
來 → 来
假 → 仮
會 → 会
傳 → 伝
僞 → 偽
價 → 価
儉 → 倹
兒 → 児
兩 → 両
凉 → 涼
處 → 処
剩 → 剰
劍 → 剣
劑 → 剤
辨 → 弁
瓣 → 弁
辯 → 弁
勞 → 労
勳 → 勲
勵 → 励
勸 → 勧
區 → 区
卷 → 巻
參 → 参
雙 → 双
單 → 単
營 → 営
嚴 → 厳
囑 → 嘱
圈 → 圏
國 → 国
圍 → 囲
圓 → 円
團 → 団
圖 → 図
壞 → 壊
墮 → 堕
壓 → 圧
壘 → 塁
壤 → 壌
壯 → 壮
壹 → 壱
壽 → 寿
奧 → 奥
奬 → 奨
孃 → 嬢
學 → 学
實 → 実
寢 → 寝
寫 → 写
寶 → 宝
將 → 将
專 → 専
對 → 対
屆 → 届
屬 → 属
峽 → 峡
嶽 → 岳
帶 → 帯
廣 → 広
廢 → 廃
廳 → 庁
彈 → 弾
彌 → 弥
徑 → 径
從 → 従
恆 → 恒
惡 → 悪
惠 → 恵
惱 → 悩
愼 → 慎
慘 → 惨
應 → 応
懷 → 懐
戀 → 恋
戰 → 戦
戲 → 戯
拔 → 抜
擔 → 担
拜 → 拝
拂 → 払
挾 → 挟
搜 → 捜
插 → 挿
搖 → 揺
攝 → 摂
據 → 拠
擇 → 択
擧 → 挙
擴 → 拡
收 → 収
效 → 効
敕 → 勅
敍 → 叙
數 → 数
變 → 変
斷 → 断
晝 → 昼
曉 → 暁
霸 → 覇
條 → 条
棧 → 桟
榮 → 栄
樂 → 楽
權 → 権
樞 → 枢
樣 → 様
樓 → 楼
檢 → 検
櫻 → 桜
盜 → 盗
歐 → 欧
歡 → 歓
歸 → 帰
殘 → 残
殼 → 殻
毆 → 殴
氣 → 気
淨 → 浄
淺 → 浅
滿 → 満
溪 → 渓
滯 → 滞
澁 → 渋
潛 → 潜
澤 → 沢
濟 → 済
濕 → 湿
濱 → 浜
灣 → 湾
燒 → 焼
爐 → 炉
爭 → 争
爲 → 為
犧 → 犠
狹 → 狭
獎 → 奨
默 → 黙
獨 → 独
獸 → 獣
獵 → 猟
獻 → 献
畫 → 画
當 → 当
疊 → 畳
癡 → 痴
發 → 発
盡 → 尽
眞 → 真
碎 → 砕
祕 → 秘
齋 → 斎
禪 → 禅
禮 → 礼
稱 → 称
稻 → 稲
穗 → 穂
穩 → 穏
竊 → 窃
竝 → 並
粹 → 粋
絲 → 糸
經 → 経
總 → 総
縣 → 県
縱 → 縦
繪 → 絵
繩 → 縄
繼 → 継
續 → 続
纖 → 繊
缺 → 欠
罐 → 缶
飜 → 翻
聲 → 声
聽 → 聴
肅 → 粛
腦 → 脳
膽 → 胆
臟 → 臓
臺 → 台
與 → 与
舊 → 旧
艷 → 艶
莖 → 茎
莊 → 荘
萬 → 万
藏 → 蔵
藝 → 芸
藥 → 薬
號 → 号
螢 → 蛍
蟲 → 虫
蠶 → 蚕
蠻 → 蛮
衞 → 衛
裝 → 装
襃 → 褒
覺 → 覚
覽 → 覧
觀 → 観
觸 → 触
謠 → 謡
證 → 証
譯 → 訳
譽 → 誉
讀 → 読
讓 → 譲
豐 → 豊
貳 → 弐
賣 → 売
贊 → 賛
踐 → 践
輕 → 軽
轉 → 転
辭 → 辞
遞 → 逓
隨 → 随
遲 → 遅
邊 → 辺
醉 → 酔
醫 → 医
釀 → 醸
釋 → 釈
錢 → 銭
鎭 → 鎮
鐵 → 鉄
鑄 → 鋳
鑛 → 鉱
關 → 関
陷 → 陥
險 → 険
隱 → 隠
雜 → 雑
靈 → 霊
靜 → 静
顯 → 顕
餘 → 余
餠 → 餅
騷 → 騒
驅 → 駆
驛 → 駅
驗 → 験
髓 → 髄
體 → 体
髮 → 髪
鷄 → 鶏
鹽 → 塩
麥 → 麦
點 → 点
黨 → 党
齊 → 斉
齒 → 歯
齡 → 齢
龜 → 亀
增 → 増
寬 → 寛
德 → 徳
橫 → 横
瀨 → 瀬
甁 → 瓶
綠 → 緑
緖 → 緒
薰 → 薫
賴 → 頼
郞 → 郎
鄕 → 郷
黑 → 黒
倂 → 併
卽 → 即
巢 → 巣
徵 → 徴
戾 → 戻
揭 → 掲
擊 → 撃
晚 → 晩
曆 → 暦
槪 → 概
步 → 歩
歷 → 歴
每 → 毎
涉 → 渉
淚 → 涙
渴 → 渇
溫 → 温
狀 → 状
瘦 → 痩
硏 → 研
緣 → 縁
虛 → 虚
錄 → 録
鍊 → 錬
鬭 → 闘
麵 → 麺
黃 → 黄
亙 → 亘
凛 → 凜
堯 → 尭
巖 → 巌
晄 → 晃
檜 → 桧
槇 → 槙
禰 → 祢
禱 → 祷
祿 → 禄
穰 → 穣
萠 → 萌
遙 → 遥
豔 → 艶
啞 → 唖
穎 → 頴
鷗 → 鴎
軀 → 躯
攪 → 撹
麴 → 麹
鹼 → 鹸
嚙 → 噛
繡 → 繍
蔣 → 蒋
醬 → 醤
搔 → 掻
屛 → 屏
幷 → 并
濾 → 沪
蘆 → 芦
蠟 → 蝋
彎 → 弯
焰 → 焔
礦 → 砿
讚 → 讃
顚 → 顛
醱 → 醗
潑 → 溌
輛 → 輌
繫 → 繋

Of these there are quite a few occurrences in JMdict where it typically includes both the kyuujitai and the shinjitai as headwords.

Therefore it seems better to first try looking up using the original input string and then, if there are kyuujitai found, trying again with all kyuujitai replaced with shinjitai and, if we get a longer result, using that instead.

I've done that in 4acdd7d.

Do you have any texts I can test it out on?

Tomalak · 2021-05-25T06:02:39Z

if there are kyuujitai found, trying again with all kyuujitai replaced with shinjitai and, if we get a longer result, using that instead.

Could there be any cases where this swallows matches/loses detail?

birtles · 2021-05-25T06:11:31Z

Could there be any cases where this swallows matches/loses detail?

I'm not sure. I guess the case we'd be concerned about is if JMdict has entries such as:

Entry A: ○○薗○ (but NO ○○園○)
Entry B: ○○園○○

and the input text was ○○薗○○.

In that case we'd say that the 新字体 version produced a longer maximum match so we'd display the entries that match on it. As a result we'd show entry B and not entry A so that would possibly be a regression. However, I don't think JMdict ever has entries with 旧字体 headwords where the 新字体 headword is not also present (although I believe the opposite is common) so maybe it's ok?

As for losing detail, if the matches we find using the 旧字体 have the same maximum length as the converted 新字体 version, we'll stick with the original 旧字体 match so that in the pop-up the 旧字体 headword will be highlighted.

Tomalak · 2021-05-26T07:14:53Z

Yeah that's what I was thinking about. You're probably right, it should be rather rare.

birtles · 2021-05-26T09:56:33Z

Great, thanks for checking. I'll close this out for now then.

nicolasmaia · 2024-10-01T06:43:25Z

FYI, I just noticed 戶 doesn't get parsed as 戸.

Cf. https://en.wiktionary.org/wiki/%E6%88%B6#Japanese

As reported here birchill/10ten-ja-reader#604 (comment)

birtles · 2024-10-03T05:49:06Z

FYI, I just noticed 戶 doesn't get parsed as 戸.

Cf. https://en.wiktionary.org/wiki/%E6%88%B6#Japanese

Thanks! Looks like that one's not in the list at https://en.wikipedia.org/wiki/Kyūjitai

I've updated the library we use for this upstream so this should be fixed by the next release.

nicolasmaia · 2024-10-03T11:23:57Z

You might also want to add 內, which became 内. See https://en.wiktionary.org/wiki/%E5%BA%84%E5%86%85

As reported here: birchill/10ten-ja-reader#604 (comment)

birtles · 2024-10-04T03:35:49Z

You might also want to add 內, which became 内. See https://en.wiktionary.org/wiki/%E5%BA%84%E5%86%85

Thanks! I've added that one too now.

birtles added a commit that referenced this issue May 25, 2021

feat: Add support for looking up kyuujitai

4acdd7d

See #604.

birtles closed this as completed May 26, 2021

birtles added a commit to birchill/normal-jp that referenced this issue Oct 3, 2024

fix: add 戶 kyūjitai

ba2c0fc

As reported here birchill/10ten-ja-reader#604 (comment)

birtles mentioned this issue Oct 3, 2024

fix: add 戶 kyūjitai birchill/normal-jp#18

Merged

birtles added a commit to birchill/normal-jp that referenced this issue Oct 3, 2024

fix: add 戶 kyūjitai

f59171b

As reported here birchill/10ten-ja-reader#604 (comment)

birtles added a commit to birchill/normal-jp that referenced this issue Oct 4, 2024

fix: add 內 kyūjitai

73ab423

As reported here: birchill/10ten-ja-reader#604 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting kyujitai #604

Converting kyujitai #604

nicolasmaia commented May 19, 2021

birtles commented May 20, 2021

nicolasmaia commented May 20, 2021

Tomalak commented May 21, 2021

birtles commented May 22, 2021

Tomalak commented May 22, 2021

birtles commented May 25, 2021

Tomalak commented May 25, 2021

birtles commented May 25, 2021

Tomalak commented May 26, 2021

birtles commented May 26, 2021

nicolasmaia commented Oct 1, 2024

birtles commented Oct 3, 2024

nicolasmaia commented Oct 3, 2024

birtles commented Oct 4, 2024

Converting kyujitai #604

Converting kyujitai #604

Comments

nicolasmaia commented May 19, 2021

birtles commented May 20, 2021

nicolasmaia commented May 20, 2021

Tomalak commented May 21, 2021

birtles commented May 22, 2021

Tomalak commented May 22, 2021

birtles commented May 25, 2021

Tomalak commented May 25, 2021

birtles commented May 25, 2021

Tomalak commented May 26, 2021

birtles commented May 26, 2021

nicolasmaia commented Oct 1, 2024

birtles commented Oct 3, 2024

nicolasmaia commented Oct 3, 2024

birtles commented Oct 4, 2024