Contemplate how to handle words with non-Chinese characters #42

danielat998 · 2017-06-21T22:23:04Z

No description provided.

james-s-w-clark · 2017-06-22T05:48:26Z

Could try a rule:
If (string unicode isn't han) {
Try looking up substring+next character(s) (which may be Chinese)
}

danielat998 · 2017-06-22T09:59:30Z

@james-clark-5 A valid idea. But we will have to consider where in the code this is done, as I have a suspicion we may extract non-Chinese characters earlier in the process. Also, there exist words that start with non-Chinese characters

james-s-w-clark · 2017-09-22T03:16:59Z

What examples do we have? Should help get the ball rolling.
AA制 (to go 50/50)
.......

danielat998 · 2017-09-22T08:15:38Z

That's the example that came to mind too, though if you look at the beginning of the file, you will find more examples. I think the first step would be to find out if there are any words that *end* with non Hanzi characters, as this might make the implementation a little (though not much) more complicated.

…

On 22 September 2017 at 04:16, IdiosApps ***@***.***> wrote: What examples do we have? Should help get the ball rolling. AA制 (to go 50/50) ....... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#42 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOmv5AFAcwBBAMiCPmqlYkZEyYDul118ks5skyargaJpZM4OBl4j> .

james-s-w-clark · 2017-09-22T09:18:53Z

I guess we'll be able to find about potential words ending with non-hanzi characters by reusing the recursive "try-all-character-combinations" code. Would also bring up words starting with non-hanzi characters.

danielat998 · 2017-09-22T09:53:25Z

Finding them would be very easy, something like: for Word word in (the set of words){ for(char c in word.toCharArray()){ if (c is not Unicode.HAN){ System.out.println(word.getSimplifiedChinese); continue; } } }

…

On 22 September 2017 at 10:18, IdiosApps ***@***.***> wrote: I guess we'll be able to find about potential words ending with non-hanzi characters by reusing the recursive "try-all-character-combinations" code. Would also bring up words starting with non-hanzi characters. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#42 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOmv5Ly3HkHC2otRLwUrafDGc86EOIXJks5sk3t9gaJpZM4OBl4j> .

danielt998 · 2017-09-22T11:26:23Z

less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[A-Z]|[A-Z]'
edit:less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[A-Z]|[a-z]'
3C 3P 3Q A AA制 AB制 ACG A咖 A圈儿 A片 A菜 A货 B B型超声 B超 CP C盘 C罗 DNA鉴定 E仔 G友 G弦裤 G点 H桥 K人 K仔 K他命 K房 K书 K歌 K粉 K线 K线图 M巾 N挡 OK绷 OK镜 OS O型腿 P P图 P挡 P民 Q T TA T字帐 T字裤 T恤 T裇 USB手指 USB记忆棒 U凸内裤 U型枕 U形转弯 U盘 V沟 X光三C 三K党三P 事儿B 来M 傻B 傻X 动L 卡拉OK 哆啦A梦唱K 大V 巴比Q 拉K 柯P 牛B 异维A酸装B 阿Q 阿Q正传齐B小短裙齐B短裙

james-s-w-clark · 2017-09-22T16:01:33Z

I'm curious what this would pull up:
less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[a-z]|[a-z]'

james-s-w-clark · 2017-09-22T16:02:33Z

I'm curious what this would pull up:
less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[0-9]|[0-9]'

danielt998 · 2017-09-22T16:54:19Z

圈a
and

21三体综合症
3C
3P
3Q
502胶
美国51区

and bear in mind there are entries with other characters too, such as this:
□ □ [ging1] /uptight/obstinate/to awkwardly force oneself to do sth/(Taiwanese, POJ pr. [gēng], often written as ㄍㄧㄥ, no generally accepted hanzi form)/

danielat998 added the Thinking point label Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contemplate how to handle words with non-Chinese characters #42

Contemplate how to handle words with non-Chinese characters #42

danielat998 commented Jun 21, 2017

james-s-w-clark commented Jun 22, 2017

danielat998 commented Jun 22, 2017

james-s-w-clark commented Sep 22, 2017

danielat998 commented Sep 22, 2017 via email

james-s-w-clark commented Sep 22, 2017

danielat998 commented Sep 22, 2017 via email •

edited by danielt998

Loading

danielt998 commented Sep 22, 2017 •

edited

Loading

james-s-w-clark commented Sep 22, 2017

james-s-w-clark commented Sep 22, 2017

danielt998 commented Sep 22, 2017

Contemplate how to handle words with non-Chinese characters #42

Contemplate how to handle words with non-Chinese characters #42

Comments

danielat998 commented Jun 21, 2017

james-s-w-clark commented Jun 22, 2017

danielat998 commented Jun 22, 2017

james-s-w-clark commented Sep 22, 2017

danielat998 commented Sep 22, 2017 via email

james-s-w-clark commented Sep 22, 2017

danielat998 commented Sep 22, 2017 via email • edited by danielt998 Loading

danielt998 commented Sep 22, 2017 • edited Loading

james-s-w-clark commented Sep 22, 2017

james-s-w-clark commented Sep 22, 2017

danielt998 commented Sep 22, 2017

danielat998 commented Sep 22, 2017 via email •

edited by danielt998

Loading

danielt998 commented Sep 22, 2017 •

edited

Loading