Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contemplate how to handle words with non-Chinese characters #42

Open
danielat998 opened this issue Jun 21, 2017 · 10 comments
Open

Contemplate how to handle words with non-Chinese characters #42

danielat998 opened this issue Jun 21, 2017 · 10 comments

Comments

@danielat998
Copy link
Collaborator

No description provided.

@james-s-w-clark
Copy link
Collaborator

Could try a rule:
If (string unicode isn't han) {
Try looking up substring+next character(s) (which may be Chinese)
}

@danielat998
Copy link
Collaborator Author

@james-clark-5 A valid idea. But we will have to consider where in the code this is done, as I have a suspicion we may extract non-Chinese characters earlier in the process. Also, there exist words that start with non-Chinese characters

@james-s-w-clark
Copy link
Collaborator

What examples do we have? Should help get the ball rolling.
AA制 (to go 50/50)
.......

@danielat998
Copy link
Collaborator Author

danielat998 commented Sep 22, 2017 via email

@james-s-w-clark
Copy link
Collaborator

I guess we'll be able to find about potential words ending with non-hanzi characters by reusing the recursive "try-all-character-combinations" code. Would also bring up words starting with non-hanzi characters.

@danielat998
Copy link
Collaborator Author

danielat998 commented Sep 22, 2017 via email

@danielt998
Copy link
Owner

danielt998 commented Sep 22, 2017

less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[A-Z]|[A-Z]'
edit:less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[A-Z]|[a-z]'
3C 3P 3Q A AA制 AB制 ACG A咖 A圈儿 A片 A菜 A货 B B型超声 B超 CP C盘 C罗 DNA鉴定 E仔 G友 G弦裤 G点 H桥 K人 K仔 K他命 K房 K书 K歌 K粉 K线 K线图 M巾 N挡 OK绷 OK镜 OS O型腿 P P图 P挡 P民 Q T TA T字帐 T字裤 T恤 T裇 USB手指 USB记忆棒 U凸内裤 U型枕 U形转弯 U盘 V沟 X光 三C 三K党 三P 事儿B 来M 傻B 傻X 动L 卡拉OK 哆啦A梦 唱K 大V 巴比Q 拉K 柯P 牛B 异维A酸 装B 阿Q 阿Q正传 齐B小短裙 齐B短裙

@james-s-w-clark
Copy link
Collaborator

I'm curious what this would pull up:
less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[a-z]|[a-z]'

@james-s-w-clark
Copy link
Collaborator

I'm curious what this would pull up:
less cedict_ts.u8 | cut -d ' ' -f 2 |grep '[0-9]|[0-9]'

@danielt998
Copy link
Owner

圈a
and

21三体综合症
3C
3P
3Q
502胶
美国51区

and bear in mind there are entries with other characters too, such as this:
□ □ [ging1] /uptight/obstinate/to awkwardly force oneself to do sth/(Taiwanese, POJ pr. [gēng], often written as ㄍㄧㄥ, no generally accepted hanzi form)/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants