Add CJK IPA Tokenizer #367

gkielian · 2025-01-19T20:22:28Z

Currently creating some modifications to the Japanese to IPA tokenizer, noticing there are still a few hiragana types that we'll need to map:

Also switching from csv library to pandas library since the former appears to reach its limit when parsing longer tsv fields (might also be because some rows aren't being recognized, possibly not in the right format).

in any case there were less than 10% of the file so as a tempfix, will be adding a sed to delete those lines which weren't successfully processed, then we can focus toward marking those rows which have data not being processed into IPA by the present scripts.

So currently working on:

PR with the above changes (pruning rows not parsed by pandas)
Marking (with [[[[[ ]]]]]) those sections of words not parsed
Creating an IPA token list not counting the above words (will sed to remove those brackets), and start collecting ipa symbols for the tokenized dataset.

Note: if the [[[[[]]]]] words turn out to be very common, we can still use the combined IPA tokens and utilize byte fallback for remainer.

Semantic Factorization

Wanted to mention that am looking forward to adding a learned semantic/position-encoding, and here we might be able to add a parallel dataset for the hiragana and kanji types for each of the phonemes.

So the embeddings will look like:

language embedding
(if ja) hiragana
(if ja) kanji
(if zh) tone (finally a way to incorporate tone!)
(if zh) hanzi character (?) or radical x position embeddings
(if ko) hangul glyph (will likely speed up processing of particles)

Doing this will theoretically make it a reversible mapping without adding specialized numeric tokens, and still factorizing targets (reducing overhead for the multi category shadow).

The text was updated successfully, but these errors were encountered:

gkielian assigned gkielian, xinyixuu, Zhao-Yuting and klei22 Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CJK IPA Tokenizer #367

Add CJK IPA Tokenizer #367

gkielian commented Jan 19, 2025

Add CJK IPA Tokenizer #367

Add CJK IPA Tokenizer #367

Comments

gkielian commented Jan 19, 2025

Semantic Factorization