Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CJK IPA Tokenizer #367

Open
gkielian opened this issue Jan 19, 2025 · 0 comments
Open

Add CJK IPA Tokenizer #367

gkielian opened this issue Jan 19, 2025 · 0 comments
Assignees

Comments

@gkielian
Copy link
Collaborator

Currently creating some modifications to the Japanese to IPA tokenizer, noticing there are still a few hiragana types that we'll need to map:

Image

Also switching from csv library to pandas library since the former appears to reach its limit when parsing longer tsv fields (might also be because some rows aren't being recognized, possibly not in the right format).

in any case there were less than 10% of the file so as a tempfix, will be adding a sed to delete those lines which weren't successfully processed, then we can focus toward marking those rows which have data not being processed into IPA by the present scripts.

So currently working on:

  1. PR with the above changes (pruning rows not parsed by pandas)
  2. Marking (with [[[[[ ]]]]]) those sections of words not parsed
  3. Creating an IPA token list not counting the above words (will sed to remove those brackets), and start collecting ipa symbols for the tokenized dataset.

Note: if the [[[[[]]]]] words turn out to be very common, we can still use the combined IPA tokens and utilize byte fallback for remainer.

Semantic Factorization

Wanted to mention that am looking forward to adding a learned semantic/position-encoding, and here we might be able to add a parallel dataset for the hiragana and kanji types for each of the phonemes.

So the embeddings will look like:

  1. language embedding
  2. (if ja) hiragana
  3. (if ja) kanji
  4. (if zh) tone (finally a way to incorporate tone!)
  5. (if zh) hanzi character (?) or radical x position embeddings
  6. (if ko) hangul glyph (will likely speed up processing of particles)

Doing this will theoretically make it a reversible mapping without adding specialized numeric tokens, and still factorizing targets (reducing overhead for the multi category shadow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants