You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently creating some modifications to the Japanese to IPA tokenizer, noticing there are still a few hiragana types that we'll need to map:
Also switching from csv library to pandas library since the former appears to reach its limit when parsing longer tsv fields (might also be because some rows aren't being recognized, possibly not in the right format).
in any case there were less than 10% of the file so as a tempfix, will be adding a sed to delete those lines which weren't successfully processed, then we can focus toward marking those rows which have data not being processed into IPA by the present scripts.
So currently working on:
PR with the above changes (pruning rows not parsed by pandas)
Marking (with [[[[[ ]]]]]) those sections of words not parsed
Creating an IPA token list not counting the above words (will sed to remove those brackets), and start collecting ipa symbols for the tokenized dataset.
Note: if the [[[[[]]]]] words turn out to be very common, we can still use the combined IPA tokens and utilize byte fallback for remainer.
Semantic Factorization
Wanted to mention that am looking forward to adding a learned semantic/position-encoding, and here we might be able to add a parallel dataset for the hiragana and kanji types for each of the phonemes.
So the embeddings will look like:
language embedding
(if ja) hiragana
(if ja) kanji
(if zh) tone (finally a way to incorporate tone!)
(if zh) hanzi character (?) or radical x position embeddings
(if ko) hangul glyph (will likely speed up processing of particles)
Doing this will theoretically make it a reversible mapping without adding specialized numeric tokens, and still factorizing targets (reducing overhead for the multi category shadow).
The text was updated successfully, but these errors were encountered:
Currently creating some modifications to the Japanese to IPA tokenizer, noticing there are still a few hiragana types that we'll need to map:
Also switching from csv library to pandas library since the former appears to reach its limit when parsing longer tsv fields (might also be because some rows aren't being recognized, possibly not in the right format).
in any case there were less than 10% of the file so as a tempfix, will be adding a
sed
to delete those lines which weren't successfully processed, then we can focus toward marking those rows which have data not being processed into IPA by the present scripts.So currently working on:
Note: if the [[[[[]]]]] words turn out to be very common, we can still use the combined IPA tokens and utilize byte fallback for remainer.
Semantic Factorization
Wanted to mention that am looking forward to adding a learned semantic/position-encoding, and here we might be able to add a parallel dataset for the hiragana and kanji types for each of the phonemes.
So the embeddings will look like:
Doing this will theoretically make it a reversible mapping without adding specialized numeric tokens, and still factorizing targets (reducing overhead for the multi category shadow).
The text was updated successfully, but these errors were encountered: