-
Notifications
You must be signed in to change notification settings - Fork 131
How to add a new BERT tokenizer model
We assume the Bling Fire tools are already compiled and the PATH is set.
- Create a new directory under ldbsrc
cd ldbsrc mkdir bert_chinese
- Copy content of an existing model similar to yours into the new directory:
cp bert_base_tok/* bert_chinese
- Modify options.small to use new output name for your bin file:
OUTPUT = bert_chinese.bin
OUTPUT = bert_chinese.bin USE_CHARMAP = 1 opt_build_wbd = --dict-root=. --full-unicode opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap resources = \ <------>$(tmpdir)/wbd.fsa.$(mode).dump \ <------>$(tmpdir)/wbd.mmap.$(mode).dump \ <------>$(tmpdir)/charmap.mmap.$(mode).dump \
If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation from options.small file and ldb.conf.small:
OUTPUT = bert_chinese_no_normalization.bin opt_build_wbd = --dict-root=. --full-unicode opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap resources = \ <------>$(tmpdir)/wbd.fsa.$(mode).dump \ <------>$(tmpdir)/wbd.mmap.$(mode).dump \
[wbd] max-depth 4 xword 2 seg 3 ignore 4 fsm 1 multi-map-mode triv-dump multi-map 2 # charmap 3
If you need normalization such as case folding, dropping of accents or something else. You can generate your own charmap.utf8 file. The format of the file is
Example:
# A --> a \x0041 \x0061 # B --> b \x0042 \x0062 # C --> c \x0043 \x0063 # D --> d \x0044 \x0064 # E --> e \x0045 \x0065 # F --> f \x0046 \x0066 # G --> g \x0047 \x0067 # H --> h \x0048 \x0068
It is easy to use a script to generate a charmap you need. For BERT casefolded models we use this command line:
python gen_charmap.py > charmap.utf8
After a charmap.utf8 is created you need to make sure options.small and ldb.conf contain options for compilation and resource reference for charmap.utf8 as before (see bert_base_tok directory.)
If you model uses a different vocab.txt file. (Computation of vocab.txt is outside of scope of this tutorial.) Then you need to convert it into "fa_lex" format. We have a simple helper script for this: python vocab_to_fa_lex.py
.
Note the script will create a new wbd.tagset.txt file and vocab.falex file. vocab.falex contains all words converted into fa_lex rules, these rules will be applied once a tokenizer finds a full token. Each rule maps a token to a unique tag value which is the same as ID in the original vocab.txt file. This way we marry a tokenizer and a dictionary lookup into one finite-state machine.
< ^ [\[][U][N][K][\]] > --> WORD_ID_100 < ^ [\[][C][L][S][\]] > --> WORD_ID_101 < ^ [\[][S][E][P][\]] > --> WORD_ID_102 < ^ [\[][M][A][S][K][\]] > --> WORD_ID_103 < ^ [<][S][>] > --> WORD_ID_104 < ^ [<][T][>] > --> WORD_ID_105 < ^ [!] > --> WORD_ID_106 < ^ ["] > --> WORD_ID_107 < ^ [#] > --> WORD_ID_108 < ^ [$] > --> WORD_ID_109 < ^ [%] > --> WORD_ID_110 < ^ [&] > --> WORD_ID_111 < ^ ['] > --> WORD_ID_112 < ^ [(] > --> WORD_ID_113 < ^ [)] > --> WORD_ID_114 < ^ [*] > --> WORD_ID_115 < ^ [+] > --> WORD_ID_116 < ^ [,] > --> WORD_ID_117 < ^ [\-] > --> WORD_ID_118 < ^ [.] > --> WORD_ID_119 < ^ [/] > --> WORD_ID_120 < ^ [0] > --> WORD_ID_121 < ^ [1] > --> WORD_ID_122 < ^ [2] > --> WORD_ID_123 < ^ [3] > --> WORD_ID_124 < ^ [4] > --> WORD_ID_125 < ^ [5] > --> WORD_ID_126 < ^ [6] > --> WORD_ID_127 < ^ [7] > --> WORD_ID_128 < ^ [8] > --> WORD_ID_129 < ^ [9] > --> WORD_ID_130 < ^ [:] > --> WORD_ID_131 < ^ [;] > --> WORD_ID_132 < ^ [<] > --> WORD_ID_133 < ^ [=] > --> WORD_ID_134 < ^ [>] > --> WORD_ID_135 < ^ [?] > --> WORD_ID_136 < ^ [@] > --> WORD_ID_137 < ^ [\[] > --> WORD_ID_138 < ^ [\x5C] > --> WORD_ID_139