-
Notifications
You must be signed in to change notification settings - Fork 131
How to add a new BERT tokenizer model
We assume the Bling Fire tools are already compiled and the PATH is set.
- Create a new directory under ldbsrc
cd ldbsrc mkdir bert_chinese
- Copy content of an existing model similar to yours into the new directory:
cp bert_base_tok/* bert_chinese
- Modify options.small to use new output name for your bin file:
OUTPUT = bert_chinese.bin
OUTPUT = bert_chinese.bin USE_CHARMAP = 1 opt_build_wbd = --dict-root=. --full-unicode opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap resources = \ <------>$(tmpdir)/wbd.fsa.$(mode).dump \ <------>$(tmpdir)/wbd.mmap.$(mode).dump \ <------>$(tmpdir)/charmap.mmap.$(mode).dump \
If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation from options.small file and ldb.conf.small:
OUTPUT = bert_chinese_no_normalization.bin opt_build_wbd = --dict-root=. --full-unicode opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap resources = \ <------>$(tmpdir)/wbd.fsa.$(mode).dump \ <------>$(tmpdir)/wbd.mmap.$(mode).dump \
[wbd] max-depth 4 xword 2 seg 3 ignore 4 fsm 1 multi-map-mode triv-dump multi-map 2 # charmap 3
If you need normalization such as case folding, dropping of accents or something else. You can generate your own charmap.utf8 file. The format of the file is
# A --> a \x0041 \x0061 # B --> b \x0042 \x0062 # C --> c \x0043 \x0063 # D --> d \x0044 \x0064 # E --> e \x0045 \x0065 # F --> f \x0046 \x0066 # G --> g \x0047 \x0067 # H --> h \x0048 \x0068
It is easy to use a script to generate a charmap you need. For BERT casefolded models we use this command line:
python gen_charmap.py > charmap.utf8
After a charmap.utf8 is created you need to make sure options.small and ldb.conf contain options for compilation and resource reference for charmap.utf8 as before (see bert_base_tok directory.)
If you model uses a different vocab.txt file. (Computation of vocab.txt is outside of scope of this tutorial.) Then you need to convert it into "fa_lex" format. We have a simple helper script for this: python vocab_to_fa_lex.py
.
Note the script will create a new wbd.tagset.txt file and vocab.falex file. vocab.falex contains all words converted into fa_lex rules, these rules will be applied once a tokenizer finds a full token. Each rule maps a token to a unique tag value which is the same as ID in the original vocab.txt file. This way we marry a tokenizer and a dictionary lookup into one finite-state machine.
< ^ [\[][U][N][K][\]] > --> WORD_ID_100 < ^ [\[][C][L][S][\]] > --> WORD_ID_101 < ^ [\[][S][E][P][\]] > --> WORD_ID_102 < ^ [\[][M][A][S][K][\]] > --> WORD_ID_103 < ^ [<][S][>] > --> WORD_ID_104 < ^ [<][T][>] > --> WORD_ID_105 < ^ [!] > --> WORD_ID_106 < ^ ["] > --> WORD_ID_107 < ^ [#] > --> WORD_ID_108 < ^ [$] > --> WORD_ID_109 < ^ [%] > --> WORD_ID_110 < ^ [&] > --> WORD_ID_111 < ^ ['] > --> WORD_ID_112 < ^ [(] > --> WORD_ID_113 < ^ [)] > --> WORD_ID_114 < ^ [*] > --> WORD_ID_115 < ^ [+] > --> WORD_ID_116 < ^ [,] > --> WORD_ID_117 < ^ [\-] > --> WORD_ID_118 < ^ [.] > --> WORD_ID_119 < ^ [/] > --> WORD_ID_120 < ^ [0] > --> WORD_ID_121 < ^ [1] > --> WORD_ID_122 < ^ [2] > --> WORD_ID_123 < ^ [3] > --> WORD_ID_124 < ^ [4] > --> WORD_ID_125 < ^ [5] > --> WORD_ID_126 < ^ [6] > --> WORD_ID_127 < ^ [7] > --> WORD_ID_128 < ^ [8] > --> WORD_ID_129 < ^ [9] > --> WORD_ID_130 < ^ [:] > --> WORD_ID_131 < ^ [;] > --> WORD_ID_132 < ^ [<] > --> WORD_ID_133 < ^ [=] > --> WORD_ID_134 < ^ [>] > --> WORD_ID_135 < ^ [?] > --> WORD_ID_136 < ^ [@] > --> WORD_ID_137 < ^ [\[] > --> WORD_ID_138 < ^ [\x5C] > --> WORD_ID_139
Now you also need to make sure that the new vocab.falex file is included into your main tokenization grammar file wbd.lex.utf8. Note the path for the _include starts at ldbsrc, so after updating the path you should see something like this in your wbd.lex.utf8:
... _function FnTokWord _include bert_chinese/vocab.falex _end
Assuming you are in ldbsrc directory, type this:
make -f Makefile.gnu lang=bert_chinese all
Given the machine is quite complex it might take a while 2-4 hours for existing BERT files. During the compilation make sure there are no "ERROR:" messages printed. If you encounter any, you should not use the bin file even it may have been created.
Sometimes you need to be able to tell why you are getting these IDs and not those. You can use fa_lex command line tool to see how the text was segmented, where the main words are and what are the sub-tokens for each word.
Try this:
printf 'Heung-Yeung "Harry" Shum (Chinese: 沈向洋; born in October 1966) is a computer scientist of Chinese origin.' | fa_lex --ldb=ldb/bert_chinese.bin --tagset=bert_chinese/wbd.tagset.txt --normalize-input
You should get:
heung/WORD he/WORD_ID_9245 ung/WORD_ID_9112 -/WORD -/WORD_ID_118 yeung/WORD y/WORD_ID_167 e/WORD_ID_8154 ung/WORD_ID_9112 "/WORD "/WORD_ID_107 harry/WORD harry/WORD_ID_12296 "/WORD "/WORD_ID_107 shum/WORD sh/WORD_ID_11167 um/WORD_ID_8545 (/WORD (/WORD_ID_113 chinese/WORD chinese/WORD_ID_10101 :/WORD :/WORD_ID_131 沈/WORD 沈/WORD_ID_3755 向/WORD 向/WORD_ID_1403 洋/WORD 洋/WORD_ID_3817 ;/WORD ;/WORD_ID_132 born/WORD bo/WORD_ID_11059 rn/WORD_ID_9256 in/WORD in/WORD_ID_8217 october/WORD october/WORD_ID_9548 1966/WORD 1966/WORD_ID_9093 )/WORD )/WORD_ID_114 is/WORD is/WORD_ID_8310 a/WORD a/WORD_ID_143 computer/WORD com/WORD_ID_8134 put/WORD_ID_11300 er/WORD_ID_8196 scientist/WORD sci/WORD_ID_11776 ent/WORD_ID_8936 ist/WORD_ID_9527 of/WORD of/WORD_ID_8205 chinese/WORD chinese/WORD_ID_10101 origin/WORD or/WORD_ID_8549 ig/WORD_ID_11421 in/WORD_ID_8277 ./WORD ./WORD_ID_119
The format is self explanatory, each WORD is followed by sub-tokens with WORD_ID_NNNN tag, where NNNN is the ids that the API will return.
If the problem that you see is in the main word tokenization then you can comment the subtoken rules and change / recompile the grammar much faster (minutes) until the error is fixed and then uncomment the subtoken logic back and do full final recompile.
... _function FnTokWord # comment subtoken rules for faster compilation # _include bert_chinese/vocab.falex _end
You can run your model on a large text (should be in UTF-8 encoding) as follows:
printf 'Heung-Yeung "Harry" Shum (Chinese: 沈向洋; born in October 1966) is a computer scientist of Chinese origin.' | python test_bling.py -m bert_chinese.bin Heung-Yeung "Harry" Shum (Chinese: 沈向洋; born in October 1966) is a computer scientist of Chinese origin. [ 9245 9112 118 167 8154 9112 107 12296 107 11167 8545 113 10101 131 3755 1403 3817 132 11059 9256 8217 9548 9093 114 8310 143 8134 11300 8196 11776 8936 9527 8205 10101 8549 11421 8277 119 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
For a text file:
cat test.txt | python test_bling.py -m ldb/bert_chinese.bin > test.bling.chinese.txt
You can compare the output to original BERT's tokenizer code as follows:
cat test.txt | python test_bert.py > test.bert.multi_cased.txt
Please make sure that test_bert.py uses correct vocab.txt file and drop_case setting.
Then just diff them with your favorite diff program.
ProcTime = Min({Times No Output}) - Min({Times No Process})
ProcTime = 1.12 s
ProcSpeed = DataSize / ProcTime
ProcSpeed = 3.75 MB/s
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 real 1.38 user 2.66 sys 0.99 (venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 real 1.31 user 2.42 sys 0.85 (venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 real 1.36 user 2.46 sys 0.82 (venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 -n 1 real 0.19 user 1.22 sys 0.89 (venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 -n 1 real 0.19 user 1.22 sys 0.84 (venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 -n 1 real 0.19 user 1.23 sys 0.88 (venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ ls -lh test.txt -rw-rw-r-- 1 sergeio sergeio 4.2M Jul 12 20:08 test.txt
Since code is written in C++ and does not have Global Interpreter Lock, you can process your text in parallel. The models are thread safe so you don't need to keep a pool of them. In production setting we observed below 1 ms latency per document when used together with parallel for loop (called from C++.)