Skip to content

How to add a new BERT tokenizer model

SergeiAlonichau edited this page Aug 16, 2019 · 21 revisions

We assume the Bling Fire tools are already compiled and the PATH is set.

Initial Steps

  1. Create a new directory under ldbsrc

cd ldbsrc mkdir bert_chinese

  1. Copy content of an existing model similar to yours into the new directory:

cp bert_base_tok/* bert_chinese

  1. Modify options.small to use new output name for your bin file:

OUTPUT = bert_chinese.bin

OUTPUT = bert_chinese.bin

USE_CHARMAP = 1

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
<------>$(tmpdir)/wbd.fsa.$(mode).dump \
<------>$(tmpdir)/wbd.mmap.$(mode).dump \
<------>$(tmpdir)/charmap.mmap.$(mode).dump \

Disable Normalization

If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation from options.small file and ldb.conf.small:

options.small

OUTPUT = bert_chinese_no_normalization.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap

resources = \
<------>$(tmpdir)/wbd.fsa.$(mode).dump \
<------>$(tmpdir)/wbd.mmap.$(mode).dump \

ldb.conf.small

[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2
# charmap 3

Enable Normalization

If you need normalization such as case folding, dropping of accents or something else. You can generate your own charmap.utf8 file. The format of the file is space . The is 0 or more length, usually 1. If there is no entry found for then it will remain unchanged. If length is 0 (empty string) then the will be deleted.

Example:

# A --> a
\x0041 \x0061

# B --> b
\x0042 \x0062

# C --> c
\x0043 \x0063

# D --> d
\x0044 \x0064

# E --> e
\x0045 \x0065

# F --> f
\x0046 \x0066

# G --> g
\x0047 \x0067

# H --> h
\x0048 \x0068

It is easy to use a script to generate a charmap you need. For BERT casefolded models we use this command line:

python gen_charmap.py > charmap.utf8

After a charmap.utf8 is created you need to make sure options.small and ldb.conf contain options for compilation and resource reference for charmap.utf8 as before (see bert_base_tok directory.)

Add a New vocab.txt

If you model uses a different vocab.txt file. (Computation of vocab.txt is outside of scope of this tutorial.) Then you need to convert it into "fa_lex" format. We have a simple helper script for this: python vocab_to_fa_lex.py .

Note the script will create a new wbd.tagset.txt file and vocab.falex file. vocab.falex contains all words converted into fa_lex rules, these rules will be applied once a tokenizer finds a full token. Each rule maps a token to a unique tag value which is the same as ID in the original vocab.txt file. This way we marry a tokenizer and a dictionary lookup into one finite-state machine.

 < ^ [\[][U][N][K][\]] > --> WORD_ID_100
 < ^ [\[][C][L][S][\]] > --> WORD_ID_101
 < ^ [\[][S][E][P][\]] > --> WORD_ID_102
 < ^ [\[][M][A][S][K][\]] > --> WORD_ID_103
 < ^ [<][S][>] > --> WORD_ID_104
 < ^ [<][T][>] > --> WORD_ID_105
 < ^ [!] > --> WORD_ID_106
 < ^ ["] > --> WORD_ID_107
 < ^ [#] > --> WORD_ID_108
 < ^ [$] > --> WORD_ID_109
 < ^ [%] > --> WORD_ID_110
 < ^ [&] > --> WORD_ID_111
 < ^ ['] > --> WORD_ID_112
 < ^ [(] > --> WORD_ID_113
 < ^ [)] > --> WORD_ID_114
 < ^ [*] > --> WORD_ID_115
 < ^ [+] > --> WORD_ID_116
 < ^ [,] > --> WORD_ID_117
 < ^ [\-] > --> WORD_ID_118
 < ^ [.] > --> WORD_ID_119
 < ^ [/] > --> WORD_ID_120
 < ^ [0] > --> WORD_ID_121
 < ^ [1] > --> WORD_ID_122
 < ^ [2] > --> WORD_ID_123
 < ^ [3] > --> WORD_ID_124
 < ^ [4] > --> WORD_ID_125
 < ^ [5] > --> WORD_ID_126
 < ^ [6] > --> WORD_ID_127
 < ^ [7] > --> WORD_ID_128
 < ^ [8] > --> WORD_ID_129
 < ^ [9] > --> WORD_ID_130
 < ^ [:] > --> WORD_ID_131
 < ^ [;] > --> WORD_ID_132
 < ^ [<] > --> WORD_ID_133
 < ^ [=] > --> WORD_ID_134
 < ^ [>] > --> WORD_ID_135
 < ^ [?] > --> WORD_ID_136
 < ^ [@] > --> WORD_ID_137
 < ^ [\[] > --> WORD_ID_138
 < ^ [\x5C] > --> WORD_ID_139