Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged
merged 268 commits into from
Sep 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
268 commits
Select commit Hold shift + click to select a range
4aea714
fix test for bart. Order is correct now let's skip BPEs
ArthurZucker Jun 2, 2023
0df212f
ouf
ArthurZucker Jun 2, 2023
c04bd58
styling
ArthurZucker Jun 2, 2023
4a54920
fix bert....
ArthurZucker Jun 2, 2023
189b259
slow refactoring
ArthurZucker Jun 3, 2023
f1c362a
current updates
ArthurZucker Jun 14, 2023
f5b178a
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Jun 15, 2023
0047eb9
massive refactoring
ArthurZucker Jun 16, 2023
0e34205
update
ArthurZucker Jun 16, 2023
81e73d7
NICE!
ArthurZucker Jun 19, 2023
1f3ff93
update to see where I am at
ArthurZucker Jun 19, 2023
3701e26
updates
ArthurZucker Jun 20, 2023
0369a51
update
ArthurZucker Jun 24, 2023
b621c96
update
ArthurZucker Jun 25, 2023
83feef4
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Jun 25, 2023
2e7b733
revert
ArthurZucker Jun 26, 2023
582bd26
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Jul 26, 2023
4c78792
updates
ArthurZucker Jul 26, 2023
25b9836
updates
ArthurZucker Jul 26, 2023
7759304
start supporting legacy_save
ArthurZucker Jul 26, 2023
53b29e5
styling
ArthurZucker Jul 27, 2023
a7ab994
big update
ArthurZucker Jul 27, 2023
6fb6b31
revert some changes
ArthurZucker Jul 27, 2023
9b2e138
nits
ArthurZucker Jul 27, 2023
dffd6d8
nniiiiiice
ArthurZucker Jul 27, 2023
b86bb3a
small fixes
ArthurZucker Jul 27, 2023
b9bb598
kinda fix t5 with new behaviour
ArthurZucker Jul 27, 2023
55ea2f5
major update
ArthurZucker Jul 27, 2023
ff13cb1
fixup
ArthurZucker Jul 27, 2023
5b46dc9
fix copies
ArthurZucker Jul 27, 2023
70252e5
today's updates
ArthurZucker Jul 28, 2023
d76b414
fix byt5
ArthurZucker Jul 28, 2023
496679b
upfate
ArthurZucker Jul 28, 2023
521907d
update
ArthurZucker Jul 28, 2023
1d4e947
update
ArthurZucker Jul 28, 2023
6ae5e51
updates
ArthurZucker Jul 28, 2023
c42800e
update vocab size test
ArthurZucker Jul 31, 2023
754b219
Barthez does not use not need the fairseq offset ids
ArthurZucker Jul 31, 2023
80ff47a
super calll must be after
ArthurZucker Jul 31, 2023
14726cc
calll super
ArthurZucker Jul 31, 2023
1f24347
move all super init
ArthurZucker Jul 31, 2023
d2ed9c6
move other super init
ArthurZucker Aug 3, 2023
74b22e1
fixup
ArthurZucker Aug 3, 2023
162e61d
nits
ArthurZucker Aug 3, 2023
1643432
more fixes
ArthurZucker Aug 3, 2023
5db9604
nits
ArthurZucker Aug 3, 2023
957cb54
more fixes
ArthurZucker Aug 3, 2023
29e32f6
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 3, 2023
f0de438
nits
ArthurZucker Aug 3, 2023
d31acd2
more fix
ArthurZucker Aug 3, 2023
a6f0094
remove useless files
ArthurZucker Aug 3, 2023
38b8942
ouch all of them are affected
ArthurZucker Aug 3, 2023
8096378
and more!
ArthurZucker Aug 3, 2023
2d55c90
small imporvements
ArthurZucker Aug 3, 2023
31ac52d
no more sanitize token
ArthurZucker Aug 3, 2023
da42a86
more changes around unique no split tokens
ArthurZucker Aug 3, 2023
e34c6ea
partially fix more things
ArthurZucker Aug 4, 2023
becb4a5
keep legacy save but add warning
ArthurZucker Aug 4, 2023
2ddcc28
so... more fixes
ArthurZucker Aug 4, 2023
7bd0b15
updates
ArthurZucker Aug 4, 2023
365f815
guess deberta tokenizer could be nuked
ArthurZucker Aug 4, 2023
5ff1e04
fixup
ArthurZucker Aug 4, 2023
9284bbf
fixup did some bad things
ArthurZucker Aug 4, 2023
dbdafeb
nuke it if it breaks
ArthurZucker Aug 4, 2023
83310c9
remove prints and pretrain fast from slow with new format.
ArthurZucker Aug 4, 2023
9a461f5
fixups
ArthurZucker Aug 4, 2023
015e796
Apply suggestions from code review
ArthurZucker Aug 4, 2023
6e17f4e
fiou
ArthurZucker Aug 4, 2023
05e1c56
Merge branch 'fix-add-tokens' of https://github.com/ArthurZucker/tran…
ArthurZucker Aug 4, 2023
181c112
nit
ArthurZucker Aug 4, 2023
e9887e8
by default specials should not be normalized?
ArthurZucker Aug 4, 2023
df035b7
update
ArthurZucker Aug 4, 2023
d7a2458
remove brakpoint
ArthurZucker Aug 4, 2023
f5c8f2c
updates
ArthurZucker Aug 7, 2023
64237fc
a lot of updates
ArthurZucker Aug 7, 2023
fc01d58
fixup
ArthurZucker Aug 7, 2023
f101206
fixes revert some changes to match fast
ArthurZucker Aug 7, 2023
4834e64
small nits
ArthurZucker Aug 7, 2023
1acbe48
that makes it cleaner
ArthurZucker Aug 8, 2023
4c9a61e
fix camembert accordingly
ArthurZucker Aug 8, 2023
62b98b0
update
ArthurZucker Aug 8, 2023
62cedf3
some lest breaking changes
ArthurZucker Aug 8, 2023
395259a
update
ArthurZucker Aug 8, 2023
f4b3c85
fixup
ArthurZucker Aug 8, 2023
e6c7b28
fix byt5 and whisper mostly
ArthurZucker Aug 8, 2023
8626851
some more fixes, canine's byte vocab
ArthurZucker Aug 8, 2023
0d4e360
fix gpt2
ArthurZucker Aug 8, 2023
d639ea6
fix most of the perceiver tests (4 left)
ArthurZucker Aug 8, 2023
54a7ac8
fix layout lmv3
ArthurZucker Aug 8, 2023
e2b04c5
fixup
ArthurZucker Aug 8, 2023
70ca39b
fix copies for gpt2 style
ArthurZucker Aug 8, 2023
cd248db
make sure to only warn once
ArthurZucker Aug 8, 2023
02185cf
fix perciever and gpt2 tests
ArthurZucker Aug 8, 2023
7105eea
some more backward compatibility: also read special tokens map becaus…
ArthurZucker Aug 8, 2023
8f5d96b
fixup
ArthurZucker Aug 8, 2023
9f4f1e4
add else when reading
ArthurZucker Aug 8, 2023
f0c41a1
nits
ArthurZucker Aug 8, 2023
2812c70
fresh updates
ArthurZucker Aug 8, 2023
984a307
fix copies
ArthurZucker Aug 8, 2023
2fc2349
will this make everything faster?
ArthurZucker Aug 8, 2023
3861d1d
fixes
ArthurZucker Aug 8, 2023
0e20614
more fixes
ArthurZucker Aug 8, 2023
6dbb814
update
ArthurZucker Aug 8, 2023
40270d8
more fixes
ArthurZucker Aug 8, 2023
7b92445
fixup
ArthurZucker Aug 8, 2023
7b9d373
is the source of truth right?
ArthurZucker Aug 9, 2023
6631d42
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 9, 2023
58abfe7
sorry camembert for the troubles
ArthurZucker Aug 9, 2023
fd4d7f8
current updates
ArthurZucker Aug 9, 2023
8e1225f
fixup
ArthurZucker Aug 9, 2023
98c869f
update led
ArthurZucker Aug 9, 2023
da0414c
update
ArthurZucker Aug 9, 2023
b1b2779
fix regression
ArthurZucker Aug 9, 2023
161e09b
fix single word
ArthurZucker Aug 9, 2023
0ca9c8a
more model specific fixes
ArthurZucker Aug 9, 2023
87bc36d
fix t5 tests
ArthurZucker Aug 9, 2023
63dce31
fixup
ArthurZucker Aug 9, 2023
6c6f70a
more comments
ArthurZucker Aug 9, 2023
f5252a1
update
ArthurZucker Aug 9, 2023
c7c8ebe
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 9, 2023
38a013f
fix nllb
ArthurZucker Aug 9, 2023
3026e49
rstrip removed
ArthurZucker Aug 9, 2023
d3c9024
small fixes
ArthurZucker Aug 9, 2023
5242be7
better handle additional_special_tokens and vocab sizes
ArthurZucker Aug 9, 2023
769e215
fixing
ArthurZucker Aug 9, 2023
bf2cb8a
styling
ArthurZucker Aug 9, 2023
d840605
fix 4 / 21
ArthurZucker Aug 9, 2023
3bf7591
fixup
ArthurZucker Aug 9, 2023
40d55f7
fix nlbb's tests
ArthurZucker Aug 9, 2023
0b65a21
some fixes
ArthurZucker Aug 9, 2023
7a706a6
fix t5
ArthurZucker Aug 9, 2023
53d4b13
fixes
ArthurZucker Aug 9, 2023
448cfd4
style
ArthurZucker Aug 9, 2023
f521b60
fix canine tests
ArthurZucker Aug 10, 2023
384ea05
damn this is nice
ArthurZucker Aug 10, 2023
5c05f4a
nits
ArthurZucker Aug 10, 2023
1b1aad0
m2m100 nit
ArthurZucker Aug 10, 2023
6961910
fixups
ArthurZucker Aug 11, 2023
969d46e
fixes!
ArthurZucker Aug 11, 2023
d957860
fixup
ArthurZucker Aug 11, 2023
c9732fb
stash
ArthurZucker Aug 16, 2023
dd99191
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 18, 2023
463cbb4
fix merge
ArthurZucker Aug 18, 2023
965bca1
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 18, 2023
3a79633
revert bad change
ArthurZucker Aug 21, 2023
bf7ba15
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 31, 2023
717b4b9
fixup
ArthurZucker Aug 31, 2023
3d10fb0
correct order for code Llama
ArthurZucker Aug 31, 2023
78d4215
fix speecht5 post merge
ArthurZucker Aug 31, 2023
3605c75
styling
ArthurZucker Aug 31, 2023
abfe019
revert source of 11 fails
ArthurZucker Aug 31, 2023
677fc8e
small nits
ArthurZucker Aug 31, 2023
548e062
all changes in one go
ArthurZucker Sep 1, 2023
07e4d1e
fnet hack
ArthurZucker Sep 1, 2023
b871e7e
fix 2 more tests
ArthurZucker Sep 1, 2023
1520d14
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 1, 2023
136c877
update based on main branch of tokenizers
ArthurZucker Sep 5, 2023
b6a6ec7
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 5, 2023
a304a3c
fixup
ArthurZucker Sep 5, 2023
a9f47e2
fix VITS issues
ArthurZucker Sep 5, 2023
47387ec
more fixes
ArthurZucker Sep 5, 2023
c8acd2c
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 5, 2023
0c584d7
fix mgp test
ArthurZucker Sep 5, 2023
de0df2f
fix camembert issues
ArthurZucker Sep 5, 2023
fa2c424
oups camembert still has 2 failing tests
ArthurZucker Sep 5, 2023
4498ad8
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 5, 2023
d4de85c
mluke fixes
ArthurZucker Sep 5, 2023
6df2fe3
decode fixes
ArthurZucker Sep 5, 2023
4167a74
small nits
ArthurZucker Sep 5, 2023
d3e1bf1
nits
ArthurZucker Sep 5, 2023
0d65b83
fix llama and vits
ArthurZucker Sep 6, 2023
a52bb94
fix camembert
ArthurZucker Sep 6, 2023
bf58e4e
smal nits
ArthurZucker Sep 6, 2023
25adcb8
more fixes when initialising a fast from a slow and etc
ArthurZucker Sep 6, 2023
fc799f1
fix one of the last test
ArthurZucker Sep 6, 2023
205a6ed
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 6, 2023
ecbee31
fix CPM tokenizer test
ArthurZucker Sep 6, 2023
6170200
fixups
ArthurZucker Sep 6, 2023
e384d13
fix pop2piano
ArthurZucker Sep 6, 2023
b69895c
fixup
ArthurZucker Sep 6, 2023
c0c77a5
⚠️ Change tokenizers required version ⚠️
ArthurZucker Sep 7, 2023
957329f
⚠️ Change tokenizers required version ⚠️
ArthurZucker Sep 7, 2023
ae48fb2
"tokenizers>=0.14,<0.15", don't forget smaller than
ArthurZucker Sep 7, 2023
8e22a05
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 7, 2023
2b69acb
fix musicgen tests and pretraiendtokenizerfast
ArthurZucker Sep 7, 2023
abefa05
fix owlvit and all
ArthurZucker Sep 7, 2023
52c2f56
update t5
ArthurZucker Sep 7, 2023
ddb4a2b
fix 800 red
ArthurZucker Sep 7, 2023
6029ee3
fix tests
ArthurZucker Sep 7, 2023
3bebdc1
fix the fix of the fix of t5
ArthurZucker Sep 7, 2023
b4aa11f
styling
ArthurZucker Sep 7, 2023
0f7b941
documentation nits
ArthurZucker Sep 11, 2023
ca33d35
Skip failing whisper test by Merge branch 'main' of https://github.co…
ArthurZucker Sep 11, 2023
9ae3380
cache _added_tokens_encoder
ArthurZucker Sep 11, 2023
f7d9a6f
fixups
ArthurZucker Sep 11, 2023
5136647
Nit
ArthurZucker Sep 11, 2023
ca4835c
fix red tests
ArthurZucker Sep 11, 2023
c03e549
one last nit!
ArthurZucker Sep 11, 2023
8569508
make eveything a lot simpler
ArthurZucker Sep 11, 2023
a5dcdab
Now it's over :wink:
ArthurZucker Sep 11, 2023
f8638b3
few small nits
ArthurZucker Sep 11, 2023
68b5f54
Apply suggestions from code review
ArthurZucker Sep 11, 2023
de40bf2
updates that work for now
ArthurZucker Sep 11, 2023
9c6e48f
Merge branch 'fix-add-tokens' of https://github.com/arthurzucker/tran…
ArthurZucker Sep 11, 2023
0c661ec
tests that should no be skipped / changed and fixed next
ArthurZucker Sep 11, 2023
3bbe7f0
fixup
ArthurZucker Sep 11, 2023
9b22a80
i am ashamed
ArthurZucker Sep 11, 2023
f75d9bd
pushe the fix
ArthurZucker Sep 12, 2023
23d08be
update
ArthurZucker Sep 12, 2023
b68d937
fixups
ArthurZucker Sep 12, 2023
39ec344
nits
ArthurZucker Sep 12, 2023
bdb36a1
fix added_tokens_encoder
ArthurZucker Sep 12, 2023
60f7b70
fix canine test
ArthurZucker Sep 12, 2023
8ef282f
fix pegasus vocab
ArthurZucker Sep 12, 2023
3b82eac
fix transfoXL
ArthurZucker Sep 12, 2023
719ce61
fixup
ArthurZucker Sep 12, 2023
3555fc6
whisper needs to be fixed for train new
ArthurZucker Sep 12, 2023
f206fd1
pegasus nits
ArthurZucker Sep 12, 2023
cf42578
more pegasus fixes
ArthurZucker Sep 12, 2023
03de2ec
minor update
ArthurZucker Sep 12, 2023
a21d611
better error message in failed test
ArthurZucker Sep 12, 2023
604a677
fix whisper failing test
ArthurZucker Sep 12, 2023
f062446
fix whisper failing test
ArthurZucker Sep 12, 2023
d2367f1
fix pegasus
ArthurZucker Sep 12, 2023
04de249
fixup
ArthurZucker Sep 12, 2023
d1ccd62
fix **** pegasus
ArthurZucker Sep 12, 2023
ea83357
reset things
ArthurZucker Sep 12, 2023
cfd2e82
remove another file
ArthurZucker Sep 12, 2023
650d403
attempts to fix the strange custome encoder and offset
ArthurZucker Sep 12, 2023
6896a75
nits here and there
ArthurZucker Sep 12, 2023
d73fac1
update
ArthurZucker Sep 12, 2023
55fa70c
fixup
ArthurZucker Sep 12, 2023
a8cd1f0
nit
ArthurZucker Sep 12, 2023
a95460c
fix the whisper test
ArthurZucker Sep 13, 2023
5292ea6
nits nits
ArthurZucker Sep 13, 2023
6e6a0d6
Apply suggestions from code review
ArthurZucker Sep 14, 2023
d519be5
updates based on review
ArthurZucker Sep 14, 2023
589be23
some small update to potentially remove
ArthurZucker Sep 14, 2023
d20e8d7
Merge branch 'fix-add-tokens' of github.com:ArthurZucker/transformers…
ArthurZucker Sep 14, 2023
1f007bf
nits
ArthurZucker Sep 14, 2023
459b082
Merge branch 'main' of github.com:huggingface/transformers into fix-a…
ArthurZucker Sep 14, 2023
1155d3c
import rlu cache
ArthurZucker Sep 14, 2023
8516d1c
Update src/transformers/tokenization_utils_base.py
ArthurZucker Sep 18, 2023
db19f2b
move warning to `from_pretrained`
ArthurZucker Sep 18, 2023
70f303b
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 18, 2023
51eda03
update tests results now that the special tokens are always added
ArthurZucker Sep 18, 2023
24ac840
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 18, 2023
04bef33
Merge branch 'main' of github.com:huggingface/transformers into fix-a…
ArthurZucker Sep 18, 2023
7b86b04
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 18, 2023
9424fdf
Merge branch 'fix-add-tokens' of https://github.com/ArthurZucker/tran…
ArthurZucker Sep 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,4 +166,4 @@ tags
.DS_Store

# ruff
.ruff_cache
.ruff_cache
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@
"tf2onnx",
"timeout-decorator",
"timm",
"tokenizers>=0.11.1,!=0.11.3,<0.14",
"tokenizers>=0.14,<0.15",
"torch>=1.10,!=1.12.0",
"torchaudio",
"torchvision",
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/dependency_versions_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
"tf2onnx": "tf2onnx",
"timeout-decorator": "timeout-decorator",
"timm": "timm",
"tokenizers": "tokenizers>=0.11.1,!=0.11.3,<0.14",
"tokenizers": "tokenizers>=0.14,<0.15",
"torch": "torch>=1.10,!=1.12.0",
"torchaudio": "torchaudio",
"torchvision": "torchvision",
Expand Down
18 changes: 10 additions & 8 deletions src/transformers/models/albert/tokenization_albert.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,14 @@ def __init__(

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)

super().__init__(
do_lower_case=do_lower_case,
remove_space=remove_space,
Expand All @@ -174,14 +182,6 @@ def __init__(
**kwargs,
)

self.do_lower_case = do_lower_case
self.remove_space = remove_space
self.keep_accents = keep_accents
self.vocab_file = vocab_file

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)

@property
def vocab_size(self) -> int:
return len(self.sp_model)
Expand Down Expand Up @@ -228,6 +228,8 @@ def _tokenize(self, text: str) -> List[str]:
new_pieces = []
for piece in pieces:
if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit():
# Logic to handle special cases see https://github.com/google-research/bert/blob/master/README.md#tokenization
# `9,9` -> ['▁9', ',', '9'] instead of [`_9,`, '9']
cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, ""))
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
if len(cur_pieces[0]) == 1:
Expand Down
28 changes: 15 additions & 13 deletions src/transformers/models/bart/tokenization_bart.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,21 +204,10 @@ def __init__(
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token

# Mask token behave like a normal word, i.e. include the space before it
# TODO seems like both slow and fast actually don't strip left and right soooooooo yeah. See `test_embeded_special_tokens`
# Also this not only will strip the spaces but any punctuation
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token

super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)

with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
Expand All @@ -235,6 +224,19 @@ def __init__(
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

super().__init__(
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
**kwargs,
)

@property
def vocab_size(self):
return len(self.encoder)
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/bart/tokenization_bart_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ def __init__(
trim_offsets=True,
**kwargs,
):
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
vocab_file,
merges_file,
Expand Down
22 changes: 6 additions & 16 deletions src/transformers/models/barthez/tokenization_barthez.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@

SPIECE_UNDERLINE = "▁"

# TODO this class is useless. This is the most standard sentencpiece model. Let's find which one is closest and nuke this.


class BarthezTokenizer(PreTrainedTokenizer):
"""
Expand Down Expand Up @@ -141,6 +143,9 @@ def __init__(

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
Expand All @@ -153,15 +158,6 @@ def __init__(
**kwargs,
)

self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))

self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}

self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) - 1
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}

def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
Expand Down Expand Up @@ -251,16 +247,10 @@ def _tokenize(self, text: str) -> List[str]:

def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
spm_id = self.sp_model.PieceToId(token)

return spm_id if spm_id else self.unk_token_id
return self.sp_model.PieceToId(token)

def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index)

def convert_tokens_to_string(self, tokens):
Expand Down
24 changes: 12 additions & 12 deletions src/transformers/models/bartpho/tokenization_bartpho.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,18 +139,6 @@ def __init__(

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)

self.vocab_file = vocab_file
self.monolingual_vocab_file = monolingual_vocab_file
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
Expand All @@ -174,6 +162,18 @@ def __init__(

self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}

super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)

def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
Expand Down
31 changes: 16 additions & 15 deletions src/transformers/models/bert/tokenization_bert.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,20 +196,6 @@ def __init__(
strip_accents=None,
**kwargs,
):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)

if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
Expand All @@ -225,7 +211,22 @@ def __init__(
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)

self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))

super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)

@property
def do_lower_case(self):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,11 @@ def __init__(
) -> None:
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

self.vocab_file = vocab_file

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)

# Add extra_ids to the special token list
super().__init__(
bos_token=bos_token,
Expand All @@ -107,11 +112,6 @@ def __init__(
**kwargs,
)

self.vocab_file = vocab_file

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)

@property
def vocab_size(self):
return self.sp_model.get_piece_size()
Expand Down
43 changes: 21 additions & 22 deletions src/transformers/models/bert_japanese/tokenization_bert_japanese.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,25 +160,6 @@ def __init__(
jumanpp_kwargs=None,
**kwargs,
):
super().__init__(
spm_file=spm_file,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
do_lower_case=do_lower_case,
do_word_tokenize=do_word_tokenize,
do_subword_tokenize=do_subword_tokenize,
word_tokenizer_type=word_tokenizer_type,
subword_tokenizer_type=subword_tokenizer_type,
never_split=never_split,
mecab_kwargs=mecab_kwargs,
sudachi_kwargs=sudachi_kwargs,
jumanpp_kwargs=jumanpp_kwargs,
**kwargs,
)

if subword_tokenizer_type == "sentencepiece":
if not os.path.isfile(spm_file):
raise ValueError(
Expand Down Expand Up @@ -226,13 +207,31 @@ def __init__(
self.subword_tokenizer_type = subword_tokenizer_type
if do_subword_tokenize:
if subword_tokenizer_type == "wordpiece":
self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
elif subword_tokenizer_type == "character":
self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token)
self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=str(unk_token))
elif subword_tokenizer_type == "sentencepiece":
self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=self.unk_token)
self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=str(unk_token))
else:
raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.")
super().__init__(
spm_file=spm_file,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
do_lower_case=do_lower_case,
do_word_tokenize=do_word_tokenize,
do_subword_tokenize=do_subword_tokenize,
word_tokenizer_type=word_tokenizer_type,
subword_tokenizer_type=subword_tokenizer_type,
never_split=never_split,
mecab_kwargs=mecab_kwargs,
sudachi_kwargs=sudachi_kwargs,
jumanpp_kwargs=jumanpp_kwargs,
**kwargs,
)

@property
def do_lower_case(self):
Expand Down
33 changes: 16 additions & 17 deletions src/transformers/models/bertweet/tokenization_bertweet.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,18 +134,6 @@ def __init__(
mask_token="<mask>",
**kwargs,
):
super().__init__(
normalization=normalization,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)

try:
from emoji import demojize

Expand All @@ -161,10 +149,10 @@ def __init__(
self.merges_file = merges_file

self.encoder = {}
self.encoder[self.bos_token] = 0
self.encoder[self.pad_token] = 1
self.encoder[self.eos_token] = 2
self.encoder[self.unk_token] = 3
self.encoder[bos_token] = 0
self.encoder[pad_token] = 1
self.encoder[eos_token] = 2
self.encoder[unk_token] = 3

self.add_from_file(vocab_file)

Expand All @@ -178,9 +166,20 @@ def __init__(

self.normalization = normalization
self.tweetPreprocessor = TweetTokenizer()

self.special_puncts = {"’": "'", "…": "..."}

super().__init__(
normalization=normalization,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)

def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
Expand Down
Loading