Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged
merged 268 commits into from
Sep 18, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
268 commits
Select commit Hold shift + click to select a range
4aea714
fix test for bart. Order is correct now let's skip BPEs
ArthurZucker Jun 2, 2023
0df212f
ouf
ArthurZucker Jun 2, 2023
c04bd58
styling
ArthurZucker Jun 2, 2023
4a54920
fix bert....
ArthurZucker Jun 2, 2023
189b259
slow refactoring
ArthurZucker Jun 3, 2023
f1c362a
current updates
ArthurZucker Jun 14, 2023
f5b178a
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Jun 15, 2023
0047eb9
massive refactoring
ArthurZucker Jun 16, 2023
0e34205
update
ArthurZucker Jun 16, 2023
81e73d7
NICE!
ArthurZucker Jun 19, 2023
1f3ff93
update to see where I am at
ArthurZucker Jun 19, 2023
3701e26
updates
ArthurZucker Jun 20, 2023
0369a51
update
ArthurZucker Jun 24, 2023
b621c96
update
ArthurZucker Jun 25, 2023
83feef4
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Jun 25, 2023
2e7b733
revert
ArthurZucker Jun 26, 2023
582bd26
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Jul 26, 2023
4c78792
updates
ArthurZucker Jul 26, 2023
25b9836
updates
ArthurZucker Jul 26, 2023
7759304
start supporting legacy_save
ArthurZucker Jul 26, 2023
53b29e5
styling
ArthurZucker Jul 27, 2023
a7ab994
big update
ArthurZucker Jul 27, 2023
6fb6b31
revert some changes
ArthurZucker Jul 27, 2023
9b2e138
nits
ArthurZucker Jul 27, 2023
dffd6d8
nniiiiiice
ArthurZucker Jul 27, 2023
b86bb3a
small fixes
ArthurZucker Jul 27, 2023
b9bb598
kinda fix t5 with new behaviour
ArthurZucker Jul 27, 2023
55ea2f5
major update
ArthurZucker Jul 27, 2023
ff13cb1
fixup
ArthurZucker Jul 27, 2023
5b46dc9
fix copies
ArthurZucker Jul 27, 2023
70252e5
today's updates
ArthurZucker Jul 28, 2023
d76b414
fix byt5
ArthurZucker Jul 28, 2023
496679b
upfate
ArthurZucker Jul 28, 2023
521907d
update
ArthurZucker Jul 28, 2023
1d4e947
update
ArthurZucker Jul 28, 2023
6ae5e51
updates
ArthurZucker Jul 28, 2023
c42800e
update vocab size test
ArthurZucker Jul 31, 2023
754b219
Barthez does not use not need the fairseq offset ids
ArthurZucker Jul 31, 2023
80ff47a
super calll must be after
ArthurZucker Jul 31, 2023
14726cc
calll super
ArthurZucker Jul 31, 2023
1f24347
move all super init
ArthurZucker Jul 31, 2023
d2ed9c6
move other super init
ArthurZucker Aug 3, 2023
74b22e1
fixup
ArthurZucker Aug 3, 2023
162e61d
nits
ArthurZucker Aug 3, 2023
1643432
more fixes
ArthurZucker Aug 3, 2023
5db9604
nits
ArthurZucker Aug 3, 2023
957cb54
more fixes
ArthurZucker Aug 3, 2023
29e32f6
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 3, 2023
f0de438
nits
ArthurZucker Aug 3, 2023
d31acd2
more fix
ArthurZucker Aug 3, 2023
a6f0094
remove useless files
ArthurZucker Aug 3, 2023
38b8942
ouch all of them are affected
ArthurZucker Aug 3, 2023
8096378
and more!
ArthurZucker Aug 3, 2023
2d55c90
small imporvements
ArthurZucker Aug 3, 2023
31ac52d
no more sanitize token
ArthurZucker Aug 3, 2023
da42a86
more changes around unique no split tokens
ArthurZucker Aug 3, 2023
e34c6ea
partially fix more things
ArthurZucker Aug 4, 2023
becb4a5
keep legacy save but add warning
ArthurZucker Aug 4, 2023
2ddcc28
so... more fixes
ArthurZucker Aug 4, 2023
7bd0b15
updates
ArthurZucker Aug 4, 2023
365f815
guess deberta tokenizer could be nuked
ArthurZucker Aug 4, 2023
5ff1e04
fixup
ArthurZucker Aug 4, 2023
9284bbf
fixup did some bad things
ArthurZucker Aug 4, 2023
dbdafeb
nuke it if it breaks
ArthurZucker Aug 4, 2023
83310c9
remove prints and pretrain fast from slow with new format.
ArthurZucker Aug 4, 2023
9a461f5
fixups
ArthurZucker Aug 4, 2023
015e796
Apply suggestions from code review
ArthurZucker Aug 4, 2023
6e17f4e
fiou
ArthurZucker Aug 4, 2023
05e1c56
Merge branch 'fix-add-tokens' of https://github.com/ArthurZucker/tran…
ArthurZucker Aug 4, 2023
181c112
nit
ArthurZucker Aug 4, 2023
e9887e8
by default specials should not be normalized?
ArthurZucker Aug 4, 2023
df035b7
update
ArthurZucker Aug 4, 2023
d7a2458
remove brakpoint
ArthurZucker Aug 4, 2023
f5c8f2c
updates
ArthurZucker Aug 7, 2023
64237fc
a lot of updates
ArthurZucker Aug 7, 2023
fc01d58
fixup
ArthurZucker Aug 7, 2023
f101206
fixes revert some changes to match fast
ArthurZucker Aug 7, 2023
4834e64
small nits
ArthurZucker Aug 7, 2023
1acbe48
that makes it cleaner
ArthurZucker Aug 8, 2023
4c9a61e
fix camembert accordingly
ArthurZucker Aug 8, 2023
62b98b0
update
ArthurZucker Aug 8, 2023
62cedf3
some lest breaking changes
ArthurZucker Aug 8, 2023
395259a
update
ArthurZucker Aug 8, 2023
f4b3c85
fixup
ArthurZucker Aug 8, 2023
e6c7b28
fix byt5 and whisper mostly
ArthurZucker Aug 8, 2023
8626851
some more fixes, canine's byte vocab
ArthurZucker Aug 8, 2023
0d4e360
fix gpt2
ArthurZucker Aug 8, 2023
d639ea6
fix most of the perceiver tests (4 left)
ArthurZucker Aug 8, 2023
54a7ac8
fix layout lmv3
ArthurZucker Aug 8, 2023
e2b04c5
fixup
ArthurZucker Aug 8, 2023
70ca39b
fix copies for gpt2 style
ArthurZucker Aug 8, 2023
cd248db
make sure to only warn once
ArthurZucker Aug 8, 2023
02185cf
fix perciever and gpt2 tests
ArthurZucker Aug 8, 2023
7105eea
some more backward compatibility: also read special tokens map becaus…
ArthurZucker Aug 8, 2023
8f5d96b
fixup
ArthurZucker Aug 8, 2023
9f4f1e4
add else when reading
ArthurZucker Aug 8, 2023
f0c41a1
nits
ArthurZucker Aug 8, 2023
2812c70
fresh updates
ArthurZucker Aug 8, 2023
984a307
fix copies
ArthurZucker Aug 8, 2023
2fc2349
will this make everything faster?
ArthurZucker Aug 8, 2023
3861d1d
fixes
ArthurZucker Aug 8, 2023
0e20614
more fixes
ArthurZucker Aug 8, 2023
6dbb814
update
ArthurZucker Aug 8, 2023
40270d8
more fixes
ArthurZucker Aug 8, 2023
7b92445
fixup
ArthurZucker Aug 8, 2023
7b9d373
is the source of truth right?
ArthurZucker Aug 9, 2023
6631d42
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 9, 2023
58abfe7
sorry camembert for the troubles
ArthurZucker Aug 9, 2023
fd4d7f8
current updates
ArthurZucker Aug 9, 2023
8e1225f
fixup
ArthurZucker Aug 9, 2023
98c869f
update led
ArthurZucker Aug 9, 2023
da0414c
update
ArthurZucker Aug 9, 2023
b1b2779
fix regression
ArthurZucker Aug 9, 2023
161e09b
fix single word
ArthurZucker Aug 9, 2023
0ca9c8a
more model specific fixes
ArthurZucker Aug 9, 2023
87bc36d
fix t5 tests
ArthurZucker Aug 9, 2023
63dce31
fixup
ArthurZucker Aug 9, 2023
6c6f70a
more comments
ArthurZucker Aug 9, 2023
f5252a1
update
ArthurZucker Aug 9, 2023
c7c8ebe
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 9, 2023
38a013f
fix nllb
ArthurZucker Aug 9, 2023
3026e49
rstrip removed
ArthurZucker Aug 9, 2023
d3c9024
small fixes
ArthurZucker Aug 9, 2023
5242be7
better handle additional_special_tokens and vocab sizes
ArthurZucker Aug 9, 2023
769e215
fixing
ArthurZucker Aug 9, 2023
bf2cb8a
styling
ArthurZucker Aug 9, 2023
d840605
fix 4 / 21
ArthurZucker Aug 9, 2023
3bf7591
fixup
ArthurZucker Aug 9, 2023
40d55f7
fix nlbb's tests
ArthurZucker Aug 9, 2023
0b65a21
some fixes
ArthurZucker Aug 9, 2023
7a706a6
fix t5
ArthurZucker Aug 9, 2023
53d4b13
fixes
ArthurZucker Aug 9, 2023
448cfd4
style
ArthurZucker Aug 9, 2023
f521b60
fix canine tests
ArthurZucker Aug 10, 2023
384ea05
damn this is nice
ArthurZucker Aug 10, 2023
5c05f4a
nits
ArthurZucker Aug 10, 2023
1b1aad0
m2m100 nit
ArthurZucker Aug 10, 2023
6961910
fixups
ArthurZucker Aug 11, 2023
969d46e
fixes!
ArthurZucker Aug 11, 2023
d957860
fixup
ArthurZucker Aug 11, 2023
c9732fb
stash
ArthurZucker Aug 16, 2023
dd99191
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 18, 2023
463cbb4
fix merge
ArthurZucker Aug 18, 2023
965bca1
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 18, 2023
3a79633
revert bad change
ArthurZucker Aug 21, 2023
bf7ba15
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Aug 31, 2023
717b4b9
fixup
ArthurZucker Aug 31, 2023
3d10fb0
correct order for code Llama
ArthurZucker Aug 31, 2023
78d4215
fix speecht5 post merge
ArthurZucker Aug 31, 2023
3605c75
styling
ArthurZucker Aug 31, 2023
abfe019
revert source of 11 fails
ArthurZucker Aug 31, 2023
677fc8e
small nits
ArthurZucker Aug 31, 2023
548e062
all changes in one go
ArthurZucker Sep 1, 2023
07e4d1e
fnet hack
ArthurZucker Sep 1, 2023
b871e7e
fix 2 more tests
ArthurZucker Sep 1, 2023
1520d14
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 1, 2023
136c877
update based on main branch of tokenizers
ArthurZucker Sep 5, 2023
b6a6ec7
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 5, 2023
a304a3c
fixup
ArthurZucker Sep 5, 2023
a9f47e2
fix VITS issues
ArthurZucker Sep 5, 2023
47387ec
more fixes
ArthurZucker Sep 5, 2023
c8acd2c
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 5, 2023
0c584d7
fix mgp test
ArthurZucker Sep 5, 2023
de0df2f
fix camembert issues
ArthurZucker Sep 5, 2023
fa2c424
oups camembert still has 2 failing tests
ArthurZucker Sep 5, 2023
4498ad8
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 5, 2023
d4de85c
mluke fixes
ArthurZucker Sep 5, 2023
6df2fe3
decode fixes
ArthurZucker Sep 5, 2023
4167a74
small nits
ArthurZucker Sep 5, 2023
d3e1bf1
nits
ArthurZucker Sep 5, 2023
0d65b83
fix llama and vits
ArthurZucker Sep 6, 2023
a52bb94
fix camembert
ArthurZucker Sep 6, 2023
bf58e4e
smal nits
ArthurZucker Sep 6, 2023
25adcb8
more fixes when initialising a fast from a slow and etc
ArthurZucker Sep 6, 2023
fc799f1
fix one of the last test
ArthurZucker Sep 6, 2023
205a6ed
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 6, 2023
ecbee31
fix CPM tokenizer test
ArthurZucker Sep 6, 2023
6170200
fixups
ArthurZucker Sep 6, 2023
e384d13
fix pop2piano
ArthurZucker Sep 6, 2023
b69895c
fixup
ArthurZucker Sep 6, 2023
c0c77a5
⚠️ Change tokenizers required version ⚠️
ArthurZucker Sep 7, 2023
957329f
⚠️ Change tokenizers required version ⚠️
ArthurZucker Sep 7, 2023
ae48fb2
"tokenizers>=0.14,<0.15", don't forget smaller than
ArthurZucker Sep 7, 2023
8e22a05
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 7, 2023
2b69acb
fix musicgen tests and pretraiendtokenizerfast
ArthurZucker Sep 7, 2023
abefa05
fix owlvit and all
ArthurZucker Sep 7, 2023
52c2f56
update t5
ArthurZucker Sep 7, 2023
ddb4a2b
fix 800 red
ArthurZucker Sep 7, 2023
6029ee3
fix tests
ArthurZucker Sep 7, 2023
3bebdc1
fix the fix of the fix of t5
ArthurZucker Sep 7, 2023
b4aa11f
styling
ArthurZucker Sep 7, 2023
0f7b941
documentation nits
ArthurZucker Sep 11, 2023
ca33d35
Skip failing whisper test by Merge branch 'main' of https://github.co…
ArthurZucker Sep 11, 2023
9ae3380
cache _added_tokens_encoder
ArthurZucker Sep 11, 2023
f7d9a6f
fixups
ArthurZucker Sep 11, 2023
5136647
Nit
ArthurZucker Sep 11, 2023
ca4835c
fix red tests
ArthurZucker Sep 11, 2023
c03e549
one last nit!
ArthurZucker Sep 11, 2023
8569508
make eveything a lot simpler
ArthurZucker Sep 11, 2023
a5dcdab
Now it's over :wink:
ArthurZucker Sep 11, 2023
f8638b3
few small nits
ArthurZucker Sep 11, 2023
68b5f54
Apply suggestions from code review
ArthurZucker Sep 11, 2023
de40bf2
updates that work for now
ArthurZucker Sep 11, 2023
9c6e48f
Merge branch 'fix-add-tokens' of https://github.com/arthurzucker/tran…
ArthurZucker Sep 11, 2023
0c661ec
tests that should no be skipped / changed and fixed next
ArthurZucker Sep 11, 2023
3bbe7f0
fixup
ArthurZucker Sep 11, 2023
9b22a80
i am ashamed
ArthurZucker Sep 11, 2023
f75d9bd
pushe the fix
ArthurZucker Sep 12, 2023
23d08be
update
ArthurZucker Sep 12, 2023
b68d937
fixups
ArthurZucker Sep 12, 2023
39ec344
nits
ArthurZucker Sep 12, 2023
bdb36a1
fix added_tokens_encoder
ArthurZucker Sep 12, 2023
60f7b70
fix canine test
ArthurZucker Sep 12, 2023
8ef282f
fix pegasus vocab
ArthurZucker Sep 12, 2023
3b82eac
fix transfoXL
ArthurZucker Sep 12, 2023
719ce61
fixup
ArthurZucker Sep 12, 2023
3555fc6
whisper needs to be fixed for train new
ArthurZucker Sep 12, 2023
f206fd1
pegasus nits
ArthurZucker Sep 12, 2023
cf42578
more pegasus fixes
ArthurZucker Sep 12, 2023
03de2ec
minor update
ArthurZucker Sep 12, 2023
a21d611
better error message in failed test
ArthurZucker Sep 12, 2023
604a677
fix whisper failing test
ArthurZucker Sep 12, 2023
f062446
fix whisper failing test
ArthurZucker Sep 12, 2023
d2367f1
fix pegasus
ArthurZucker Sep 12, 2023
04de249
fixup
ArthurZucker Sep 12, 2023
d1ccd62
fix **** pegasus
ArthurZucker Sep 12, 2023
ea83357
reset things
ArthurZucker Sep 12, 2023
cfd2e82
remove another file
ArthurZucker Sep 12, 2023
650d403
attempts to fix the strange custome encoder and offset
ArthurZucker Sep 12, 2023
6896a75
nits here and there
ArthurZucker Sep 12, 2023
d73fac1
update
ArthurZucker Sep 12, 2023
55fa70c
fixup
ArthurZucker Sep 12, 2023
a8cd1f0
nit
ArthurZucker Sep 12, 2023
a95460c
fix the whisper test
ArthurZucker Sep 13, 2023
5292ea6
nits nits
ArthurZucker Sep 13, 2023
6e6a0d6
Apply suggestions from code review
ArthurZucker Sep 14, 2023
d519be5
updates based on review
ArthurZucker Sep 14, 2023
589be23
some small update to potentially remove
ArthurZucker Sep 14, 2023
d20e8d7
Merge branch 'fix-add-tokens' of github.com:ArthurZucker/transformers…
ArthurZucker Sep 14, 2023
1f007bf
nits
ArthurZucker Sep 14, 2023
459b082
Merge branch 'main' of github.com:huggingface/transformers into fix-a…
ArthurZucker Sep 14, 2023
1155d3c
import rlu cache
ArthurZucker Sep 14, 2023
8516d1c
Update src/transformers/tokenization_utils_base.py
ArthurZucker Sep 18, 2023
db19f2b
move warning to `from_pretrained`
ArthurZucker Sep 18, 2023
70f303b
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 18, 2023
51eda03
update tests results now that the special tokens are always added
ArthurZucker Sep 18, 2023
24ac840
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 18, 2023
04bef33
Merge branch 'main' of github.com:huggingface/transformers into fix-a…
ArthurZucker Sep 18, 2023
7b86b04
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 18, 2023
9424fdf
Merge branch 'fix-add-tokens' of https://github.com/ArthurZucker/tran…
ArthurZucker Sep 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions src/transformers/models/llama/tokenization_llama_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,15 @@
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model", "tokenizer_file": "tokenizer.json"}

PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model",
},
"tokenizer_file": {
"hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json",
},
}


class LlamaTokenizerFast(PreTrainedTokenizerFast):
"""
Expand Down Expand Up @@ -75,6 +84,7 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
slow_tokenizer_class = LlamaTokenizer
padding_side = "left"

Expand Down
19 changes: 13 additions & 6 deletions src/transformers/tokenization_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -408,10 +408,10 @@ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_to
# Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
```"""
new_tokens = [str(tok) for tok in new_tokens]
token_contents = [str(tok) for tok in new_tokens]

tokens_to_add = []
for token in new_tokens:
for i, token in enumerate(token_contents):
if not isinstance(token, str):
raise TypeError(f"Token {token} is not a string but a {type(token)}.")
if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
Expand All @@ -422,6 +422,9 @@ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_to
and token not in tokens_to_add
):
tokens_to_add.append(token)
if isinstance(new_tokens[i], AddedToken) or special_tokens:
# tokens that are added using AddedToken are special tokens.
self._additional_special_tokens.append(new_tokens[i])
if self.verbose:
logger.info(f"Adding {token} to the vocabulary")

Expand All @@ -430,12 +433,12 @@ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_to
self.added_tokens_encoder.update(added_tok_encoder)
self.added_tokens_decoder.update(added_tok_decoder)

# Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
# Make sure we don't split on any special tokens (even if they were already in the vocab before e.g. for Albert)
if special_tokens:
if len(new_tokens) == 1:
_insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
if len(token_contents) == 1:
_insert_one_token_to_ordered_list(self.unique_no_split_tokens, token_contents[0])
else:
self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(token_contents)))
else:
# Or on the newly added tokens
if len(tokens_to_add) == 1:
Expand Down Expand Up @@ -530,6 +533,10 @@ def tokenize(self, text: TextInput, **kwargs) -> List[str]:
if tok_extended.lstrip and left:
tokens[i - 1] = left.rstrip() # Opposite here
else:
# there should be a list of additional tokens that are not special. These have to be in no split
# but they are not special. By default any added token should have right and left strip to True
# Apparently. We need to keep this behaviour

# We strip left and right by default
if right:
tokens[i + 1] = right.lstrip()
Expand Down
5 changes: 4 additions & 1 deletion tests/models/llama/test_tokenization_llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,10 @@
@require_tokenizers
class LlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = LlamaTokenizer
test_rust_tokenizer = False
rust_tokenizer_class = LlamaTokenizerFast

# FIXME this does not work, support should come
# test_rust_tokenizer = True
test_sentencepiece = True
from_pretrained_kwargs = {}

Expand Down
65 changes: 65 additions & 0 deletions tests/test_tokenization_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -2110,6 +2110,71 @@ def test_batch_encode_plus_batch_sequence_length(self):
encoded_sequences_batch_padded_2[key],
)

def test_added_token_are_never_split(self):
if not self.test_slow_tokenizer:
self.skipTest("Currently this test is only for slow tokenizers")
return
model_ids = self.tokenizer_class.pretrained_vocab_files_map[self.from_pretrained_vocab_key].keys()
tokenizer = self.tokenizer_class.from_pretrained(list(model_ids)[0])
new_tokens = []
new_tokens.append(AddedToken("<lstrip=False, rstrip=False>", lstrip=False, rstrip=False))
new_tokens.append(AddedToken("<lstrip=True, rstrip=False>", lstrip=True, rstrip=False))
new_tokens.append(AddedToken("<lstrip=False, rstrip=True>", lstrip=False, rstrip=True))
new_tokens.append(AddedToken("<lstrip=True, rstrip=True>", lstrip=True, rstrip=True))

for token in new_tokens:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rewrite this

with self.subTest(f"testing with {token.content[1:-1]}"):
space = tokenizer.tokenize(" ")[0]
if len(space) > 1:
# BPE adds a spiece underline
space = space[-1]

tokenizer.add_tokens([token])
tokens = tokenizer.tokenize(f"This sentence is{token}a test")
self.assertIn(token.content, tokens)

tokens = tokenizer.tokenize(f"This sentence is {token}a test")
self.assertIn(token.content, tokens)

if not token.rstrip:
idx = tokens.index(token.content)
self.assertIn(space, tokens[idx - 1])
else:
idx = tokens.index(token.content)
self.assertNotIn(space, tokens[idx - 1])

tokens = tokenizer.tokenize(f"This sentence is{token} a test")
self.assertIn(token.content, tokens)
idx = tokens.index(token.content)

if not token.lstrip:
idx = tokens.index(token.content)
self.assertIn(space, tokens[idx + 1])
else:
idx = tokens.index(token.content)
self.assertNotIn(space, tokens[idx - 1])

tokens = tokenizer.tokenize(f"This sentence is {token} a test")
self.assertIn(token.content, tokens)

idx = tokens.index(token.content)

if not token.lstrip:
idx = tokens.index(token.content)
self.assertIn(space, tokens[idx + 1])
else:
idx = tokens.index(token.content)
self.assertNotIn(space, tokens[idx + 1])

if not token.rstrip:
idx = tokens.index(token.content)
self.assertIn(space, tokens[idx - 1])
else:
idx = tokens.index(token.content)
self.assertNotIn(space, tokens[idx - 1])

# for non BPE based tokenizers we need to test that lstrip and rstrip are respected

@require_tokenizers
def test_added_token_are_matched_longest_first(self):
if not self.test_slow_tokenizer:
Expand Down