[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues #28010

ArthurZucker · 2023-12-13T16:59:44Z

What does this PR do?

Allows users to use tokenizer.tokenize controlling the addition of prefix space. Let's also update fast!

…refix-space

…r IMO

HuggingFaceDocBuilderDev · 2024-01-16T16:07:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…refix-space

src/transformers/convert_slow_tokenizer.py

…refix-space

gabegrand · 2024-01-20T20:06:47Z

Just wanted to say this would be hugely helpful for us over at https://github.com/probcomp/hfppl !

haileyschoelkopf · 2024-01-21T19:24:41Z

Likewise the ability to not include an extra SPIECE_UNDERLINE / Llama token 29871 when encoding a word with a space in front ( <word>) would be huge for https://github.com/EleutherAI/lm-evaluation-harness !

… the decoder

…refix-space

…s into add-prefix-space

ArthurZucker

I'll let @Lysandre decide, but instead of following what we do with bloom I'd rather we convert from slow. Bit slower but at least we are sure we use the correct logic.
This is done with a warning.

src/transformers/models/llama/tokenization_llama_fast.py

ArthurZucker · 2024-02-20T11:32:07Z

Failing test is unrelated 😉

LysandreJik

Ok, this looks good to me

LysandreJik · 2024-02-20T11:50:26Z

src/transformers/models/llama/tokenization_llama_fast.py

+        if add_prefix_space is not None:
+            logger.warning_once(
+                "You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers"
+            )
+            kwargs["from_slow"] = True


How long does it take to convert the tokenizer from slow? If it's quick we can move it to info

Around 10 seconds I believe!

casper-hansen · 2024-09-09T12:03:28Z

@ArthurZucker I know this is old at this point, but this PR seems to introduce a unintended side-effect for Mistral v0.1 tokenizers. I figured out that I can fix it by always initializing with . Not sure if this is on your radar or if it is fixed in newer versions, but would totally appreciate if this can be fixed for the Mistral model series.

Tokens before (4.37.2):

SLICED: ['1: <s>', '330: A', '28747: :', '28705: ', '28740: 1', '28783: 8', '13: <0x0A>']

Tokens after (this PR, add_prefix_space not set):

The "A" vanishes

SLICED: ['1: <s>', '28747: :', '28705:  ', '28740: 1', '28783: 8', '13: <0x0A>']

Tokens after (this PR, add_prefix_space=True):

To be clear, this is the expected default behavior.

SLICED: ['1: <s>', '330: A', '28747: :', '28705: ', '28740: 1', '28783: 8', '13: <0x0A>']

https://www.diffchecker.com/GwA54pMf/

add add_dummy_prefix_space option to slow

aafce55

huggingface deleted a comment from github-actions bot Jan 15, 2024

ArthurZucker added 2 commits January 16, 2024 16:38

Merge branch 'main' of github.com:huggingface/transformers into add-p…

f72cf3d

…refix-space

checking kwargs might be better. Should be there for all spm tokenize…

1175230

…r IMO

ArthurZucker added 7 commits January 16, 2024 17:10

nits

9c2060d

fix copies

5e649ac

more copied

59fb5f4

Merge branch 'main' of github.com:huggingface/transformers into add-p…

8ff0f89

…refix-space

nits

4757eb5

add prefix space

3fc1a78

nit

ac77de3

ArthurZucker mentioned this pull request Jan 16, 2024

Add qwen2 #28436

Merged

ArthurZucker added 3 commits January 17, 2024 15:39

Merge branch 'main' of github.com:huggingface/transformers into add-p…

aa4a7bd

…refix-space

Merge branch 'main' of github.com:huggingface/transformers into add-p…

b57ec75

…refix-space

nits

16750dd

ArthurZucker marked this pull request as ready for review January 18, 2024 11:05

ArthurZucker commented Jan 18, 2024

View reviewed changes

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved

ArthurZucker and others added 7 commits January 18, 2024 12:05

Update src/transformers/convert_slow_tokenizer.py

0fa9ce3

fix inti

e3a631f

revert wrong styling

b845940

Merge branch 'main' of github.com:huggingface/transformers into add-p…

af9d6d0

…refix-space

fix

32ae37e

nits

041307f

style

f5fb07f

haileyschoelkopf mentioned this pull request Jan 19, 2024

Deal with _encode_pair() / Llama token 29871 / SPIECE_UNDERLINE better EleutherAI/lm-evaluation-harness#1322

Draft

ArthurZucker mentioned this pull request Jan 23, 2024

Can LlamaTokenizerFast support the argument add_prefix_space = False #28622

Closed

JoshC8C7 mentioned this pull request Jan 30, 2024

Allow disabling of deletion of leading SPIECE_UNDERLINE during llama decoding (tokenizer). #28776

Closed

ArthurZucker and others added 17 commits February 20, 2024 12:10

make sure we use slow tokenizer for conversion instead of looking for…

c0efd77

… the decoder

support llama ast well

f69a6a9

update llama tokenizer fast

4126664

nits

22b37ab

nits nits nits

6768ba4

update the doc

37b36fe

update

1e61538

update to fix tests

a870c8a

skip unrelated tailing test

ab9cf26

Merge branch 'main' of github.com:huggingface/transformers into add-p…

e4c17e8

…refix-space

Update src/transformers/convert_slow_tokenizer.py

3e9d0a2

add proper testing

601e873

Merge branch 'add-prefix-space' of github.com:huggingface/transformer…

d9be85d

…s into add-prefix-space

test decode as well

5e475c2

more testing

282fac4

format

52c5772

fix llama test

f79be09

LysandreJik self-requested a review February 20, 2024 10:51

ArthurZucker commented Feb 20, 2024

View reviewed changes

Apply suggestions from code review

3220c30

LysandreJik approved these changes Feb 20, 2024

View reviewed changes

LysandreJik reviewed Feb 20, 2024

View reviewed changes

ArthurZucker merged commit 15cfe38 into main Feb 20, 2024
19 of 21 checks passed

ArthurZucker deleted the add-prefix-space branch February 20, 2024 11:50

hnyls2002 mentioned this pull request Feb 21, 2024

Prefix spaces for transofmers' tokenizer to more flexible jump-forward. sgl-project/sglang#213

Closed

ArthurZucker mentioned this pull request Feb 28, 2024

[T5 and Llama Tokenizer] remove warning #29346

Merged

itazap mentioned this pull request Jul 12, 2024

The behavior of the tokenizer loaded from GGUF file is incorrect. #31630

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues #28010

[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues #28010

ArthurZucker commented Dec 13, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 16, 2024

gabegrand commented Jan 20, 2024

haileyschoelkopf commented Jan 21, 2024

ArthurZucker left a comment

ArthurZucker commented Feb 20, 2024

LysandreJik left a comment

LysandreJik Feb 20, 2024

ArthurZucker Feb 20, 2024

casper-hansen commented Sep 9, 2024 •

edited

Loading

[Core tokenization] add_dummy_prefix_space option to help with latest issues #28010

[Core tokenization] add_dummy_prefix_space option to help with latest issues #28010

Conversation

ArthurZucker commented Dec 13, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 16, 2024

gabegrand commented Jan 20, 2024

haileyschoelkopf commented Jan 21, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Feb 20, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Feb 20, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 20, 2024

Choose a reason for hiding this comment

casper-hansen commented Sep 9, 2024 • edited Loading

[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues #28010

[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues #28010

ArthurZucker commented Dec 13, 2023 •

edited

Loading

casper-hansen commented Sep 9, 2024 •

edited

Loading