Deal with `_encode_pair()` / Llama token 29871 / `SPIECE_UNDERLINE` better #1322

haileyschoelkopf · 2024-01-19T18:23:02Z

This is a fix that is intended to stamp out a lingering edge case in _encode_pair() (#1053 #1297 ) where the target continuation ends up with no tokens assigned to it, in the case that the target gets folded entirely into the last token of the context.

The reason we originally introduced _encode_pair() was so that we could avoid the behavior exhibited by Llama's tokenizer where if you pass " <word>" into it, it returns <BOS if added> 29871 <actual token for word, where word is beginning of a word in sentencepiece> rather than <BOS if added> <actual token for word, where word is beginning of a word in sentencepiece>.

29871 in the Llama tokenizer is the SPIECE_UNDERLINE character, which we don't want to be tokenized standalone in the middle of a context.
See huggingface/transformers#26273
oobabooga/text-generation-webui#2606
ggml-org/llama.cpp#3664

for further reference of this issue.

Opening a PR so that I do not forget this needs merging, but it's probably blocked on getting huggingface/transformers#28010 merged--at which point it may be possible to handle this cleanly rather than resorting to hacks.

In an ideal world we may also be able to decide to remove _encode_pair() (or leave it, but only use when a user passes a flag to use the legacy behavior.) and avoid more complicated hacks from compounding. However, we should only do so if it seems like this will not affect the context + continuation behavior in the vast majority of cases.

closes #1297 #1053 .

Will also rebase as to not be dependent on #1287 .

…or testing

…style fix

…I/lm-evaluation-harness into fix-len0-continuations

daniel-furman and others added 30 commits January 6, 2024 18:50

first stab at wrap_chat_template

3824828

first stab at wrap_chat_template, strip error fix

a784417

first stab at wrap_chat_template, rfind continuation fix

53c68db

first stab at wrap_chat_template, formatting in function

3e27f9d

first stab at wrap_chat_template, print statements in loglikelihood f…

87dff8b

…or testing

first stab at wrap_chat_template, remove system for now

5c4d9c7

first stab at wrap_chat_template, remove special chars from continuation

e689727

first stab at wrap_chat_template, remove special chars tab indenting …

337c084

…style fix

Merge branch 'EleutherAI:main' into main

6c68fd1

first stab at wrap_chat_template, various

34b32f7

first stab at wrap_chat_template, various

59e3b17

first stab at wrap_chat_template, arc conversation test

7191904

first stab at wrap_chat_template, arc conversation test

9949e4f

first stab at wrap_chat_template, remove arc experiment

2d3c835

first stab at wrap_chat_template, various

49f43f9

llama test

021232b

llama test

b6c75ed

llama test

047dde8

llama test

c38b9d2

llama test

1ea8470

llama test

2e27053

llama test

43dee06

llama test

39a11d0

remove system

bbcdffb

Merge branch 'main' into add-chat-templating

2b40017

update Instance.args setter

c47de8b

clean up wrap_chat_template + add TODOs

6ca8ab1

Merge branch 'main' into add-chat-templating

b8bda47

push most recent code

68c30aa

add the hack (works for Mistral/Llama, destroys performance for GPT2

d03c9fd

haileyschoelkopf added the bug Something isn't working. label Jan 19, 2024

haileyschoelkopf added 2 commits January 19, 2024 18:31

add the hack (works for Mistral/Llama, destroys performance for GPT2

42d54f8

Merge branch 'fix-len0-continuations' of https://github.com/EleutherA…

787c99e

…I/lm-evaluation-harness into fix-len0-continuations

baberabb mentioned this pull request Nov 22, 2024

mlx Model (loglikelihood & generate_until) #1902

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with `_encode_pair()` / Llama token 29871 / `SPIECE_UNDERLINE` better #1322

Deal with `_encode_pair()` / Llama token 29871 / `SPIECE_UNDERLINE` better #1322

haileyschoelkopf commented Jan 19, 2024

Deal with _encode_pair() / Llama token 29871 / SPIECE_UNDERLINE better #1322

Are you sure you want to change the base?

Deal with _encode_pair() / Llama token 29871 / SPIECE_UNDERLINE better #1322

Conversation

haileyschoelkopf commented Jan 19, 2024

Deal with `_encode_pair()` / Llama token 29871 / `SPIECE_UNDERLINE` better #1322

Deal with `_encode_pair()` / Llama token 29871 / `SPIECE_UNDERLINE` better #1322