`resize_token_embeddings` in NLLB leading to empty outputs #32948

bhavitvyamalik · 2024-08-22T19:20:54Z

System Info

transformers version: 4.42.3
Platform: Linux-4.18.0-513.11.1.el8_9.x86_64-x86_64-with-glibc2.28
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", additional_special_tokens=[f"code_{i}" for i in range(18)], use_fast=True)
model.resize_token_embeddings(len(tokenizer))

After resizing, generation using an official example:

article = "Şeful ONU spune că nu există o soluţie militară în Siria"
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Output is: 't t t t t t t t t t'

Expected behavior

Generation should work without any errors. One interesting thing to note here is if I add just 2 new tokens, it works fine.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-08-27T12:19:09Z

Hey! 🤗 This is not a bug and is expected. It's a duplicate of #29899 (there are many others). https://nlp.stanford.edu/~johnhew/vocab-expansion.html is the solution you are looking for.

cc @LysandreJik I think it would make sense to do this by default no?

LysandreJik · 2024-08-30T13:13:50Z

Yes if this error often we should find a better way to handle this. I'll write this as a good second issue in case a community member wants to take an initial stab at it.

bhavitvyamalik added the bug label Aug 22, 2024

LysandreJik added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! Good Difficult Issue labels Aug 30, 2024

abuelnasr0 mentioned this issue Sep 5, 2024

🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. #33325

Merged

5 tasks

ArthurZucker closed this as completed in #33325 Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`resize_token_embeddings` in NLLB leading to empty outputs #32948

`resize_token_embeddings` in NLLB leading to empty outputs #32948

bhavitvyamalik commented Aug 22, 2024 •

edited

Loading

ArthurZucker commented Aug 27, 2024

LysandreJik commented Aug 30, 2024

resize_token_embeddings in NLLB leading to empty outputs #32948

resize_token_embeddings in NLLB leading to empty outputs #32948

Comments

bhavitvyamalik commented Aug 22, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Aug 27, 2024

LysandreJik commented Aug 30, 2024

`resize_token_embeddings` in NLLB leading to empty outputs #32948

`resize_token_embeddings` in NLLB leading to empty outputs #32948

bhavitvyamalik commented Aug 22, 2024 •

edited

Loading