Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resize_token_embeddings in NLLB leading to empty outputs #32948

Closed
2 of 4 tasks
bhavitvyamalik opened this issue Aug 22, 2024 · 2 comments · Fixed by #33325
Closed
2 of 4 tasks

resize_token_embeddings in NLLB leading to empty outputs #32948

bhavitvyamalik opened this issue Aug 22, 2024 · 2 comments · Fixed by #33325
Labels
bug Good Difficult Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@bhavitvyamalik
Copy link
Contributor

bhavitvyamalik commented Aug 22, 2024

System Info

  • transformers version: 4.42.3
  • Platform: Linux-4.18.0-513.11.1.el8_9.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1+cu121 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", additional_special_tokens=[f"code_{i}" for i in range(18)], use_fast=True)
model.resize_token_embeddings(len(tokenizer))

After resizing, generation using an official example:

article = "Şeful ONU spune că nu există o soluţie militară în Siria"
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Output is: 't t t t t t t t t t'

Expected behavior

Generation should work without any errors. One interesting thing to note here is if I add just 2 new tokens, it works fine.

@ArthurZucker
Copy link
Collaborator

Hey! 🤗 This is not a bug and is expected. It's a duplicate of #29899 (there are many others). https://nlp.stanford.edu/~johnhew/vocab-expansion.html is the solution you are looking for.

cc @LysandreJik I think it would make sense to do this by default no?

@LysandreJik
Copy link
Member

Yes if this error often we should find a better way to handle this. I'll write this as a good second issue in case a community member wants to take an initial stab at it.

@LysandreJik LysandreJik added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! Good Difficult Issue labels Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Good Difficult Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
3 participants