Does the 'bad_words_ids' argument in the "generate function" works? #14206

alvinwatner · 2021-10-29T11:02:13Z

Environment info

transformers version: 4.12.0
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.9.0+cu111 (False)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Information

I attempt to evaluate whether the bad_words_ids argument that available in the generate() function works or not. However, based on the steps that I described in below section, it doesn't works.

To reproduce

Below is the steps I used to evaluate:

Run the script without bad_words_ids being specified and set_seed to get deterministic output.

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, set_seed

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

set_seed(0)

input_context = "My cute dog"
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True)
print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:
Generated: My cute dog, when it died, had taken my entire life to save the life that had been

Re-run the script, but with bad_words_ids being specified. I select the word "entire" and "save" taken from the previously generated sequence. However, both words still appear in the output sequence with no difference as the previous one. Below is the script with the following output.

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, set_seed

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

set_seed(0)

input_context = "My cute dog"
# get tokens of words that should not be generated
bad_words_ids = [tokenizer(bad_word).input_ids for bad_word in ["entire", "save"]]
# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:
Generated: My cute dog, when it died, had taken my entire life to save the life that had been

To reproduce in Google Colab:

https://colab.research.google.com/drive/1P4ruLhFstbal1qqXbjuv-kM7yMYY-S1E?usp=sharing

Expected behavior

I expect the word "entire" and "save" to not be included in the output sequence after I run step (2) in above section.

The text was updated successfully, but these errors were encountered:

qqaatw · 2021-10-30T10:55:14Z

Hey @alvinwatner,

To prevent bad words from occurring in the middle of generated texts, you'll need to add a prefix space to every bad word so that the tokenized bad words e.g. save will be ['Ġsave'] instead of ['save'], which matches GPT2's outputs.

This can be done by setting add_prefix_space=True in the kwargs of from_pretrained.

model = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2", add_prefix_space=True)

set_seed(0)

input_context = "My cute dog"
# get tokens of words that should not be generated
bad_words_ids = tokenizer(["entire", "save"]).input_ids
# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True))

Output:

Generated:  My cute dog, when it died, had taken my hand out of my pants and said "I

alvinwatner · 2021-10-30T23:25:50Z

Thank you @qqaatw for pointing that out. Just to inform that this example script doesn't work and outdated.

giladpn · 2021-11-15T18:47:04Z

Hi @qqaatw

Thanks in advance: I am trying to do something very similar but with T5 (either t5-base or t5-large) as the model instead of GPT2. My "bad words" are simply being ignored so it's a very similar problem. Can you advise? Am I missing some configuration that would be relevant for T5?

I am running code similar to the above but using T5ForConditionalGeneration with no luck. Any help appreciated!

alvinwatner · 2021-11-16T05:26:27Z

Hi @giladpn and @qqaatw. I found a thing with this bad_words functionality and I'm not sure if this is normal behaviour or not.

For a word that tokenized into multiple tokens, the generate function will only replace the final token while the earlier tokens still remained in the output sequence.

For e.g., the word " tester", with prefix space, tokenized into ---> ["Ġt", "ester"], with the following ids --> [256, 7834]), the output sequence will maintain the earlier tokens ("256") and only replace the final token ("7834"). Other instance, the word " traceroute" with prefix space tokenized into ---> 'Ġtr', 'acer', 'oute', with the following ids --> [491, 11736, 13192], the output sequnce will maintain the earlier tokens ("491, 11736") and only replace the final token ("13192").

qqaatw · 2021-11-19T12:10:40Z

Hi @giladpn,

Can you provide a minimal but reproducible code so that I can see where the problem is?

Thanks.

qqaatw · 2021-11-19T12:13:45Z

Edited: Indeed, if a word is tokenized into multiple tokens, the first token will still present on the generated sequence. I'll take some time to deal with it.

~~@alvinwatner, what's the input text that you supply to the model?~~

giladpn · 2021-11-19T12:37:28Z

Hi @qqaatw

I am trying to use T5 instead of GPT-2 in your example. Here is the code I am using, which is copy-pasted from your code example above with a few minimal changes:

changed gpt2 to t5-base
changed AutoModelForCausalLM to T5ForConditionalGeneration

The code now generates a sentence successfully but ignores the "bad word" I put in ("dude"). The generated sentence is:

"My cute cat is the sweetest little dude in the world. My cute dog is"

Here is the code, what am I doing wrong? Thank you!

from transformers import AutoTokenizer, AutoModelForCausalLM, T5ForConditionalGeneration, set_seed
model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("t5-base", add_prefix_space=True)

set_seed(0)

input_context = "My cute dog"

# get tokens of words that should not be generated
bad_words_ids = tokenizer(["dude"]).input_ids

# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True))

qqaatw · 2021-11-19T15:07:11Z

@giladpn, thanks for providing the code. Can you add add_special_tokens=False to tokenizer.__call__() and see if the problem is solved? Like so:

bad_words_ids = tokenizer(["dude"], add_special_tokens=False).input_ids

giladpn · 2021-11-19T19:29:17Z

@qqaatw Yes! It works now. Many thanks! Much appreciated.

alvinwatner · 2021-11-22T18:05:33Z

Edited: Indeed, if a word is tokenized into multiple tokens, the first token will still present on the generated sequence. I'll take some time to deal with it.

~~@alvinwatner, what's the input text that you supply to the model?~~

Hi, sorry for the late reply. I have been busy working with my paper lately. And I had eventually created my own script, not optimized well enough, but seems able to deal with those issues.

Here, generation_banned_words.py if you want to take a look link.
Unfortunately, I only managed to attach it to greedy_search due to time constraint. Here is how it looks like link. Also, since my script only requires 'input_ids' and 'next_tokens' (that exist in every sampling method) and the 'sorted_next_token_indices (that is just the topk from the next_tokens_scores), I assume that it should not be too difficult to embed this to other sampling methods. Why we need 'sorted_next_token_indices'? I could explain further, but in short, at every timestep, if the chosen token (argmax initially) satisfied the banned_words ids, it will be replaced by other token that has the next highest probs after the chosen token (for e.g., sorted_next_token_indices = [5, 9, ..., vocab_size], banned_words_ids = [5]. Then, we chose the next highest after 5, which is 9).
Here is a glimpse of usage I made in colab link

ps : sorry for the sphagetty code :(

github-actions · 2021-12-17T15:01:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

musitafa0032 · 2022-04-04T20:43:13Z

It seems like this function did not work for Chinese bart, but Chinese bart use bert tokenizer not bart tokenizer, don't know if this affect? Anyone knows how to make it work in Chinese bart? Thank you

musitafa0032 · 2022-04-04T20:54:19Z

Interesting, I just figure it out. For Chinese bart, you only need the one token id to make it work out, because there is no suffix in Chinese character, so if you use tokenizer to get bad word ids, it will return something like [[101, 704, 102]], but the 101 and 102 represent [CLS] and [SEP], you only need 704 id.

alvinwatner changed the title ~~Does the 'bad_words_ids' argument in generate **function** works?~~ Does the 'bad_words_ids' argument in the "generate function" works? Oct 30, 2021

alvinwatner closed this as completed Oct 30, 2021

qqaatw mentioned this issue Oct 31, 2021

Fix generation docstring #14216

Merged

alvinwatner reopened this Nov 16, 2021

github-actions bot closed this as completed Dec 25, 2021

zRzRzRzRzRzRzR mentioned this issue Nov 17, 2023

inference不能使用bad_words_ids THUDM/ChatGLM3#347

Closed

2 tasks

Alvant mentioned this issue Jul 4, 2024

[Frontend] Bad words sampling parameter vllm-project/vllm#5986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the 'bad_words_ids' argument in the "generate function" works? #14206

Does the 'bad_words_ids' argument in the "generate function" works? #14206

alvinwatner commented Oct 29, 2021

qqaatw commented Oct 30, 2021 •

edited

Loading

alvinwatner commented Oct 30, 2021 •

edited

Loading

giladpn commented Nov 15, 2021 •

edited

Loading

alvinwatner commented Nov 16, 2021

qqaatw commented Nov 19, 2021

qqaatw commented Nov 19, 2021 •

edited

Loading

giladpn commented Nov 19, 2021

qqaatw commented Nov 19, 2021

giladpn commented Nov 19, 2021

alvinwatner commented Nov 22, 2021 •

edited

Loading

github-actions bot commented Dec 17, 2021

musitafa0032 commented Apr 4, 2022

musitafa0032 commented Apr 4, 2022

Does the 'bad_words_ids' argument in the "generate function" works? #14206

Does the 'bad_words_ids' argument in the "generate function" works? #14206

Comments

alvinwatner commented Oct 29, 2021

Environment info

Who can help

Information

To reproduce

To reproduce in Google Colab:

Expected behavior

qqaatw commented Oct 30, 2021 • edited Loading

alvinwatner commented Oct 30, 2021 • edited Loading

giladpn commented Nov 15, 2021 • edited Loading

alvinwatner commented Nov 16, 2021

qqaatw commented Nov 19, 2021

qqaatw commented Nov 19, 2021 • edited Loading

giladpn commented Nov 19, 2021

qqaatw commented Nov 19, 2021

giladpn commented Nov 19, 2021

alvinwatner commented Nov 22, 2021 • edited Loading

github-actions bot commented Dec 17, 2021

musitafa0032 commented Apr 4, 2022

musitafa0032 commented Apr 4, 2022

qqaatw commented Oct 30, 2021 •

edited

Loading

alvinwatner commented Oct 30, 2021 •

edited

Loading

giladpn commented Nov 15, 2021 •

edited

Loading

qqaatw commented Nov 19, 2021 •

edited

Loading

alvinwatner commented Nov 22, 2021 •

edited

Loading