-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the 'bad_words_ids' argument in the "generate function" works? #14206
Comments
Hey @alvinwatner, To prevent bad words from occurring in the middle of generated texts, you'll need to add a prefix space to every bad word so that the tokenized bad words e.g. This can be done by setting
Output:
|
Thank you @qqaatw for pointing that out. Just to inform that this example script doesn't work and outdated. |
Hi @qqaatw Thanks in advance: I am trying to do something very similar but with T5 (either I am running code similar to the above but using |
Hi @giladpn and @qqaatw. I found a thing with this bad_words functionality and I'm not sure if this is normal behaviour or not.
For e.g., the word " tester", with prefix space, tokenized into ---> ["Ġt", "ester"], with the following ids --> [256, 7834]), the output sequence will maintain the earlier tokens ("256") and only replace the final token ("7834"). Other instance, the word " traceroute" with prefix space tokenized into ---> 'Ġtr', 'acer', 'oute', with the following ids --> [491, 11736, 13192], the output sequnce will maintain the earlier tokens ("491, 11736") and only replace the final token ("13192"). |
Hi @giladpn, Can you provide a minimal but reproducible code so that I can see where the problem is? Thanks. |
Edited: Indeed, if a word is tokenized into multiple tokens, the first token will still present on the generated sequence. I'll take some time to deal with it.
|
Hi @qqaatw I am trying to use T5 instead of GPT-2 in your example. Here is the code I am using, which is copy-pasted from your code example above with a few minimal changes:
The code now generates a sentence successfully but ignores the "bad word" I put in ("dude"). The generated sentence is: "My cute cat is the sweetest little dude in the world. My cute dog is" Here is the code, what am I doing wrong? Thank you!
|
@giladpn, thanks for providing the code. Can you add
|
@qqaatw Yes! It works now. Many thanks! Much appreciated. |
Hi, sorry for the late reply. I have been busy working with my paper lately. And I had eventually created my own script, not optimized well enough, but seems able to deal with those issues.
ps : sorry for the sphagetty code :( |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
It seems like this function did not work for Chinese bart, but Chinese bart use bert tokenizer not bart tokenizer, don't know if this affect? Anyone knows how to make it work in Chinese bart? Thank you |
Interesting, I just figure it out. For Chinese bart, you only need the one token id to make it work out, because there is no suffix in Chinese character, so if you use tokenizer to get bad word ids, it will return something like [[101, 704, 102]], but the 101 and 102 represent [CLS] and [SEP], you only need 704 id. |
Environment info
transformers
version: 4.12.0Who can help
Information
I attempt to evaluate whether the
bad_words_ids
argument that available in thegenerate()
function works or not. However, based on the steps that I described in below section, it doesn't works.To reproduce
Below is the steps I used to evaluate:
bad_words_ids
being specified andset_seed
to get deterministic output.Output:
Generated: My cute dog, when it died, had taken my entire life to save the life that had been
bad_words_ids
being specified. I select the word "entire" and "save" taken from the previously generated sequence. However, both words still appear in the output sequence with no difference as the previous one. Below is the script with the following output.Output:
Generated: My cute dog, when it died, had taken my entire life to save the life that had been
To reproduce in Google Colab:
https://colab.research.google.com/drive/1P4ruLhFstbal1qqXbjuv-kM7yMYY-S1E?usp=sharing
Expected behavior
I expect the word "entire" and "save" to not be included in the output sequence after I run step (2) in above section.
The text was updated successfully, but these errors were encountered: