Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5Tokenizer: decode does not show special tokens #8109

Closed
2 tasks
jsrozner opened this issue Oct 28, 2020 · 3 comments
Closed
2 tasks

T5Tokenizer: decode does not show special tokens #8109

jsrozner opened this issue Oct 28, 2020 · 3 comments

Comments

@jsrozner
Copy link
Contributor

jsrozner commented Oct 28, 2020

Environment info

  • transformers version: 3.4.0
  • Platform: macOS-10.15.7-x86_64-i386-64bit
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.7.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: N/a
  • Using distributed or parallel set-up in script? N/a

Who can help

examples/seq2seq: @sshleifer
-->

Information

Model I am using (Bert, XLNet ...): T5Tokenizer

The problem arises when using:

  • the official example scripts: (give details below)
  • [X ] my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [ X] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import T5Tokenizer
input = "word  <pad> <unk> </s> </s>"
t5tokenizer = T5Tokenizer.from_pretrained('t5-small')
tokenized = t5tokenizer.batch_encode_plus([input], max_length=10, padding="longest", return_tensors="pt").input_ids
print(t5tokenizer.batch_decode(tokenized, skip_special_tokens=False, clean_up_tokenization_spaces=False))

IDs output: _word <pad> <unk> </s> </s>
decode output: word ⁇

Expected behavior

The tokens should be shown in the decoded output, but everything except for the unknown token is dropped (no pad or EOS).
convert_ids_to_tokens followed by convert_tokens_to_string also drops the tokens.

@sshleifer
Copy link
Contributor

sshleifer commented Oct 28, 2020

T5: @patrickvonplaten I think you need to set _additional_special_tokens.

@sshleifer
Copy link
Contributor

@jsrozner want to try to fix?

@patrickvonplaten
Copy link
Contributor

This is a duplicate of #5142 and will be fixed with the PR linked below. Thanks for reporting it - seems like multiple people were running into this issue!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants