Wrong offsets_mapping in T5TokenizerFast #9633

zorikg · 2021-01-16T11:03:55Z

Environment info

transformers version: 4.2.1
Platform: Linux-4.9.0-14-amd64-x86_64-with-debian-9.13
Python version: 3.6.10
PyTorch version (GPU?): 1.7.0 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help @patrickvonplaten, @mfuntowicz

Information

Model I am using: T5

To reproduce

See comments in the code snippet.

from transformers import T5TokenizerFast


def test_offset_mapping():
    """This test fails and therefore we know that there is a bug in offset_mapping mechanism.
        We try to tokenize the sentence 'This is a test sentence' and notice to issues:

        1. The tokenizer tokenizes it to ['This', 'is', '', 'a', 'test', 'sentence']
            which means that it has redundant empty string in position 2.
        2. The offset mapping maps to ['This', 'is', 'a', 'a', 'test', 'sentence']
            replacing the empty string with redundant 'a'.

    """
    tokenizer = T5TokenizerFast.from_pretrained('google/t5-v1_1-base')

    s = "This is a test sentence"
    tokenized = tokenizer(s, return_offsets_mapping=True)
    
    decoded_tokens, tokens_from_offset_mapping = [], []
    for token_index, offset_mapping in enumerate(tokenized['offset_mapping']):
        decoded_token = tokenizer.decode(tokenized['input_ids'][token_index])
        if decoded_token != tokenizer.eos_token:
            decoded_tokens.append(decoded_token)
            tokens_from_offset_mapping.append(s[offset_mapping[0]:offset_mapping[1]])

    error_msg = f"Wrong offset mapping for '{s}'! \n" \
                f"Maps to:          {tokens_from_offset_mapping}\n" \
                f"Instead of:       {decoded_tokens}"
    assert decoded_tokens == tokens_from_offset_mapping, error_msg


if __name__ == "__main__":
    test_offset_mapping()

Expected behavior

AssertionError: Wrong offset mapping for 'This is a test sentence'! 
Maps to:          ['This', 'is', 'a', 'a', 'test', 'sentence']
Instead of:       ['This', 'is', '', 'a', 'test', 'sentence']

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-01-18T11:43:25Z

@patrickvonplaten @n1t0 do you have any advice on this? The T5 tokenizer tokenizes the sentence as follows:

['▁This', '▁is', '▁', 'a', '▁test', '▁sentence']

Unfortunately the offset mapping point to both '▁' and 'a' being at (8, 9), as the following suggests:

'offset_mapping': [(0, 4), (5, 7), (8, 9), (8, 9), (10, 14), (15, 23), (0, 0)]
                                    ^---- & ^---- here

How should one map this encoding back to the initial sequence?

zorikg · 2021-02-01T13:20:06Z

@patrickvonplaten @n1t0 - did you have a chance to look at this?
Thanks!

n1t0 · 2021-02-03T14:34:36Z

Hi @zorikg! Thank you for reporting this issue. This is related to #9637 concerning the offset mappings bug.

The fix for this bug is tricky to deploy, but we are working on it, and I expect it to be available in the coming weeks.

zorikg · 2021-03-06T22:00:31Z

Thanks @n1t0, I wondered if there have been any progress on this? Any expectation for when the fix will be avail? Thanks!

github-actions · 2021-04-14T15:05:13Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

n1t0 · 2021-04-15T19:39:41Z

@zorikg Using the last few versions of transformers, you can instantiate your tokenizer as follow:

tokenizer = T5TokenizerFast.from_pretrained('google/t5-v1_1-base', from_slow=True)

This will force the conversion from the slow tokenizer, thus using the fixed version.

github-actions · 2021-05-10T15:02:49Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oxi84 · 2021-11-29T09:18:19Z

I am getting some difference between these 2 tokenizers is this solved?

n1t0 mentioned this issue Feb 3, 2021

Fix SentencePiece tokenizers conversion huggingface/tokenizers#616

Merged

github-actions bot closed this as completed May 19, 2021

LysandreJik mentioned this issue Jul 15, 2021

Add versioning system to fast tokenizer files #12713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong offsets_mapping in T5TokenizerFast #9633

Wrong offsets_mapping in T5TokenizerFast #9633

zorikg commented Jan 16, 2021 •

edited

Loading

LysandreJik commented Jan 18, 2021

zorikg commented Feb 1, 2021

n1t0 commented Feb 3, 2021

zorikg commented Mar 6, 2021

github-actions bot commented Apr 14, 2021

n1t0 commented Apr 15, 2021 •

edited

Loading

github-actions bot commented May 10, 2021

Oxi84 commented Nov 29, 2021

Wrong offsets_mapping in T5TokenizerFast #9633

Wrong offsets_mapping in T5TokenizerFast #9633

Comments

zorikg commented Jan 16, 2021 • edited Loading

Environment info

Who can help @patrickvonplaten, @mfuntowicz

Information

To reproduce

Expected behavior

LysandreJik commented Jan 18, 2021

zorikg commented Feb 1, 2021

n1t0 commented Feb 3, 2021

zorikg commented Mar 6, 2021

github-actions bot commented Apr 14, 2021

n1t0 commented Apr 15, 2021 • edited Loading

github-actions bot commented May 10, 2021

Oxi84 commented Nov 29, 2021

zorikg commented Jan 16, 2021 •

edited

Loading

n1t0 commented Apr 15, 2021 •

edited

Loading