Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XLM-RoBERTa detokenize() #1289

Merged
merged 1 commit into from
Oct 28, 2023

Conversation

abheesht17
Copy link
Collaborator

@abheesht17 abheesht17 commented Oct 28, 2023

Resolves #1282

Tested manually

import io
import os

import sentencepiece
import tensorflow as tf
from keras_nlp.backend import keras

from keras_nlp.models.xlm_roberta.xlm_roberta_tokenizer import (
    XLMRobertaTokenizer,
)

bytes_io = io.BytesIO()
vocab_data = tf.data.Dataset.from_tensor_slices(
    ["the quick brown fox", "the earth is round"]
)
sentencepiece.SentencePieceTrainer.train(
    sentence_iterator=vocab_data.as_numpy_iterator(),
    model_writer=bytes_io,
    vocab_size=10,
    model_type="WORD",
    unk_id=0,
    bos_id=1,
    eos_id=2,
)
proto = bytes_io.getvalue()

tokenizer = XLMRobertaTokenizer(proto=proto)

input_data = tf.constant([[4, 9, 5, 7]])
output = tokenizer.detokenize(input_data)
output

Output:

<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'the quick brown fox'], dtype=object)>

@abheesht17
Copy link
Collaborator Author

/gcbrun

@mattdangerw
Copy link
Member

Looks good! I will add bad the detokenize tests while we are at it.

@mattdangerw mattdangerw merged commit 6b66ad8 into keras-team:master Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XLMRobertaTokenizer.detokenize method is not wokring
2 participants