Fix XLM-RoBERTa detokenize() #1289

abheesht17 · 2023-10-28T00:41:02Z

Resolves #1282

Tested manually

import io
import os

import sentencepiece
import tensorflow as tf
from keras_nlp.backend import keras

from keras_nlp.models.xlm_roberta.xlm_roberta_tokenizer import (
    XLMRobertaTokenizer,
)

bytes_io = io.BytesIO()
vocab_data = tf.data.Dataset.from_tensor_slices(
    ["the quick brown fox", "the earth is round"]
)
sentencepiece.SentencePieceTrainer.train(
    sentence_iterator=vocab_data.as_numpy_iterator(),
    model_writer=bytes_io,
    vocab_size=10,
    model_type="WORD",
    unk_id=0,
    bos_id=1,
    eos_id=2,
)
proto = bytes_io.getvalue()

tokenizer = XLMRobertaTokenizer(proto=proto)

input_data = tf.constant([[4, 9, 5, 7]])
output = tokenizer.detokenize(input_data)
output

Output:

<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'the quick brown fox'], dtype=object)>

abheesht17 · 2023-10-28T00:42:24Z

/gcbrun

mattdangerw · 2023-10-28T00:44:33Z

Looks good! I will add bad the detokenize tests while we are at it.

Fix XLM-RoBERTa detokenize()

f28ae2a

abheesht17 requested a review from mattdangerw October 28, 2023 00:41

mattdangerw merged commit 6b66ad8 into keras-team:master Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix XLM-RoBERTa detokenize() #1289

Fix XLM-RoBERTa detokenize() #1289

abheesht17 commented Oct 28, 2023 •

edited

Loading

abheesht17 commented Oct 28, 2023

mattdangerw commented Oct 28, 2023

Fix XLM-RoBERTa detokenize() #1289

Fix XLM-RoBERTa detokenize() #1289

Conversation

abheesht17 commented Oct 28, 2023 • edited Loading

abheesht17 commented Oct 28, 2023

mattdangerw commented Oct 28, 2023

abheesht17 commented Oct 28, 2023 •

edited

Loading