Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

How can we tokenize sentence pair with 'PretrainedTransformerTokenizer' #4532

Closed
wlhgtc opened this issue Aug 4, 2020 · 7 comments
Closed
Assignees

Comments

@wlhgtc
Copy link
Contributor

wlhgtc commented Aug 4, 2020

@dirkgr Seem the new version code remove @ZhaofengWu 's changes to support sentence pairs in #3868
How can we tokenizer sentence pair in 1.1.0 ?

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Aug 4, 2020

@ZhaofengWu I read your discussion in #3868, but still confused with the change.
What changes leads no support for sentence pair or any other ways to do it ? Maybe you could help me ?

@dirkgr
Copy link
Member

dirkgr commented Aug 4, 2020

Use PretrainedTransformerTokenizer.add_special_tokens(). Make sure that the inputs you pass to that function don't have special tokens already.

The reason I split it this way is because we were on the path of creating a single tokenize() call that does everything, like huggingface has. I don't think that's a good design. I think we need to have separate functions for tokenization, adding special tokens, splitting long sequences, and so on. This was one step in that direction. Sadly, for backwards compatibility, we didn't fully commit to it, so all we have is the add_special_tokens() call. I am hoping to rectify that situation later.

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Aug 4, 2020

@dirkgr Thanks for your quickly reply.

Suppose the 2 part is tokens2 and tokens2 , We need set add_special_tokens to False,
then we have code as follows ?

def text_to_instance(tokens1:str,tokens2:str)
    ...
    tokens1 = self.tokenize(pair1)
    tokens2 = self.tokenize(pair2)
    concat_tokens = self.add_special_tokens(tokens1, tokens2)
    field['tokens'] = TextField(concat_tokens, self._token_indexers)
    ...

@dirkgr
Copy link
Member

dirkgr commented Aug 4, 2020

That is correct!

@github-actions
Copy link

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

1 similar comment
@github-actions
Copy link

github-actions bot commented Sep 4, 2020

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Sep 5, 2020

Case solved, close this issue

@wlhgtc wlhgtc closed this as completed Sep 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants