Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

[FEATURE] Need to detokenize a BertTokenizer output #117

Closed
daden-ms opened this issue Jun 21, 2019 · 5 comments
Closed

[FEATURE] Need to detokenize a BertTokenizer output #117

daden-ms opened this issue Jun 21, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@daden-ms
Copy link
Contributor

Description

Currently the output of the NER prediction contains the subword, but the end user doesn't care about subword but the original word

For example , 'call Qingxiong Daisy'
tokenizer.tokenize([text]) -> [['call', 'Qing', '##xi', '##ong', 'Daisy']]
output label [['O', 'PersonName', 'X', 'X', 'X', 'PersonName', 'X', 'X']]

Expected behavior with the suggested feature

The desired output should be
'Qingxiong Daisy'->PersonName

It can also be helpful to provide the position of the entity

Other Comments

@daden-ms daden-ms added the enhancement New feature or request label Jun 21, 2019
@daden-ms
Copy link
Contributor Author

There is a similar ask for the pytorch-pretrained-BERT: huggingface/transformers#36

@daden-ms
Copy link
Contributor Author

@hlums
Copy link
Collaborator

hlums commented Jul 5, 2019

@daden Can we consider this issue as resolved?

@miguelgfierro
Copy link
Member

hey @daden-ms, I think Hong linked another person lol

@daden-ms
Copy link
Contributor Author

daden-ms commented Aug 6, 2019

Yes, we can close this issue now. For the record, we changed the input of the tokenize_ner to take in a list of tokens, instead of a string.

@daden-ms daden-ms closed this as completed Aug 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants