[FEATURE] Need to detokenize a BertTokenizer output #117

daden-ms · 2019-06-21T02:32:40Z

Description

Currently the output of the NER prediction contains the subword, but the end user doesn't care about subword but the original word

For example , 'call Qingxiong Daisy'
tokenizer.tokenize([text]) -> [['call', 'Qing', '##xi', '##ong', 'Daisy']]
output label [['O', 'PersonName', 'X', 'X', 'X', 'PersonName', 'X', 'X']]

Expected behavior with the suggested feature

The desired output should be
'Qingxiong Daisy'->PersonName

It can also be helpful to provide the position of the entity

Other Comments

daden-ms · 2019-06-21T02:43:21Z

There is a similar ask for the pytorch-pretrained-BERT: huggingface/transformers#36

daden-ms · 2019-06-21T02:56:38Z

This is the file to see an example of how to deal with the problem: https://github.com/huggingface/pytorch-pretrained-BERT/blob/c304593d8fa93f25febe1458c63497a846749c89/examples/run_squad_dataset_utils.py#L179

hlums · 2019-07-05T15:13:09Z

@daden Can we consider this issue as resolved?

miguelgfierro · 2019-08-02T12:07:48Z

hey @daden-ms, I think Hong linked another person lol

daden-ms · 2019-08-06T17:39:28Z

Yes, we can close this issue now. For the record, we changed the input of the tokenize_ner to take in a list of tokens, instead of a string.

daden-ms added the enhancement New feature or request label Jun 21, 2019

daden-ms closed this as completed Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Need to detokenize a BertTokenizer output #117

[FEATURE] Need to detokenize a BertTokenizer output #117

daden-ms commented Jun 21, 2019

daden-ms commented Jun 21, 2019

daden-ms commented Jun 21, 2019

hlums commented Jul 5, 2019

miguelgfierro commented Aug 2, 2019

daden-ms commented Aug 6, 2019