Can PretrainedTransformerTokenizer track character offset like WordTokenizer？ #3458

yyHaker · 2019-11-16T16:02:13Z

Question

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？
Since character offset is important to calculate answer span after wordpiece tokenization？

matt-gardner · 2019-11-17T22:13:56Z

This is a TODO in the code. I know that the huggingface repo has code to train SQuAD models, so there must be a way to do this calculation in that repo, but I haven't looked at the code to figure it out. Contributions welcome!

nadgeri14 · 2020-01-15T08:58:06Z

@matt-gardner if this issue is still pending I would love to take this up. I might need your assistance as I am relatively new to the code base. As in if you could provide me a list of TODO's it will really help me.

matt-gardner · 2020-01-15T15:55:53Z

The new tokenizers library from huggingface tracks character offsets, so we don't need to add this ourselves. We have someone here who's going to be fixing this very soon. I'd recommend against picking up this issue.

nadgeri14 · 2020-01-15T16:29:30Z

@matt-gardner Oh, thanks for the update.

dirkgr · 2020-02-13T00:24:45Z

I few weeks ago I added a parameter to PretrainedTransformerTokenizer that attempts to calculate offsets after the fact. It does so imperfectly, but it might get you going if you need this right away.

matt-gardner · 2020-03-23T18:04:29Z

I have no context or intuition about what time label this one should get; @dirkgr, any ideas?

dirkgr · 2020-03-23T18:45:16Z

We already have the code for this (#3868), so this task is to integrate the new huggingface tokenizers whenever those remaining bugs are fixed, and bring that PR up to date. I'll say that's a day's worth of work.

matt-gardner · 2020-04-05T15:34:28Z

Just noting that #4018 integrated new huggingface tokenizers, so updating #3868 should be unblocked at this point.

dirkgr · 2020-04-06T15:44:35Z

I'm aware.

dirkgr · 2020-04-14T23:18:00Z

New huggingface tokenizers are still broken. I'm moving this to the bottom of the stack for 1.0. Maybe we'll bump it to 1.1.

dirkgr · 2020-05-12T23:58:26Z

Finally done!

matt-gardner added Contributions welcome Good First Issue A great place to start for first time contributors labels Nov 17, 2019

matt-gardner added Under Development and removed Contributions welcome Good First Issue A great place to start for first time contributors labels Jan 15, 2020

matt-gardner added this to the 1.0.0 milestone Jan 15, 2020

dirkgr self-assigned this Jan 17, 2020

dirkgr added the Day label Mar 23, 2020

dirkgr removed the Day label Apr 14, 2020

dirkgr closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？ #3458

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？ #3458

yyHaker commented Nov 16, 2019

matt-gardner commented Nov 17, 2019

nadgeri14 commented Jan 15, 2020

matt-gardner commented Jan 15, 2020

nadgeri14 commented Jan 15, 2020

dirkgr commented Feb 13, 2020

matt-gardner commented Mar 23, 2020

dirkgr commented Mar 23, 2020

matt-gardner commented Apr 5, 2020

dirkgr commented Apr 6, 2020

dirkgr commented Apr 14, 2020

dirkgr commented May 12, 2020

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？ #3458

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？ #3458

Comments

yyHaker commented Nov 16, 2019

matt-gardner commented Nov 17, 2019

nadgeri14 commented Jan 15, 2020

matt-gardner commented Jan 15, 2020

nadgeri14 commented Jan 15, 2020

dirkgr commented Feb 13, 2020

matt-gardner commented Mar 23, 2020

dirkgr commented Mar 23, 2020

matt-gardner commented Apr 5, 2020

dirkgr commented Apr 6, 2020

dirkgr commented Apr 14, 2020

dirkgr commented May 12, 2020