Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Can PretrainedTransformerTokenizer track character offset like WordTokenizer? #3458

Closed
yyHaker opened this issue Nov 16, 2019 · 11 comments
Closed
Assignees
Milestone

Comments

@yyHaker
Copy link

yyHaker commented Nov 16, 2019

Question

  • Can PretrainedTransformerTokenizer track character offset like WordTokenizer?
    Since character offset is important to calculate answer span after wordpiece tokenization?
@matt-gardner
Copy link
Contributor

This is a TODO in the code. I know that the huggingface repo has code to train SQuAD models, so there must be a way to do this calculation in that repo, but I haven't looked at the code to figure it out. Contributions welcome!

@matt-gardner matt-gardner added Contributions welcome Good First Issue A great place to start for first time contributors labels Nov 17, 2019
@nadgeri14
Copy link
Contributor

@matt-gardner if this issue is still pending I would love to take this up. I might need your assistance as I am relatively new to the code base. As in if you could provide me a list of TODO's it will really help me.

@matt-gardner
Copy link
Contributor

The new tokenizers library from huggingface tracks character offsets, so we don't need to add this ourselves. We have someone here who's going to be fixing this very soon. I'd recommend against picking up this issue.

@matt-gardner matt-gardner added Under Development and removed Contributions welcome Good First Issue A great place to start for first time contributors labels Jan 15, 2020
@matt-gardner matt-gardner added this to the 1.0.0 milestone Jan 15, 2020
@nadgeri14
Copy link
Contributor

@matt-gardner Oh, thanks for the update.

@dirkgr dirkgr self-assigned this Jan 17, 2020
@dirkgr
Copy link
Member

dirkgr commented Feb 13, 2020

I few weeks ago I added a parameter to PretrainedTransformerTokenizer that attempts to calculate offsets after the fact. It does so imperfectly, but it might get you going if you need this right away.

@matt-gardner
Copy link
Contributor

I have no context or intuition about what time label this one should get; @dirkgr, any ideas?

@dirkgr dirkgr added the Day label Mar 23, 2020
@dirkgr
Copy link
Member

dirkgr commented Mar 23, 2020

We already have the code for this (#3868), so this task is to integrate the new huggingface tokenizers whenever those remaining bugs are fixed, and bring that PR up to date. I'll say that's a day's worth of work.

@matt-gardner
Copy link
Contributor

Just noting that #4018 integrated new huggingface tokenizers, so updating #3868 should be unblocked at this point.

@dirkgr
Copy link
Member

dirkgr commented Apr 6, 2020

I'm aware.

@dirkgr dirkgr removed the Day label Apr 14, 2020
@dirkgr
Copy link
Member

dirkgr commented Apr 14, 2020

New huggingface tokenizers are still broken. I'm moving this to the bottom of the stack for 1.0. Maybe we'll bump it to 1.1.

@dirkgr
Copy link
Member

dirkgr commented May 12, 2020

Finally done!

@dirkgr dirkgr closed this as completed May 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants