Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export to HF and special tokens #2

Closed
JonasGeiping opened this issue Aug 29, 2024 · 3 comments
Closed

Export to HF and special tokens #2

JonasGeiping opened this issue Aug 29, 2024 · 3 comments

Comments

@JonasGeiping
Copy link

Hi, nice repo!

I'm getting an error with the tokenizer.export_to_huggingface_format conversion. The exact problem is that in convert_tiktoken_to_huggingface, the code expects len(merged) ==2, but for special tokens like <s>, the partial bpe function returns three parts:

(Pdb) merged
(b'<', b's', b'>')

which breaks the assertion.

P.S: This is unrelated, but is there an easy way to create a "block-list" of merges that I don't want to appear in the final vocab? For example, if I think a particular token is a glitch that is overrepresented in the tokenization dataset, but not representative of real text?

@gautierdag
Copy link
Owner

You're totally right, special tokens should have been skipped when building the merges. I've added a test to catch this and pushed the fix in version 0.1.4.

P.S: This is unrelated, but is there an easy way to create a "block-list" of merges that I don't want to appear in the final vocab? For example, if I think a particular token is a glitch that is overrepresented in the tokenization dataset, but not representative of real text?

That's a good question, there is no way to do this right now since this library very bare-bones. It would be possible if you wanted to fork the repo and mess with the rust code. Probably wouldn't be too hard to hack (probably around L300 in lib.rs) - could be just a check to see if a proposed new token is in a blocklist and if it is then skip .

Though a maybe even simpler solution could just be to clean your data 😅

@JonasGeiping
Copy link
Author

Thanks!

Regarding the blocklist, I'll have to think which one of messing with the implementation and cleaning the data is easier to do ...

@gautierdag
Copy link
Owner

Also should add that I'd be happy to review a PR if you wanted to merge back such a feature into main 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants