-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export to HF and special tokens #2
Comments
You're totally right, special tokens should have been skipped when building the merges. I've added a test to catch this and pushed the fix in version 0.1.4.
That's a good question, there is no way to do this right now since this library very bare-bones. It would be possible if you wanted to fork the repo and mess with the rust code. Probably wouldn't be too hard to hack (probably around L300 in Though a maybe even simpler solution could just be to clean your data 😅 |
Thanks! Regarding the blocklist, I'll have to think which one of messing with the implementation and cleaning the data is easier to do ... |
Also should add that I'd be happy to review a PR if you wanted to merge back such a feature into main 🙂 |
Hi, nice repo!
I'm getting an error with the
tokenizer.export_to_huggingface_format
conversion. The exact problem is that inconvert_tiktoken_to_huggingface
, the code expectslen(merged) ==2
, but for special tokens like<s>
, the partial bpe function returns three parts:which breaks the assertion.
P.S: This is unrelated, but is there an easy way to create a "block-list" of merges that I don't want to appear in the final vocab? For example, if I think a particular token is a glitch that is overrepresented in the tokenization dataset, but not representative of real text?
The text was updated successfully, but these errors were encountered: