Export to HF and special tokens #2

JonasGeiping · 2024-08-29T14:05:24Z

Hi, nice repo!

I'm getting an error with the tokenizer.export_to_huggingface_format conversion. The exact problem is that in convert_tiktoken_to_huggingface, the code expects len(merged) ==2, but for special tokens like <s>, the partial bpe function returns three parts:

(Pdb) merged
(b'<', b's', b'>')

which breaks the assertion.

P.S: This is unrelated, but is there an easy way to create a "block-list" of merges that I don't want to appear in the final vocab? For example, if I think a particular token is a glitch that is overrepresented in the tokenization dataset, but not representative of real text?

The text was updated successfully, but these errors were encountered:

gautierdag · 2024-08-29T16:24:42Z

You're totally right, special tokens should have been skipped when building the merges. I've added a test to catch this and pushed the fix in version 0.1.4.

P.S: This is unrelated, but is there an easy way to create a "block-list" of merges that I don't want to appear in the final vocab? For example, if I think a particular token is a glitch that is overrepresented in the tokenization dataset, but not representative of real text?

That's a good question, there is no way to do this right now since this library very bare-bones. It would be possible if you wanted to fork the repo and mess with the rust code. Probably wouldn't be too hard to hack (probably around L300 in lib.rs) - could be just a check to see if a proposed new token is in a blocklist and if it is then skip .

Though a maybe even simpler solution could just be to clean your data 😅

JonasGeiping · 2024-08-29T19:25:54Z

Thanks!

Regarding the blocklist, I'll have to think which one of messing with the implementation and cleaning the data is easier to do ...

gautierdag · 2024-08-30T10:05:03Z

Also should add that I'd be happy to review a PR if you wanted to merge back such a feature into main 🙂

JonasGeiping closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export to HF and special tokens #2

Export to HF and special tokens #2

JonasGeiping commented Aug 29, 2024

gautierdag commented Aug 29, 2024

JonasGeiping commented Aug 29, 2024

gautierdag commented Aug 30, 2024

Export to HF and special tokens #2

Export to HF and special tokens #2

Comments

JonasGeiping commented Aug 29, 2024

gautierdag commented Aug 29, 2024

JonasGeiping commented Aug 29, 2024

gautierdag commented Aug 30, 2024