You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@dxoigmn - Based on the bad_token_ids implementation, it currently identifies tokens as "bad" if they are non-printable or non-ASCII. These bad tokens are subsequently ignored (banned) by the (function) ignored_values: Tensor, where ignored_values=tokenizer.bad_token_ids.
Based on the above understanding, I have a couple of questions:
Would you prefer a configurable policy that allows end users to define what constitutes a bad token?
How would end-users configure or customize this policy for non-ASCII languages to identify bad tokens? Would this be done via CLI arguments for a set of specific non-ASCII characters?
I would appreciate more clarification on this issue.
llmart
has the capability of banning "bad" tokens from the adversarial optimization.Right now
bad_token_ids
implements a static policy for what is considered a "bad" token (non-printability, ascii-only):LLMart/src/llmart/tokenizer.py
Lines 428 to 444 in c7bbef3
Being able to add configurable policies would help with non-ascii languages. Additionally, being able to ban a set of tokens would also be beneficial.
The text was updated successfully, but these errors were encountered: