Add preprocessor to patch PromptGuard scores for inserted characters #636
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem: Inserting spaces between characters in given prompts causes misclassifications in PromptGuard. See meta-llama/llama-models#50 for more context.
Solution: Tokenize the string with all spaces removed, to ensure that larger tokens (for example,
[“ignore”]
) are not broken up into smaller tokens (for example,[“i”, “g”, “n”, “o”, “r”, “e”]
. Add back spaces between the larger tokens if spaces exist in the original string.This approach showed a slight positive impact on all of our evaluation datasets, suggesting that making the system more robust to jailbreaks that disrupt tokenization like this one will be an important part of improving model quality. Notably, simply subtracting spaces from the string lead to a moderate quality regression on some datasets, which is why we don’t take that simpler approach here.
This solution only targets jailbreaks enabled by inserted spaces and not other special characters. For a more complete approach longer term, we’re continuing to work on building more adversarial examples into our dataset.
The preprocessor is used by default by our inference utilities.