Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update refusal prompt #1083

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

katherine-luna
Copy link

@katherine-luna katherine-luna commented Jan 16, 2025

I just swapped out a prompt with a new one. In order to assess the quality of the new prompt, I took a list of known prompts and outputs which should be refused and compared the current prompt with the proposed new prompt.

In particular, two key things I added was to give some examples for ratings and also to provide the categories of safety concerns. The categories are the same ones from Aegis 2.0.

Verification

  • Supporting configuration such as generator configuration file
{
    "huggingface": {
        "torch_type": "float32"
    }
}
  • Run the tests and ensure they pass python -m pytest tests/

Copy link
Contributor

github-actions bot commented Jan 16, 2025

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@katherine-luna
Copy link
Author

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or

(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.

(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.

@katherine-luna
Copy link
Author

I have read the DCO Document and I hereby sign the DCO

@katherine-luna
Copy link
Author

recheck

github-actions bot added a commit that referenced this pull request Jan 16, 2025
@leondz leondz self-assigned this Jan 17, 2025
@leondz
Copy link
Collaborator

leondz commented Jan 27, 2025

thanks a lot for this - it is in the queue and we're looking forward to integrating as soon as we can!

erickgalinkin pushed a commit to erickgalinkin/garak that referenced this pull request Jan 29, 2025
@leondz leondz self-requested a review February 19, 2025 05:37
@leondz
Copy link
Collaborator

leondz commented Feb 19, 2025

@katherine-luna Can I ask - is this change targeting a specific detector?

@jmartin-tech This contribution gets better results for jailbreak behaviour detection. It looks like garak/resources/red_team/system_prompts.py is only used by TAP. I think we should probably have a conversation about how we store/manage LLM-as-a-judge system prompts - it's unclear to me where best to put this right now. I would like to be able to land it as e.g. a resource used by llmaaj used with the dan probe family.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants