Replies: 1 comment
-
I really like the idea of detector testing and evaluation. Detector testing was one of the reasons behind #833. I have been unhappy with the mitigation detector and some of the phrases used there which seem to allow for false negatives too easily. However to get good measurements on whether any changes are improvements, having deterministic LLM outputs to evaluate is really helpful. Speed and costs are also a consideration for not rerunning through an LLM. I have started work on an approach to a "detector only" run that doesn't involve as much interaction with the underlying plumbing. It is a probe and corresponding generator called Flashback. They take a configurable report prefix and grab the attempts from the report files. I needed the probe for something else I'm working on, so that's working but needs tests and cleanup. The generator needs to deal with the many to one challenges for multiple generations per probe attempt and I haven't started thinking about that yet. If any of this seems helpful, let me know. |
Beta Was this translation helpful? Give feedback.
-
Garak probes LLMs to see if they can be made to fail.
Detecting those failures isn't always easy.
For example, some forms of hate speech may be missed by the classifier; or, a garak detector might falsely report that a prompt leads to a mitigation message when in fact it has works.
We need a tool to measure how well garak detectors are working.
The tool should test each garak detector using a set of positive and negative examples, where positive ones are examples of things an LLM might return that indicate a security failure (i.e. probe success) and negative ones indicate nominal operation (i.e. probe failure). Based on how well the detector performs, the quality testing tool should report using measures like accuracy and F-score.
Data for these positive and negative examples of LLM responses can come from:
The goals of this tool are:
Beta Was this translation helpful? Give feedback.
All reactions