Detector quality testing #1104

leondz · 2025-02-07T12:02:20Z

leondz
Feb 7, 2025
Maintainer

Garak probes LLMs to see if they can be made to fail.
Detecting those failures isn't always easy.
For example, some forms of hate speech may be missed by the classifier; or, a garak detector might falsely report that a prompt leads to a mitigation message when in fact it has works.

We need a tool to measure how well garak detectors are working.

The tool should test each garak detector using a set of positive and negative examples, where positive ones are examples of things an LLM might return that indicate a security failure (i.e. probe success) and negative ones indicate nominal operation (i.e. probe failure). Based on how well the detector performs, the quality testing tool should report using measures like accuracy and F-score.

Data for these positive and negative examples of LLM responses can come from:

toy examples, maybe extreme
real LLM responses to garak prompts (where license permits)
datasets of LLM outputs (where license permits)

The goals of this tool are:

To quantify how well garak detectors are performing at detecting LLM failures
To identify detectors that we need to improve, so that results are sufficiently reliable
Once we have good quality, to provide tests that block merging changes that reduce detector performance below a set threshold

Eric-Hacker · 2025-02-07T17:03:04Z

Eric-Hacker
Feb 7, 2025

I really like the idea of detector testing and evaluation.

Detector testing was one of the reasons behind #833. I have been unhappy with the mitigation detector and some of the phrases used there which seem to allow for false negatives too easily. However to get good measurements on whether any changes are improvements, having deterministic LLM outputs to evaluate is really helpful. Speed and costs are also a consideration for not rerunning through an LLM.

I have started work on an approach to a "detector only" run that doesn't involve as much interaction with the underlying plumbing. It is a probe and corresponding generator called Flashback. They take a configurable report prefix and grab the attempts from the report files.

I needed the probe for something else I'm working on, so that's working but needs tests and cleanup. The generator needs to deal with the many to one challenges for multiple generations per probe attempt and I haven't started thinking about that yet. If any of this seems helpful, let me know.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detector quality testing #1104

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Detector quality testing #1104

leondz Feb 7, 2025 Maintainer

Replies: 1 comment

Eric-Hacker Feb 7, 2025

leondz
Feb 7, 2025
Maintainer

Eric-Hacker
Feb 7, 2025