-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental feature: policy scan base infrastructure #955
base: main
Are you sure you want to change the base?
Conversation
…it back to their caller
…arness, custom harness, and command.xxx_run()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some more detail around policy codes and clarity on how to develop/specify a policy. The code largely looks good, I left a few comments throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual comments are thoughts on code itself.
Overall I think this is a reasonable foundation and would like to see the result expose some determination other than detector results in the output.
clarity on how to develop/specify a policy.
I think is this a good point, it would be helpful to see a summary output about inferred policy or possibly an option for the user to provide an expected policy the summary could be compare against to determine output divergence based on detection.
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
class PolicyHarness(ProbewiseHarness): | ||
|
||
def _probe_check(self, probe): | ||
assert ( | ||
probe.policy_probe == True | ||
), "only policy probes should be used in policy runs" | ||
setattr(probe, "generations", _config.policy.generations) | ||
return probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the plugin load cache is holding onto all created probe instances based on the config_root
used to create them. This should not mutate the probe as that will modify the cached probe, we should create another with the required generations passed via config_root
.
Consider the hook call can replace the call to load the probe itself (note this is untested code):
try:
probe = _plugins.load_plugin(probename)
except Exception as e:
print(f"failed to load probe {probename}")
logging.warning("failed to load probe %s:", repr(e))
becomes:
probe = _load_probe(probename)
with:
def _load_probe(self, probename):
probe = None
try:
probe = _plugins.load_plugin(probename)
except Exception as e:
print(f"failed to load probe {probename}")
logging.warning("failed to load probe %s:", repr(e))
return probe
class PolicyHarness(ProbewiseHarness):
def _load_probe(self, probename):
import copy
probe = None
assert (
_plugins.plugin_info["policy_probe"] == True
), "only policy probes should be used in policy runs"
config_root = copy.deepcopy(_config.plugins.probes)
probe_config = config_root
for path in probename.split(".")[2:]:
probe_config = probe_config[path]
probe_config["generations"] = _config.policy.generations
try:
probe = _plugins.load_plugin(probename, config_root=config_root)
except Exception as e:
print(f"failed to load probe {probename}")
logging.warning("failed to load probe %s:", repr(e))
return probe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the plugin load cache is holding onto all created probe instances
Can you say more about this? Does adjusting probe member values after the configuration in constructor, alter the cache in a meaningful way? I know probes alter their own internal values during execution and I think I'm missing the border
This module represents objects related to policy scanning.
Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.
It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.
Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in
data/policy/policy_typology.json
.A policy scan is conducted by invoking garak with the
--policy_scan
switch.When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting
policy_probe=True
.A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and policy reports.
What this PR adds
We're laying the base infrastructure for policy scans in this PR.
policy
module andPolicy
class to allow storing and manipulation of target content policies. Policies consist of a set of policy points each describing a behaviour and whether this is permitted by the target.policy.Policy
object detailing what was extracted about the target's apparent content policy_plugins.enumerate_plugins()
to help dynamic selection of plugins based on class attributesVerification
garak -m test --policy_scan -p encoding -g 1
, then tail thexxx.policy.jsonl
todo for this vs. later PRs
There are required for merging this:
These are out-of-scope and planned: