Skip to content

Commit

Permalink
Streamline translation use case
Browse files Browse the repository at this point in the history
* refactor for encapsulation
* set default config entries for early load support
  * adds docs for new entries
  * initializes `lang_spec` to `en`
  * intializes `translators` to an empty list
* revised test config formats
* only maintain prompts in one language
* limit to only translated prompts
* treat "$" as a special line bypassing translation attempt
* NLI detector should support None response
* check translation required carefully
  * strings that do not contain "words" should bypass translation
  * tests of translation configuration require `lang_spec`
  * add remote translator specific class to tests
* clarify base model prefix as opus-mt-*
* detectors need translators in config
* always return a translator even if just target to target
  * always have a translator
  * only attempt to translate output that is not None
* force garbage collection after translator tests
* validate probe trigger type during tranlation
* translation needs lists of strings
  * support for nested lists is added for existing probes content
* remove direct _config access in plugins
  * remove access to _config.run from `probe` classes
  * adjust goodside translations to not retain original prompts
* refactor probe translation tests for unit testing
  * In the interest of reasonable execution time test probe
    call translation instead of executing translation.
  * probe translation tests as unit testing only
* Translation actions are tested with there own tests.
* remove side-effects for internal translation methods
* latentinjection init adjustment
  * Remove extra call for translator
  * Ensure `_build_prompts_triggers` is called only once during init
    for all implemented classes.
* bugfix - goodside instance instead of class attributes
* remote test case corrections
* extract translator base config restrictions
  * ENV var needs are handled by `remote` module
  * adjust docs for each class
  * match extending class method signature
* consolidate nltk overrides in resources.api
* remove no longer used "only_translate_word"
* remove lang_list references, support a single target language
* use pythonic code-style, adjust inline comments
* refactor report file to rely on global fixture
* rename base class to `Translator`
  * rename `SimpleTranslator` to `Translator`
  * source and target language determined via translator held values
* update translation configuration docs

Signed-off-by: Jeffrey Martin <[email protected]>
  • Loading branch information
jmartin-tech committed Feb 13, 2025
1 parent e53e7d2 commit e5a08c7
Show file tree
Hide file tree
Showing 35 changed files with 1,456 additions and 1,120 deletions.
2 changes: 2 additions & 0 deletions docs/source/configurable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,8 @@ such as ``show_100_pass_modules``.
* ``seed`` - An optional random seed
* ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
* ``user_agent`` - What HTTP user agent string should garak use? ``{version}`` can be used to signify where garak version ID should go
docs/source/configurable.rst* ``lang_spec`` - A single bcp47 value the target application for LLM accepts as prompt and output
* ``translators`` - A list of configurations representing translators for converting from probe bcp47 language to land_spec target bcp47 languages

``plugins`` config items
""""""""""""""""""""""""
Expand Down
166 changes: 93 additions & 73 deletions docs/source/translator.rst
Original file line number Diff line number Diff line change
@@ -1,58 +1,53 @@
The `translator.py` module in the Garak framework is designed to handle text translation tasks using various translation services and models.
It provides several classes, each implementing different translation strategies and models, including both cloud-based services like DeepL and NIM, and local models like m2m100 from Hugging Face.
The ``translator.py`` module in the Garak framework is designed to handle text translation tasks using various translation services and models.
It provides several classes, each implementing different translation strategies and models, including both cloud-based services,
like `DeepL<https://www.deepl.com/>`_ and `NVIDIA Riva<https://build.nvidia.com/nvidia/megatron-1b-nmt>`_, and local models like facebook/m2m100 available on `Hugging Face<https://huggingface.co/>`_.

garak.translator
=============
================

.. automodule:: garak.translator
:members:
:undoc-members:
:show-inheritance:

Multilingual support
====================
Translation support
===================

This feature adds multilingual probes and detector keywords and triggers.
You can check the model vulnerability for multilingual languages.
This module adds translation support for probe and detector keywords and triggers.
Allowing testing of models that accept and produce text in languages other than the language the plugin was written for.

* limitation:
- This function only supports for `bcp47` code is "en".
- Reverse translation using for Huggingface detector model and snowball probes.
- Huggingface detector only supports English. You need to bring the target language NLI model for the detector.
- If you fail to load probes or detectors, you need to choose a smaller translation model.
* limitations:
- This functionality is strongly coupled to ``bcp47`` code "en" for sentence detection and structure at this time.
- Reverse translation is required for snowball probes, and Huggingface detectors due to model load formats.
- Huggingface detectors primarily load English models. Requiring a target language NLI model for the detector.
- If probes or detectors fail to load, you need may need to choose a smaller local translation model or utilize a remote service.
- Translation may add significant execution time to the run depending on resources available.

pre-requirements
----------------

.. code-block:: bash
pip install nvidia-riva-client==2.16.0
Support translation service
---------------------------
Supported translation services
------------------------------

- Huggingface
- This code uses the following translation models:
- `Helsinki-NLP/opus-mt-en-{lang} <https://huggingface.co/docs/transformers/model_doc/marian>`_
- This project supports usage of the following translation models:
- `Helsinki-NLP/opus-mt-{<source_lang>-<target_lang>} <https://huggingface.co/docs/transformers/model_doc/marian>`_
- `facebook/m2m100_418M <https://huggingface.co/facebook/m2m100_418M>`_
- `facebook/m2m100_1.2B <https://huggingface.co/facebook/m2m100_1.2B>`_
- `DeepL <https://www.deepl.com/docs-api>`_
- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_
- `NVIDIA Riva <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_

API KEY
-------
API KEY Requirements
--------------------

You can use DeepL API or NIM API to translate probe and detector keywords and triggers.
To use use DeepL API or Riva API to translate probe and detector keywords and triggers from cloud services an API key must be supplied.

You need an API key for the preferred service.
API keys for the preferred service can be obtained in following locations:
- `DeepL <https://www.deepl.com/en/pro-api>`_
- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_
- `Riva <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_

Supported languages:
Supported languages for remote services:
- `DeepL <https://developers.deepl.com/docs/resources/supported-languages>`_
- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt/modelcard>`_
- `Riva <https://docs.nvidia.com/nim/riva/nmt/latest/getting-started.html#supported-languages>`_

Set up the API key with the following command:
API keys can be stored in environment variables with the following commands:

DeepL
~~~~~
Expand All @@ -61,52 +56,71 @@ DeepL
export DEEPL_API_KEY=xxxx
NIM
RIVA
~~~

.. code-block:: bash
export NIM_API_KEY=xxxx
export RIVA_API_KEY=xxxx
Configuration file
------------------

Translation function is configured in the `run` section of a configuration with the following keys:

lang_spec - A single `bcp47` entry designating the language of the target under test. "ja", "fr", "jap" etc.
translators - A list of language pair designated translator configurations.

* Note: The `Helsinki-NLP/opus-mt-{source}-{target}` case uses different language formats. The language codes used to name models are inconsistent.
Two-digit codes can usually be found `here<https://developers.google.com/admin-sdk/directory/v1/languages>`_, while three-digit codes require
a search such as “language code {code}". More details can be found `here <https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models>`_.

config file
-----------
A translator configuration is provided using the project's configurable pattern with the following required keys:

You can pass the translation service, source language, and target language by the argument.
* ``language`` - A `-` separated pair of `bcp47` entires describing translation format provided by the configuration
* ``model_type`` - the module and optional instance class to be instantiated. local, remote, remote.DeeplTranslator etc.
* ``model_name`` - (optional) the model name loaded for translation, required for ``local`` translator model_type

- translation_service: "nim" or "deepl", "local"
- lang_spec: "ja", "ja,fr" etc. (you can set multiple language codes)
(Optional) Model specific parameters defined by the translator model type may exist.

* Note: The `Helsinki-NLP/opus-mt-en-{lang}` case uses different language formats. The language codes used to name models are inconsistent. Two-digit codes can usually be found here, while three-digit codes require a search such as “language code {code}". More details can be found `here <https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models>`_.
* Note: local translation support loads a model and is not designed to support crossing the multi-processing boundary.

The translator config writes to a file and the path passed, with
You can also configure this via a config file:
is given in `Translator Config with yaml <translator_with_yaml>`_ below.
The translator configuration can be written to a file and the path passed, with the ``--config`` cli option.

An example template is provided below.

.. code-block:: yaml
run:
translation:
translation_service: {translation service}
api_key: {your API key}
lang_spec: {language code}
model_spec:
lang_spec: {target language code}
translators:
- language: {source language code}-{target language code}
api_key: {your API key}
model_type: {translator module or module.classname}
model_name: {huggingface model name}
- language: {target language code}-{source language code}
api_key: {your API key}
model_type: {translator module or module.classname}
model_name: {huggingface model name}

* Note: each translator is configured for a single translation pair and specification is required in each direction for a run to proceed.

Examples for multilingual
-------------------------
Examples for translation configuration
--------------------------------------

DeepL
~~~~~

To use the translation option for garak, run the following command:
To use DeepL translation in garak, run the following command:
You use the following yaml config.

.. code-block:: yaml
run:
translation:
translation_service: deepl
api_key: {your API key}
lang_spec: ja
lang_spec: {target language code}
translator:
- language: {source language code}-{target language code}
model_type: remote.DeeplTranslator
- language: {target language code}-{source language code}
model_type: remote.DeeplTranslator


.. code-block:: bash
Expand All @@ -115,24 +129,25 @@ run:
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config {path to your yaml config file}
NIM
~~~
Riva
~~~~

For NIM, run the following command:
For Riva, run the following command:
You use the following yaml config.

.. code-block:: yaml
run:
translation:
translation_service: nim
api_key: {your API key}
lang_spec: ja
- language: {source language code}-{target language code}
model_type: remote
- language: {target language code}-{source language code}
model_type: remote


.. code-block:: bash
export NIM_API_KEY=xxxx
export RIVA_API_KEY=xxxx
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config {path to your yaml config file}
Expand All @@ -144,11 +159,14 @@ You use the following yaml config.

.. code-block:: yaml
run:
translation:
translation_service: local
lang_spec: ja
model_spec:
model_name: facebook/m2m100_418M
lang_spec: ja
translators:
- language: en-ja
model_type: local
model_name: facebook/m2m100_418M
- language: jap-en
model_type: local
model_name: facebook/m2m100_418M


.. code-block:: bash
Expand All @@ -158,12 +176,14 @@ run:
.. code-block:: yaml
run:
translation:
translation_service: local
lang_spec: jap
model_spec:
model_name: Helsinki-NLP/opus-mt-en-{}

lang_spec: jap
translators:
- language: en-jap
model_type: local
model_name: Helsinki-NLP/opus-mt-{}
- language: jap-en
model_type: local
model_name: Helsinki-NLP/opus-mt-{}

.. code-block:: bash
Expand Down
2 changes: 2 additions & 0 deletions garak/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ def _nested_dict():

# this is so popular, let's set a default. what other defaults are worth setting? what's the policy?
run.seed = None
run.lang_spec = "en"
run.translators = []

# placeholder
# generator, probe, detector, buff = {}, {}, {}, {}
Expand Down
14 changes: 9 additions & 5 deletions garak/attempt.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def __init__(
detector_results=None,
goal=None,
seq=-1,
lang_type=None,
bcp47=None, # language code for prompt as sent to the target
reverse_translator_outputs=None,
) -> None:
self.uuid = uuid.uuid4()
Expand All @@ -88,8 +88,10 @@ def __init__(
self.seq = seq
if prompt is not None:
self.prompt = prompt
self.lang_type = lang_type
self.reverse_translator_outputs = {} if reverse_translator_outputs is None else reverse_translator_outputs
self.bcp47 = bcp47
self.reverse_translator_outputs = (
{} if reverse_translator_outputs is None else reverse_translator_outputs
)

def as_dict(self) -> dict:
"""Converts the attempt to a dictionary."""
Expand All @@ -107,8 +109,10 @@ def as_dict(self) -> dict:
"notes": self.notes,
"goal": self.goal,
"messages": self.messages,
"lang_type": self.lang_type,
"reverse_translator_outputs": {k: list(v) for k, v in self.reverse_translator_outputs.items()},
"bcp47": self.bcp47,
"reverse_translator_outputs": {
k: list(v) for k, v in self.reverse_translator_outputs.items()
},
}

@property
Expand Down
Loading

0 comments on commit e5a08c7

Please sign in to comment.