Streamline translation use case

* refactor for encapsulation * set default config entries for early load support * adds docs for new entries * initializes `lang_spec` to `en` * intializes `translators` to an empty list * revised test config formats * only maintain prompts in one language * limit to only translated prompts * treat "$" as a special line bypassing translation attempt * NLI detector should support None response * check translation required carefully * strings that do not contain "words" should bypass translation * tests of translation configuration require `lang_spec` * add remote translator specific class to tests * clarify base model prefix as opus-mt-* * detectors need translators in config * always return a translator even if just target to target * always have a translator * only attempt to translate output that is not None * force garbage collection after translator tests * validate probe trigger type during tranlation * translation needs lists of strings * support for nested lists is added for existing probes content * remove direct _config access in plugins * remove access to _config.run from `probe` classes * adjust goodside translations to not retain original prompts * refactor probe translation tests for unit testing * In the interest of reasonable execution time test probe call translation instead of executing translation. * probe translation tests as unit testing only * Translation actions are tested with there own tests. * remove side-effects for internal translation methods * latentinjection init adjustment * Remove extra call for translator * Ensure `_build_prompts_triggers` is called only once during init for all implemented classes. * bugfix - goodside instance instead of class attributes * remote test case corrections * extract translator base config restrictions * ENV var needs are handled by `remote` module * adjust docs for each class * match extending class method signature * consolidate nltk overrides in resources.api * remove no longer used "only_translate_word" * remove lang_list references, support a single target language * use pythonic code-style, adjust inline comments * refactor report file to rely on global fixture * rename base class to `Translator` * rename `SimpleTranslator` to `Translator` * source and target language determined via translator held values * update translation configuration docs Signed-off-by: Jeffrey Martin <[email protected]>
SnowMasaya · Feb 13, 2025 · e5a08c7 · e5a08c7
1 parent e53e7d2
commit e5a08c7
Show file tree

Hide file tree

Showing 35 changed files with 1,456 additions and 1,120 deletions.
diff --git a/docs/source/configurable.rst b/docs/source/configurable.rst
@@ -103,6 +103,8 @@ such as ``show_100_pass_modules``.
 * ``seed`` - An optional random seed
 * ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
 * ``user_agent`` - What HTTP user agent string should garak use? ``{version}`` can be used to signify where garak version ID should go
+docs/source/configurable.rst* ``lang_spec`` - A single bcp47 value the target application for LLM accepts as prompt and output
+* ``translators`` - A list of configurations representing translators for converting from probe bcp47 language to land_spec target bcp47 languages
 
 ``plugins`` config items
 """"""""""""""""""""""""

diff --git a/docs/source/translator.rst b/docs/source/translator.rst
@@ -1,58 +1,53 @@
-The `translator.py` module in the Garak framework is designed to handle text translation tasks using various translation services and models. 
-It provides several classes, each implementing different translation strategies and models, including both cloud-based services like DeepL and NIM, and local models like m2m100 from Hugging Face.
+The ``translator.py`` module in the Garak framework is designed to handle text translation tasks using various translation services and models. 
+It provides several classes, each implementing different translation strategies and models, including both cloud-based services, 
+like `DeepL<https://www.deepl.com/>`_ and `NVIDIA Riva<https://build.nvidia.com/nvidia/megatron-1b-nmt>`_, and local models like facebook/m2m100 available on `Hugging Face<https://huggingface.co/>`_.
 
 garak.translator
-=============
+================
 
 .. automodule:: garak.translator
    :members:
    :undoc-members:
    :show-inheritance:   
 
-Multilingual support
-====================
+Translation support
+===================
 
-This feature adds multilingual probes and detector keywords and triggers.
-You can check the model vulnerability for multilingual languages.
+This module adds translation support for probe and detector keywords and triggers.
+Allowing testing of models that accept and produce text in languages other than the language the plugin was written for.
 
-* limitation:
-  - This function only supports for `bcp47` code is "en".
-  - Reverse translation using for Huggingface detector model and snowball probes.
-  - Huggingface detector only supports English. You need to bring the target language NLI model for the detector.
-  - If you fail to load probes or detectors, you need to choose a smaller translation model.
+* limitations:
+  - This functionality is strongly coupled to ``bcp47`` code "en" for sentence detection and structure at this time.
+  - Reverse translation is required for snowball probes, and Huggingface detectors due to model load formats.
+  - Huggingface detectors primarily load English models. Requiring a target language NLI model for the detector.
+  - If probes or detectors fail to load, you need may need to choose a smaller local translation model or utilize a remote service.
+  - Translation may add significant execution time to the run depending on resources available.
 
-pre-requirements
-----------------
-
-.. code-block:: bash
-
-    pip install nvidia-riva-client==2.16.0 
-
-Support translation service
----------------------------
+Supported translation services
+------------------------------
 
 - Huggingface
-  - This code uses the following translation models:
-    - `Helsinki-NLP/opus-mt-en-{lang} <https://huggingface.co/docs/transformers/model_doc/marian>`_
+  - This project supports usage of the following translation models:
+    - `Helsinki-NLP/opus-mt-{<source_lang>-<target_lang>} <https://huggingface.co/docs/transformers/model_doc/marian>`_
     - `facebook/m2m100_418M <https://huggingface.co/facebook/m2m100_418M>`_
     - `facebook/m2m100_1.2B <https://huggingface.co/facebook/m2m100_1.2B>`_
 - `DeepL <https://www.deepl.com/docs-api>`_
-- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_
+- `NVIDIA Riva <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_
 
-API KEY
--------
+API KEY Requirements
+--------------------
 
-You can use DeepL API or NIM API to translate probe and detector keywords and triggers.
+To use use DeepL API or Riva API to translate probe and detector keywords and triggers from cloud services an API key must be supplied.
 
-You need an API key for the preferred service.
+API keys for the preferred service can be obtained in following locations:
 - `DeepL <https://www.deepl.com/en/pro-api>`_
-- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_
+- `Riva <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_
 
-Supported languages:
+Supported languages for remote services:
 - `DeepL <https://developers.deepl.com/docs/resources/supported-languages>`_
-- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt/modelcard>`_
+- `Riva <https://docs.nvidia.com/nim/riva/nmt/latest/getting-started.html#supported-languages>`_
 
-Set up the API key with the following command:
+API keys can be stored in environment variables with the following commands:
 
 DeepL
 ~~~~~
@@ -61,52 +56,71 @@ DeepL
 
     export DEEPL_API_KEY=xxxx
 
-NIM
+RIVA
 ~~~
 
 .. code-block:: bash
 
-    export NIM_API_KEY=xxxx
+    export RIVA_API_KEY=xxxx
+
+Configuration file
+------------------
+
+Translation function is configured in the `run` section of a configuration with the following keys:
+
+lang_spec   - A single `bcp47` entry designating the language of the target under test. "ja", "fr", "jap" etc.
+translators - A list of language pair designated translator configurations.
+
+* Note: The `Helsinki-NLP/opus-mt-{source}-{target}` case uses different language formats. The language codes used to name models are inconsistent. 
+Two-digit codes can usually be found `here<https://developers.google.com/admin-sdk/directory/v1/languages>`_, while three-digit codes require
+a search such as “language code {code}". More details can be found `here <https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models>`_.
 
-config file
------------
+A translator configuration is provided using the project's configurable pattern with the following required keys:
 
-You can pass the translation service, source language, and target language by the argument.
+* ``language``   - A `-` separated pair of `bcp47` entires describing translation format provided by the configuration
+* ``model_type`` - the module and optional instance class to be instantiated. local, remote, remote.DeeplTranslator etc.
+* ``model_name`` - (optional) the model name loaded for translation, required for ``local`` translator model_type
 
-- translation_service: "nim" or "deepl", "local"
-- lang_spec: "ja", "ja,fr" etc. (you can set multiple language codes)
+(Optional) Model specific parameters defined by the translator model type may exist.
 
-* Note: The `Helsinki-NLP/opus-mt-en-{lang}` case uses different language formats. The language codes used to name models are inconsistent. Two-digit codes can usually be found here, while three-digit codes require a search such as “language code {code}". More details can be found `here <https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models>`_.
+* Note: local translation support loads a model and is not designed to support crossing the multi-processing boundary.
 
-The translator config writes to a file and the path passed, with 
-You can also configure this via a config file:
-is given in `Translator Config with yaml <translator_with_yaml>`_ below.
+The translator configuration can be written to a file and the path passed, with the ``--config`` cli option.
+
+An example template is provided below.
 
 .. code-block:: yaml 
 run:
-  translation:
-    translation_service: {translation service}
-    api_key: {your API key}
-    lang_spec: {language code} 
-    model_spec:
+  lang_spec: {target language code}
+  translators:
+    - language: {source language code}-{target language code}
+      api_key: {your API key}
+      model_type: {translator module or module.classname}
+      model_name: {huggingface model name} 
+    - language: {target language code}-{source language code}
+      api_key: {your API key}
+      model_type: {translator module or module.classname}
       model_name: {huggingface model name} 
 
+* Note: each translator is configured for a single translation pair and specification is required in each direction for a run to proceed.
 
-Examples for multilingual
--------------------------
+Examples for translation configuration
+--------------------------------------
 
 DeepL
 ~~~~~
 
-To use the translation option for garak, run the following command:
+To use DeepL translation in garak, run the following command:
 You use the following yaml config.
 
 .. code-block:: yaml 
 run:
-  translation:
-    translation_service: deepl
-    api_key: {your API key}
-    lang_spec: ja
+  lang_spec: {target language code}
+  translator:
+    - language: {source language code}-{target language code}
+      model_type: remote.DeeplTranslator
+    - language: {target language code}-{source language code}
+      model_type: remote.DeeplTranslator
 
 
 .. code-block:: bash
@@ -115,24 +129,25 @@ run:
     python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config {path to your yaml config file} 
 
 
-NIM
-~~~
+Riva
+~~~~
 
-For NIM, run the following command:
+For Riva, run the following command:
 You use the following yaml config.
 
 .. code-block:: yaml 
 
 run:
   translation:
-    translation_service: nim
-    api_key: {your API key}
-    lang_spec: ja
+    - language: {source language code}-{target language code}
+      model_type: remote
+    - language: {target language code}-{source language code}
+      model_type: remote
 
 
 .. code-block:: bash
 
-    export NIM_API_KEY=xxxx
+    export RIVA_API_KEY=xxxx
     python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config {path to your yaml config file} 
 
 
@@ -144,11 +159,14 @@ You use the following yaml config.
 
 .. code-block:: yaml 
 run:
-  translation:
-    translation_service: local
-    lang_spec: ja 
-    model_spec:
-      model_name: facebook/m2m100_418M 
+  lang_spec: ja
+  translators:
+    - language: en-ja
+      model_type: local
+      model_name: facebook/m2m100_418M
+    - language: jap-en
+      model_type: local
+      model_name: facebook/m2m100_418M
 
 
 .. code-block:: bash
@@ -158,12 +176,14 @@ run:
 
 .. code-block:: yaml 
 run:
-  translation:
-    translation_service: local
-    lang_spec: jap 
-    model_spec:
-      model_name: Helsinki-NLP/opus-mt-en-{}
-
+  lang_spec: jap
+  translators:
+    - language: en-jap
+      model_type: local
+      model_name: Helsinki-NLP/opus-mt-{}
+    - language: jap-en
+      model_type: local
+      model_name: Helsinki-NLP/opus-mt-{}
 
 .. code-block:: bash
 

diff --git a/garak/_config.py b/garak/_config.py
@@ -113,6 +113,8 @@ def _nested_dict():
 
 # this is so popular, let's set a default. what other defaults are worth setting? what's the policy?
 run.seed = None
+run.lang_spec = "en"
+run.translators = []
 
 # placeholder
 # generator, probe, detector, buff = {}, {}, {}, {}

diff --git a/garak/attempt.py b/garak/attempt.py
@@ -72,7 +72,7 @@ def __init__(
         detector_results=None,
         goal=None,
         seq=-1,
-        lang_type=None,
+        bcp47=None,  # language code for prompt as sent to the target
         reverse_translator_outputs=None,
     ) -> None:
         self.uuid = uuid.uuid4()
@@ -88,8 +88,10 @@ def __init__(
         self.seq = seq
         if prompt is not None:
             self.prompt = prompt
-        self.lang_type = lang_type
-        self.reverse_translator_outputs = {} if reverse_translator_outputs is None else reverse_translator_outputs
+        self.bcp47 = bcp47
+        self.reverse_translator_outputs = (
+            {} if reverse_translator_outputs is None else reverse_translator_outputs
+        )
 
     def as_dict(self) -> dict:
         """Converts the attempt to a dictionary."""
@@ -107,8 +109,10 @@ def as_dict(self) -> dict:
             "notes": self.notes,
             "goal": self.goal,
             "messages": self.messages,
-            "lang_type": self.lang_type,
-            "reverse_translator_outputs": {k: list(v) for k, v in self.reverse_translator_outputs.items()},
+            "bcp47": self.bcp47,
+            "reverse_translator_outputs": {
+                k: list(v) for k, v in self.reverse_translator_outputs.items()
+            },
         }
 
     @property