From 4337bd01945edfeffb38e1d832cfaf87e824a9fe Mon Sep 17 00:00:00 2001 From: Omri Mendels Date: Thu, 26 Dec 2024 18:26:42 +0200 Subject: [PATCH 1/5] updates to docs --- docs/analyzer/adding_recognizers.md | 19 +- docs/analyzer/customizing_nlp_models.md | 2 +- docs/analyzer/developing_recognizers.md | 10 +- docs/analyzer/index.md | 107 +++++ docs/analyzer/nlp_engines/transformers.md | 14 +- docs/anonymizer/adding_operators.md | 17 +- docs/anonymizer/index.md | 81 +++- docs/api.md | 1 + docs/api/analyzer_python.md | 69 ++- docs/api/anonymizer_python.md | 2 - docs/api/structured_python.md | 4 + docs/community.md | 22 +- docs/evaluation/index.md | 49 ++ docs/getting_started.md | 157 +------ .../getting_started/getting_started_images.md | 89 ++++ .../getting_started_structured.md | 42 ++ docs/getting_started/getting_started_text.md | 155 +++++++ docs/image-redactor/index.md | 27 +- docs/index.md | 10 +- docs/learn_presidio/concepts.md | 39 ++ docs/learn_presidio/index.md | 33 ++ docs/requirements-docs.txt | 1 + docs/samples/deployments/app-service/index.md | 2 +- ...ata-factory-template-gallery-databricks.md | 10 +- ...idio-data-factory-template-gallery-http.md | 4 +- .../data-factory/presidio-data-factory.md | 7 +- docs/samples/deployments/spark/index.md | 6 +- .../deployments/spark/notebooks/00_setup.py | 4 +- docs/samples/index.md | 4 +- .../python/ner_model_configuration.ipynb | 408 +++++++++++++++++ docs/samples/python/no_code_config.ipynb | 431 ++++++++++++++++++ ...omyzation.ipynb => pseudonymization.ipynb} | 0 docs/structured/index.md | 2 + docs/tutorial/08_no_code.md | 236 ++++++++-- mkdocs.yml | 233 ++++++---- 35 files changed, 1961 insertions(+), 336 deletions(-) create mode 100644 docs/api/structured_python.md create mode 100644 docs/evaluation/index.md create mode 100644 docs/getting_started/getting_started_images.md create mode 100644 docs/getting_started/getting_started_structured.md create mode 100644 docs/getting_started/getting_started_text.md create mode 100644 docs/learn_presidio/concepts.md create mode 100644 docs/learn_presidio/index.md create mode 100644 docs/samples/python/ner_model_configuration.ipynb create mode 100644 docs/samples/python/no_code_config.ipynb rename docs/samples/python/{pseudonomyzation.ipynb => pseudonymization.ipynb} (100%) diff --git a/docs/analyzer/adding_recognizers.md b/docs/analyzer/adding_recognizers.md index 4ca8fe187..13552c483 100644 --- a/docs/analyzer/adding_recognizers.md +++ b/docs/analyzer/adding_recognizers.md @@ -150,7 +150,7 @@ To add a recognizer to the list of pre-defined recognizers: 1. Clone the repo. 2. Create a file containing the new recognizer Python class. -3. Add the recognizer to the `recognizers` in the [`default_recognizers`](../../presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) config. Details of recognizer paramers are given [Here](./recognizer_registry_provider.md#the-recognizer-parameters). +3. Add the recognizer to the `recognizers` in the [`default_recognizers`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) config. Details of recognizer parameters are given [Here](./recognizer_registry_provider.md#the-recognizer-parameters). 4. Optional: Update documentation (e.g., the [supported entities list](../supported_entities.md)). ### Azure AI Language recognizer @@ -218,7 +218,7 @@ Additional examples can be found in the [OpenAPI spec](../api-docs/api-docs.html ### Reading pattern recognizers from YAML Recognizers can be loaded from a YAML file, which allows users to add recognition logic without writing code. -An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml). +An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml). Once the YAML file is created, it can be loaded into the `RecognizerRegistry` instance. @@ -226,14 +226,19 @@ This example creates a `RecognizerRegistry` holding only the recognizers in the ``` python -from presidio_analyzer import AnalyzerEngine, RecognizerRegistry +from presidio_analyzer import AnalyzerEngine +from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider -yaml_file = "recognizers.yaml" -registry = RecognizerRegistry() -registry.add_recognizers_from_yaml(yaml_file) +recognizer_registry_conf_file = "./analyzer/recognizers-config.yml" +provider = RecognizerRegistryProvider( + conf_file=recognizer_registry_conf_file + ) +registry = provider.create_recognizer_registry() analyzer = AnalyzerEngine(registry=registry) -analyzer.analyze(text="Mr. and Mrs. Smith", language="en") + +results = analyzer.analyze(text="My name is Morris", language="en") +print(results) ``` This example adds the new recognizers to the predefined recognizers in Presidio: diff --git a/docs/analyzer/customizing_nlp_models.md b/docs/analyzer/customizing_nlp_models.md index abc6e6f7f..33c14326e 100644 --- a/docs/analyzer/customizing_nlp_models.md +++ b/docs/analyzer/customizing_nlp_models.md @@ -123,7 +123,7 @@ Configuration can be done in two ways: ## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection -In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models. +In addition to the built-in spaCy/Stanza/transformers capabilities, it is possible to create new recognizers which serve as interfaces to other models. For more information: - [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb). diff --git a/docs/analyzer/developing_recognizers.md b/docs/analyzer/developing_recognizers.md index 3772867ce..5e3fe0235 100644 --- a/docs/analyzer/developing_recognizers.md +++ b/docs/analyzer/developing_recognizers.md @@ -7,7 +7,7 @@ Recognizers define the logic for detection, as well as the confidence a predicti ### Accuracy -Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. +Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research). @@ -23,7 +23,7 @@ Make sure your recognizer doesn't take too long to process text. Anything above ### Environment -When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. +When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. ## Recognizer Types @@ -34,7 +34,7 @@ Generally speaking, there are three types of recognizers: A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.) -See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input. +See [this documentation](adding_recognizers.md) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input. ### Pattern Based @@ -57,13 +57,13 @@ Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analy `spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance. `spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned. -In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created. +In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created. See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py). #### Apply Custom Logic In some cases, rule-based logic provides reasonable ways for detecting entities. -The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic. +The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. !!! attention "Considerations for selecting one option over another" diff --git a/docs/analyzer/index.md b/docs/analyzer/index.md index 096a46b62..14c0d6e28 100644 --- a/docs/analyzer/index.md +++ b/docs/analyzer/index.md @@ -58,6 +58,113 @@ see [Installing Presidio](../installation.md). curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' -H "Content-Type: application/json" -X POST http://localhost:3000/analyze ``` +## Main concepts + +Presidio analyzer is a set of tools that are used to detect entities in text. The main object in Presidio Analyzer is the `AnalyzerEngine`. In the following section we'll describe the main concepts in Presidio Analyzer. + +This simplified class diagram shows the main classes in Presidio Analyzer: + +```mermaid +classDiagram + direction LR + class RecognizerResult { + +str entity_type + +float score + +int start + +int end + } + + class EntityRecognizer { + +str name + +int version + +List[str] supported_entities + +analyze(text, entities) List[RecognizerResult] + } + + + class RecognizerRegistry { + +add_recognizer(recognizer) None + +remove_recognizer(recognizer) None + +load_predefined_recognizers() None + +get_recognizers() List[EntityRecognizer] + + + } + + class NlpEngine { + +process_text(text, language) NlpArtifacts + +process_batch(texts, language) Iterator[NlpArtifacts] + } + + class ContextAwareEnhancer { + +enhance_using_context(text, recognizer_results) List[RecognizerResult] + } + + + class AnalyzerEngine { + +NlpEngine nlp_engine + +RecognizerRegistry registry + +ContextAwareEnhancer context_aware_enhancer + +analyze(text: str, language) List[RecognizerResult] + + } + + NlpEngine <|-- SpacyNlpEngine + NlpEngine <|-- TransformersNlpEngine + NlpEngine <|-- StanzaNlpEngine + AnalyzerEngine *-- RecognizerRegistry + AnalyzerEngine *-- NlpEngine + AnalyzerEngine *-- ContextAwareEnhancer + RecognizerRegistry o-- "0..*" EntityRecognizer + ContextAwareEnhancer <|-- LemmaContextAwareEnhancer + + %% Defining styles + style RecognizerRegistry fill:#E6F7FF,stroke:#005BAC,stroke-width:2px + style NlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style SpacyNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style YourNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style TransformersNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style StanzaNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style ContextAwareEnhancer fill:#E6FFE6,stroke:#008000,stroke-width:2px + style LemmaContextAwareEnhancer fill:#E6FFE6,stroke:#008000,stroke-width:2px + style EntityRecognizer fill:#F5F5DC,stroke:#8B4513,stroke-width:2px + style YourEntityRecognizer fill:#F5F5DC,stroke:#8B4513,stroke-width:2px + style RecognizerResult fill:#FFF0F5,stroke:#FF69B4,stroke-width:2px +``` + +### `RecognizerResult` + +A `RecognizerResult` holds the type and span of a PII entity. + +### `EntityRecognizer` + +An entity recognizer is an object in Presidio that is responsible for detecting entities in text. An entity recognizer can be a rule-based recognizer, a machine learning model, or a combination of both. + +### `PatternRecognizer` + +A `PatternRecognizer` is a type of entity recognizer that uses regular expressions to detect entities in text. One can create new `PatternRecognizer` objects by providing a list of regular expressions, context words, validation and invalidation logic and additional parameters that facilitate the detection of entities. + +### `AnalyzerEngine` + +The `AnalyzerEngine` is the main object in Presidio Analyzer that is responsible for detecting entities in text. The `AnalyzerEngine` can be configured in various ways to fit the specific needs of the user. + +### `RecognizerRegistry` + +The `RecognizerRegistry` is a registry that contains all the entity recognizers that are available in Presidio. The `AnalyzerEngine` uses the `RecognizerRegistry` to detect entities in text. + +### `NlpEngine` + +An NLP Engine is an object that holds the NLP model that is used by the `AnalyzerEngine` to parse the input text and extract different features from it, such as tokens, lemmas, entities, and more. Note that Named Entity Recognition (NER) models can be added in two ways to Presidio: One is through the `NlpEngine` object, and the other is through a new `EntityRecognizer` object. By creating a Named Entity Recognition model through the `NlpEngine`, the named entities will be available to the different modules in Presidio. Furthermore, the `NlpEngine` object supports a batch mode (i.e., processing multiple texts at once) which allows for faster processing of large amounts of text. +It is possible to mix multiple NER models in Presidio, for instance, one model as the `NlpEngine` and others as additional `EntityRecognizer` objects. + +Presidio has an off-the-shelf support for multiple NLP packages, such as spaCy, stanza, and huggingface. The simplest way to integrate a model from these packages is through the `NlpEngine`. More information on this [can be found in the NlpEngine documentation](customizing_nlp_models.md). The samples gallery has several examples of leveraging NER models as new `EntityRecognizer` objects. For example, [flair](../samples/python/flair_recognizer.py) and [spanmarker](../samples/python/span_marker_recognizer.py). +For a detailed flow of Named Entities within presidio, see the diagram [in this document](nlp_engines/transformers.md#how-ner-results-flow-within-presidio). + +### `Context Aware Enhancer` + +The `ContextAwareEnhancer` is a module that enhances the detection of entities by using the context of the text. The `ContextAwareEnhancer` can be used to improve the detection of entities that are dependent on the context of the text, such as dates, locations, and more. The default implementation is the `LemmaContextAwareEnhancer` which uses the lemmas of the tokens in the text to enhance the detection of entities. Note that it's possible (and sometimes recommended) to create custom `ContextAwareEnhancer` objects to fit the specific needs of the user, for example if the context should support more than one word, which is currently not supported by the default Lemma based enhancer. +More information on this can be found [in this sample](../samples/python/customizing_presidio_analyzer.ipynb). + ## Creating PII recognizers Presidio analyzer can be easily extended to support additional PII entities. diff --git a/docs/analyzer/nlp_engines/transformers.md b/docs/analyzer/nlp_engines/transformers.md index 12f737fda..f8e8b8352 100644 --- a/docs/analyzer/nlp_engines/transformers.md +++ b/docs/analyzer/nlp_engines/transformers.md @@ -8,7 +8,9 @@ Presidio leverages other types of information from spaCy such as tokens, lemmas Therefore the pipeline returns both the NER model results as well as results from other pipeline components. ## How NER results flow within Presidio + This diagram describes the flow of NER results within Presidio, and the relationship between the `TransformersNlpEngine` component and the `TransformersRecognizer` component: + ```mermaid sequenceDiagram AnalyzerEngine->>TransformersNlpEngine: Call engine.process_text(text)
to get model results @@ -55,7 +57,6 @@ Then, also download a spaCy pipeline/model: python -m spacy download en_core_web_sm ``` - ### Configuring the NER pipeline Once the models are downloaded, one option to configure them is to create a YAML configuration file. @@ -193,7 +194,7 @@ Once the configuration file is created, it can be used to create a new `Transfor print(results_english) ``` -#### Explaning the configuration options +#### Explaining the configuration options - `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`. - The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2` @@ -208,19 +209,16 @@ The `ner_model_configuration` section contains the following parameters: - `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence. - `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to. - !!! note "Defining the entity mapping" - To be able to create the `model_to_presidio_entity_mapping` dictionary, it is advised to check which classes the model is able to predict. - This can be found on the huggingface hub site for the model in some cases. In other, one can check the model's `config.json` uner `id2label`. - For example, for `bert-base-NER-uncased`, it can be found here: https://huggingface.co/dslim/bert-base-NER-uncased/blob/main/config.json. + To be able to create the `model_to_presidio_entity_mapping` dictionary, it is advised to check which classes the model is able to predict. + This can be found on the huggingface hub site for the model in some cases. In other, one can check the model's `config.json` under `id2label`. + For example, for `bert-base-NER-uncased`, it can be found here: . Note that most NER models add a prefix to the class (e.g. `B-PER` for class `PER`). When creating the mapping, do not add the prefix. - See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification). Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information. - ### Training your own model !!! note "Note" diff --git a/docs/anonymizer/adding_operators.md b/docs/anonymizer/adding_operators.md index 5e696ef42..49c5922ff 100644 --- a/docs/anonymizer/adding_operators.md +++ b/docs/anonymizer/adding_operators.md @@ -3,21 +3,20 @@ Operators are the presidio-anonymizer actions over the text. There are two types of operators: -- Anonymize (hash, replace, redact, encrypt, mask) -- Deanonymize (decrypt) -Presidio anonymizer can be easily extended to support additional anonymization and deanonymization methods. +- Anonymize (e.g., hash, replace, redact, encrypt, mask) +- Deanonymize (e.g., decrypt) -## Extending presidio-anonymizer for additional PII operators: +Presidio anonymizer can be easily extended to support additional anonymization and deanonymization methods (called Operators). -1. Under the path presidio_anonymizer/operators create new python class implementing the abstract [Operator](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/operators/operator.py) class +## Extending presidio-anonymizer for additional PII operators + +1. Create new python class implementing the abstract [Operator](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/operators/operator.py) class. 2. Implement the methods: - `operate` - gets the data and returns a new text expected to replace the old one. - `validate` - validate the parameters entered for the anonymizer exists and valid. - `operator_name` - this method helps to automatically load the existing anonymizers. - `operator_type` - either Anonymize or Deanonymize. Will be mapped to the proper engine. -3. Add the class to presidio_anonymizer/operators/__init__.py. -4. Restart the anonymizer. +3. Call the `AnonymizerEngine.add_anonymizer` method to add a new operator to the anonymizer. Alternatively, call the `DeanonymizeEngine.add_deanonymizer` method to add a new deanonymizer. -!!! note "Note" - The list of operators is being loaded dynamically each time Presidio Anonymizer is started. +See a detailed example [here](../samples/python/pseudonymization.ipynb). diff --git a/docs/anonymizer/index.md b/docs/anonymizer/index.md index b0c272a34..4170763ed 100644 --- a/docs/anonymizer/index.md +++ b/docs/anonymizer/index.md @@ -155,6 +155,85 @@ see [Installing Presidio](../installation.md). ]} ``` +## Main concepts + +The following class diagram shows a simplified view of the main classes in Presidio Anonymizer: + +```mermaid +classDiagram + direction LR + + class RecognizerResult { + +entity_type: str + +start: int + +end: int + +score: float + } + + class AnonymizerEngine { + +anonymize(text: str, analyzer_results: List[RecognizerResult], operators: Dict[str, OperatorConfig], ...) EngineResult + +add_anonymizer(anonymizer_cls: Type[Operator]) None + +remove_anonymizer(deanonymizer_cls: Type[Operator]) None + } + class DeanonymizeEngine { + +deanonymize(text: str, entities: List[OperatorResult], operators: Dict[str, OperatorConfig]) EngineResult + +get_deanonymizers() List[str] + +add_deanonymizer(deanonymizer_cls: Type[Operator]) None + +remove_deanonymizer(deanonymizer_cls: Type[Operator]) None + } + class Operator { + +operate(text: str, params: Dict) str + } + class OperatorConfig { + +operator_name: str + +params: Dict + } + + class EngineResult { + +text: str + +items: List[OperatorResult] + } + + class OperatorResult { + +start: int + +end: int + +entity_type: str + +text: str + +operator: str + } + + + RecognizerResult <-- AnonymizerEngine + RecognizerResult <-- DeanonymizeEngine + AnonymizerEngine o-- "1..*" Operator + AnonymizerEngine --o OperatorConfig + DeanonymizeEngine o-- "1..*" Operator + DeanonymizeEngine --o OperatorConfig + EngineBase --|> DeanonymizeEngine + EngineBase --|> AnonymizerEngine + EngineResult --o OperatorResult + + + %% Defining styles + style Operator fill:#E6F7FF,stroke:#005BAC,stroke-width:2px + style AnonymizerEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style DeanonymizeEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style EngineBase fill:#FFF5E6,stroke:#FFA500,stroke-width:2px + style OperatorConfig fill:#E6FFE6,stroke:#008000,stroke-width:2px + style EngineResult fill:#FFF0F5,stroke:#FF69B4,stroke-width:2px + style OperatorResult fill:#FFF0F5,stroke:#FF69B4,stroke-width:2px + +note for RecognizerResult "RecognizerResults +are the output +of the AnalyzerEngine" + +``` + +- The **AnonymizerEngine** is the main class in Presidio that is responsible for anonymizing PII entities in text. It uses the results from the **AnalyzerEngine** to perform the anonymization. +- The **DeanonymizerEngine** is a class in Presidio that is responsible for deanonymizing text that has been anonymized by the **AnonymizerEngine**, given that the operation is reversible (e.g. encryption). +- An **Operator** is an object in Presidio that is responsible for performing the anonymization operation on a PII entity. Presidio provides several built-in operators, such as **Replace**, **Redact**, and **Encrypt**, and allows users to create custom operators. +- The **BatchAnonymizerEngine** is a class in Presidio that is responsible for anonymizing PII entities in a batch of texts. It uses the **AnonymizerEngine** to perform the anonymization on each text in the batch. ([see more here](../samples/python/batch_processing.ipynb)). + ## Built-in operators | Operator type | Operator name | Description | Parameters | @@ -181,7 +260,7 @@ anonymization scenarios: - **No overlap (single PII)**: When there is no overlap in spans of entities, Presidio Anonymizer uses a given or default anonymization operator to anonymize and replace the PII text entity. -- **Full overlap of PII entitie spans**: When entities have overlapping substrings, +- **Full overlap of PII entity spans**: When entities have overlapping substrings, the PII with the higher score will be taken. Between PIIs with identical scores, the selection is arbitrary. - **One PII is contained in another**: Presidio Anonymizer will use the PII with the larger text even if it's score is lower. diff --git a/docs/api.md b/docs/api.md index f06adba1a..86002a37d 100644 --- a/docs/api.md +++ b/docs/api.md @@ -5,3 +5,4 @@ Api reference for Presidio's main python modules - [Presidio analyzer Python API reference](api/analyzer_python.md) - [Presidio anonymizer Python API reference](api/anonymizer_python.md) - [Presidio image redactor Python API reference](api/image_redactor_python.md) +- [Presidio structured Python API reference](api/structured_python.md) diff --git a/docs/api/analyzer_python.md b/docs/api/analyzer_python.md index 9e0665a22..72d551a65 100644 --- a/docs/api/analyzer_python.md +++ b/docs/api/analyzer_python.md @@ -1,6 +1,69 @@ # Presidio Analyzer API Reference -::: presidio_analyzer +## Objects at the top of the presidio-analyzer package + +::: presidio_analyzer.AnalyzerEngine + handler: python + +::: presidio_analyzer.analyzer_engine_provider.AnalyzerEngineProvider + handler: python + +::: presidio_analyzer.analysis_explanation.AnalysisExplanation + handler: python + +::: presidio_analyzer.recognizer_result.RecognizerResult + handler: python + +## Batch + +::: presidio_analyzer.batch_analyzer_engine.BatchAnalyzerEngine + handler: python + +::: presidio_analyzer.dict_analyzer_result.DictAnalyzerResult + handler: python + +## Recognizers and patters + +::: presidio_analyzer.entity_recognizer.EntityRecognizer + handler: python + +::: presidio_analyzer.local_recognizer.LocalRecognizer + handler: python + +::: presidio_analyzer.pattern.Pattern + handler: python + +::: presidio_analyzer.pattern_recognizer.PatternRecognizer + handler: python + +::: presidio_analyzer.remote_recognizer.RemoteRecognizer + handler: python + +## Misc + +::: presidio_analyzer.analyzer_request.AnalyzerRequest + handler: python + +::: presidio_analyzer.analyzer_utils.PresidioAnalyzerUtils + handler: python + +## Recognizer registry + +::: presidio_analyzer.recognizer_registry.RecognizerRegistry + handler: python + +::: presidio_analyzer.recognizer_registry.RecognizerRegistryProvider + handler: python + +## Context awareness + +::: presidio_analyzer.context_aware_enhancers + +## NLP Engine classes + +::: presidio_analyzer.nlp_engine handler: python - options: - docstring_style: sphinx + +## Predefined Recognizers + +::: presidio_analyzer.predefined_recognizers diff --git a/docs/api/anonymizer_python.md b/docs/api/anonymizer_python.md index f59ee1255..93ecce6bf 100644 --- a/docs/api/anonymizer_python.md +++ b/docs/api/anonymizer_python.md @@ -2,5 +2,3 @@ ::: presidio_anonymizer handler: python - options: - docstring_style: sphinx diff --git a/docs/api/structured_python.md b/docs/api/structured_python.md new file mode 100644 index 000000000..70ed02454 --- /dev/null +++ b/docs/api/structured_python.md @@ -0,0 +1,4 @@ +# Presidio Structured API Reference + +::: presidio_structured +handler: python diff --git a/docs/community.md b/docs/community.md index e7810ef84..51a0aa488 100644 --- a/docs/community.md +++ b/docs/community.md @@ -1,27 +1,29 @@ # Presidio eco-system -This section collects different resources developed with Presidio. +This section collects different resources developed with Presidio. -## Resources +## Resources | Resource | Description | | ------ | ------ | -| [Obsei](https://github.com/obsei/obsei) | Obsei is an open-source, low-code, AI powered automation tool | -| [data-describe](https://github.com/data-describe/data-describe) | data-describe is a Python toolkit for Exploratory Data Analysis (EDA). It aims to accelerate data exploration and analysis by providing automated and polished analysis widgets. | -| [Azure Search Power Skills](https://github.com/Azure-Samples/azure-search-power-skills) | Power Skills are a collection of useful functions to be deployed as custom skills for Azure Cognitive Search. The skills can be used as templates or starting points for your own custom skills, or they can be deployed and used as they are if they happen to meet your requirements. | -| [DataOps for the Modern Data Warehouse](https://github.com/Azure-Samples/modern-data-warehouse-dataops) | Contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on [Microsoft Azure](https://azure.microsoft.com/en-au/). | -| [Extending Power BI with Python and R](https://github.com/PacktPublishing/Extending-Power-BI-with-Python-and-R) | Code repository for [Extending Power BI with Python and R](https://www.packtpub.com/product/extending-power-bi-with-python-and-r/9781801078207), published by Packt. | -| [HebSafeHarbor](https://github.com/8400TheHealthNetwork/HebSafeHarbor) | Clinical notes anonymization in Hebrew. | -| [Presidio Github Action](https://github.com/marketplace/actions/presidio-action) | Github Action that analyzes text for PII entities with Microsoft's Presidio framework. | | [HashiCorp Vault Operator](https://github.com/sahajsoft/presidio-vault) | A library that allows to integrate Presidio with HashiCorp Vault for anonymization and deanonymization. | | [Rasa bot framework](https://rasa.com/docs/rasa/pii-management/) | Use Presidio to de-identify chat bot messages in Rasa. | | [LangChain](https://python.langchain.com/v0.1/docs/guides/productionization/safety/presidio_data_anonymization/) | De-identification and reversible anonymization within LangChain. | | [LlamaIndex](https://python.langchain.com/v0.1/docs/guides/productionization/safety/presidio_data_anonymization) | De-identification and reversible anonymization within LlamaIndex. | -| [LiteLLM](https://docs.litellm.ai/docs/proxy/guardrails/pii_masking_v2) | Itegrate Presidio into LiteLLM | +| [LiteLLM](https://docs.litellm.ai/docs/proxy/guardrails/pii_masking_v2) | Integrate Presidio into LiteLLM | | [Guardrails-ai](https://www.guardrailsai.com/docs/examples/check_for_pii) | Use Presidio as an LLM guardrails using the Guardrails AI suite. | | [Presidio in LLMGuard](https://llm-guard.com/input_scanners/anonymize) | Integrate Presidio into LLM Guard - The Security Toolkit for LLM Interactions. | | [Privy](https://blog.px.dev/detect-pii/) | Integrate Presidio into Privy. | | [Huggingface](https://huggingface.co/blog/presidio-pii-detection) | Automatic PII detection on Huggingface datasets. | | [KNIME](https://hub.knime.com/knime/extensions/org.knime.python.features.presidio/latest/org.knime.python3.nodes.extension.ExtensionNodeSetFactory$DynamicExtensionNodeFactory:290c90e1) | Use Presidio within the KNIME framework. | | [OpenMetadata](https://docs.open-metadata.org/latest/how-to-guides/data-quality-observability/profiler/auto-pii-tagging) | Auto PII tagging for Sensitive/NonSensitive at the column level. | +| [dataiku](https://doc.dataiku.com/dss/latest/generative-ai/pii-detection.html) | PII detection in the LLM Mesh can detect various forms of PII in your prompts and queries, and either block or redact the queries. | +| [Obsei](https://github.com/obsei/obsei) | Obsei is an open-source, low-code, AI powered automation tool | +| [data-describe](https://github.com/data-describe/data-describe) | data-describe is a Python toolkit for Exploratory Data Analysis (EDA). It aims to accelerate data exploration and analysis by providing automated and polished analysis widgets. | +| [Azure Search Power Skills](https://github.com/Azure-Samples/azure-search-power-skills) | Power Skills are a collection of useful functions to be deployed as custom skills for Azure Cognitive Search. The skills can be used as templates or starting points for your own custom skills, or they can be deployed and used as they are if they happen to meet your requirements. | +| [DataOps for the Modern Data Warehouse](https://github.com/Azure-Samples/modern-data-warehouse-dataops) | Contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on [Microsoft Azure](https://azure.microsoft.com/en-au/). | +| [Extending Power BI with Python and R](https://github.com/PacktPublishing/Extending-Power-BI-with-Python-and-R) | Code repository for [Extending Power BI with Python and R](https://www.packtpub.com/product/extending-power-bi-with-python-and-r/9781801078207), published by Packt. | +| [HebSafeHarbor](https://github.com/8400TheHealthNetwork/HebSafeHarbor) | Clinical notes anonymization in Hebrew. | +| [Presidio Github Action](https://github.com/marketplace/actions/presidio-action) | Github Action that analyzes text for PII entities with Microsoft's Presidio framework. | + * Please create a PR if you're interested in adding your tool to this list. diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md new file mode 100644 index 000000000..33480adc0 --- /dev/null +++ b/docs/evaluation/index.md @@ -0,0 +1,49 @@ +# Evaluating PII detection with Presidio + +## Why evaluate PII detection? + +No de-identification system is perfect. +It is important to evaluate the performance of a PII detection system for your specific use case. +This evaluation can help you understand where the system makes mistakes and how to iteratively improve the detection mechanisms, +which recognizers and models to use, and how to configure them. + +## Common evaluation metrics + +The most common evaluation metrics are [`precision`, `recall`](), and [`Fβ score`](), which is a combination of precision and recall. +These metrics are calculated based on the number of true positives, false positives, and false negatives. +For every use case, the false positive and false negative rates should be balanced to achieve the desired level of accuracy. + +- Precision measures the proportion of true positive results among the positive results: `TP / (TP + FP)`. +- Recall measures the proportion of true positive results among the actual positives: `TP / (TP + FN)`. +- Fβ score is a weighted harmonic mean of precision and recall: `(1 + β^2) * (precision * recall) / (β^2 * precision + recall)`. + +[Click here for more definitions](https://en.wikipedia.org/wiki/Precision_and_recall#Definition). + +!!! note "Note" + In PII detection, recall is often more important than precision, as we'd like to avoid missing any PII. + In such cases, we recommend to use the β=2 score, which gives more importance to recall. + +## How to evaluate PII detection with Presidio + +Presidio provides a set of tools to evaluate the performance of the PII detection system. +In addition, it provides simple data generation tools to help you create a dataset for evaluation. + +### Evaluating the Presidio Analyzer using Presidio-Research + +Presidio-Research is a python package with a set of tools that help you evaluate the performance of the Presidio Analyzer. +To get started, follow the instructions in the [Presidio-Research repository](https://github.com/microsoft/presidio-research). + +The easiest way to get started is by reviewing the notebooks: + +- [Notebook 1](https://github.com/microsoft/presidio-research/blob/master/notebooks/1_Generate_data.ipynb): Shows how to use the PII data generator. +- [Notebook 2](https://github.com/microsoft/presidio-research/blob/master/notebooks/2_PII_EDA.ipynb): Shows a simple analysis of the PII dataset. +- [Notebook 3](https://github.com/microsoft/presidio-research/blob/master/notebooks/3_Split_by_pattern_number.ipynb): Provides tools to split the dataset into train/test/validation sets while avoiding leakage due to the same pattern appearing in multiple folds (only applicable for synthetically generated data). +- [Notebook 4](https://github.com/microsoft/presidio-research/blob/master/notebooks/4_Evaluate_Presidio_Analyzer.ipynb): Shows how to use the evaluation tools to evaluate how well Presidio detects PII. Note that this is using the vanilla Presidio, and the results aren't very accurate. +- [Notebook 5](https://github.com/microsoft/presidio-research/blob/master/notebooks/5_Evaluate_Custom_Presidio_Analyzer.ipynb): Shows how one can configure Presidio to detect PII much more accurately, and boost the f score in ~30%. + +For more information and advanced usage, refer to the [Presidio-Research repository](https://github.com/microsoft/presidio-research). + +### Evaluating DICOM redaction with Presidio Image Redactor + +See [Evaluating DICOM redaction](../image-redactor/evaluating_dicom_redaction.md) for more information. +For a full demonstration, see the [evaluation notebook](../samples/python/example_dicom_redactor_evaluation.ipynb). diff --git a/docs/getting_started.md b/docs/getting_started.md index 2339bc79c..d8543564c 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -1,154 +1,7 @@ -# Getting started with Presidio +# Getting started with Microsoft Presidio -## Simple flow +The core functionality in Presidio is to detect PII in text. Presidio further contains a set of tools that build on top of text PII detection, for example in images, structured data, JSON and more. -Using Presidio's modules as Python packages to get started: - -===+ "Anonymize PII in text (Default spaCy model)" - - - 1. Install Presidio - - ```sh - pip install presidio-analyzer - pip install presidio-anonymizer - python -m spacy download en_core_web_lg - ``` - - 2. Analyze + Anonymize - - ```py - from presidio_analyzer import AnalyzerEngine - from presidio_anonymizer import AnonymizerEngine - - text="My phone number is 212-555-5555" - - # Set up the engine, loads the NLP module (spaCy model by default) - # and other PII recognizers - analyzer = AnalyzerEngine() - - # Call analyzer to get results - results = analyzer.analyze(text=text, - entities=["PHONE_NUMBER"], - language='en') - print(results) - - # Analyzer results are passed to the AnonymizerEngine for anonymization - - anonymizer = AnonymizerEngine() - - anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results) - - print(anonymized_text) - ``` - -=== "Anonymize PII in text (transformers)" - - 1. Install Presidio - - ```sh - pip install "presidio-analyzer[transformers]" - pip install presidio-anonymizer - python -m spacy download en_core_web_sm - ``` - - 2. Analyze + Anonymize - - ```py - from presidio_analyzer import AnalyzerEngine - from presidio_analyzer.nlp_engine import TransformersNlpEngine - from presidio_anonymizer import AnonymizerEngine - - text = "My name is Don and my phone number is 212-555-5555" - - # Define which transformers model to use - model_config = [{"lang_code": "en", "model_name": { - "spacy": "en_core_web_sm", # use a small spaCy model for lemmas, tokens etc. - "transformers": "dslim/bert-base-NER" - } - }] - - nlp_engine = TransformersNlpEngine(models=model_config) - - # Set up the engine, loads the NLP module (spaCy model by default) - # and other PII recognizers - analyzer = AnalyzerEngine(nlp_engine=nlp_engine) - - # Call analyzer to get results - results = analyzer.analyze(text=text, language='en') - print(results) - - # Analyzer results are passed to the AnonymizerEngine for anonymization - - anonymizer = AnonymizerEngine() - - anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) - - print(anonymized_text) - - ``` - !!! tip "Tip: Downloading models" - If not available, the transformers model and the spacy model would be downloaded on the first call to the `AnalyzerEngine`. To pre-download, see [this doc](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model). - -## Simple flow: Images - -=== "Anonymize PII in images" - - 1. Install presidio-image-redactor - - ```sh - pip install presidio-image-redactor - ``` - - 2. Redact PII from image - - ```py - from presidio_image_redactor import ImageRedactorEngine - from PIL import Image - - image = Image.open(path_to_image_file) - - redactor = ImageRedactorEngine() - redactor.redact(image=image) - ``` - -=== "Redact text PII in DICOM images" - - 1. Install presidio-image-redactor - - ```sh - pip install presidio-image-redactor - ``` - - 2. Redact text PII from DICOM image - - ```py - import pydicom - from presidio_image_redactor import DicomImageRedactorEngine - - # Set input and output paths - input_path = "path/to/your/dicom/file.dcm" - output_dir = "./output" - - # Initialize the engine - engine = DicomImageRedactorEngine() - - # Option 1: Redact from a loaded DICOM image - dicom_image = pydicom.dcmread(input_path) - redacted_dicom_image = engine.redact(dicom_image, fill="contrast") - - # Option 2: Redact from DICOM file - engine.redact_from_file(input_path, output_dir, padding_width=25, fill="contrast") - - # Option 3: Redact from directory - engine.redact_from_directory("path/to/your/dicom", output_dir, padding_width=25, fill="contrast") - ``` ---- - -## Read more - -- [Installing Presidio](installation.md) -- [PII detection in text](analyzer/index.md) -- [PII anonymization in text](anonymizer/index.md) -- [PII redaction in images](image-redactor/index.md) -- [Discussion board](https://github.com/microsoft/presidio/discussions) +- For a quickstart for PII detection and de-identification in text [click here](getting_started/getting_started_text.md). +- For a quickstart for PII detection and de-identification in images [click here](getting_started/getting_started_images.md). +- For a quickstart for PII detection and de-identification in structured and semi-structured data [click here](getting_started/getting_started_structured.md). diff --git a/docs/getting_started/getting_started_images.md b/docs/getting_started/getting_started_images.md new file mode 100644 index 000000000..69a6419c5 --- /dev/null +++ b/docs/getting_started/getting_started_images.md @@ -0,0 +1,89 @@ +# Getting started with image de-identification with Presidio + +Presidio provides a simple way to de-identify image data by detecting and anonymizing personally identifiable information (PII). This guide shows you how to get started with image de-identification using Presidio's Python packages. + +Presidio has two main modules for image de-identification: General purpose, and specifically for DICOM (medical) images. + +## Simple flow - Python package + +=== "Anonymize PII in images" + + 1. Install presidio-image-redactor + + ```sh + pip install presidio-image-redactor + ``` + + 2. Redact PII from image + + ```py + from presidio_image_redactor import ImageRedactorEngine + from PIL import Image + + image = Image.open(path_to_image_file) + + redactor = ImageRedactorEngine() + redactor.redact(image=image) + ``` + +=== "Redact text PII in DICOM images" + + 1. Install presidio-image-redactor + + ```sh + pip install presidio-image-redactor + ``` + + 2. Redact text PII from DICOM image + + ```py + import pydicom + from presidio_image_redactor import DicomImageRedactorEngine + + # Set input and output paths + input_path = "path/to/your/dicom/file.dcm" + output_dir = "./output" + + # Initialize the engine + engine = DicomImageRedactorEngine() + + # Option 1: Redact from a loaded DICOM image + dicom_image = pydicom.dcmread(input_path) + redacted_dicom_image = engine.redact(dicom_image, fill="contrast") + + # Option 2: Redact from DICOM file + engine.redact_from_file(input_path, output_dir, padding_width=25, fill="contrast") + + # Option 3: Redact from directory + engine.redact_from_directory("path/to/your/dicom", output_dir, padding_width=25, fill="contrast") + ``` +--- + +## Simple flow - Docker container + +Presidio provides a Docker containers that you can use to de-identify image data. + +1. Download Docker image + +```sh +docker pull mcr.microsoft.com/presidio-image-redactor +``` + +2. Run container + +```sh +docker run -d -p 5003:3000 mcr.microsoft.com/presidio-image-redactor +``` + +3. Use the API + +```sh +curl -XPOST "http://localhost:5003/redact" -H "content-type: multipart/form-data" -F "image=@img.png" -F "data=\"{'color_fill':'255'}\"" > out.png +``` + +## Read more + +- [Installing Presidio](../installation.md) +- [PII detection in images](../image-redactor/index.md) +- [Samples](../samples/index.md) +- [Python API reference - Image Redactor](../api/image_redactor_python.md) diff --git a/docs/getting_started/getting_started_structured.md b/docs/getting_started/getting_started_structured.md new file mode 100644 index 000000000..b02341390 --- /dev/null +++ b/docs/getting_started/getting_started_structured.md @@ -0,0 +1,42 @@ +# Getting started with structured and semi-structured de-identification with Presidio + +Presidio-structured is a package built on top of Presidio that provides a simple way to de-identify structured and semi-structured data by detecting and anonymizing personally identifiable information (PII). + +Presidio-structured supports the detection and anonymization of PII in tables (e.g. Pandas DataFrames or SQL tables) and semi-structured data (e.g. JSON). + +!!! warning "Warning" + **Alpha**: This package is currently in alpha, meaning it is in its early stages of development. Features and functionality may change as the project evolves. + +## Simple flow - structured data + +```python +import pandas as pd +from presidio_structured import StructuredEngine, PandasAnalysisBuilder +from presidio_anonymizer.entities import OperatorConfig +from faker import Faker # optionally using faker as an example + +# Initialize the engine with a Pandas data processor (default) +pandas_engine = StructuredEngine() + +# Create a sample DataFrame +sample_df = pd.DataFrame({'name': ['John Doe', 'Jane Smith'], 'email': ['john.doe@example.com', 'jane.smith@example.com']}) + +# Generate a tabular analysis which detects the PII entities in the DataFrame. +tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df) + +# Define anonymization operators +fake = Faker() +operators = { + "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"}), + "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()}) +} + +# Anonymize DataFrame +anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators) +print(anonymized_df) +``` + +## Read more + +- [Presidio structured documentation](../structured/index.md) +- [Presidio structured sample notebook](../samples/python/example_structured.ipynb) diff --git a/docs/getting_started/getting_started_text.md b/docs/getting_started/getting_started_text.md new file mode 100644 index 000000000..181644375 --- /dev/null +++ b/docs/getting_started/getting_started_text.md @@ -0,0 +1,155 @@ +# Getting started with text de-identification with Presidio + +Presidio provides a simple way to de-identify text data by detecting and anonymizing personally identifiable information (PII). This guide shows you how to get started with text de-identification using Presidio's Python packages. + +Note that Presidio can leverage different NLP packages to analyze text data. The default engine is based on `spaCy`, but you can [also use others](../analyzer/customizing_nlp_models.md). This guide shows two examples: one using `spaCy` and the other using `transformers`. + +## Simple flow - Python package + +Using Presidio's modules as Python packages to get started: + +===+ "Anonymize PII in text (Default spaCy model)" + + 1. Install Presidio + + ```sh + pip install presidio-analyzer + pip install presidio-anonymizer + python -m spacy download en_core_web_lg + ``` + + 2. Analyze + Anonymize + + ```py + from presidio_analyzer import AnalyzerEngine + from presidio_anonymizer import AnonymizerEngine + + text="My phone number is 212-555-5555" + + # Set up the engine, loads the NLP module (spaCy model by default) + # and other PII recognizers + analyzer = AnalyzerEngine() + + # Call analyzer to get results + results = analyzer.analyze(text=text, + entities=["PHONE_NUMBER"], + language='en') + print(results) + + # Analyzer results are passed to the AnonymizerEngine for anonymization + + anonymizer = AnonymizerEngine() + + anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results) + + print(anonymized_text) + ``` + +=== "Anonymize PII in text (transformers)" + + 1. Install Presidio + + ```sh + pip install "presidio-analyzer[transformers]" + pip install presidio-anonymizer + python -m spacy download en_core_web_sm + ``` + + 2. Analyze + Anonymize + + ```py + from presidio_analyzer import AnalyzerEngine + from presidio_analyzer.nlp_engine import TransformersNlpEngine + from presidio_anonymizer import AnonymizerEngine + + text = "My name is Don and my phone number is 212-555-5555" + + # Define which transformers model to use + model_config = [{"lang_code": "en", "model_name": { + "spacy": "en_core_web_sm", # use a small spaCy model for lemmas, tokens etc. + "transformers": "dslim/bert-base-NER" + } + }] + + nlp_engine = TransformersNlpEngine(models=model_config) + + # Set up the engine, loads the NLP module (spaCy model by default) + # and other PII recognizers + analyzer = AnalyzerEngine(nlp_engine=nlp_engine) + + # Call analyzer to get results + results = analyzer.analyze(text=text, language='en') + print(results) + + # Analyzer results are passed to the AnonymizerEngine for anonymization + + anonymizer = AnonymizerEngine() + + anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) + + print(anonymized_text) + + ``` + !!! tip "Tip: Downloading models" + If not available, the transformers model and the spacy model would be downloaded on the first call to the `AnalyzerEngine`. To pre-download, see [this doc](../analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model). + +## Simple flow - Docker container + +Presidio provides Docker containers that you can use to de-identify text data. Each module, analyzer, and anonymizer, has its own Docker container. The containers are available on Docker Hub. + +1. Download Docker images + +```sh +docker pull mcr.microsoft.com/presidio-analyzer +docker pull mcr.microsoft.com/presidio-anonymizer +``` + +2. Run containers + +```sh +docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest + +docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest +``` + +3. Use the API + +```sh +curl -X POST http://localhost:5002/analyze \ +-H "Content-Type: application/json" \ +-d '{ + "text": "My phone number is 555-123-4567.", + "language": "en" +}' + + +curl -X POST http://localhost:5001/anonymize -H "Content-Type: application/json" -d ' + { + "text": "My phone number is 555-123-4567", + "anonymizers": { + "PHONE_NUMBER": { + "type": "replace", + "new_value": "--Redacted phone number--" + } + }, + "analyzer_results": [ + { + "start": 19, + "end": 31, + "score": 0.95, + "entity_type": "PHONE_NUMBER" + } + ]}' + +``` + +## Read more + +- [Installing Presidio](../installation.md) +- [PII detection in text](../analyzer/index.md) +- [PII anonymization in text](../anonymizer/index.md) +- [Tutorial](../tutorial/index.md) +- [Samples](../samples/index.md) +- [Python API reference - Analyzer](../api/analyzer_python.md) +- [Python API reference - Anonymizer](../api/anonymizer_python.md) +- [REST API reference](../api-docs/api-docs.html) diff --git a/docs/image-redactor/index.md b/docs/image-redactor/index.md index 5045d67a1..ebb96a96e 100644 --- a/docs/image-redactor/index.md +++ b/docs/image-redactor/index.md @@ -145,6 +145,7 @@ Python script example can be found under: ocr_kwargs = {"ocr_threshold": 50} engine.redact_from_directory("path/to/your/dicom", output_dir, fill="background", save_bboxes=True, ocr_kwargs=ocr_kwargs) ``` + ## Getting started using the document intelligence OCR engine Presidio offers two engines for OCR based PII removal. The first is the default engine which uses Tesseract OCR. The second is the Document Intelligence OCR engine which uses Azure's Document Intelligence service, which requires an Azure subscription. The following sections describe how to setup and use the Document Intelligence OCR engine. @@ -152,38 +153,42 @@ Presidio offers two engines for OCR based PII removal. The first is the default You will need to register with Azure to get an API key and endpoint. Perform the steps in the "Prerequisites" section of [this page](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). Once your resource deploys, copy your endpoint and key values and save them for the next step. The most basic usage of the engine can be setup like the following in python + ``` diOCR = DocumentIntelligenceOCR(endpoint="", key="") ``` The DocumentIntelligenceOCR can also attempt to pull your endpoint and key values from environment variables. -``` -$ export DOCUMENT_INTELLIGENCE_ENDPOINT= -$ export DOCUMENT_INTELLIGENCE_KEY= + +``` +export DOCUMENT_INTELLIGENCE_ENDPOINT= +export DOCUMENT_INTELLIGENCE_KEY= ``` + ### Document Intelligence Model Support -There are numerous document processing models available, and currently we only support the most basic usage of the model. For an overview of the functionalities offered by Document Intelligence, see [this page](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview). Presidio offers only word-level processing on the result for PII redaction purposes, as all prebuilt document models support this interface. Different models support additional structured support for tables, paragraphs, key-value pairs, fields and other types of metadata in the response. +There are numerous document processing models available, and currently we only support the most basic usage of the model. For an overview of the functionalities offered by Document Intelligence, see [this page](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview). Presidio offers only word-level processing on the result for PII redaction purposes, as all prebuilt document models support this interface. Different models support additional structured support for tables, paragraphs, key-value pairs, fields and other types of metadata in the response. Additional metadata can be sent to the Document Intelligence API call, such as pages, locale, and features, which are documented [here](https://learn.microsoft.com/en-us/python/api/azure-ai-formrecognizer/azure.ai.formrecognizer.documentanalysisclient?view=azure-python#azure-ai-formrecognizer-documentanalysisclient-begin-analyze-document). You are encouraged to test each model to see which fits best to your use case. -##### Creating an image redactor engine in Python: +#### Creating an image redactor engine in Python + ``` diOCR = DocumentIntelligenceOCR() ia_engine = ImageAnalyzerEngine(ocr=di_ocr) my_engine = ImageRedactorEngine(image_analyzer_engine=ia_engine) ``` -#### Testing Document Inteligence +#### Testing Document Intelligence Follow the steps of [running the tests](../development.md#running-tests) The test suite has a series of tests which are only exercised when the appropriate environment variables are populated. To run the test suite, to test the DocumentIntelligenceOCR engine, call the tests like this: -``` -$ export DOCUMENT_INTELLIGENCE_ENDPOINT= -$ export DOCUMENT_INTELLIGENCE_KEY= -$ pytest +``` +export DOCUMENT_INTELLIGENCE_ENDPOINT= +export DOCUMENT_INTELLIGENCE_KEY= +pytest ``` ### Evaluating de-identification performance @@ -192,7 +197,7 @@ If you are interested in evaluating the performance of the DICOM de-identificati ### Side note for Windows -If you are using a Windows machine, you may run into issues if file paths are too long. Unfortunatley, this is not rare when working with DICOM images that are often nested in directories with descriptive names. +If you are using a Windows machine, you may run into issues if file paths are too long. Unfortunately, this is not rare when working with DICOM images that are often nested in directories with descriptive names. To avoid errors where the code may not recognize a path as existing due to the length of the characters in the file path, please [enable long paths on your system](https://learn.microsoft.com/en-us/answers/questions/293227/longpathsenabled.html). diff --git a/docs/index.md b/docs/index.md index 7a1ebd3ac..84a883b17 100644 --- a/docs/index.md +++ b/docs/index.md @@ -28,18 +28,22 @@ bitcoin wallets, US phone numbers, financial data and more. !!! warning "Warning" Presidio can help identify sensitive/PII data in un/structured text. However, because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed. -## [Demo](https://aka.ms/presidio-demo) | [Frequently Asked Questions](faq.md) +## Demo -## Are you using Presidio? We'd love to know how +Link to demo: -Please help us improve by taking [this short anonymous survey](https://forms.office.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR9LagCGNW01LpMix2pnFWFJUQjJDTVkwSlJYRkFPSUNNVlVRRVRWVDVNSy4u). + +## Provide feedback + +Are you using Presidio? We'd love to know how! Please help us improve by taking [this short anonymous survey](https://forms.office.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR9LagCGNW01LpMix2pnFWFJUQjJDTVkwSlJYRkFPSUNNVlVRRVRWVDVNSy4u). ## Presidio's modules 1. [Presidio analyzer](analyzer/index.md): PII identification in text 2. [Presidio anonymizer](anonymizer/index.md): De-identify detected PII entities using different operators 3. [Presidio image redactor](image-redactor/index.md): Redact PII entities from images using OCR and PII identification +4. [Presidio structured](structured/index.md): PII identification in structured/semi-structured data ## Installing Presidio diff --git a/docs/learn_presidio/concepts.md b/docs/learn_presidio/concepts.md new file mode 100644 index 000000000..3953ce76a --- /dev/null +++ b/docs/learn_presidio/concepts.md @@ -0,0 +1,39 @@ +# Concepts in Microsoft Presidio + +## High-level concepts + +| Concept | Definition | Learn More | +|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| +| **Entity** | An *entity* is a span of text that can be used to directly identify an individual. For example, a phone number, email address, or social security number. In Presidio, an entity is represented by a **RecognizerResult** object. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **Context** | *Context* is defined as the surrounding text of an entity. Context can be used to provide additional information about the entity which can be used to improve the detection accuracy. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **Recognizer** | A *recognizer* is an object that is responsible for detecting entities in text. Recognizers can be rule-based, machine learning-based, or a combination of both. The Presidio Analyzer orchestrates multiple recognizers to detect PII entities in text. The main objects in Presidio that implement PII detection logic are the **EntityRecognizer** and **PatternRecognizer**. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **Analyzer** | The Presidio `AnalyzerEngine` is responsible for orchestrating the PII detection using various recognizers.| [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **Predefined recognizer** | A recognizer that already exists in Presidio | [Predefined recognizers](../supported_entities.md) | +| **Custom recognizer** | A recognizer that is added by the user | [Adding recognizers](../analyzer/adding_recognizers.md) | +| **ad-hoc recognizer** | A recognizer that is added to the request itself, rather than to the list of recognizers loaded within Presidio | [ad-hoc recognizers](../analyzer/adding_recognizers.md#creating-ad-hoc-recognizers) | +| **Deny list** | A list of terms that should always be identified as PII | [denylist tutorial](../tutorial/01_deny_list.md) | +| **Allow list** | A list of terms that should not be identified as PII | [allowlist tutorial](../tutorial/13_allow_list.md) | + +## Main objects in Presidio + +| Concept | Definition | Learn More | +|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| +| **EntityRecognizer** | An **EntityRecognizer** is an object in Presidio that is responsible for detecting entities in text. An entity recognizer can be rule-based, a machine learning model, or a combination of both. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **RecognizerResult** | A **RecognizerResult** holds the type and span of a PII entity. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **RecognizerRegistry** | The **RecognizerRegistry** is a class in Presidio that is responsible for holding the various recognizers used by the **AnalyzerEngine**. | [link](../analyzer/index.md#main-concepts) | +| **NlpEngine** | The **NlpEngine** is an interface that defines the methods for processing text. Presidio provides several implementations of the **NlpEngine**, such as **SpacyNlpEngine**, **TransformersNlpEngine**, and **StanzaNlpEngine**. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **AnalyzerEngine** | The **AnalyzerEngine** is the main class in Presidio that is responsible for orchestrating the PII detection in text. It uses an **NlpEngine** to process the text and a **RecognizerRegistry** to hold the different recognizers. | [Analyzer concepts](../analyzer/index.md#main-concepts) | +| **BatchAnalyzerEngine** | The **BatchAnalyzerEngine** is a class in Presidio that is responsible for detecting PII entities in a batch of texts. It uses the **AnalyzerEngine** to process each text in the batch. | [Batch processing sample](../samples/python/batch_processing.ipynb) | +| **AnonymizerEngine** | The **AnonymizerEngine** is the main class in Presidio that is responsible for anonymizing PII entities in text. It uses the results from the **AnalyzerEngine** to perform the anonymization. | [Anonymizer concepts](../anonymizer/index.md#main-concepts) | +| **DeanonymizerEngine** | The **DeanonymizerEngine** is a class in Presidio that is responsible for deanonymizing text that has been anonymized by the **AnonymizerEngine**, given that the operation is reversible (e.g. encryption). | [Anonymizer concepts](../anonymizer/index.md#main-concepts) | +| **Operator** | An **Operator** is an object in Presidio that is responsible for performing the anonymization operation on a PII entity. Presidio provides several built-in operators, such as **Replace**, **Redact**, and **Encrypt**, and allows users to create custom operators. | [Anonymizer concepts](../anonymizer/index.md#main-concepts) | +| **BatchAnonymizerEngine** | The **BatchAnonymizerEngine** is a class in Presidio that is responsible for anonymizing PII entities in a batch of texts. It uses the **AnonymizerEngine** to perform the anonymization on each text in the batch. | [Sample](../samples/python/batch_processing.ipynb) | +| **ImageRedactorEngine** | The **ImageRedactorEngine** is a class in Presidio that is responsible for redacting PII entities in images. It leverages the **AnalyzerEngine** to detect PII entities in the text extracted from the images. | [Image redaction docs](../image-redactor/index.md) | +| **StructuredEngine** | The **StructuredEngine** is a class in Presidio that is responsible for detecting PII entities in structured data. It uses the **AnalyzerEngine** to detect PII entities in the text fields of the structured data. | [Image redaction docs](../structured/index.md) | + +## Evaluation concepts + +| Concept | Definition | Learn More | +|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| +| **Precision** | **Precision** is a metric that measures the proportion of true positive results among the positive results. In the context of PII detection, precision measures the proportion of correctly identified PII entities among all the entities identified by the system. | [Evaluation docs](../evaluation/index.md) | +| **Recall** | **Recall** is a metric that measures the proportion of true positive results among the actual positive results. In the context of PII detection, recall measures the proportion of correctly identified PII entities among all the PII entities present in the text.| [Evaluation docs](../evaluation/index.md) | diff --git a/docs/learn_presidio/index.md b/docs/learn_presidio/index.md new file mode 100644 index 000000000..9402ff055 --- /dev/null +++ b/docs/learn_presidio/index.md @@ -0,0 +1,33 @@ +# Learn Presidio + +Presidio is a suite of tools for detecting and de-identifying PII in text, images, and structured data. + +The recommended place to start is to follow the [tutorial](../tutorial/index.md) which will guide you through the process of setting up and using Presidio. +To learn about the different concepts in Presidio, visit the [concepts page](concepts.md). + +To go deeper into each component, visit the relevant docs: + +- For the Presidio Analyzer, visit the [Analyzer docs](../analyzer/index.md). +- For Presidio Anonymizer, visit the [Anonymizer docs](../anonymizer/index.md). +- For Presidio Image Redactor, visit the [Image Redactor docs](../image-redactor/index.md). +- For Presidio structured, visit the [Structured docs](../structured/index.md). + +The following diagrams provide a high level understanding of the Presidio components: + +## Analyzer + +[![Analyzer Design](../assets/analyzer-design.png)](../analyzer/index.md) + +## Anonymizer + +[![Anonymizer Design](../assets/anonymizer-design.png)](../anonymizer/index.md) + +## Image Redactor + +### Standard Image Types + +[![Image Redactor Design](../assets/image-redactor-design.png)](../image-redactor/index.md) + +### DICOM Images + +[![DICOM Image Redactor Design](../assets/dicom-image-redactor-design.png)](../image-redactor/index.md) diff --git a/docs/requirements-docs.txt b/docs/requirements-docs.txt index c5131d5d2..fd2527de9 100644 --- a/docs/requirements-docs.txt +++ b/docs/requirements-docs.txt @@ -7,4 +7,5 @@ mkdocstrings-python presidio_analyzer presidio_anonymizer presidio_image_redactor +presidio_structured pygments>=2.10 diff --git a/docs/samples/deployments/app-service/index.md b/docs/samples/deployments/app-service/index.md index d7f43281e..252c9a46e 100644 --- a/docs/samples/deployments/app-service/index.md +++ b/docs/samples/deployments/app-service/index.md @@ -83,7 +83,7 @@ $APP_SERVICE_ID --logs '[{"category": "AppServicePlatformLogs","enabled": true ## Using an ARM template -Alternatlively, you can use the provided ARM template which can deploy either both or any of the presidio services. +Alternatively, you can use the provided ARM template which can deploy either both or any of the presidio services. Note that while Log Analytics integration with Azure App Service is in preview, the ARM template deployment will not create a Log Analytics resource or configure the diagnostics settings from the App Service to a Log Analytics workspace. To deploy the app services using the provided ARM template, fill in the provided values.json file with the required values and run the following script. diff --git a/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md b/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md index fddd6c5ed..ebf43b1d0 100644 --- a/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md +++ b/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md @@ -1,15 +1,15 @@ # Anonymize PII entities in datasets using Azure Data Factory template and Presidio on Databricks -This sample uses the built in [data anonymization template](https://github.com/Azure/Azure-DataFactory/tree/main/templates/Data%20Anonymization%20with%20Presidio%20on%20Databricks) of Azure Data Factory (which is a part of the Template Gallery) to copy a csv dataset from one location to another, while anonymizing PII data from a text column in the dataset. It leverages the code for using [Presidio on Azure Databricks](../spark/index.md) to call Presidio as a Databricks notebook job in the Azure Data Factory (ADF) pipeline to transform the input dataset before mergine the results to an Azure Blob Storage. +This sample uses the built in [data anonymization template](https://github.com/Azure/Azure-DataFactory/tree/main/templates/Data%20Anonymization%20with%20Presidio%20on%20Databricks) of Azure Data Factory (which is a part of the Template Gallery) to copy a csv dataset from one location to another, while anonymizing PII data from a text column in the dataset. It leverages the code for using [Presidio on Azure Databricks](../spark/index.md) to call Presidio as a Databricks notebook job in the Azure Data Factory (ADF) pipeline to transform the input dataset before merging the results to an Azure Blob Storage. -**Note that** this solution is capabale of transforming large datasets. For smaller, text based input you may want to work with the [Data Anonymization with Presidio as an HTTP service](./presidio-data-factory-template-gallery-http.md) template which offers an easier deployment for Presidio. +**Note that** this solution is capable of transforming large datasets. For smaller, text based input you may want to work with the [Data Anonymization with Presidio as an HTTP service](./presidio-data-factory-template-gallery-http.md) template which offers an easier deployment for Presidio. The sample deploys the following Azure Services: * Azure Storage - The target storage account where data will be persisted. * Azure Databricks - Host presidio to anonymize the data. -Additionaly you should already have an instance of Azure Data Factory which hosts and orchestrates the transformation pipeline and a storage account which holds the source files. +Additionally you should already have an instance of Azure Data Factory which hosts and orchestrates the transformation pipeline and a storage account which holds the source files. ## About this Solution Template @@ -34,7 +34,7 @@ To use this template you should first setup the required infrastructure for the ### Setup Presidio -Provision and setup the datbricks cluster by following the Deploy and Setup steps in [presidio-spark sample](../spark/index.md#Pre-requisites). +Provision and setup the databricks cluster by following the Deploy and Setup steps in [presidio-spark sample](../spark/index.md#Pre-requisites). Take a note of the authentication token and do not follow the "Running the sample" steps. ### Setup Azure Data Factory @@ -42,7 +42,7 @@ Take a note of the authentication token and do not follow the "Running the sampl 1. Go to the Data anonymization with Presidio on Databricks template. Select the AnonymizedCSV connection (Azure Storage) and select "New" from the drop down menu. ![ADF-Template-Load](images/data-anonymization-databricks-01.png) -2. Name the service "PresidioStorage" and select the storage account that was created in the previous steps from your subscription. Note that Target source was also selecte as the sample uses the same storage account for both source and target. +2. Name the service "PresidioStorage" and select the storage account that was created in the previous steps from your subscription. Note that Target source was also selected as the sample uses the same storage account for both source and target. ![ADF-Storage-Link](images/data-anonymization-databricks-02.png) 3. Select the Anonymize Source connection (Databricks) and select "New" from the drop down menu. diff --git a/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md b/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md index 3189844b5..88d3c3889 100644 --- a/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md +++ b/docs/samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md @@ -3,7 +3,7 @@ This sample uses the built in [data anonymization template](https://github.com/Azure/Azure-DataFactory/tree/main/templates/Data%20Anonymization%20with%20Presidio%20as%20an%20HTTP%20service) of Azure Data Factory which is a part of the Template Gallery to move a set of text files from one location to another while anonymizing their content. It leverages the code for using [Presidio on Azure App Service](../app-service/index.md) to call Presidio as an HTTP REST endpoint in the Azure Data Factory (ADF) pipeline while parsing and storing each file as an Azure Blob Storage. **Note that** given the solution architecture which call presidio services using HTTP, this sample should be used for up to 5000 files, each up to 200KB in size. -The restrictions are based on ADF lookup-activity which is used to iterate the files in the storage container (up to 5000 records), and having Presidio as an HTTP endpoint with text being sent over network to be anonymized. +The restrictions are based on ADF lookup-activity which is used to iterate the files in the storage container (up to 5000 records), and having Presidio as an HTTP endpoint with text being sent over network to be anonymized. For larger sets please work with the [Data Anonymization with Presidio on Databricks](./presidio-data-factory-template-gallery-databricks.md) template. The sample deploys the following Azure Services: @@ -12,7 +12,7 @@ The sample deploys the following Azure Services: * Azure Storage - The target storage account where data will be persisted. * Azure App Service - Host presidio to anonymize the data. -Additionaly you should already have an instance of Azure Data Factory which host and orchestrate the transformation pipeline and a storage account which holds the source files. +Additionally you should already have an instance of Azure Data Factory which host and orchestrate the transformation pipeline and a storage account which holds the source files. ## About this Solution Template diff --git a/docs/samples/deployments/data-factory/presidio-data-factory.md b/docs/samples/deployments/data-factory/presidio-data-factory.md index 32939e8b2..e07144bda 100644 --- a/docs/samples/deployments/data-factory/presidio-data-factory.md +++ b/docs/samples/deployments/data-factory/presidio-data-factory.md @@ -2,7 +2,7 @@ The following samples showcase two scenarios which use Azure Data Factory (ADF) to move a set of JSON objects from an online location to an Azure Storage while anonymizing their content. The first sample leverages the code for using [Presidio on Azure App Service](../app-service/index.md) to call Presidio as an HTTP REST endpoint in the ADF pipeline while parsing and storing each file as an Azure Blob Storage. -The second sample leverage the code for using [Presidio on spark](../spark/index.md) to run over a set of files on an Azure Blob Storage to anonymnize their content, in the case of having a large data set that requires the scale of databricks. +The second sample leverage the code for using [Presidio on spark](../spark/index.md) to run over a set of files on an Azure Blob Storage to anonymize their content, in the case of having a large data set that requires the scale of databricks. The samples deploy and use the following Azure Services: @@ -25,7 +25,6 @@ Create the Azure App Service and the ADF pipeline by clicking the Deploy-to-Azur [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2Fpresidio%2Fmain%2Fdocs%2Fsamples%2Fdeployments%2Fdata-factory%2Farm-templates%2Fazure-deploy-adf-app-service.json) - ```bash RESOURCE_GROUP=[Name of resource group] LOCATION=[location of resources] @@ -55,13 +54,13 @@ The template contains seven activities: ## Option 2: Presidio on Azure Databricks -By using Presidio as a Notebook step in ADF, we allow Databricks to scale presidio according to the cluster capabilities and the input dataset. Using presidio as a native python package in pyspark can unlock more analysis and de-identifiaction scenarios. +By using Presidio as a Notebook step in ADF, we allow Databricks to scale presidio according to the cluster capabilities and the input dataset. Using presidio as a native python package in pyspark can unlock more analysis and de-identification scenarios. ![ADF-Databricks](images/adf-databricks-screenshot.png) ### Pre-requisite - Deploy Azure Databricks -Provision and setup the datbricks cluster by following the steps in [presidio-spark sample](../spark/index.md#Deploy-Infrastructure). +Provision and setup the databricks cluster by following the steps in [presidio-spark sample](../spark/index.md#Deploy-Infrastructure). Note the output key and export it as DATABRICKS_TOKEN environment variable. ### Deploy the ARM template diff --git a/docs/samples/deployments/spark/index.md b/docs/samples/deployments/spark/index.md index fbd6e458d..d0844327f 100644 --- a/docs/samples/deployments/spark/index.md +++ b/docs/samples/deployments/spark/index.md @@ -50,9 +50,9 @@ anonymized_df = input_df.withColumn( ## Pre-requisites -If you do not have an instance of Azure Databricks, follow through with the following steps to provision and setup the required infrastrucutre. +If you do not have an instance of Azure Databricks, follow through with the following steps to provision and setup the required infrastructure. -If you do have a Databricks workspace and a cluster you wish to configure to run Presidio, jump over to the [Configure an existing cluster](#Configure-an-existing-cluster) section. +If you do have a Databricks workspace and a cluster you wish to configure to run Presidio, jump over to the [Configure an existing cluster](#configure-an-existing-cluster) section. ### Deploy Infrastructure @@ -152,7 +152,7 @@ Run the first code-cell and note the following parameters on the top end of the ### Run the notebook -Upload a text file to the blob storage input folder, using any preferd method ([Azure Portal](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal), [Azure Storage Explorer](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-storage-explorer), [Azure CLI](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-cli)). +Upload a text file to the blob storage input folder, using any preferred method ([Azure Portal](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal), [Azure Storage Explorer](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-storage-explorer), [Azure CLI](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-cli)). ```bash az storage blob upload --account-name $STORAGE_ACCOUNT_NAME --container $STORAGE_CONTAINER_NAME --file ./[file name] --name input/[file name] diff --git a/docs/samples/deployments/spark/notebooks/00_setup.py b/docs/samples/deployments/spark/notebooks/00_setup.py index ef92ed1f2..54c34d383 100644 --- a/docs/samples/deployments/spark/notebooks/00_setup.py +++ b/docs/samples/deployments/spark/notebooks/00_setup.py @@ -4,7 +4,7 @@ # MAGIC # MAGIC
Mount an Azure Storage blob container to a databricks cluster. # MAGIC -# MAGIC
This sciprt requires the following environment variables to be set. +# MAGIC
This script requires the following environment variables to be set. # MAGIC # MAGIC
    # MAGIC
  1. STORAGE_MOUNT_NAME - Name of mount which will be used by notebooks accessing the mount point.
  2. @@ -12,7 +12,7 @@ # MAGIC
  3. STORAGE_CONTAINER_NAME - Blob container name
  4. # MAGIC
# MAGIC -# MAGIC
Additionaly, the following secrets are used. +# MAGIC
Additionally, the following secrets are used. # MAGIC # MAGIC
    # MAGIC
  1. storage_account_access_key under scope storage_scope - storage account key.
  2. diff --git a/docs/samples/index.md b/docs/samples/index.md index bb5563f70..d1de81152 100644 --- a/docs/samples/index.md +++ b/docs/samples/index.md @@ -21,10 +21,12 @@ | Usage | Text | Python | [Using Flair as an external PII model](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py)| | Usage | Text | Python file | [Using Span Marker as an external PII model](https://github.com/microsoft/presidio/blob/main/docs/samples/python/span_marker_recognizer.py)| | Usage | Text | Python file | [Using Transformers as an external PII model](python/transformers_recognizer/index.md)| -| Usage | Text | Python file | [Pseudonomization (replace PII values using mappings)](python/pseudonomyzation.ipynb)| +| Usage | Text | Python file | [Pseudonymization (replace PII values using mappings)](python/pseudonymization.ipynb)| | Usage | Text | Python file | [Passing a lambda as a Presidio anonymizer using Faker](python/example_custom_lambda_anonymizer.py)| | Usage | Text | Python file | [Synthetic data generation with OpenAI](python/synth_data_with_openai.ipynb)| +| Usage | Text | Python file | [Keeping some entities from being anonymized](python/keep_entities.ipynb)| | Usage | Text | LiteLLM Proxy | [PII Masking LLM calls across Anthropic/Gemini/Bedrock/Azure, etc.](docker/litellm.md)| +| Usage | Text | Python Notebook | [YAML based no-code configuration](python/no_code_config.ipynb) | | Usage | | REST API (postman) | [Presidio as a REST endpoint](docker/index.md)| | Deployment | | App Service | [Presidio with App Service](deployments/app-service/index.md)| | Deployment | | Kubernetes | [Presidio with Kubernetes](deployments/k8s/index.md)| diff --git a/docs/samples/python/ner_model_configuration.ipynb b/docs/samples/python/ner_model_configuration.ipynb new file mode 100644 index 000000000..ef1092a26 --- /dev/null +++ b/docs/samples/python/ner_model_configuration.ipynb @@ -0,0 +1,408 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "c5688685-cfb9-41e5-b0bb-27ac87757ada", + "metadata": {}, + "outputs": [], + "source": [ + "# download presidio\n", + "!pip install presidio_analyzer presidio_anonymizer" + ] + }, + { + "cell_type": "markdown", + "id": "cb8e0bdb-3138-44ad-8d87-d9d549c51ce5", + "metadata": {}, + "source": [ + "###### Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/ner_model_configuration.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/ner_model_configuration.ipynb)" + ] + }, + { + "cell_type": "markdown", + "id": "4b7961b8-34ab-40fd-8672-2f988f115c17", + "metadata": {}, + "source": [ + "# Configuring the NER model\n", + "\n", + "This notebook contains a few examples to customize and configure the NER model through code.\n", + "Examples:\n", + "1. Changing the default model's parameters\n", + "2. Using Stanza as the NER engine\n", + "3. Using transformers as the NER engine\n", + "4. Supporting multiple languages\n", + "\n", + "This notebook complements the documentation, which primarily focuses on reading the NER configuration from file" + ] + }, + { + "cell_type": "markdown", + "id": "bbe0e30e-b040-4c72-9743-eba443529084", + "metadata": {}, + "source": [ + "### 1. Changing the default model's parameters\n", + "\n", + "In this example, we'll change the models' default confidence score (spaCy models do not generally output confidence per prediction, so we add a default score(. In addition, we'll change the types of PII entities the model returns." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "130eb964-e141-4ebd-999f-ccce659d2adb", + "metadata": {}, + "outputs": [], + "source": [ + "from presidio_analyzer import AnalyzerEngine\n", + "from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NerModelConfiguration" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "dabc2b20-6e83-48e5-a2c9-1398aec8c456", + "metadata": {}, + "outputs": [], + "source": [ + "# Define which model to use\n", + "model_config = [{\"lang_code\": \"en\", \"model_name\": \"en_core_web_lg\"}]\n", + "\n", + "# Define which entities the model returns and how they map to Presidio's\n", + "entity_mapping = dict(\n", + " PER=\"PERSON\",\n", + " LOC= \"LOCATION\",\n", + " GPE=\"LOCATION\",\n", + " ORG=\"ORGANIZATION\"\n", + ")\n", + "\n", + "ner_model_configuration = NerModelConfiguration(default_score = 0.6, \n", + " model_to_presidio_entity_mapping=entity_mapping)\n", + "\n", + "# Create the NLP Engine based on this configuration\n", + "spacy_nlp_engine = SpacyNlpEngine(models= model_config, ner_model_configuration=ner_model_configuration)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "7e11eee9-4ac4-4533-960c-8c18509fb7e3", + "metadata": {}, + "outputs": [], + "source": [ + "# Helper method to use the NLP Engine as part of Presidio Analyzer, and print configuration+results\n", + "\n", + "def call_analyzer_and_print_results(nlp_engine: NlpEngine,\n", + " language: str = \"en\",\n", + " text: str = \"Bill Clinton used to be the president of the United States\") -> None:\n", + " \"\"\"\n", + " Instantiate the AnalyzerEngine with the provided nlp_engine and return output.\n", + "\n", + " This method creates an AnalyzerEngine instance with the provided NlpEngine, and three supported languages (en, es, de)\n", + " Then, it calls the analyze method to return identified PII.\n", + "\n", + " :param nlp_engine: The NlpEngine instance as configured by the user\n", + " :param language: the language the request should support (in contrast to the AnalyzerEngine which can support multiple)\n", + " :param text: The text to look for PII entities in.\n", + "\n", + " \"\"\"\n", + " \n", + " print(f\"Input text:\\n\\t{text}\\n\")\n", + " \n", + " # Initialize the AnalyzerEngine with the configured Nlp Engine:\n", + " analyzer = AnalyzerEngine(nlp_engine=nlp_engine, \n", + " supported_languages=[\"en\", \"de\", \"es\"])\n", + "\n", + " # Print the NLP Engine's configuration\n", + " print(f\"NLP Engine configuration:\\n\\tLoaded NLP engine: {analyzer.nlp_engine.__class__.__name__}\")\n", + " print(f\"\\tSupported entities: {analyzer.nlp_engine.get_supported_entities()}\")\n", + " print(f\"\\tSupported languages: {analyzer.nlp_engine.get_supported_languages()}\")\n", + " print()\n", + " \n", + " # Call the analyzer.analyze to detect PII entities (from the NLP engine + all other recognizers)\n", + " results = analyzer.analyze(text=text, \n", + " language=language, \n", + " return_decision_process=True)\n", + "\n", + " # sort results\n", + " results = sorted(results, key= lambda x: x.start)\n", + " \n", + " # Print results\n", + " print(\"Returning full results, including the decision process:\")\n", + " for i, result in enumerate(results):\n", + " print(f\"\\tResult {i}: {result}\")\n", + " print(f\"\\tDetected text: {text[result.start: result.end]}\")\n", + " print(f\"\\t{result.analysis_explanation.textual_explanation}\")\n", + " print(\"\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "1004a788-8111-4793-b7b5-e7759358ab2d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input text:\n", + "\tBill Clinton used to be the president of the United States\n", + "\n", + "NLP Engine configuration:\n", + "\tLoaded NLP engine: SpacyNlpEngine\n", + "\tSupported entities: ['LOCATION', 'PERSON', 'ORGANIZATION']\n", + "\tSupported languages: ['en']\n", + "\n", + "Returning full results, including the decision process:\n", + "\tResult 0: type: PERSON, start: 0, end: 12, score: 0.6\n", + "\tDetected text: Bill Clinton\n", + "\tIdentified as PERSON by Spacy's Named Entity Recognition\n", + "\n", + "\tResult 1: type: LOCATION, start: 41, end: 58, score: 0.6\n", + "\tDetected text: the United States\n", + "\tIdentified as LOCATION by Spacy's Named Entity Recognition\n", + "\n" + ] + } + ], + "source": [ + "# Run it as part of Presidio's AnalyzerEngine\n", + "call_analyzer_and_print_results(spacy_nlp_engine)" + ] + }, + { + "cell_type": "markdown", + "id": "394f8f23-15e7-4767-8298-17170fa2d316", + "metadata": {}, + "source": [ + "## 2. Using Stanza" + ] + }, + { + "cell_type": "markdown", + "id": "6491fa00-4229-4792-9120-ffe601a1cb2f", + "metadata": {}, + "source": [ + "Stanza is an NLP package by Stanford. More details on Stanza can be found here: https://stanfordnlp.github.io/stanza/\n", + "Loading Stanza instead of spaCy is straightforward. Just use `StanzaNlpEngine` instead of `SpacyNlpEngine` and define a model name supported by stanza (for example, `en` instead of `en_core_web_lg`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e660475-0043-4a2e-a198-9eca1cb307a5", + "metadata": {}, + "outputs": [], + "source": [ + "from presidio_analyzer.nlp_engine import StanzaNlpEngine, NerModelConfiguration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a29cbd9-8298-494f-812b-fb5d3a460ed0", + "metadata": {}, + "outputs": [], + "source": [ + "# Define which model to use\n", + "model_config = [{\"lang_code\": \"en\", \"model_name\": \"en\"}]\n", + "\n", + "# Define which entities the model returns and how they map to Presidio's\n", + "entity_mapping = dict(\n", + " PER=\"PERSON\",\n", + " LOC= \"LOCATION\",\n", + " GPE=\"LOCATION\",\n", + " ORG=\"ORGANIZATION\"\n", + ")\n", + "\n", + "ner_model_configuration = NerModelConfiguration(model_to_presidio_entity_mapping=entity_mapping)\n", + "\n", + "# Create the Stanza NLP Engine based on this configuration\n", + "stanza_nlp_engine = StanzaNlpEngine(models= model_config, ner_model_configuration=ner_model_configuration)\n", + "\n", + "# Run it as part of Presidio's AnalyzerEngine\n", + "call_analyzer_and_print_results(stanza_nlp_engine)" + ] + }, + { + "cell_type": "markdown", + "id": "c118badd-c924-4b3b-ad6a-24cad1c5ffec", + "metadata": {}, + "source": [ + "## 3. Using transformers as the NLP engine" + ] + }, + { + "cell_type": "markdown", + "id": "5c27a1f8-5eb6-4b98-bc67-de7470f5786c", + "metadata": {}, + "source": [ + "A third option is to use a model based on the `transformers` package. Note that in this case, we use both spaCy and transformers. The actual PII entities are detected using a transformers model, but additional text features such as lemmas and others, are extracted from a spaCy pipeline. We use a small spaCy model as it's faster and more memory efficient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78e9e9a1-743b-4764-add1-8963b6116460", + "metadata": {}, + "outputs": [], + "source": [ + "from presidio_analyzer.nlp_engine import TransformersNlpEngine, NerModelConfiguration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf14cce2-f593-43f2-9d3e-12a99f0cbbbe", + "metadata": {}, + "outputs": [], + "source": [ + "# Define which model to use\n", + "model_config = [{\n", + " \"lang_code\":\"en\",\n", + " \"model_name\":{\n", + " \"spacy\":\"en_core_web_sm\",\n", + " \"transformers\":\"obi/deid_roberta_i2b2\"\n", + " }\n", + "}]\n", + "\n", + "# Map transformers model labels to Presidio's\n", + "model_to_presidio_entity_mapping = dict(\n", + " PER=\"PERSON\",\n", + " PERSON=\"PERSON\",\n", + " LOC= \"LOCATION\",\n", + " LOCATION= \"LOCATION\",\n", + " GPE=\"LOCATION\",\n", + " ORG=\"ORGANIZATION\",\n", + " ORGANIZATION=\"ORGANIZATION\",\n", + " NORP=\"NRP\",\n", + " AGE=\"AGE\",\n", + " ID=\"ID\",\n", + " EMAIL=\"EMAIL\",\n", + " PATIENT=\"PERSON\",\n", + " STAFF=\"PERSON\",\n", + " HOSP=\"ORGANIZATION\",\n", + " PATORG=\"ORGANIZATION\",\n", + " DATE=\"DATE_TIME\",\n", + " TIME=\"DATE_TIME\",\n", + " PHONE=\"PHONE_NUMBER\",\n", + " HCW=\"PERSON\",\n", + " HOSPITAL=\"ORGANIZATION\",\n", + " FACILITY=\"LOCATION\",\n", + ")\n", + "\n", + "ner_model_configuration = NerModelConfiguration(model_to_presidio_entity_mapping=model_to_presidio_entity_mapping, \n", + " aggregation_strategy=\"simple\",\n", + " stride=14)\n", + "\n", + "transformers_nlp_engine = TransformersNlpEngine(models=model_config,\n", + " ner_model_configuration=ner_model_configuration)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ef5566d-90d9-484e-9e09-7625ba60cffb", + "metadata": {}, + "outputs": [], + "source": [ + "# Run it as part of Presidio's AnalyzerEngine\n", + "call_analyzer_and_print_results(transformers_nlp_engine)" + ] + }, + { + "cell_type": "markdown", + "id": "13dd60ab-a44e-4951-9159-e593d5a17d27", + "metadata": {}, + "source": [ + "## 4. Supporting multiple languages\n", + "Presidio allows the user to create a model per language:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d52af66-5764-464b-9476-2868cf7f6851", + "metadata": {}, + "outputs": [], + "source": [ + "from presidio_analyzer.nlp_engine import TransformersNlpEngine, NerModelConfiguration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65e059da-5d10-491f-99da-beaf3e2e58d5", + "metadata": {}, + "outputs": [], + "source": [ + "# Define which model to use\n", + "model_config = [{\n", + " \"lang_code\":\"en\",\n", + " \"model_name\":{\n", + " \"spacy\":\"en_core_web_sm\",\n", + " \"transformers\":\"obi/deid_roberta_i2b2\"\n", + " }\n", + "},\n", + "{\n", + " \"lang_code\":\"es\",\n", + " \"model_name\":{\n", + " \"spacy\":\"es_core_news_sm\",\n", + " \"transformers\":\"PlanTL-GOB-ES/roberta-large-bne-capitel-ner\"\n", + " }\n", + "}]\n", + "\n", + "transformers_nlp_engine = TransformersNlpEngine(models=model_config,\n", + " ner_model_configuration=ner_model_configuration)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f2020e8-adab-4ff8-8f16-0d6200dbae71", + "metadata": {}, + "outputs": [], + "source": [ + "# Call in English\n", + "call_analyzer_and_print_results(transformers_nlp_engine, \n", + " language=\"en\", \n", + " text = \"Bill Clinton was the president of the United States\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9fe8668-ae84-4b68-b048-fe1b346726aa", + "metadata": {}, + "outputs": [], + "source": [ + "# Call in Spanish\n", + "call_analyzer_and_print_results(transformers_nlp_engine, \n", + " language=\"es\", \n", + " text = \"Bill Clinton solía ser el presidente de los Estados Unidos.\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "presidio_e2e", + "language": "python", + "name": "presidio" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/samples/python/no_code_config.ipynb b/docs/samples/python/no_code_config.ipynb new file mode 100644 index 000000000..6c76a67c1 --- /dev/null +++ b/docs/samples/python/no_code_config.ipynb @@ -0,0 +1,431 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d521fcdd-39f6-4c5d-9466-8e6f7fb8ab2e", + "metadata": {}, + "source": [ + "# No code configuration" + ] + }, + { + "cell_type": "markdown", + "id": "a533b0a0-7acc-4164-ad2c-cb64b3d25aca", + "metadata": {}, + "source": [ + "No-code configuration can be helpful in three scenarios:\n", + "\n", + "1. There's an existing set of regular expressions / deny-lists that should be leveraged within Presidio.\n", + "2. As a simple way to configure which recognizers to enable and disable, and how to configure the NLP engine.\n", + "3. For team members interested in changing the configuration without writing code.\n", + "\n", + "In this example, we'll show how to create a no-code configuration in Presidio.\n", + "We start by creating YAML configuration files that are based on the default ones. \n", + "Te default configuration files for Presidio can be found here:\n", + "- [Analyzer configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml)\n", + "- [Recognizer registry configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml)\n", + "- [NLP engine configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml)\n", + "\n", + "Alternatively, one can create one configuration file for all three components.\n", + "In this example, we'll tweak the configuration to reduce the number of predefinedrecognizers to only a few, and add a new custom one. We'll also adjust the context words to support the detection of a different language (Spanish).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "6cd78f11-d3b3-43f9-8cc5-6a08b52e2403", + "metadata": {}, + "outputs": [], + "source": [ + "import yaml\n", + "import json\n", + "import tempfile\n", + "import warnings\n", + "from pprint import pprint\n", + "from presidio_analyzer import AnalyzerEngineProvider\n", + "\n", + "warnings.filterwarnings(\"ignore\")" + ] + }, + { + "cell_type": "markdown", + "id": "4b864e1b-a2a3-4ed8-be43-092f02e56d55", + "metadata": {}, + "source": [ + "In this example we're going to create the yaml as a string for illustration purposes, but the more common scenario is to create these YAML files and load them into the `PresidioAnalyzerProvider`." + ] + }, + { + "cell_type": "markdown", + "id": "9894a09f-9df2-4afa-8c3f-3b2a0f72f270", + "metadata": {}, + "source": [ + "### General Analyzer parameters\n", + "([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml))" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "2f1d8b01-f9d2-4827-a1d1-ba4d6ca70daf", + "metadata": {}, + "outputs": [], + "source": [ + "analyzer_config_yaml = \"\"\"\n", + "supported_languages: \n", + " - en\n", + " - es\n", + "default_score_threshold: 0.4\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "4f423e92-1eb4-4f85-9427-578aef7dc25f", + "metadata": {}, + "source": [ + "### Recognizer Registry parameters\n", + "([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml))" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "4c3b6a93-79a5-467e-b08d-d59f7ea461e9", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "recognizer_registry_config_yaml = \"\"\"\n", + "recognizer_registry:\n", + " supported_languages: \n", + " - en\n", + " - es\n", + " global_regex_flags: 26\n", + "\n", + " recognizers:\n", + " - name: CreditCardRecognizer\n", + " supported_languages:\n", + " - language: en\n", + " context: [credit, card, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment]\n", + " - language: es\n", + " context: [tarjeta, credito, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment]\n", + " type: predefined\n", + " \n", + " - name: DateRecognizer\n", + " supported_languages:\n", + " - language: en\n", + " context: [date, time, birthday, birthdate, dob]\n", + " - language: es\n", + " context: [fecha, tiempo, hora, nacimiento, dob]\n", + " type: predefined\n", + "\n", + " - name: EmailRecognizer\n", + " supported_languages:\n", + " - language: en\n", + " context: [email, mail, address]\n", + " - language: es\n", + " context: [correo, electrónico, email]\n", + " type: predefined\n", + " \n", + " - name: PhoneRecognizer\n", + " type: predefined\n", + " supported_languages:\n", + " - language: en\n", + " context: [phone, number, telephone, fax]\n", + " - language: es\n", + " context: [teléfono, número, fax]\n", + " \n", + " - name: \"Titles recognizer (en)\"\n", + " supported_language: \"en\"\n", + " supported_entity: \"TITLE\"\n", + " deny_list:\n", + " - Mr.\n", + " - Mrs.\n", + " - Ms.\n", + " - Miss\n", + " - Dr.\n", + " - Prof.\n", + " - Doctor\n", + " - Professor\n", + " - name: \"Titles recognizer (es)\"\n", + " supported_language: \"es\"\n", + " supported_entity: \"TITLE\"\n", + " deny_list:\n", + " - Sr.\n", + " - Señor\n", + " - Sra.\n", + " - Señora\n", + " - Srta.\n", + " - Señorita\n", + " - Dr.\n", + " - Doctor\n", + " - Doctora\n", + " - Prof.\n", + " - Profesor\n", + " - Profesora\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "49d6b8d7-dec7-4d0f-932b-5795e1b665bf", + "metadata": {}, + "source": [ + "### NLP Engine parameters\n", + "([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml))" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "daef9a20-e126-483f-9f1f-7d29f0b5f6f7", + "metadata": {}, + "outputs": [], + "source": [ + "nlp_engine_yaml = \"\"\"\n", + "nlp_configuration:\n", + " nlp_engine_name: transformers\n", + " models:\n", + " -\n", + " lang_code: en\n", + " model_name:\n", + " spacy: en_core_web_sm\n", + " transformers: StanfordAIMI/stanford-deidentifier-base\n", + " -\n", + " lang_code: es\n", + " model_name:\n", + " spacy: es_core_news_sm\n", + " transformers: MMG/xlm-roberta-large-ner-spanish \n", + " ner_model_configuration:\n", + " labels_to_ignore:\n", + " - O\n", + " aggregation_strategy: first # \"simple\", \"first\", \"average\", \"max\"\n", + " stride: 16\n", + " alignment_mode: expand # \"strict\", \"contract\", \"expand\"\n", + " model_to_presidio_entity_mapping:\n", + " PER: PERSON\n", + " PERSON: PERSON\n", + " LOC: LOCATION\n", + " LOCATION: LOCATION\n", + " GPE: LOCATION\n", + " ORG: ORGANIZATION\n", + " ORGANIZATION: ORGANIZATION\n", + " NORP: NRP\n", + " AGE: AGE\n", + " ID: ID\n", + " EMAIL: EMAIL\n", + " PATIENT: PERSON\n", + " STAFF: PERSON\n", + " HOSP: ORGANIZATION\n", + " PATORG: ORGANIZATION\n", + " DATE: DATE_TIME\n", + " TIME: DATE_TIME\n", + " PHONE: PHONE_NUMBER\n", + " HCW: PERSON\n", + " HOSPITAL: LOCATION\n", + " FACILITY: LOCATION\n", + " VENDOR: ORGANIZATION\n", + " MISC: ID\n", + " \n", + " low_confidence_score_multiplier: 0.4\n", + " low_score_entity_names:\n", + " - ID\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "a827c70c-c39c-4399-97a4-0bd41040fb6c", + "metadata": {}, + "source": [ + "Create a unified YAML file and save it as a temp file" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "9acd2c0d-793a-4a33-8c9e-622d01057f79", + "metadata": {}, + "outputs": [], + "source": [ + "full_config = f\"{analyzer_config_yaml}\\n{recognizer_registry_config_yaml}\\n{nlp_engine_yaml}\"\n", + "\n", + "with tempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='.yaml') as temp_file:\n", + " # Write the YAML string to the temp file\n", + " temp_file.write(full_config)\n", + " temp_file_path = temp_file.name\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3957de3-88b8-4aa8-844a-e31f84c26711", + "metadata": {}, + "source": [ + "Pass the YAML file to `AnalyzerEngineProvider` to create an `AnalyzerEngine` instance" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b25df9cd-483f-473a-be0a-3c7b0e1b4e4f", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Device set to use cpu\n", + "Device set to use cpu\n" + ] + } + ], + "source": [ + "analyzer_engine = AnalyzerEngineProvider(analyzer_engine_conf_file=temp_file_path).create_engine()\n" + ] + }, + { + "cell_type": "markdown", + "id": "06fbe6f4-cefe-4875-9ddf-db8b50056392", + "metadata": {}, + "source": [ + "Print the loaded configuration for both languages" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ee8cb346-5dc1-4b08-9ee4-6a71308c10d1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'Supported entities for en:'\n", + "\n", + "\n", + "['ORGANIZATION', 'EMAIL', 'EMAIL_ADDRESS', 'CREDIT_CARD', 'AGE', 'LOCATION',\n", + " 'PERSON', 'NRP', 'PHONE_NUMBER', 'TITLE', 'DATE_TIME', 'ID']\n", + "\n", + "Loaded recognizers for en:\n", + "['CreditCardRecognizer', 'DateRecognizer', 'EmailRecognizer', 'PhoneRecognizer',\n", + " 'Titles recognizer (en)', 'TransformersRecognizer']\n", + "\n", + "\n", + "'Supported entities for es:'\n", + "\n", + "\n", + "['ORGANIZATION', 'EMAIL', 'EMAIL_ADDRESS', 'CREDIT_CARD', 'AGE', 'LOCATION',\n", + " 'PERSON', 'NRP', 'PHONE_NUMBER', 'TITLE', 'DATE_TIME', 'ID']\n", + "\n", + "Loaded recognizers for es:\n", + "['CreditCardRecognizer', 'DateRecognizer', 'EmailRecognizer', 'PhoneRecognizer',\n", + " 'Titles recognizer (es)', 'TransformersRecognizer']\n", + "\n", + "\n", + "\n", + "Loaded NER models:\n", + "[{'lang_code': 'en',\n", + " 'model_name': {'spacy': 'en_core_web_sm',\n", + " 'transformers': 'StanfordAIMI/stanford-deidentifier-base'}},\n", + " {'lang_code': 'es',\n", + " 'model_name': {'spacy': 'es_core_news_sm',\n", + " 'transformers': 'MMG/xlm-roberta-large-ner-spanish'}}]\n" + ] + } + ], + "source": [ + "for lang in (\"en\", \"es\"):\n", + " pprint(f\"Supported entities for {lang}:\")\n", + " print(\"\\n\")\n", + " pprint(analyzer_engine.get_supported_entities(lang), compact=True)\n", + " \n", + " print(f\"\\nLoaded recognizers for {lang}:\")\n", + " pprint([rec.name for rec in analyzer_engine.registry.get_recognizers(lang, all_fields=True)], compact=True)\n", + " print(\"\\n\")\n", + " \n", + "print(f\"\\nLoaded NER models:\")\n", + "pprint(analyzer_engine.nlp_engine.models)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c6a50606-a68a-4984-995b-36c9ccdccb81", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[type: CREDIT_CARD, start: 98, end: 114, score: 1.0,\n", + " type: PERSON, start: 15, end: 28, score: 0.9991055727005005,\n", + " type: LOCATION, start: 52, end: 62, score: 0.9933345317840576]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "es_text = \"Hola, me llamo David Johnson y soy originalmente de Liverpool. Mi número de tarjeta de crédito es 4095260993934932\"\n", + "analyzer_engine.analyze(es_text, language=\"es\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "ea24b036-c65f-4376-b545-76089e7dddef", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[type: CREDIT_CARD, start: 89, end: 105, score: 1.0,\n", + " type: LOCATION, start: 53, end: 62, score: 0.9989457726478577,\n", + " type: PERSON, start: 15, end: 28, score: 0.9727346897125244]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "en_text = \"Hi, my name is David Johnson and I'm originally from Liverpool. My credit card number is 4095260993934932\"\n", + "analyzer_engine.analyze(en_text, language=\"en\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "681d2542-0f22-4841-9217-a64f72a12c84", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "presidio_e2e", + "language": "python", + "name": "presidio" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/samples/python/pseudonomyzation.ipynb b/docs/samples/python/pseudonymization.ipynb similarity index 100% rename from docs/samples/python/pseudonomyzation.ipynb rename to docs/samples/python/pseudonymization.ipynb diff --git a/docs/structured/index.md b/docs/structured/index.md index 34b5d4ac6..41a56638e 100644 --- a/docs/structured/index.md +++ b/docs/structured/index.md @@ -147,5 +147,7 @@ Contributions are welcome! Please refer to the [Contributing Guide](https://gith #### More information +- [API documentation](../api/structured_python.md) +- [Sample code](../samples/python/example_structured.ipynb) - [Join the discussion](https://github.com/microsoft/presidio/discussions?discussions_q=structured) - [Relevant issues on Github](https://github.com/microsoft/presidio/issues?q=is%3Aissue+label%3Astructured-data) diff --git a/docs/tutorial/08_no_code.md b/docs/tutorial/08_no_code.md index 0defa0604..27c584536 100644 --- a/docs/tutorial/08_no_code.md +++ b/docs/tutorial/08_no_code.md @@ -1,46 +1,226 @@ -# Example 8: Creating no-code pattern recognizers +# No code configuration -No-code pattern recognizers can be helpful in two scenarios: +No-code configuration can be helpful in three scenarios: -1. There's an existing set of regular expressions / deny-lists which needs to be added to Presidio. -2. Non-technical team members who require adding logic without writing code. +1. There's an existing set of regular expressions / deny-lists that should be leveraged within Presidio. +2. As a simple way to configure which recognizers to enable and disable, and how to configure the NLP engine. +3. For team members interested in changing the configuration without writing code. -Regular expression or deny-list based recognizers can be written in a YAML file, and added to the list of recognizers in Presidio. +In this example, we'll show how to create a no-code configuration in Presidio. +We start by creating YAML configuration files that are based on the default ones. +Te default configuration files for Presidio can be found here: -An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml). +- [Analyzer configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml) +- [Recognizer registry configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) +- [NLP engine configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml) -For more information on the schema, see the `PatternRecognizer` definition on the [API Docs](https://microsoft.github.io/presidio/api-docs/api-docs.html#tag/Analyzer)). +Alternatively, one can create one configuration file for all three components. +In this example, we'll tweak the configuration to reduce the number of predefinedrecognizers to only a few, and add a new custom one. We'll also adjust the context words to support the detection of a different language (Spanish). -Once the YAML file is created, it can be loaded into the `RecognizerRegistry` instance. +```python +import yaml +import json +import tempfile +from pprint import pprint +from presidio_analyzer import AnalyzerEngineProvider +``` + +In this example we're going to create the yaml as a string for illustration purposes, but the more common scenario is to create these YAML files and load them into the `PresidioAnalyzerProvider`. -This example creates a `RecognizerRegistry` holding only the recognizers in the YAML file: +## Defining the configuration in YAML format - -``` python -from presidio_analyzer import AnalyzerEngine, RecognizerRegistry +### General Analyzer parameters -yaml_file = "recognizers.yaml" -registry = RecognizerRegistry() -registry.add_recognizers_from_yaml(yaml_file) +([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml)) -analyzer = AnalyzerEngine(registry=registry) -analyzer.analyze(text="Mr. and Mrs. Smith", language="en") +```python +analyzer_config_yaml = """ +supported_languages: + - en + - es +default_score_threshold: 0.4 +""" ``` -This example adds the new recognizers to the predefined recognizers in Presidio: +### Recognizer Registry parameters + +([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml)) + +```python + +recognizer_registry_config_yaml = """ +recognizer_registry: + supported_languages: + - en + - es + global_regex_flags: 26 + + recognizers: + - name: CreditCardRecognizer + supported_languages: + - language: en + context: [credit, card, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment] + - language: es + context: [tarjeta, credito, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment] + type: predefined + + - name: DateRecognizer + supported_languages: + - language: en + context: [date, time, birthday, birthdate, dob] + - language: es + context: [fecha, tiempo, hora, nacimiento, dob] + type: predefined + + - name: EmailRecognizer + supported_languages: + - language: en + context: [email, mail, address] + - language: es + context: [correo, electrónico, email] + type: predefined + + - name: PhoneRecognizer + type: predefined + supported_languages: + - language: en + context: [phone, number, telephone, fax] + - language: es + context: [teléfono, número, fax] + + - name: "Titles recognizer (en)" + supported_language: "en" + supported_entity: "TITLE" + deny_list: + - Mr. + - Mrs. + - Ms. + - Miss + - Dr. + - Prof. + - Doctor + - Professor + - name: "Titles recognizer (es)" + supported_language: "es" + supported_entity: "TITLE" + deny_list: + - Sr. + - Señor + - Sra. + - Señora + - Srta. + - Señorita + - Dr. + - Doctor + - Doctora + - Prof. + - Profesor + - Profesora +""" +``` + +### NLP Engine parameters + +([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml)) + +```python +nlp_engine_yaml = """ +nlp_configuration: + nlp_engine_name: transformers + models: + - + lang_code: en + model_name: + spacy: en_core_web_sm + transformers: StanfordAIMI/stanford-deidentifier-base + - + lang_code: es + model_name: + spacy: es_core_news_sm + transformers: MMG/xlm-roberta-large-ner-spanish + ner_model_configuration: + labels_to_ignore: + - O + aggregation_strategy: first # "simple", "first", "average", "max" + stride: 16 + alignment_mode: expand # "strict", "contract", "expand" + model_to_presidio_entity_mapping: + PER: PERSON + PERSON: PERSON + LOC: LOCATION + LOCATION: LOCATION + GPE: LOCATION + ORG: ORGANIZATION + ORGANIZATION: ORGANIZATION + NORP: NRP + AGE: AGE + ID: ID + EMAIL: EMAIL + PATIENT: PERSON + STAFF: PERSON + HOSP: ORGANIZATION + PATORG: ORGANIZATION + DATE: DATE_TIME + TIME: DATE_TIME + PHONE: PHONE_NUMBER + HCW: PERSON + HOSPITAL: LOCATION + FACILITY: LOCATION + VENDOR: ORGANIZATION + MISC: ID + + low_confidence_score_multiplier: 0.4 + low_score_entity_names: + - ID +""" +``` - -``` python -from presidio_analyzer import AnalyzerEngine, RecognizerRegistry +## Creating the analyzer engine and running it -yaml_file = "recognizers.yaml" # path to YAML file -registry = RecognizerRegistry() -registry.load_predefined_recognizers() # Loads all the predefined recognizers (Credit card, phone number etc.) +### Create a unified YAML file and save it as a temp file + +```python +full_config = f"{analyzer_config_yaml}\n{recognizer_registry_config_yaml}\n{nlp_engine_yaml}" + +with tempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='.yaml') as temp_file: + # Write the YAML string to the temp file + temp_file.write(full_config) + temp_file_path = temp_file.name -registry.add_recognizers_from_yaml(yaml_file) -analyzer = AnalyzerEngine(registry=registry) -analyzer.analyze(text="Mr. Plum wrote a book", language="en") ``` -Finally, for initializing and customizing recognizer registry from file see the following [section](../analyzer/recognizer_registry_provider.md). +### Pass the YAML file to `AnalyzerEngineProvider` to create an `AnalyzerEngine` instance + +```python +analyzer_engine = AnalyzerEngineProvider(analyzer_engine_conf_file=temp_file_path).create_engine() + +``` + +### Print the loaded configuration for both languages + +```python +for lang in ("en", "es"): + pprint(f"Supported entities for {lang}:") + print("\n") + pprint(analyzer_engine.get_supported_entities(lang), compact=True) + + print(f"\nLoaded recognizers for {lang}:") + pprint([rec.name for rec in analyzer_engine.registry.get_recognizers(lang, all_fields=True)], compact=True) + print("\n") + +print(f"\nLoaded NER models:") +pprint(analyzer_engine.nlp_engine.models) +``` + +## Run two requests, one in English and one in Spanish + +```python +es_text = "Hola, me llamo David Johnson y soy originalmente de Liverpool. Mi número de tarjeta de crédito es 4095260993934932" +analyzer_engine.analyze(es_text, language="es") +``` + +```python +en_text = "Hi, my name is David Johnson and I'm originally from Liverpool. My credit card number is 4095260993934932" +analyzer_engine.analyze(en_text, language="en") +``` diff --git a/mkdocs.yml b/mkdocs.yml index f5969b087..7559368bc 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,84 +1,142 @@ site_name: Microsoft Presidio site_url: https://microsoft.github.io/presidio -site_description: PII anonymization for text and images. +site_description: PII anonymization for text, images, and structured data. site_author: Microsoft repo_url: https://github.com/microsoft/presidio/ edit_uri: "" nav: - - Home: index.md - - Installation: installation.md - - Quickstart: getting_started.md - - Handling text: - - Home: text_anonymization.md - - Step by step tutorial: - - Home: tutorial/index.md - - Getting started: tutorial/00_getting_started.md - - Deny-list recognizers: tutorial/01_deny_list.md - - Regex recognizers: tutorial/02_regex.md - - Rule-based recognizers: tutorial/03_rule_based.md - - Additional models/languages: tutorial/05_languages.md - - External services: tutorial/04_external_services.md - - Context enhancement: tutorial/06_context.md - - Decision process: tutorial/07_decision_process.md - - No-code recognizers: tutorial/08_no_code.md - - Ad-hoc recognizers: tutorial/09_ad_hoc.md - - Simple anonymization: tutorial/10_simple_anonymization.md - - Custom anonymization: tutorial/11_custom_anonymization.md - - Encryption/Decryption: tutorial/12_encryption.md - - Allow-lists: tutorial/13_allow_list.md - - - Presidio Analyzer: - - Home: analyzer/index.md - - Developing PII recognizers: - - Tutorial: analyzer/adding_recognizers.md - - Best practices in developing recognizers: analyzer/developing_recognizers.md - - Customizing recognizer registry from file: analyzer/recognizer_registry_provider.md - - Multi-language support: analyzer/languages.md - - Customizing the NLP model: - - Home: analyzer/customizing_nlp_models.md - - Spacy/Stanza: analyzer/nlp_engines/spacy_stanza.md - - Transformers: analyzer/nlp_engines/transformers.md - - Tracing the decision process: analyzer/decision_process.md - - Configuring the Analyzer Engine from file: analyzer/analyzer_engine_provider.md - - Presidio Anonymizer: - - Home: anonymizer/index.md - - Developing PII operators: anonymizer/adding_operators.md - - Handling images: - - Home: image-redactor/index.md - - Evaluating DICOM redaction: image-redactor/evaluating_dicom_redaction.md - - Handling structured data: - - Home: structured/index.md - - Samples: - - Home: samples/index.md - - Selected samples: - - Customizing Presidio Analyzer: samples/python/customizing_presidio_analyzer.ipynb - - NER model configuration: samples/python/ner_model_configuration.ipynb - - Presidio Structured Basic Usage: samples/python/example_structured.ipynb - - Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb - - Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb - - Annotating PII in a PDF: samples/python/example_pdf_annotation.ipynb - - Pseudonomization: samples/python/pseudonomyzation.ipynb - - Encrypting and Decrypting: samples/python/encrypt_decrypt.ipynb - - - General: - - Supported entities: supported_entities.md - - Development and design: - - Design: design.md + - Presidio: + - Home: index.md + - Installation: installation.md + - FAQ: faq.md + - Quick start: + - Home: getting_started.md + - Text: getting_started/getting_started_text.md + - Images: getting_started/getting_started_images.md + - Semi/Structured data: getting_started/getting_started_structured.md + - Learn Presidio: + - Home: learn_presidio/index.md + - Concepts: learn_presidio/concepts.md + - Tutorial: + - Home: tutorial/index.md + - Getting started: tutorial/00_getting_started.md + - Deny-list recognizers: tutorial/01_deny_list.md + - Regex recognizers: tutorial/02_regex.md + - Rule-based recognizers: tutorial/03_rule_based.md + - Additional models/languages: tutorial/05_languages.md + - External services: tutorial/04_external_services.md + - Context enhancement: tutorial/06_context.md + - Decision process: tutorial/07_decision_process.md + - No-code recognizers: tutorial/08_no_code.md + - Ad-hoc recognizers: tutorial/09_ad_hoc.md + - Simple anonymization: tutorial/10_simple_anonymization.md + - Custom anonymization: tutorial/11_custom_anonymization.md + - Encryption/Decryption: tutorial/12_encryption.md + - Allow-lists: tutorial/13_allow_list.md + - Text de-identification: + - Home: text_anonymization.md + - Presidio Analyzer: + - Home: analyzer/index.md + - Developing PII recognizers: + - Tutorial: analyzer/adding_recognizers.md + - Best practices: analyzer/developing_recognizers.md + - Recognizer registry from file: analyzer/recognizer_registry_provider.md + - Multi-language support: analyzer/languages.md + - Customizing the NLP model: + - Home: analyzer/customizing_nlp_models.md + - Spacy/Stanza: analyzer/nlp_engines/spacy_stanza.md + - Transformers: analyzer/nlp_engines/transformers.md + - Tracing the decision process: analyzer/decision_process.md + - Configure from file: analyzer/analyzer_engine_provider.md + - Presidio Anonymizer: + - Home: anonymizer/index.md + - Developing PII anonymization operators: anonymizer/adding_operators.md + - Image de-identification: + - Home: image-redactor/index.md + - Evaluating DICOM redaction: image-redactor/evaluating_dicom_redaction.md + - Structured and Semi-structured: + - Home: structured/index.md + - PII detection evaluation: evaluation/index.md + - Resources: + - Supported entities: supported_entities.md + - Community: community.md + - Change log: https://github.com/microsoft/presidio/blob/main/CHANGELOG.md - Setting up a development environment: development.md - Build and release process: build_release.md - Changes from V1 to V2: presidio_V2.md - Python API reference: - - Home: api.md - - Presidio Analyzer Python API: api/analyzer_python.md - - Presidio Anonymizer Python API: api/anonymizer_python.md - - Presidio Image Redactor Python API: api/image_redactor_python.md + - Home: api.md + - Presidio Analyzer Python API: api/analyzer_python.md + - Presidio Anonymizer Python API: api/anonymizer_python.md + - Presidio Image Redactor Python API: api/image_redactor_python.md + - Presidio Structured Python API: api/structured_python.md - REST API reference: https://microsoft.github.io/presidio/api-docs/api-docs.html" target="_blank + - Samples: + - Usage: + - Home: samples/index.md + - Text: + - Presidio Basic Usage Notebook: samples/python/presidio_notebook.ipynb + - Customizing Presidio Analyzer: samples/python/customizing_presidio_analyzer.ipynb + - Configuring The NLP engine: samples/python/ner_model_configuration.ipynb + - Encrypting and Decrypting identified entities: samples/python/encrypt_decrypt.ipynb + - Getting the identified entity value using a custom Operator: samples/python/getting_entity_values.ipynb + - Anonymizing known values: samples/python/Anonymizing known values.ipynb + - Keeping some entities from being anonymized: samples/python/keep_entities.ipynb + - Integrating with external services: samples/python/integrating_with_external_services.ipynb + - Remote Recognizer: samples/python/example_remote_recognizer.py + - Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md + - Using Flair as an external PII model: samples/python/flair_recognizer.py + - Using Span Marker as an external PII model: samples/python/span_marker_recognizer.py + - Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md + - Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb + - Passing a lambda as a Presidio anonymizer using Faker: samples/python/example_custom_lambda_anonymizer.py + - Synthetic data generation with OpenAI: samples/python/synth_data_with_openai.ipynb + - YAML based no-code configuration: samples/python/no_code_config.ipynb + - Data: + - Analyzing structured / semi-structured data in batch: samples/python/batch_processing.ipynb + - Presidio Structured Basic Usage Notebook: samples/python/example_structured.ipynb + - Analyze and Anonymize CSV file: samples/python/process_csv_file.py + - Images: + - Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb + - Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb + - Plot custom bounding boxes: samples/python/plot_custom_bboxes.ipynb + - Example DICOM redaction evaluation: samples/python/example_dicom_redactor_evaluation.ipynb + - PDF: + - Annotating PII in a PDF: samples/python/example_pdf_annotation.ipynb + - Deployment: + - Presidio with App Service: samples/deployments/app-service/index.md + - Presidio with Kubernetes: samples/deployments/k8s/index.md + - Presidio with Spark: samples/deployments/spark/index.md + - Azure Data Factory: + - ETL using AppService/Databricks: samples/deployments/data-factory/presidio-data-factory.md + - Add Presidio as an HTTP service to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md + - Add Presidio on Databricks to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md + - PII Masking LLM calls using LiteLLM proxy: samples/docker/litellm.md + - Demo: + - Create a simple demo app using Streamlit: samples/python/streamlit/index.md +not_in_nav : | + design.md + samples/deployments/index.md + samples/deployments/data-factory/index.md + samples/deployments/spark/notebooks/00_setup.py + samples/deployments/spark/notebooks/01_transform_presidio.py + samples/docker/index.md + samples/python/custom_presidio.py + samples/python/simple_anonymization_example.py + samples/python/streamlit/azure_ai_language_wrapper.py + samples/python/streamlit/flair_recognizer.py + samples/python/streamlit/openai_fake_data_generator.py + samples/python/streamlit/presidio_helpers.py + samples/python/streamlit/presidio_nlp_engine_config.py + samples/python/streamlit/presidio_streamlit.py + samples/python/streamlit/test_streamlit.py + samples/python/text_analytics/__init__.py + samples/python/transformers_recognizer/__init__.py + samples/python/transformers_recognizer/configuration.py + samples/python/transformers_recognizer/transformer_recognizer.py - - Community: community.md - - FAQ: faq.md - - Demo: https://huggingface.co/spaces/presidio/presidio_demo" target="_blank theme: name: material custom_dir: overrides @@ -96,18 +154,37 @@ theme: features: - navigation.instant - content.tabs.link - # - navigation.sections - # - navigation.tabs - # - navigation.tabs.sticky + - navigation.tabs + - navigation.tabs.sticky plugins: -- search -- mkdocstrings: - handlers: - python: - options: - docstring_style: sphinx -- mkdocs-jupyter: - ignore_h1_titles: True + - search + - mkdocstrings: + handlers: + python: + options: + docstring_style: sphinx + docstring_section_style: spacy + show_root_heading: true + show_submodules: true + show_bases: true + merge_init_into_class: false + group_by_category: false + inherited_members: true + members_order: source + show_signature: true + line_length: 80 + separate_signature: true + show_signature_annotations: true + show_docstring_examples: true + summary: + attributes: false + functions: true + modules: false + filters: + - "!^_" + - "^__" + - mkdocs-jupyter: + ignore_h1_titles: True extra: social: From 2711233e11be1b42d4cef74feb1c0ead66d49134 Mon Sep 17 00:00:00 2001 From: Omri Mendels Date: Thu, 26 Dec 2024 18:33:50 +0200 Subject: [PATCH 2/5] docstring updates --- CONTRIBUTING.md | 21 ++++++++-------- .../presidio_analyzer/analyzer_engine.py | 20 ++++++++------- .../batch_analyzer_engine.py | 2 +- .../recognizer_registry.py | 4 ++- .../dicom_image_redactor_engine.py | 25 ++++++++++--------- .../document_intelligence_ocr.py | 8 +++--- .../image_analyzer_engine.py | 1 - .../image_processing_engine.py | 5 ++-- 8 files changed, 45 insertions(+), 41 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 37849776c..e6a6b398b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -13,11 +13,12 @@ Presidio is both a framework and a system. It's a framework in a sense that you When contributing to presidio, it's important to keep this in mind, as some "framework" contributions might not be suitable for a deployment, or vice-versa. ### PR guidelines + Commit message should be clear, explaining the committed changes. Update CHANGELOG.md: -Under Unreleased section, use the category which is most suitable for your change (changed/removed/deprecated). +Under Unreleased section, use the category which is most suitable for your change (changed/removed/deprecated). Document the change with simple readable text and push it as part of the commit. Next release, the change will be documented under the new version. @@ -30,12 +31,6 @@ For more details follow the [Build and Release documentation](docs/build_release To get started, refer to the documentation for [setting up a development environment](docs/development.md). -### How can I contribute? - -- [Testing](#how-to-test) -- [Adding new recognizers for new PII types](#adding-new-recognizers-for-new-pii-types) -- [Fixing Bugs and improving the code](#fixing-bugs-and-improving-the-code) - ### How to test? For Python, Presidio leverages `pytest` and `ruff`. See [this tutorial](docs/development.md#testing) on more information on testing presidio modules. @@ -50,14 +45,20 @@ Best practices for developing recognizers [are described here](docs/analyzer/dev Please review the open [issues on Github](https://github.com/microsoft/presidio/issues) for known bugs and feature requests. We sometimes add 'good first issue' labels on those we believe are simpler, and 'advanced' labels on those which require more work or multiple changes across the solution. +### Adding samples + +We would love to see more samples demonstrating how to use Presidio in different scenarios. If you have a sample that you think would be useful for others, please consider contributing it. You can find the samples in the [samples folder](docs/samples/). + +When contributing a sample, make sure it is self contained (e.g. external dependencies are documented), add it [to the index] (docs/samples/index.md), and to the [mkdocs.yml](mkdocs.yml) file. + ## Contacting Us -For any questions, please email presidio@microsoft.com. +For any questions, please email . ## Contribution guidelines -This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. +This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit . When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. -This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments. +This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact with any additional questions or comments. diff --git a/presidio-analyzer/presidio_analyzer/analyzer_engine.py b/presidio-analyzer/presidio_analyzer/analyzer_engine.py index e7e0b7c52..5f2e0c429 100644 --- a/presidio-analyzer/presidio_analyzer/analyzer_engine.py +++ b/presidio-analyzer/presidio_analyzer/analyzer_engine.py @@ -187,16 +187,18 @@ def analyze( :example: - >>> from presidio_analyzer import AnalyzerEngine + .. code-block:: python + + from presidio_analyzer import AnalyzerEngine + + # Set up the engine, loads the NLP module (spaCy model by default) + # and other PII recognizers + analyzer = AnalyzerEngine() + + # Call analyzer to get results + results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501 + print(results) - >>> # Set up the engine, loads the NLP module (spaCy model by default) - >>> # and other PII recognizers - >>> analyzer = AnalyzerEngine() - - >>> # Call analyzer to get results - >>> results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501 - >>> print(results) - [type: PHONE_NUMBER, start: 19, end: 31, score: 0.85] """ # noqa: E501 all_fields = not entities diff --git a/presidio-analyzer/presidio_analyzer/batch_analyzer_engine.py b/presidio-analyzer/presidio_analyzer/batch_analyzer_engine.py index cb5b7cca4..aba6b9096 100644 --- a/presidio-analyzer/presidio_analyzer/batch_analyzer_engine.py +++ b/presidio-analyzer/presidio_analyzer/batch_analyzer_engine.py @@ -14,7 +14,7 @@ class BatchAnalyzerEngine: Wrapper class to run Presidio Analyzer Engine on multiple values, either lists/iterators of strings, or dictionaries. - :param: analyzer_engine: AnalyzerEngine instance to use + :param analyzer_engine: AnalyzerEngine instance to use for handling the values in those collections. """ diff --git a/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py b/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py index 991e5fed5..f8a697969 100644 --- a/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py +++ b/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py @@ -53,7 +53,9 @@ def __init__( ) def _create_nlp_recognizer( - self, nlp_engine: NlpEngine = None, supported_language: str = None + self, + nlp_engine: Optional[NlpEngine] = None, + supported_language: Optional[str] = None ) -> SpacyRecognizer: nlp_recognizer = self._get_nlp_recognizer(nlp_engine) diff --git a/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py b/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py index 09347d3f4..5ebcd053b 100644 --- a/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py +++ b/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py @@ -23,10 +23,7 @@ class DicomImageRedactorEngine(ImageRedactorEngine): - """Performs OCR + PII detection + bounding box redaction. - - :param image_analyzer_engine: Engine which performs OCR + PII detection. - """ + """Performs OCR + PII detection + bounding box redaction.""" def redact_and_return_bbox( self, @@ -160,12 +157,11 @@ def redact_from_file( ) -> None: """Redact method to redact from a given file. - Please notice, this method duplicates the file, creates - new instance and manipulate them. - :param input_dicom_path: String path to DICOM image. :param output_dir: String path to parent output directory. :param padding_width : Padding width to use when running OCR. + :param crop_ratio: Portion of image to consider when selecting + most common pixel value as the background color value. :param fill: Color setting to use for redaction box ("contrast" or "background"). :param use_metadata: Whether to redact text in the image that @@ -177,6 +173,10 @@ def redact_from_file( for ad-hoc recognizer. :param text_analyzer_kwargs: Additional values for the analyze method in AnalyzerEngine. + + Please notice, this method duplicates the file, creates + new instance and manipulate them. + """ # Verify the given paths if Path(input_dicom_path).is_dir() is True: @@ -226,9 +226,6 @@ def redact_from_directory( ) -> None: """Redact method to redact from a directory of files. - Please notice, this method duplicates the files, creates - new instances and manipulate them. - :param input_dicom_path: String path to directory of DICOM images. :param output_dir: String path to parent output directory. :param padding_width : Padding width to use when running OCR. @@ -238,12 +235,16 @@ def redact_from_directory( ("contrast" or "background"). :param use_metadata: Whether to redact text in the image that are present in the metadata. - :param save_bboxes: True if we want to save boundings boxes. + :param save_bboxes: True if we want to save bounding boxes. :param ocr_kwargs: Additional params for OCR methods. :param ad_hoc_recognizers: List of PatternRecognizer objects to use for ad-hoc recognizer. :param text_analyzer_kwargs: Additional values for the analyze method in AnalyzerEngine. + + Please notice, this method duplicates the files, creates + new instances and manipulate them. + """ # Verify the given paths if Path(input_dicom_path).is_dir() is False: @@ -641,7 +642,7 @@ def _get_text_metadata( def augment_word(word: str, case_sensitive: bool = False) -> list: """Apply multiple types of casing to the provided string. - :param words: String containing the word or term of interest. + :param word: String containing the word or term of interest. :param case_sensitive: True if we want to preserve casing. :return: List of the same string with different casings and spacing. diff --git a/presidio-image-redactor/presidio_image_redactor/document_intelligence_ocr.py b/presidio-image-redactor/presidio_image_redactor/document_intelligence_ocr.py index b348ed112..7692a276b 100755 --- a/presidio-image-redactor/presidio_image_redactor/document_intelligence_ocr.py +++ b/presidio-image-redactor/presidio_image_redactor/document_intelligence_ocr.py @@ -71,7 +71,7 @@ def _polygon_to_bbox(polygon: Sequence[Point]) -> tuple: :param polygon: A sequence of points - :return a tuple of left/top/width/height in pixel dimensions + :return: a tuple of left/top/width/height in pixel dimensions """ # We need at least two points for a valid bounding box. @@ -104,7 +104,7 @@ def _page_to_bboxes(page: DocumentPage) -> dict: :param page: The documentpage object from the DI client library - :return dictionary in the expected format for presidio + :return: dictionary in the expected format for presidio """ bounds = [ DocumentIntelligenceOCR._polygon_to_bbox(word.polygon) @@ -125,7 +125,7 @@ def get_imgbytes(self, image: Union[bytes, np.ndarray, Image.Image]) -> bytes: :param image: Any of bytes/numpy array /PIL image object - :return raw image bytes + :return: raw image bytes """ if isinstance(image, bytes): return image @@ -150,7 +150,7 @@ def analyze_document(self, imgbytes: bytes, **kwargs) -> AnalyzedDocument: :param imgbytes: The bytes to send to the API endpoint :param kwargs: additional arguments for begin_analyze_document - :return the result of the poller, an AnalyzedDocument object. + :return: the result of the poller, an AnalyzedDocument object. """ poller = self.client.begin_analyze_document(self.model_id, imgbytes, **kwargs) return poller.result() diff --git a/presidio-image-redactor/presidio_image_redactor/image_analyzer_engine.py b/presidio-image-redactor/presidio_image_redactor/image_analyzer_engine.py index 5b76f70bf..5833e6c4d 100644 --- a/presidio-image-redactor/presidio_image_redactor/image_analyzer_engine.py +++ b/presidio-image-redactor/presidio_image_redactor/image_analyzer_engine.py @@ -378,7 +378,6 @@ def add_custom_bboxes( :param image: Standard image of DICOM pixels. :param bboxes: List of bounding boxes to display (with is_PII field). - :param gt_bboxes: Ground truth bboxes (list of dictionaries). :param show_text_annotation: True if you want text annotation for PHI status to display. :param use_greyscale_cmap: Use greyscale color map. diff --git a/presidio-image-redactor/presidio_image_redactor/image_processing_engine.py b/presidio-image-redactor/presidio_image_redactor/image_processing_engine.py index f687a75bb..ab73ecf4e 100644 --- a/presidio-image-redactor/presidio_image_redactor/image_processing_engine.py +++ b/presidio-image-redactor/presidio_image_redactor/image_processing_engine.py @@ -33,7 +33,6 @@ def convert_image_to_array(self, image: Image.Image) -> np.ndarray: """Convert PIL image to numpy array. :param image: Loaded PIL image. - :param convert_to_greyscale: Whether to convert the image to greyscale. :return: image pixels as a numpy array. @@ -167,8 +166,8 @@ def __init__( :param block_size: Size of the neighborhood area for threshold calculation. :param contrast_threshold: Threshold for low contrast images. - :param C_low_contrast: Constant added to the mean for low contrast images. - :param C_high_contrast: Constant added to the mean for high contrast images. + :param c_low_contrast: Constant added to the mean for low contrast images. + :param c_high_contrast: Constant added to the mean for high contrast images. :param bg_threshold: Threshold for background color. """ From 7853b105d244faf6f9cddeeef71b0dd3bb16784f Mon Sep 17 00:00:00 2001 From: Omri Mendels Date: Fri, 27 Dec 2024 13:12:09 +0200 Subject: [PATCH 3/5] more updates to docs --- docs/faq.md | 20 +++-- docs/installation.md | 4 +- docs/requirements-docs.txt | 1 + docs/samples/python/no_code_config.ipynb | 88 ++----------------- mkdocs.yml | 3 +- .../presidio_analyzer/analyzer_engine.py | 26 +++--- .../in_voter_recognizer.py | 3 - .../recognizer_registry.py | 3 +- .../presidio_anonymizer/operators/decrypt.py | 2 +- .../presidio_anonymizer/operators/encrypt.py | 2 +- .../dicom_image_redactor_engine.py | 4 +- 11 files changed, 43 insertions(+), 113 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index 113230ad4..33bc7ac8b 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -4,7 +4,7 @@ - [What is Presidio?](#what-is-presidio) - [Why did Microsoft create Presidio?](#why-did-microsoft-create-presidio) - [Is Microsoft Presidio an official Microsoft product?](#is-microsoft-presidio-an-official-microsoft-product) - - [What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend?](#what-is-the-difference-between-presidio-and-different-pii-detection-services-like-azure-text-analytics-and-amazon-comprehend) + - [What is the difference between Presidio and different PII detection services like Azure AI Language and Amazon Comprehend?](#what-is-the-difference-between-presidio-and-different-pii-detection-services-like-azure-ai-language-and-amazon-comprehend) - [Using Presidio](#using-presidio) - [How can I start using Presidio?](#how-can-i-start-using-presidio) - [What are the main building blocks in Presidio?](#what-are-the-main-building-blocks-in-presidio) @@ -33,7 +33,7 @@ Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensu Presidio is a library or SDK rather than a service. It is meant to be customized to the user's or organization's specific needs. !!! warning "Warning" - Presidio can help identify sensitive/PII data in un/structured text. However, because Presidio is using trained ML models, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed. + Presidio can help identify sensitive/PII data in un/structured text. However, because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed. ### Why did Microsoft create Presidio? @@ -50,11 +50,11 @@ The authors and maintainers of Presidio come from the [Industry Solutions Engine !!! note "Note" Microsoft Presidio is not an official Microsoft product. Usage terms are defined in the [repository's license](https://github.com/microsoft/presidio/blob/main/LICENSE). -### What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend? +### What is the difference between Presidio and different PII detection services like Azure AI Language and Amazon Comprehend? In a nutshell, Presidio is a library which is meant to be customized, whereas different SaaS tools for PII detection have less customization capabilities. Most of these SaaS offerings use dedicated ML models and other logic for PII detection and often have better entity coverage or accuracy than Presidio. -Based on our internal research, leveraging Presidio in parallel to 3rd party PII detection services like Azure Text Analytics can bring optimal results mainly when the data in hand has entity types or values not supported by the 3rd party service. ([see example here](https://microsoft.github.io/presidio/samples/python/text_analytics/)). +Based on our internal research, leveraging Presidio in parallel to 3rd party PII detection services like Azure AI Language can bring optimal results mainly when the data in hand has entity types or values not supported by the 3rd party service. ([see example here](https://microsoft.github.io/presidio/samples/python/text_analytics/)). ## Using Presidio @@ -71,7 +71,8 @@ Presidio is a suite built of several packages and building blocks: 1. [Presidio Analyzer](https://microsoft.github.io/presidio/analyzer/): a package for detecting PII entities in natural language. 2. [Presidio Anonymizer](https://microsoft.github.io/presidio/anonymizer/): a package for manipulating PII entities in text (e.g. remove, redact, hash, encrypt). 3. [Presidio Image Redactor](https://microsoft.github.io/presidio/image-redactor/): A package for detecting PII entities in image using OCR. -4. A set of sample deployments as Python packages or Docker containers for Kubernetes, Azure Data Factory, Spark and more. +4. [Presidio Structured](https://microsoft.github.io/presidio/structured/): A package for detecting PII entities in structured/semi-structured data. +5. A set of sample deployments as Python packages or Docker containers for Kubernetes, Azure Data Factory, Spark and more. ## Customizing Presidio @@ -99,17 +100,20 @@ Pseudonymization is a de-identification technique in which the real data is repl ### Does Presidio work on structured/tabular data? -This is an area we are actively looking into. We have an [example implementation](https://microsoft.github.io/presidio/samples/python/batch_processing/) of using Presidio on structured/semi-structured data. Also see the different discussions on this topic on the [Discussions](https://github.com/microsoft/presidio/discussions) section. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at +[Presidio-structured](https://microsoft.github.io/presidio/structured/) is a new capability in Presidio for detecting PII entities in structured/semi-structured data, and is still in alpha. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at ## Improving detection accuracy ### What can I do if Presidio does not detect some of the PII entities in my data (False Negatives)? -Presidio comes loaded with several PII recognizers (see [list here](https://microsoft.github.io/presidio/supported_entities/)), however its main strength lies in its customization capabilities to new entities, specific datasets, languages or use cases. For a recommended process for improving detection accuracy, see [these guidelines](https://github.com/microsoft/presidio/discussions/767#discussion-3567223). +Presidio comes loaded with several PII recognizers (see [list here](https://microsoft.github.io/presidio/supported_entities/)), +however its main strength lies in its customization capabilities to new entities, specific datasets, file types, languages or use cases. ### What can I do if Presidio falsely detects text as PII entities (False Positives)? -Some PII recognizers are less specific than others. A driver's license number, for example, could be any 9-digit number. While Presidio leverages context words and other logic to improve the detection quality, it could still falsely detect non-entity values as PII entities. +Some PII recognizers are less specific than others. A driver's license number, for example, could be any 9-digit number. +While Presidio leverages context words and other logic to improve the detection quality, +it could still falsely detect non-entity values as PII entities. In order to avoid false positives, one could try to: diff --git a/docs/installation.md b/docs/installation.md index 1f0898365..52e27d008 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -75,7 +75,7 @@ For PII redaction in images python -m spacy download en_core_web_lg ``` -2. Install an OCR engine. The default version uses the [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract). +2. Install an OCR engine. The default version uses the [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract). More information on installation can be found [here](image-redactor/index.md#installation). ## Using Docker @@ -137,7 +137,7 @@ git clone git@github.com:microsoft/presidio.git Then, build the containers locally. !!! note "Note" - Presidio uses [docker-compose](https://docs.docker.com/compose/) to manage the different Presidio containers. + Presidio uses [docker-compose](https://docs.docker.com/compose/) to manage the different Presidio containers. From the root folder of the repo: diff --git a/docs/requirements-docs.txt b/docs/requirements-docs.txt index fd2527de9..34d0d2800 100644 --- a/docs/requirements-docs.txt +++ b/docs/requirements-docs.txt @@ -9,3 +9,4 @@ presidio_anonymizer presidio_image_redactor presidio_structured pygments>=2.10 +black \ No newline at end of file diff --git a/docs/samples/python/no_code_config.ipynb b/docs/samples/python/no_code_config.ipynb index 6c76a67c1..6425c3dd0 100644 --- a/docs/samples/python/no_code_config.ipynb +++ b/docs/samples/python/no_code_config.ipynb @@ -268,19 +268,10 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "b25df9cd-483f-473a-be0a-3c7b0e1b4e4f", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Device set to use cpu\n", - "Device set to use cpu\n" - ] - } - ], + "outputs": [], "source": [ "analyzer_engine = AnalyzerEngineProvider(analyzer_engine_conf_file=temp_file_path).create_engine()\n" ] @@ -295,47 +286,10 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "ee8cb346-5dc1-4b08-9ee4-6a71308c10d1", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "'Supported entities for en:'\n", - "\n", - "\n", - "['ORGANIZATION', 'EMAIL', 'EMAIL_ADDRESS', 'CREDIT_CARD', 'AGE', 'LOCATION',\n", - " 'PERSON', 'NRP', 'PHONE_NUMBER', 'TITLE', 'DATE_TIME', 'ID']\n", - "\n", - "Loaded recognizers for en:\n", - "['CreditCardRecognizer', 'DateRecognizer', 'EmailRecognizer', 'PhoneRecognizer',\n", - " 'Titles recognizer (en)', 'TransformersRecognizer']\n", - "\n", - "\n", - "'Supported entities for es:'\n", - "\n", - "\n", - "['ORGANIZATION', 'EMAIL', 'EMAIL_ADDRESS', 'CREDIT_CARD', 'AGE', 'LOCATION',\n", - " 'PERSON', 'NRP', 'PHONE_NUMBER', 'TITLE', 'DATE_TIME', 'ID']\n", - "\n", - "Loaded recognizers for es:\n", - "['CreditCardRecognizer', 'DateRecognizer', 'EmailRecognizer', 'PhoneRecognizer',\n", - " 'Titles recognizer (es)', 'TransformersRecognizer']\n", - "\n", - "\n", - "\n", - "Loaded NER models:\n", - "[{'lang_code': 'en',\n", - " 'model_name': {'spacy': 'en_core_web_sm',\n", - " 'transformers': 'StanfordAIMI/stanford-deidentifier-base'}},\n", - " {'lang_code': 'es',\n", - " 'model_name': {'spacy': 'es_core_news_sm',\n", - " 'transformers': 'MMG/xlm-roberta-large-ner-spanish'}}]\n" - ] - } - ], + "outputs": [], "source": [ "for lang in (\"en\", \"es\"):\n", " pprint(f\"Supported entities for {lang}:\")\n", @@ -352,23 +306,10 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "c6a50606-a68a-4984-995b-36c9ccdccb81", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[type: CREDIT_CARD, start: 98, end: 114, score: 1.0,\n", - " type: PERSON, start: 15, end: 28, score: 0.9991055727005005,\n", - " type: LOCATION, start: 52, end: 62, score: 0.9933345317840576]" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "es_text = \"Hola, me llamo David Johnson y soy originalmente de Liverpool. Mi número de tarjeta de crédito es 4095260993934932\"\n", "analyzer_engine.analyze(es_text, language=\"es\")" @@ -376,23 +317,10 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "id": "ea24b036-c65f-4376-b545-76089e7dddef", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[type: CREDIT_CARD, start: 89, end: 105, score: 1.0,\n", - " type: LOCATION, start: 53, end: 62, score: 0.9989457726478577,\n", - " type: PERSON, start: 15, end: 28, score: 0.9727346897125244]" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "en_text = \"Hi, my name is David Johnson and I'm originally from Liverpool. My credit card number is 4095260993934932\"\n", "analyzer_engine.analyze(en_text, language=\"en\")" diff --git a/mkdocs.yml b/mkdocs.yml index 7559368bc..0b0d06665 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -181,8 +181,7 @@ plugins: functions: true modules: false filters: - - "!^_" - - "^__" + - "!^_" - mkdocs-jupyter: ignore_h1_titles: True diff --git a/presidio-analyzer/presidio_analyzer/analyzer_engine.py b/presidio-analyzer/presidio_analyzer/analyzer_engine.py index 5f2e0c429..ee83d3cfc 100644 --- a/presidio-analyzer/presidio_analyzer/analyzer_engine.py +++ b/presidio-analyzer/presidio_analyzer/analyzer_engine.py @@ -185,19 +185,19 @@ def analyze( :param nlp_artifacts: precomputed NlpArtifacts :return: an array of the found entities in the text - :example: - - .. code-block:: python - - from presidio_analyzer import AnalyzerEngine - - # Set up the engine, loads the NLP module (spaCy model by default) - # and other PII recognizers - analyzer = AnalyzerEngine() - - # Call analyzer to get results - results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501 - print(results) + :Example: + + ```python + from presidio_analyzer import AnalyzerEngine + + # Set up the engine, loads the NLP module (spaCy model by default) + # and other PII recognizers + analyzer = AnalyzerEngine() + + # Call analyzer to get results + results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') + print(results) + ``` """ # noqa: E501 diff --git a/presidio-analyzer/presidio_analyzer/predefined_recognizers/in_voter_recognizer.py b/presidio-analyzer/presidio_analyzer/predefined_recognizers/in_voter_recognizer.py index c5ab05e34..d4ebaf468 100644 --- a/presidio-analyzer/presidio_analyzer/predefined_recognizers/in_voter_recognizer.py +++ b/presidio-analyzer/presidio_analyzer/predefined_recognizers/in_voter_recognizer.py @@ -16,9 +16,6 @@ class InVoterRecognizer(PatternRecognizer): :param context: List of context words to increase confidence in detection :param supported_language: Language this recognizer supports :param supported_entity: The entity this recognizer can detect - :param replacement_pairs: List of tuples with potential replacement values - for different strings to be used during pattern matching. - This can allow a greater variety in input, for example by removing dashes or spaces. """ PATTERNS = [ diff --git a/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py b/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py index f8a697969..97c64b307 100644 --- a/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py +++ b/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py @@ -32,8 +32,9 @@ class RecognizerRegistry: :param recognizers: An optional list of recognizers, that will be available instead of the predefined recognizers - :param global_regex_flags : regex flags to be used in regex matching, + :param global_regex_flags: regex flags to be used in regex matching, including deny-lists + :param supported_languages: List of languages supported by this registry. """ diff --git a/presidio-anonymizer/presidio_anonymizer/operators/decrypt.py b/presidio-anonymizer/presidio_anonymizer/operators/decrypt.py index 5dbd6f912..0b2f6d7a6 100644 --- a/presidio-anonymizer/presidio_anonymizer/operators/decrypt.py +++ b/presidio-anonymizer/presidio_anonymizer/operators/decrypt.py @@ -32,7 +32,7 @@ def validate(self, params: Dict = None) -> None: :param params: * *key* The key supplied by the user for the encryption. Should be a string of 128, 192 or 256 bits length. - :raises InvalidParamException in case on an invalid parameter. + :raises InvalidParamException: in case on an invalid parameter. """ Encrypt().validate(params) diff --git a/presidio-anonymizer/presidio_anonymizer/operators/encrypt.py b/presidio-anonymizer/presidio_anonymizer/operators/encrypt.py index 9cd11d4ae..0b1120ade 100644 --- a/presidio-anonymizer/presidio_anonymizer/operators/encrypt.py +++ b/presidio-anonymizer/presidio_anonymizer/operators/encrypt.py @@ -33,7 +33,7 @@ def validate(self, params: Dict = None) -> None: :param params: * *key* The key supplied by the user for the encryption. Should be a string of 128, 192 or 256 bits length. - :raises InvalidParamException in case on an invalid parameter. + :raises InvalidParamException: in case on an invalid parameter. """ key = params.get(self.KEY) if isinstance(key, str): diff --git a/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py b/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py index 5ebcd053b..698abcba7 100644 --- a/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py +++ b/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py @@ -159,7 +159,7 @@ def redact_from_file( :param input_dicom_path: String path to DICOM image. :param output_dir: String path to parent output directory. - :param padding_width : Padding width to use when running OCR. + :param padding_width: Padding width to use when running OCR. :param crop_ratio: Portion of image to consider when selecting most common pixel value as the background color value. :param fill: Color setting to use for redaction box @@ -228,7 +228,7 @@ def redact_from_directory( :param input_dicom_path: String path to directory of DICOM images. :param output_dir: String path to parent output directory. - :param padding_width : Padding width to use when running OCR. + :param padding_width: Padding width to use when running OCR. :param crop_ratio: Portion of image to consider when selecting most common pixel value as the background color value. :param fill: Color setting to use for redaction box From 495cdafe10b303014665f581899b3cb34693926a Mon Sep 17 00:00:00 2001 From: Omri Mendels Date: Fri, 27 Dec 2024 13:28:27 +0200 Subject: [PATCH 4/5] py files to repo --- mkdocs.yml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 0b0d06665..ab90bc0c8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -85,19 +85,19 @@ nav: - Anonymizing known values: samples/python/Anonymizing known values.ipynb - Keeping some entities from being anonymized: samples/python/keep_entities.ipynb - Integrating with external services: samples/python/integrating_with_external_services.ipynb - - Remote Recognizer: samples/python/example_remote_recognizer.py + - Remote Recognizer: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py - Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md - - Using Flair as an external PII model: samples/python/flair_recognizer.py + - Using Flair as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py - Using Span Marker as an external PII model: samples/python/span_marker_recognizer.py - Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md - Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb - - Passing a lambda as a Presidio anonymizer using Faker: samples/python/example_custom_lambda_anonymizer.py + - Passing a lambda as a Presidio anonymizer using Faker: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py - Synthetic data generation with OpenAI: samples/python/synth_data_with_openai.ipynb - YAML based no-code configuration: samples/python/no_code_config.ipynb - Data: - Analyzing structured / semi-structured data in batch: samples/python/batch_processing.ipynb - Presidio Structured Basic Usage Notebook: samples/python/example_structured.ipynb - - Analyze and Anonymize CSV file: samples/python/process_csv_file.py + - Analyze and Anonymize CSV file: https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py - Images: - Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb - Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb From 70fcb880d36c7877cec4d450100a6cea58211a58 Mon Sep 17 00:00:00 2001 From: Omri Mendels Date: Fri, 27 Dec 2024 13:39:26 +0200 Subject: [PATCH 5/5] spanmarker.py to repo --- mkdocs.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yml b/mkdocs.yml index ab90bc0c8..bf0f18a93 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -88,7 +88,7 @@ nav: - Remote Recognizer: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py - Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md - Using Flair as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py - - Using Span Marker as an external PII model: samples/python/span_marker_recognizer.py + - Using Span Marker as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/span_marker_recognizer.py - Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md - Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb - Passing a lambda as a Presidio anonymizer using Faker: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py