From 213e0c7eb8ec25d943ee4917faf051c4963107dc Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Thu, 3 Oct 2024 20:24:17 +0900 Subject: [PATCH 1/3] Update README.md updated hap README --- transforms/universal/hap/python/README.md | 36 +++++++++++++++-------- 1 file changed, 24 insertions(+), 12 deletions(-) diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md index 23be7084c8..347fa86ae9 100644 --- a/transforms/universal/hap/python/README.md +++ b/transforms/universal/hap/python/README.md @@ -1,14 +1,14 @@ -# HAP Annotation +# Hate, Abuse, and Profanity (HAP) Annotation Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up. ## Prerequisite -This repo needs NLTK and please refer to `requirements.txt`. +This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`. ## Summary The hap transform maps a non-empty input table to an output table with an added `hap_score` column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document: * Sentence spliting: we use NLTK to split the document into sentence pieces. -* Hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap. +* hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap. * Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences. @@ -16,25 +16,26 @@ The hap transform maps a non-empty input table to an output table with an added The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py) configuration for values are as follows: -* --model_name_or_path - specifies HAP model which should be compatable with HuggingFace's `AutoModelForSequenceClassification` -* --batch_size - modify it based on the infrastructure capacity. -* --max_length - the maximum length for the tokenizer. - - +* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`. +* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`. +* --max_length - the maximum length for the tokenizer. Defaults to `512`. +* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`. +* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`. + ## input format The input is in .parquet format and contains the following columns: -| doc_id | doc_text | -|:------|:------| +| doc_id | contents | +|:------:|:------:| | 1 | GSC is very much a little Swiss Army knife for... | | 2 | Here are only a few examples. And no, I'm not ... | ## output format The output is in .parquet format and includes an additional column, in addition to those in the input: -| doc_id | doc_text | hap_score | -|:------|:------|:-------------| +| doc_id | contents | hap_score | +|:------:|:------:|:-------------:| | 1 | GSC is very much a little Swiss Army knife for... | 0.002463 | | 2 | Here are only a few examples. And no, I'm not ... | 0.989713 | @@ -47,6 +48,17 @@ python hap_local_python.py You will obtain the output file `test1.parquet` in the output directory. +## Throughput +The table below shows the throughput (tokens per second) of the HAP transform module, which primarily includes sentence splitting, HAP annotation, and HAP score aggregation. We herein compare two models: + +* 4-layer lightweight toxicity classifier [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m) +* 12-layer toxicity classifier [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m) + +We report the average throughput on CPU over three runs. +| Model used in HAP transform module | throughput (tokens per second) | +|:------:|:------:| +| granite-guardian-hap-38m | 6.16 k | +| granite-guardian-hap-125m | 1.14 k | From 2971c730b0a580a7731ec70f7e2bb8e75a299cb0 Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Thu, 3 Oct 2024 21:27:54 +0900 Subject: [PATCH 2/3] Update README.md --- transforms/universal/hap/python/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md index 347fa86ae9..29d54d999a 100644 --- a/transforms/universal/hap/python/README.md +++ b/transforms/universal/hap/python/README.md @@ -54,7 +54,8 @@ The table below shows the throughput (tokens per second) of the HAP transform mo * 4-layer lightweight toxicity classifier [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m) * 12-layer toxicity classifier [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m) -We report the average throughput on CPU over three runs. +We processed 6,000 documents (12 MB in Parquet file size) using the HAP transform module and reported the average CPU throughput over three trials. + | Model used in HAP transform module | throughput (tokens per second) | |:------:|:------:| | granite-guardian-hap-38m | 6.16 k | From 9ad002ad738e6cc68bb83f0fc97623c7cee12c4d Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Thu, 3 Oct 2024 11:16:26 -0700 Subject: [PATCH 3/3] Update README.md Added HAP to the table in README --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index ade3bed689..aeec4ef704 100644 --- a/README.md +++ b/README.md @@ -139,6 +139,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al | [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | +| [HAP](transforms/universal/hap/python/README.md) | :white_check_mark: | | | | | [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | **Language-only** | | | | | | [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |