From 213e0c7eb8ec25d943ee4917faf051c4963107dc Mon Sep 17 00:00:00 2001
From: ian-cho <42691703+ian-cho@users.noreply.github.com>
Date: Thu, 3 Oct 2024 20:24:17 +0900
Subject: [PATCH 1/3] Update README.md

updated hap README
---
 transforms/universal/hap/python/README.md | 36 +++++++++++++++--------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md
index 23be7084c8..347fa86ae9 100644
--- a/transforms/universal/hap/python/README.md
+++ b/transforms/universal/hap/python/README.md
@@ -1,14 +1,14 @@
-# HAP Annotation
+# Hate, Abuse, and Profanity (HAP) Annotation
 Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up.
 
 ## Prerequisite
-This repo needs NLTK and please refer to `requirements.txt`.
+This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`.
 
 ## Summary
 The hap transform maps a non-empty input table to an output table with an added `hap_score` column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document:
 
 * Sentence spliting: we use NLTK to split the document into sentence pieces.
-* Hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap.
+* hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap.
 * Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences.
 
 
@@ -16,25 +16,26 @@ The hap transform maps a non-empty input table to an output table with an added
 The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py) 
 configuration for values are as follows:
 
-* --model_name_or_path - specifies HAP model which should be compatable with HuggingFace's `AutoModelForSequenceClassification` 
-* --batch_size - modify it based on the infrastructure capacity.
-* --max_length - the maximum length for the tokenizer.
-
-
+* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`.
+* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`.
+* --max_length - the maximum length for the tokenizer. Defaults to `512`.
+* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`.
+* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`.
+  
 
 ## input format
 The input is in .parquet format and contains the following columns:
 
-| doc_id  |   doc_text | 
-|:------|:------|
+| doc_id  | contents | 
+|:------:|:------:|
 | 1  |    GSC is very much a little Swiss Army knife for...   |
 | 2  |    Here are only a few examples. And no, I'm not ...   |
 
 ## output format
 The output is in .parquet format and includes an additional column, in addition to those in the input:
 
-| doc_id  |   doc_text | hap_score   |
-|:------|:------|:-------------|
+| doc_id  | contents | hap_score  |
+|:------:|:------:|:-------------:|
 | 1  |    GSC is very much a little Swiss Army knife for... | 0.002463     |
 | 2  |    Here are only a few examples. And no, I'm not ... | 0.989713     |
 
@@ -47,6 +48,17 @@ python hap_local_python.py
 
 You will obtain the output file `test1.parquet` in the output directory.
 
+## Throughput 
+The table below shows the throughput (tokens per second) of the HAP transform module, which primarily includes sentence splitting, HAP annotation, and HAP score aggregation. We herein compare two models:
+
+* 4-layer lightweight toxicity classifier [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m)
+* 12-layer toxicity classifier [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m)
+ 
+We report the average throughput on CPU over three runs.
+| Model used in HAP transform module  | throughput (tokens per second) | 
+|:------:|:------:|
+| granite-guardian-hap-38m  |  6.16 k   |
+| granite-guardian-hap-125m |  1.14 k   |
 
 
 

From 2971c730b0a580a7731ec70f7e2bb8e75a299cb0 Mon Sep 17 00:00:00 2001
From: ian-cho <42691703+ian-cho@users.noreply.github.com>
Date: Thu, 3 Oct 2024 21:27:54 +0900
Subject: [PATCH 2/3] Update README.md

---
 transforms/universal/hap/python/README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md
index 347fa86ae9..29d54d999a 100644
--- a/transforms/universal/hap/python/README.md
+++ b/transforms/universal/hap/python/README.md
@@ -54,7 +54,8 @@ The table below shows the throughput (tokens per second) of the HAP transform mo
 * 4-layer lightweight toxicity classifier [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m)
 * 12-layer toxicity classifier [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m)
  
-We report the average throughput on CPU over three runs.
+We processed 6,000 documents (12 MB in Parquet file size) using the HAP transform module and reported the average CPU throughput over three trials.
+
 | Model used in HAP transform module  | throughput (tokens per second) | 
 |:------:|:------:|
 | granite-guardian-hap-38m  |  6.16 k   |

From 9ad002ad738e6cc68bb83f0fc97623c7cee12c4d Mon Sep 17 00:00:00 2001
From: Shahrokh Daijavad <shahrokhDaijavad@users.noreply.github.com>
Date: Thu, 3 Oct 2024 11:16:26 -0700
Subject: [PATCH 3/3] Update README.md

Added HAP to the table in README
---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index ade3bed689..aeec4ef704 100644
--- a/README.md
+++ b/README.md
@@ -139,6 +139,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al
 | [Filter on annotations](transforms/universal/filter/python/README.md)                | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | [Profiler](transforms/universal/profiler/ray/README.md)                              | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | [Resize](transforms/universal/resize/python/README.md)                               | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
+| [HAP](transforms/universal/hap/python/README.md)                      | :white_check_mark: |  |                    |  |
 | [Tokenizer](transforms/universal/tokenization/python/README.md)                      | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | **Language-only**                                                                    |                    |                    |                    |                    |
 | [Language identification](transforms/language/lang_id/python/README.md)              | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |