Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update HAP README.md #661

Merged
merged 3 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al
| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [HAP](transforms/universal/hap/python/README.md) | :white_check_mark: | | | |
| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| **Language-only** | | | | |
| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
Expand Down
37 changes: 25 additions & 12 deletions transforms/universal/hap/python/README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,41 @@
# HAP Annotation
# Hate, Abuse, and Profanity (HAP) Annotation
Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up.

## Prerequisite
This repo needs NLTK and please refer to `requirements.txt`.
This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`.

## Summary
The hap transform maps a non-empty input table to an output table with an added `hap_score` column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document:

* Sentence spliting: we use NLTK to split the document into sentence pieces.
* Hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap.
* hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap.
* Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences.


## Configuration and command line Options
The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py)
configuration for values are as follows:

* --model_name_or_path - specifies HAP model which should be compatable with HuggingFace's `AutoModelForSequenceClassification`
* --batch_size - modify it based on the infrastructure capacity.
* --max_length - the maximum length for the tokenizer.


* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`.
* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`.
* --max_length - the maximum length for the tokenizer. Defaults to `512`.
* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`.
* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`.


## input format
The input is in .parquet format and contains the following columns:

| doc_id | doc_text |
|:------|:------|
| doc_id | contents |
|:------:|:------:|
| 1 | GSC is very much a little Swiss Army knife for... |
| 2 | Here are only a few examples. And no, I'm not ... |

## output format
The output is in .parquet format and includes an additional column, in addition to those in the input:

| doc_id | doc_text | hap_score |
|:------|:------|:-------------|
| doc_id | contents | hap_score |
|:------:|:------:|:-------------:|
| 1 | GSC is very much a little Swiss Army knife for... | 0.002463 |
| 2 | Here are only a few examples. And no, I'm not ... | 0.989713 |

Expand All @@ -47,6 +48,18 @@ python hap_local_python.py

You will obtain the output file `test1.parquet` in the output directory.

## Throughput
The table below shows the throughput (tokens per second) of the HAP transform module, which primarily includes sentence splitting, HAP annotation, and HAP score aggregation. We herein compare two models:

* 4-layer lightweight toxicity classifier [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m)
* 12-layer toxicity classifier [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m)

We processed 6,000 documents (12 MB in Parquet file size) using the HAP transform module and reported the average CPU throughput over three trials.

| Model used in HAP transform module | throughput (tokens per second) |
|:------:|:------:|
| granite-guardian-hap-38m | 6.16 k |
| granite-guardian-hap-125m | 1.14 k |



Expand Down