Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gneissweb_classification #974

Merged
merged 12 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions .github/workflows/test-language-gneissweb_classification.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/language/gneissweb_classification

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- ".make.*"
- "transforms/.make.transforms"
- "transforms/language/gneissweb_classification/**"
- "data-processing-lib/**"
- "!transforms/language/gneissweb_classification/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- ".make.*"
- "transforms/.make.transforms"
- "transforms/language/gneissweb_classification/**"
- "data-processing-lib/**"
- "!transforms/language/gneissweb_classification/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

# Taken from https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
check_if_push_image:
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
runs-on: ubuntu-22.04
outputs:
publish_images: ${{ steps.version.outputs.publish_images }}
steps:
- id: version
run: |
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/language/gneissweb_classification
run: |
if [ -e "transforms/language/gneissweb_classification/Makefile" ]; then
make -C transforms/language/gneissweb_classification DOCKER=docker test-src
else
echo "transforms/language/gneissweb_classification/Makefile not found - source testing disabled for this transform."
fi
test-image:
needs: [check_if_push_image]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform image in transforms/language/gneissweb_classification
run: |
if [ -e "transforms/language/gneissweb_classification/Makefile" ]; then
if [ -d "transforms/language/gneissweb_classification/spark" ]; then
make -C data-processing-lib/spark DOCKER=docker image
fi
make -C transforms/language/gneissweb_classification DOCKER=docker test-image
else
echo "transforms/language/gneissweb_classification/Makefile not found - testing disabled for this transform."
fi
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
- name: Publish images
if: needs.check_if_push_image.outputs.publish_images == 'true'
run: |
if [ -e "transforms/language/gneissweb_classification/Makefile" ]; then
make -C transforms/language/gneissweb_classification publish
else
echo "transforms/language/gneissweb_classification/Makefile not found - publishing disabled for this transform."
fi
46 changes: 46 additions & 0 deletions transforms/language/gneissweb_classification/Dockerfile.python
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
FROM docker.io/python:3.11.11-slim-bullseye

RUN pip install --upgrade --no-cache-dir pip

# install pytest
RUN pip install --no-cache-dir pytest

# Create a user and use it to run the transform
RUN useradd -ms /bin/bash dpk
USER dpk
WORKDIR /home/dpk
ARG DPK_WHEEL_FILE_NAME

# Copy and install data processing libraries
# These are expected to be placed in the docker context before this is run (see the make image).
COPY --chown=dpk:root data-processing-dist/ data-processing-dist/
RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}

# END OF STEPS destined for a data-prep-kit base image

# set up environment required to install and use huggingface and fasttext
USER root
RUN apt update && apt install gcc g++ -y
RUN mkdir -p /home/dpk/.cache/huggingface/hub && chmod -R 777 /home/dpk/.cache/huggingface/hub
USER dpk

COPY --chown=dpk:root dpk_gneissweb_classification/ dpk_gneissweb_classification/
COPY --chown=dpk:root requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# clean up apt
USER root
RUN apt-get remove gcc g++ -y \
&& apt clean \
&& rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/*
USER dpk


# Set environment
ENV PYTHONPATH /home/dpk

# Put these at the end since they seem to upset the docker cache.
ARG BUILD_DATE
ARG GIT_COMMIT
LABEL build-date=$BUILD_DATE
LABEL git-commit=$GIT_COMMIT
46 changes: 46 additions & 0 deletions transforms/language/gneissweb_classification/Dockerfile.ray
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py311

FROM ${BASE_IMAGE}

# see https://docs.openshift.com/container-platform/4.17/openshift_images/create-images.html#use-uid_create-images
USER root
RUN chown ray:root /home/ray && chmod 775 /home/ray
USER ray

RUN pip install --upgrade --no-cache-dir pip

# install pytest
RUN pip install --no-cache-dir pytest
ARG DPK_WHEEL_FILE_NAME

# set up environment required to install and use huggingface and fasttext
USER root
RUN sudo apt update && sudo apt install gcc g++ -y
RUN mkdir -p /home/ray/.cache/huggingface/hub && chmod -R 777 /home/ray/.cache/huggingface/hub
USER ray

# Copy and install data processing libraries
# These are expected to be placed in the docker context before this is run (see the make image).
COPY --chmod=775 --chown=ray:root data-processing-dist data-processing-dist
RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}[ray]


COPY --chmod=775 --chown=ray:root dpk_gneissweb_classification/ dpk_gneissweb_classification/
COPY --chmod=775 --chown=ray:root requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# clean up apt
USER root
RUN sudo apt remove gcc g++ -y \
&& sudo apt clean \
&& sudo rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/*
USER ray

# Set environment
ENV PYTHONPATH /home/ray

# Put these at the end since they seem to upset the docker cache.
ARG BUILD_DATE
ARG GIT_COMMIT
LABEL build-date=$BUILD_DATE
LABEL git-commit=$GIT_COMMIT
36 changes: 36 additions & 0 deletions transforms/language/gneissweb_classification/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
REPOROOT=../../..
# Use make help, to see the available rules
include $(REPOROOT)/transforms/.make.cicd.targets

#
# This is intended to be included across the Makefiles provided within
# a given transform's directory tree, so must use compatible syntax.
#
################################################################################
# This defines the name of the transform and is used to match against
# expected files and is used to define the transform's image name.
TRANSFORM_NAME=$(shell basename `pwd`)

################################################################################



run-cli-sample:
make venv
source venv/bin/activate && \
$(PYTHON) -m dpk_$(TRANSFORM_NAME).transform_python \
--data_local_config "{ 'input_folder' : 'test-data/input', 'output_folder' : 'output'}" \
--gcls_model_credential "PUT YOUR OWN HUGGINGFACE CREDENTIAL" \
--gcls_model_file_name "model.bin" \
--gcls_model_url "facebook/fasttext-language-identification" \
--gcls_content_column_name "text"

run-cli-ray-sample:
make venv
source venv/bin/activate && \
$(PYTHON) -m dpk_$(TRANSFORM_NAME).ray.transform \
--run_locally True --data_local_config "{ 'input_folder' : 'test-data/input', 'output_folder' : 'output'}" \
--gcls_model_credential "PUT YOUR OWN HUGGINGFACE CREDENTIAL" \
--gcls_model_file_name "model.bin" \
--gcls_model_url "facebook/fasttext-language-identification" \
--gcls_content_column_name "text"
79 changes: 79 additions & 0 deletions transforms/language/gneissweb_classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Gneissweb Classification Transform
The Gneissweb Classification transform serves as a simple exemplar to demonstrate the development
of a simple 1:1 transform.
Please see the set of [transform project conventions](../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up.

## Summary
This transform will classify each text with confidence score with fasttext classification model such as [ref](https://huggingface.co/facebook/fasttext-language-identification).

## Configuration and command line Options

The set of dictionary keys holding [ClassificationTransform](dpk_gneissweb_classification/transform.py)
configuration for values are as follows:

| Key name | Default | Description |
|------------|----------|--------------|
| _model_credential_ | _unset_ | specifies the credential you use to get model. This will be huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens) |
| _model_filename_ | _unset_ | specifies what filename of model you use to get model, like `model.bin` |
| _model_url_ | _unset_ | specifies url that model locates. For fasttext, this will be repo name of the model, like `facebook/fasttext-language-identification` |
| _content_column_name_ | `contents` | specifies name of the column containing documents |
| _output_lablel_column_name_ | `label` | specifies name of the output column to hold predicted classes|
| _output_score_column_name_ | `score` | specifies name of the output column to hold score of prediction |

## Running

### Launched Command Line Options
The following command line arguments are available in addition to
the options provided by
the [launcher](../../../data-processing-lib/doc/launcher-options.md).
The prefix gcls is short name for Gneissweb CLaSsification.
```
--gcls_model_credential GCLS_MODEL_CREDENTIAL the credential you use to get model. This will be huggingface token.
--gcls_model_file_name GCLS_MODEL_KIND filename of model you use to get model. Currently,like `model.bin`
--gcls_model_url GCLS_MODEL_URL url that model locates. For fasttext, this will be repo name of the model, like `facebook/fasttext-language-identification`
--gcls_content_column_name GCLS_CONTENT_COLUMN_NAME A name of the column containing documents
--gcls_output_lable_column_name GCLS_OUTPUT_LABEL_COLUMN_NAME Column name to store classification results
--gcls_output_score_column_name GCLS_OUTPUT_SCORE_COLUMN_NAME Column name to store the score of prediction
```
These correspond to the configuration keys described above.

### Code example
Here is a sample [notebook](gneissweb_classification.ipynb)

## Troubleshooting guide

For M1 Mac user, if you see following error during make command, `error: command '/usr/bin/clang' failed with exit code 1`, you should follow [this step](https://freeman.vc/notes/installing-fasttext-on-an-m1-mac)


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

# Gneissweb Classification Ray Transform
Please see the set of
[transform project conventions](../../README.md#transform-project-conventions)
for details on general project conventions, transform configuration,
testing and IDE set up.

## Summary
This project wraps the gneissweb classification transform with a Ray runtime.

## Configuration and command line Options

Gneissweb Classification configuration and command line options are the same as for the base python transform.

### Launched Command Line Options
In addition to those available to the transform as defined here,
the set of
[launcher options](../../../data-processing-lib/doc/launcher-options.md) are available.

### Code example (Ray version)
Here is a sample [notebook](gneissweb_classification-ray.ipynb)

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
Loading