Skip to content

Commit

Permalink
Merge branch 'dev' into python3.12
Browse files Browse the repository at this point in the history
  • Loading branch information
roytman committed Sep 27, 2024
2 parents 683197a + bb31e4c commit 73f695e
Show file tree
Hide file tree
Showing 30 changed files with 109 additions and 58 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/universal/html2parquet
name: Test - transforms/language/html2parquet

on:
workflow_dispatch:
Expand All @@ -12,9 +12,9 @@ on:
tags:
- "*"
paths:
- "transforms/universal/html2parquet/**"
- "transforms/language/html2parquet/**"
- "data-processing-lib/**"
- "!transforms/universal/html2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!transforms/language/html2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
Expand All @@ -26,9 +26,9 @@ on:
- "dev"
- "releases/**"
paths:
- "transforms/universal/html2parquet/**"
- "transforms/language/html2parquet/**"
- "data-processing-lib/**"
- "!transforms/universal/html2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!transforms/language/html2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
Expand Down Expand Up @@ -72,12 +72,12 @@ jobs:
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/universal/html2parquet
- name: Test transform source in transforms/language/html2parquet
run: |
if [ -e "transforms/universal/html2parquet/Makefile" ]; then
make -C transforms/universal/html2parquet DOCKER=docker test-src
if [ -e "transforms/language/html2parquet/Makefile" ]; then
make -C transforms/language/html2parquet DOCKER=docker test-src
else
echo "transforms/universal/html2parquet/Makefile not found - source testing disabled for this transform."
echo "transforms/language/html2parquet/Makefile not found - source testing disabled for this transform."
fi
test-image:
needs: [check_if_push_image]
Expand All @@ -99,15 +99,15 @@ jobs:
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform image in transforms/universal/html2parquet
- name: Test transform image in transforms/language/html2parquet
run: |
if [ -e "transforms/universal/html2parquet/Makefile" ]; then
if [ -d "transforms/universal/html2parquet/spark" ]; then
if [ -e "transforms/language/html2parquet/Makefile" ]; then
if [ -d "transforms/language/html2parquet/spark" ]; then
make -C data-processing-lib/spark DOCKER=docker image
fi
make -C transforms/universal/html2parquet DOCKER=docker test-image
make -C transforms/language/html2parquet DOCKER=docker test-image
else
echo "transforms/universal/html2parquet/Makefile not found - testing disabled for this transform."
echo "transforms/language/html2parquet/Makefile not found - testing disabled for this transform."
fi
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
Expand All @@ -117,8 +117,8 @@ jobs:
- name: Publish images
if: needs.check_if_push_image.outputs.publish_images == 'true'
run: |
if [ -e "transforms/universal/html2parquet/Makefile" ]; then
make -C transforms/universal/html2parquet publish
if [ -e "transforms/language/html2parquet/Makefile" ]; then
make -C transforms/language/html2parquet publish
else
echo "transforms/universal/html2parquet/Makefile not found - publishing disabled for this transform."
echo "transforms/language/html2parquet/Makefile not found - publishing disabled for this transform."
fi
46 changes: 27 additions & 19 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Release Management

## Overview
Release are created from the main repository branch using the version
Releases are created from the main repository branch using the version
numbers, including an intermediate version suffix,
defined in `.make.versions`.
The following points are important:

1. In general, common a version number is used for all published pypi wheels and docker images.
1. In general, a common version number is used for all published pypi wheels and docker images.
1. `.make.versions` contains the version to be used when publishing the **next** release.
1. Whenever `.make.versions` is changed, `make set-versions` should be run from the top of the repo.
1. Corollary: `make set-versions` should ONLY be used from the top of the repo when `.make.versions` changes.
Expand All @@ -20,29 +20,35 @@ allows intermediate publishing from the main branch using version X.Y.Z.dev\<N\>
## Cutting the release
Creating the release involves

1. Creating a release branch and tag and updating the main branch versions.
1. Creating a github release from the release branch and tag.
1. Edit the `release-notes.md` to list major/minor changes
1. Creating a release branch and updating the main branch versions (using `release-branch.sh`).
1. Creating a github release and tag from the release branch.
1. Building and publishing pypi library wheels and docker registry image.

Each is discussed below.

### Creating release branch and tag
### Editing release-notes.md
Make a dummy release on github (see below) to get a listing of all commits.
Use this to come up with the items.
Commit this to the main branch so it is ready for including in the release branch.

### Creating release branch
The `scripts/release-branch.sh` is currently run manually to create the branch and tags as follows:

1. Creates the `releases/vX.Y.Z` from the main branch where `X.Y.Z` are defined in .make.versions
1. Creates the `vX.Y.Z` branch for PR'ing back into the `releases/vX.Y.Z` branch.
1. In the new `vX.Y.Z` branch
1. Nulls out the version suffix in the new branch's `.make.version` file.
1. Applies the unsuffixed versions to the artifacts published from the repo using `make set-versions`..
1. Commits and pushes branch and tag
1. Commits and pushes branch
1. Creates the `pending-version-change/vX.Y.Z` branch for PR'ing back into the main branch.
1. In the `pending-version-change/vX.Y.Z` branch
1. Increments the minor version (i.e. Z+1) and resets the suffix to `dev0` in `.make.versions`.
1. Commits and pushes branch

To double-check the version that will be published from the release,
```
git checkout releasing/vX.Y.Z
git checkout vX.Y.Z
make show-version
```
This will print for example, 1.2.3.
Expand All @@ -58,20 +64,22 @@ After running the script, you should
2. Use the github web UI to create a git release and tag of the `releases/vX.Y.Z` branch
3. Create a pull request from branch `pending-version-change/vX.Y.Z` into the main branch, and merge.

### Github release
### Creating the Github Release
After running the `release-branch.sh` script, to create tag `vX.Y.Z` and branch `releases/vX.Y.Z`
and PRing/merging `vX.Y.Z` into `releases/vX.Y.Z`.
1. Go to the [releases page](https://github.com/IBM/data-prep-kit/releases).
2. Select `Draft a new release`
3. Select `Choose a tag -> vX.Y.Z`
4. Press `Generate release notes`
5. Add a title (e.g., Release X.Y.Z)
6. Add any additional relese notes.
7. Press `Publish release`

### Publishing wheels and images
After creating the release branch and tag using the `scripts/release-branch.sh` script:

1. Switch to a release branch (e.g. releases/v1.2.3) created by the `release-branch.sh` script
1. Select `Draft a new release`
1. Select target branch `releases/vX.Y.Z`
1. Select `Choose a tag`, type in vX.Y.Z, click `Create tag`
1. Press `Generate release notes`
1. Add a title (e.g., Release X.Y.Z)
1. Add any additional relese notes.
1. Press `Publish release`

### Building and Publishing Wheels and Images
After creating the release and tag on github:

1. Switch to a release branch (e.g. releases/v1.2.3).
1. Be sure you're at the top of the repository (`.../data-prep-kit`)
1. Optionally, `make show-version` to see the version that will be published
1. Running the following, either manually or in a git action
Expand Down
1 change: 0 additions & 1 deletion data-processing-lib/spark/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ include $(REPOROOT)/.make.defaults
SPARK_VERSION=3.5.2
DOCKER_IMAGE_NAME=data-prep-kit-spark-$(SPARK_VERSION)
DOCKER_IMAGE_LIB_NAME=data-prep-kit-spark
DOCKER_IMAGE_VERSION := latest


.check-env::
Expand Down
31 changes: 31 additions & 0 deletions release-notes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,36 @@
# Data Prep Kit Release notes

## Release 0.2.1 - 9/24/2024

### General
1. Bug fixes across the repo
1. Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
1. Added new transforms and single package for transforms published to pypi
1. Improved CI/CD with targeted workflow triggered on specific changes to specific modules
1. New enhancements for cutting a release


### data-prep-toolkit libraries (python, ray, spark)

1. Restructure the repository to distinguish/separate runtime libraries
1. Split data-processing-lib/ray into python and ray
1. Spark runtime
1. Updated pyarrow version
1. Define required transform() method as abstract to AbstractTableTransform
1. Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies


### KFP Workloads

1. Add a configurable timeout before destroying the deployed Ray cluster.

### Transforms

1. Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, pdf2parquet, HTML2Parquet and PII Transform
1. Added ededup python implementation and incremental ededup
1. Added fuzzy floating point comparison


## Release 0.2.0 - 6/27/2024

### General
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ RUN cd data-processing-lib-python && pip install --no-cache-dir -e .

COPY --chown=dpk:root src/ src/
COPY --chown=dpk:root pyproject.toml pyproject.toml
COPY --chown=dpk:root requirements.txt requirements.txt
RUN pip install --no-cache-dir -e .

# copy transform main() entry point to the image
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,15 @@ authors = [
{ name = "Sungeun An", email = "[email protected]" },
{ name = "Syed Zawad", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.2.dev0",
"trafilatura==1.12.0"

]
dynamic = ["dependencies"]

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"

[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

[project.optional-dependencies]
dev = [
"twine",
Expand Down
2 changes: 2 additions & 0 deletions transforms/language/html2parquet/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
data-prep-toolkit==0.2.2.dev0
trafilatura==1.12.0
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ def get_test_transform_fixtures(self) -> list[tuple]:
"html2parquet_output_format": "markdown",
}
# this is added as a fixture to remove these columns from comparison
ignore_columns = ["date_acquired", "document_id", "pdf_convert_time", "hash"]
ignore_columns = ["date_acquired"]

fixtures = []
Expand Down
1 change: 1 addition & 0 deletions transforms/language/pdf2parquet/python/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ RUN cd data-processing-lib-python && pip install --no-cache-dir -e .
# END OF STEPS destined for a data-prep-kit base image

COPY --chown=dpk:root pyproject.toml pyproject.toml
COPY --chown=dpk:root requirements.txt requirements.txt
RUN pip install ${PIP_INSTALL_EXTRA_ARGS} --no-cache-dir -e .

# Download models
Expand Down
12 changes: 4 additions & 8 deletions transforms/language/pdf2parquet/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,15 @@ authors = [
{ name = "Michele Dolfi", email = "[email protected]" },
{ name = "Christoph Auer", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.2.dev0",
"docling-core==1.2.0",
"docling-ibm-models==1.1.7",
"deepsearch-glm==0.21.0",
"docling==1.11.0",
"filetype >=1.2.0, <2.0.0",
]
dynamic = ["dependencies"]

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"

[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

[project.optional-dependencies]
dev = [
"twine",
Expand Down
6 changes: 6 additions & 0 deletions transforms/language/pdf2parquet/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
data-prep-toolkit==0.2.2.dev0
docling-core==1.3.0
docling-ibm-models==1.1.7
deepsearch-glm==0.21.0
docling==1.11.0
filetype >=1.2.0, <2.0.0
1 change: 1 addition & 0 deletions transforms/language/pdf2parquet/ray/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ RUN cd python-transform && pip install ${PIP_INSTALL_EXTRA_ARGS} --no-cache-dir


COPY --chown=ray:users pyproject.toml pyproject.toml
COPY --chown=ray:users requirements.txt requirements.txt
RUN pip install ${PIP_INSTALL_EXTRA_ARGS} --no-cache-dir -e .

# Download models
Expand Down
9 changes: 5 additions & 4 deletions transforms/language/pdf2parquet/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,16 @@ authors = [
{ name = "Michele Dolfi", email = "[email protected]" },
{ name = "Christoph Auer", email = "[email protected]" },
]
dependencies = [
"dpk-pdf2parquet-transform-python==0.2.2.dev0",
"data-prep-toolkit-ray==0.2.2.dev0",
]

dynamic = ["dependencies"]

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"

[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

[project.optional-dependencies]
dev = [
"twine",
Expand Down
7 changes: 7 additions & 0 deletions transforms/language/pdf2parquet/ray/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
dpk-pdf2parquet-transform-python==0.2.2.dev0
data-prep-toolkit-ray==0.2.2.dev0
docling-core==1.3.0
docling-ibm-models==1.1.7
deepsearch-glm==0.21.0
docling==1.11.0
filetype >=1.2.0, <2.0.0
2 changes: 1 addition & 1 deletion transforms/universal/doc_id/spark/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:0.2.1.dev0
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:latest
FROM ${BASE_IMAGE}

USER root
Expand Down
2 changes: 1 addition & 1 deletion transforms/universal/filter/spark/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:0.2.1.dev0
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:latest
FROM ${BASE_IMAGE}

USER root
Expand Down
2 changes: 1 addition & 1 deletion transforms/universal/noop/spark/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:0.2.1.dev0
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:latest
FROM ${BASE_IMAGE}

USER root
Expand Down

0 comments on commit 73f695e

Please sign in to comment.