Skip to content

Commit

Permalink
Merge pull request #927 from matouma/missing-ray-notebook
Browse files Browse the repository at this point in the history
added missing ray notebooks for doc_quality and filter
  • Loading branch information
touma-I authored Jan 14, 2025
2 parents ae2575f + f977479 commit 4d361b0
Show file tree
Hide file tree
Showing 16 changed files with 9,557 additions and 4,433 deletions.
6 changes: 6 additions & 0 deletions transforms/README-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Note: This list includes the transforms that were part of the release starting w
* [header_cleanser (Not available on MacOS)](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/header_cleanser/python/README.md)
* [code_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_quality/python/README.md)
* [proglang_select](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/README.md)
* [code_profiler](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_profiler/README.md)
* language
* [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/README.md)
* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/README.md)
Expand All @@ -40,6 +41,11 @@ Note: This list includes the transforms that were part of the release starting w

## Release notes:

### 1.0.0.a4
Added missing ray implementation for lang_id, doc_quality, tokenization and filter
Added ray notebooks for lang id, Doc Quality, tokenization, and Filter
### 1.0.0.a3
Added code_profiler
### 1.0.0.a2
Relax dependencies on pandas (use latest or whatever is installed by application)
Relax dependencies on requests (use latest or whatever is installed by application)
Expand Down
9 changes: 4 additions & 5 deletions transforms/language/doc_quality/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,6 @@ To see results of the transform.

[notebook](./doc_quality.ipynb)

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

## Testing

Expand Down Expand Up @@ -161,6 +156,10 @@ ls output
```
To see results of the transform.

### Code example (Ray)

[notebook](./doc_quality-ray.ipynb)


#### Transforming data using the transform image

Expand Down
140 changes: 140 additions & 0 deletions transforms/language/doc_quality/doc_quality-ray.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "afd55886-5f5b-4794-838e-ef8179fb0394",
"metadata": {},
"source": [
"##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
"```\n",
"make venv \n",
"source venv/bin/activate \n",
"pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install \"data-prep-toolkit-transforms[ray,doc_quality]==1.0.0a4\""
]
},
{
"cell_type": "markdown",
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
"* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
"* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n",
"* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n",
"#####"
]
},
{
"cell_type": "markdown",
"id": "ebf1f782-0e61-485c-8670-81066beb734c",
"metadata": {},
"source": [
"##### ***** Import required classes and modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
"source": [
"from dpk_doc_quality.ray.transform import DocQuality\n",
"from data_processing.utils import GB"
]
},
{
"cell_type": "markdown",
"id": "7234563c-2924-4150-8a31-4aec98c1bf33",
"metadata": {},
"source": [
"##### ***** Setup runtime parameters and invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95737436",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"DocQuality(input_folder='test-data/input',\n",
" output_folder= 'output',\n",
" run_locally= True,\n",
" num_cpus= 0.8,\n",
" memory= 2 * GB,\n",
" runtime_num_workers = 3,\n",
" runtime_creation_delay = 0,\n",
" docq_text_lang = \"en\",\n",
" docq_doc_content_column =\"contents\").transform()"
]
},
{
"cell_type": "markdown",
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
"metadata": {},
"source": [
"##### **** The specified folder will include the transformed parquet files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [],
"source": [
"import glob\n",
"glob.glob(\"output/*\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
44 changes: 5 additions & 39 deletions transforms/language/doc_quality/doc_quality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,13 @@
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install data-prep-toolkit\n",
"%pip install data-prep-toolkit-transforms[doc_quality]"
"%pip install \"data-prep-toolkit-transforms[doc_quality]==1.0.0a4\""
]
},
{
"cell_type": "markdown",
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"metadata": {},
"source": [
"##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
"* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
Expand Down Expand Up @@ -72,27 +69,7 @@
"execution_count": null,
"id": "95737436",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"11:54:20 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/Users/touma/data-prep-kit-pkg/transforms/language/doc_quality/dpk_doc_quality/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x10b1d1c50>}\n",
"11:54:20 INFO - pipeline id pipeline_id\n",
"11:54:20 INFO - code location None\n",
"11:54:20 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n",
"11:54:20 INFO - data factory data_ max_files -1, n_sample -1\n",
"11:54:20 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
"11:54:20 INFO - orchestrator docq started at 2024-12-04 11:54:20\n",
"11:54:20 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n",
"11:54:20 INFO - Load badwords found locally from /Users/touma/data-prep-kit-pkg/transforms/language/doc_quality/dpk_doc_quality/ldnoobw/en\n",
"11:54:20 INFO - Completed 1 files (100.0%) in 0.002 min\n",
"11:54:20 INFO - Done processing 1 files, waiting for flush() completion.\n",
"11:54:20 INFO - done flushing in 0.0 sec\n",
"11:54:20 INFO - Completed execution in 0.003 min, execution result 0\n"
]
}
],
"outputs": [],
"source": [
"%%capture\n",
"DocQuality(input_folder='test-data/input',\n",
Expand All @@ -111,21 +88,10 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['output/metadata.json', 'output/test1.parquet']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"import glob\n",
"glob.glob(\"output/*\")"
Expand Down
47 changes: 44 additions & 3 deletions transforms/language/doc_quality/dpk_doc_quality/ray/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,18 @@
# limitations under the License.
################################################################################

import pyarrow as pa
from data_processing.utils import get_logger
import sys, os
from data_processing.utils import ParamsUtils, get_logger
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing_ray.runtime.ray.runtime_configuration import (
RayTransformRuntimeConfiguration,
)
from dpk_doc_quality.transform import DocQualityTransformConfiguration

from dpk_doc_quality.transform import (
DocQualityTransformConfiguration,
bad_word_filepath_cli_param,
text_lang_cli_param,
)

logger = get_logger(__name__)

Expand All @@ -37,6 +41,43 @@ def __init__(self):
super().__init__(transform_config=DocQualityTransformConfiguration())


# Class used by the notebooks to ingest binary files and create parquet files
class DocQuality:
def __init__(self, **kwargs):
self.params = {}
for key in kwargs:
self.params[key] = kwargs[key]
# if input_folder and output_folder are specified, then assume it is represent data_local_config
try:
local_conf = {k: self.params[k] for k in ("input_folder", "output_folder")}
self.params["data_local_config"] = ParamsUtils.convert_to_ast(local_conf)
del self.params["input_folder"]
del self.params["output_folder"]
except:
pass
try:
worker_options = {k: self.params[k] for k in ("num_cpus", "memory")}
self.params["runtime_worker_options"] = ParamsUtils.convert_to_ast(worker_options)
del self.params["num_cpus"]
del self.params["memory"]
except:
pass

if text_lang_cli_param not in self.params:
self.params[text_lang_cli_param] = "en"
if bad_word_filepath_cli_param not in self.params:
self.params[bad_word_filepath_cli_param] = os.path.abspath(
os.path.join(os.path.dirname(__file__), "../ldnoobw", self.params[text_lang_cli_param])
)


def transform(self):
sys.argv = ParamsUtils.dict_to_req(d=(self.params))
launcher = RayTransformLauncher(DocQualityRayTransformConfiguration())
return_code = launcher.launch()
return return_code


if __name__ == "__main__":
launcher = RayTransformLauncher(DocQualityRayTransformConfiguration())
logger.info("Launching doc_quality transform")
Expand Down
34 changes: 32 additions & 2 deletions transforms/language/lang_id/dpk_lang_id/ray/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
# limitations under the License.
################################################################################

import pyarrow as pa
from data_processing.utils import get_logger
import sys
from data_processing.utils import ParamsUtils, get_logger
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing_ray.runtime.ray.runtime_configuration import (
RayTransformRuntimeConfiguration,
Expand All @@ -36,6 +36,36 @@ def __init__(self):
"""
super().__init__(transform_config=LangIdentificationTransformConfiguration())

# Class used by the notebooks to ingest binary files and create parquet files
class LangId:
def __init__(self, **kwargs):
self.params = {}
for key in kwargs:
self.params[key] = kwargs[key]
# if input_folder and output_folder are specified, then assume it is represent data_local_config
try:
local_conf = {k: self.params[k] for k in ("input_folder", "output_folder")}
self.params["data_local_config"] = ParamsUtils.convert_to_ast(local_conf)
del self.params["input_folder"]
del self.params["output_folder"]
except:
pass
try:
worker_options = {k: self.params[k] for k in ("num_cpus", "memory")}
self.params["runtime_worker_options"] = ParamsUtils.convert_to_ast(worker_options)
del self.params["num_cpus"]
del self.params["memory"]
except:
pass

def transform(self):
sys.argv = ParamsUtils.dict_to_req(d=(self.params))
# create launcher
launcher = RayTransformLauncher(LangIdentificationRayTransformConfiguration())
# launch
return_code = launcher.launch()
return return_code


if __name__ == "__main__":
launcher = RayTransformLauncher(LangIdentificationRayTransformConfiguration())
Expand Down
Loading

0 comments on commit 4d361b0

Please sign in to comment.