-
Notifications
You must be signed in to change notification settings - Fork 174
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #927 from matouma/missing-ray-notebook
added missing ray notebooks for doc_quality and filter
- Loading branch information
Showing
16 changed files
with
9,557 additions
and
4,433 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "afd55886-5f5b-4794-838e-ef8179fb0394", | ||
"metadata": {}, | ||
"source": [ | ||
"##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", | ||
"```\n", | ||
"make venv \n", | ||
"source venv/bin/activate \n", | ||
"pip install jupyterlab\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", | ||
"metadata": { | ||
"scrolled": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture\n", | ||
"## This is here as a reference only\n", | ||
"# Users and application developers must use the right tag for the latest from pypi\n", | ||
"%pip install \"data-prep-toolkit-transforms[ray,doc_quality]==1.0.0a4\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", | ||
"metadata": { | ||
"jp-MarkdownHeadingCollapsed": true | ||
}, | ||
"source": [ | ||
"##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n", | ||
"* text_lang - specifies language used in the text content. By default, \"en\" is used.\n", | ||
"* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n", | ||
"* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n", | ||
"#####" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ebf1f782-0e61-485c-8670-81066beb734c", | ||
"metadata": {}, | ||
"source": [ | ||
"##### ***** Import required classes and modules" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dpk_doc_quality.ray.transform import DocQuality\n", | ||
"from data_processing.utils import GB" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7234563c-2924-4150-8a31-4aec98c1bf33", | ||
"metadata": {}, | ||
"source": [ | ||
"##### ***** Setup runtime parameters and invoke the transform" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "95737436", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture\n", | ||
"DocQuality(input_folder='test-data/input',\n", | ||
" output_folder= 'output',\n", | ||
" run_locally= True,\n", | ||
" num_cpus= 0.8,\n", | ||
" memory= 2 * GB,\n", | ||
" runtime_num_workers = 3,\n", | ||
" runtime_creation_delay = 0,\n", | ||
" docq_text_lang = \"en\",\n", | ||
" docq_doc_content_column =\"contents\").transform()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b", | ||
"metadata": {}, | ||
"source": [ | ||
"##### **** The specified folder will include the transformed parquet files." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "7276fe84-6512-4605-ab65-747351e13a7c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import glob\n", | ||
"glob.glob(\"output/*\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.