Skip to content

Commit

Permalink
use data-prep-toolkit-transforms==0.2.2.dev3
Browse files Browse the repository at this point in the history
Signed-off-by: Daiki Tsuzuku <[email protected]>
  • Loading branch information
dtsuzuku-ibm committed Nov 25, 2024
1 parent cf13388 commit edb605b
Showing 1 changed file with 20 additions and 21 deletions.
41 changes: 20 additions & 21 deletions transforms/language/doc_quality/doc_quality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 1,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {},
"outputs": [],
Expand All @@ -24,8 +24,7 @@
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install data-prep-toolkit\n",
"%pip install data-prep-toolkit-transforms\n",
"%pip install dpk-doc-quality-transform-python"
"%pip install data-prep-toolkit-transforms==0.2.2.dev3"
]
},
{
Expand All @@ -52,7 +51,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 2,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
Expand All @@ -76,7 +75,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 3,
"id": "e90a853e-412f-45d7-af3d-959e755aeebb",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -114,27 +113,27 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 4,
"id": "0775e400-7469-49a6-8998-bd4772931459",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"10:38:40 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x11206e010>}\n",
"10:38:40 INFO - pipeline id pipeline_id\n",
"10:38:40 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n",
"10:38:40 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n",
"10:38:40 INFO - data factory data_ max_files -1, n_sample -1\n",
"10:38:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
"10:38:40 INFO - orchestrator docq started at 2024-11-22 10:38:40\n",
"10:38:40 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n",
"10:38:40 INFO - Load badwords found locally from python/ldnoobw/en\n",
"10:38:49 INFO - Completed 1 files (100.0%) in 0.146 min\n",
"10:38:49 INFO - Done processing 1 files, waiting for flush() completion.\n",
"10:38:49 INFO - done flushing in 0.0 sec\n",
"10:38:49 INFO - Completed execution in 0.146 min, execution result 0\n"
"12:39:07 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x12ae67650>}\n",
"12:39:07 INFO - pipeline id pipeline_id\n",
"12:39:07 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n",
"12:39:07 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n",
"12:39:07 INFO - data factory data_ max_files -1, n_sample -1\n",
"12:39:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
"12:39:07 INFO - orchestrator docq started at 2024-11-25 12:39:07\n",
"12:39:07 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n",
"12:39:07 INFO - Load badwords found locally from python/ldnoobw/en\n",
"12:39:09 INFO - Completed 1 files (100.0%) in 0.033 min\n",
"12:39:09 INFO - Done processing 1 files, waiting for flush() completion.\n",
"12:39:09 INFO - done flushing in 0.0 sec\n",
"12:39:09 INFO - Completed execution in 0.033 min, execution result 0\n"
]
}
],
Expand All @@ -155,7 +154,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 5,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [
Expand All @@ -165,7 +164,7 @@
"['python/output/metadata.json', 'python/output/test1.parquet']"
]
},
"execution_count": 11,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
Expand Down

0 comments on commit edb605b

Please sign in to comment.