Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for refactoring #110

Merged
merged 4 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion data-processing-lib/doc/advanced-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,5 +280,5 @@ python ededup_transform.py --hash_cpu 0.5 --num_hashes 2 --doc_column "contents"
--s3_config "{'input_folder': 'cos-optimal-llm-pile/test/david/input/', 'output_folder': 'cos-optimal-llm-pile/test/david/output/'}"
```
This is a minimal set of options to run locally.
See the [launcher options](launcher-options.md) for a complete list of
See the [launcher options](ray-launcher-options) for a complete list of
transform-independent command line options.
Binary file removed data-processing-lib/doc/logo-ibm-dark.png
Binary file not shown.
Binary file removed data-processing-lib/doc/logo-ibm.png
Binary file not shown.
11 changes: 6 additions & 5 deletions data-processing-lib/doc/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,16 @@ developers of data transformation are:
* [Transformation](../src/data_processing/transform/table_transform.py) - a simple, easily-implemented interface defines
the specifics of a given data transformation.
* [Transform Configuration](../src/data_processing/runtime/ray/transform_runtime.py) - defines
the transform implementation and runtime classes, the
command line arguments specific to transform, and the short name for the transform.
* [Transformation Runtime](../src/data_processing/runtime/ray/transform_runtime.py) - allows for customization of the Ray environment for the transformer.
This might include provisioning of shared memory objects or creation of additional actors.
the transform short name, its implementation class, and command line configuration
parameters.

To learn more consider the following:

* [Transform Tutorials](transform-tutorials.md)
* [Testing transformers with S3](using_s3_transformers.md)
* [Transform Runtimes](transform-runtimes.md)
* [Transform Examples](transform-tutorial-examples.md)
* [Testing Transforms](transform-testing.md)
* [Utilities](transformer-utilities.md)
* [Architecture Deep Dive](architecture.md)
* [Transform project root readme](../../transforms/README.md)

60 changes: 60 additions & 0 deletions data-processing-lib/doc/python-launcher-options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Ray Launcher Command Line Options
A number of command line options are available when launching a transform.

The following is a current --help output (a work in progress) for
the `NOOPTransform` (note the --noop_sleep_sec option):

```
usage: noop_python_runtime.py [-h] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES]
[--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
[--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]

Driver for noop processing

options:
-h, --help show this help message and exit
--noop_sleep_sec NOOP_SLEEP_SEC
Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
--noop_pwd NOOP_PWD A dummy password which should be filtered out of the metadata
--data_s3_cred DATA_S3_CRED
AST string of options for s3 credentials. Only required for S3 data access.
access_key: access key help text
secret_key: secret key help text
url: optional s3 url
region: optional s3 region
Example: { 'access_key': 'access', 'secret_key': 'secret',
'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
'region': 'us-east-1' }
--data_s3_config DATA_S3_CONFIG
AST string containing input/output paths.
input_folder: Path to input folder of files to be processed
output_folder: Path to output folder of processed files
Example: { 'input_folder': 's3-path/your-input-bucket',
'output_folder': 's3-path/your-output-bucket' }
--data_local_config DATA_LOCAL_CONFIG
ast string containing input/output folders using local fs.
input_folder: Path to input folder of files to be processed
output_folder: Path to output folder of processed files
Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
--data_max_files DATA_MAX_FILES
Max amount of files to process
--data_checkpointing DATA_CHECKPOINTING
checkpointing flag
--data_data_sets DATA_DATA_SETS
List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
--data_files_to_use DATA_FILES_TO_USE
list of file extensions to choose for input.
--data_num_samples DATA_NUM_SAMPLES
number of random input files to process
--runtime_pipeline_id RUNTIME_PIPELINE_ID
pipeline id
--runtime_job_id RUNTIME_JOB_ID
job id
--runtime_code_location RUNTIME_CODE_LOCATION
AST string containing code location
github: Github repository URL.
commit_hash: github commit hash
path: Path within the repository
Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
'path': 'transforms/universal/code' }
```
12 changes: 12 additions & 0 deletions data-processing-lib/doc/python-runtime.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## Python Runtime
The python runtime provides a simple mechanism to run a transform on a set of input data to produce
a set of output data, all within the python execution environment.

A `PythonTransformLauncher` class is provided that enables the running of the transform. For example,

```python
launcher = PythonTransformLauncher(YourTransformConfiguration())
launcher.launch()
```
The `YourTransformConfiguration` class configures your transform.
More details can be found in the [transform tutorial](transform-tutorials.md).
Original file line number Diff line number Diff line change
@@ -1,65 +1,57 @@
# Launcher Command Line Options
# Ray Launcher Command Line Options
A number of command line options are available when launching a transform.

The following is a current --help output (a work in progress) for
the `NOOPTransform` (note the --noop_sleep_sec option):
the `NOOPTransform` (note the --noop_sleep_sec and --noop_pwd options):

```
usage: noop_transform.py [-h]
[--run_locally RUN_LOCALLY]
[--noop_sleep_sec NOOP_SLEEP_SEC]
[--data_s3_cred DATA_S3_CRED]
[--data_s3_config DATA_S3_CONFIG]
[--data_local_config DATA_LOCAL_CONFIG]
[--data_max_files DATA_MAX_FILES]
[--data_checkpointing DATA_CHECKPOINTING]
[--data_data_sets DATA_DATA_SETS]
[--data_max_files MAX_FILES]
[--data_files_to_use DATA_FILES_TO_USE]
[--data_num_samples DATA_NUM_SAMPLES]
[--runtime_num_workers NUM_WORKERS]
[--runtime_worker_options WORKER_OPTIONS]
[--runtime_pipeline_id PIPELINE_ID] [--job_id JOB_ID]
[--runtime_creation_delay CREATION_DELAY]
[--runtime_code_location CODE_LOCATION]
usage: noop_transform.py [-h] [--run_locally RUN_LOCALLY] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG]
[--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES]
[--runtime_num_workers RUNTIME_NUM_WORKERS] [--runtime_worker_options RUNTIME_WORKER_OPTIONS] [--runtime_creation_delay RUNTIME_CREATION_DELAY] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
[--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]

Driver for NOOP processing
Driver for noop processing

options:
-h, --help show this help message and exit
--run_locally RUN_LOCALLY
running ray local flag
--noop_sleep_sec NOOP_SLEEP_SEC
Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
--data_s3_cred S3_CRED
AST string of options for cos credentials. Only required for COS or Lakehouse.
--noop_pwd NOOP_PWD A dummy password which should be filtered out of the metadata
--data_s3_cred DATA_S3_CRED
AST string of options for s3 credentials. Only required for S3 data access.
access_key: access key help text
secret_key: secret key help text
cos_url: COS url
Example: { 'access_key': 'access', 'secret_key': 'secret', 's3_url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud' }
--data_s3_config S3_CONFIG
url: optional s3 url
region: optional s3 region
Example: { 'access_key': 'access', 'secret_key': 'secret',
'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
'region': 'us-east-1' }
--data_s3_config DATA_S3_CONFIG
AST string containing input/output paths.
input_folder: Path to input folder of files to be processed
output_folder: Path to output folder of processed files
Example: { 'input_folder': 'your input folder', 'output_folder ': 'your output folder' }
--data_local_config LOCAL_CONFIG
Example: { 'input_folder': 's3-path/your-input-bucket',
'output_folder': 's3-path/your-output-bucket' }
--data_local_config DATA_LOCAL_CONFIG
ast string containing input/output folders using local fs.
input_folder: Path to input folder of files to be processed
output_folder: Path to output folder of processed files
Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
--data_max_files MAX_FILES
--data_max_files DATA_MAX_FILES
Max amount of files to process
--data_checkpointing CHECKPOINTING
--data_checkpointing DATA_CHECKPOINTING
checkpointing flag
--data_data_sets DATA_SETS
List of data sets
--data_data_sets DATA_DATA_SETS
List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
--data_files_to_use DATA_FILES_TO_USE
files extensions to use, default .parquet
list of file extensions to choose for input.
--data_num_samples DATA_NUM_SAMPLES
number of randomply picked files to use
--runtime_num_workers NUM_WORKERS
number of random input files to process
--runtime_num_workers RUNTIME_NUM_WORKERS
number of workers
--runtime_worker_options WORKER_OPTIONS
--runtime_worker_options RUNTIME_WORKER_OPTIONS
AST string defining worker resource requirements.
num_cpus: Required number of CPUs.
num_gpus: Required number of GPUs
Expand All @@ -69,16 +61,19 @@ options:
placement_group_bundle_index, placement_group_capture_child_tasks, resources, runtime_env,
scheduling_strategy, _metadata, concurrency_groups, lifetime, max_concurrency, max_restarts,
max_task_retries, max_pending_calls, namespace, get_if_exists
Example: { 'num_cpus': '8', 'num_gpus': '1', 'resources': '{"special_hardware": 1, "custom_label": 1}' }
--runtime_pipeline_id PIPELINE_ID
pipeline id
--runtime_job_id JOB_ID job id
--runtime_creation_delay CREATION_DELAY
Example: { 'num_cpus': '8', 'num_gpus': '1',
'resources': '{"special_hardware": 1, "custom_label": 1}' }
--runtime_creation_delay RUNTIME_CREATION_DELAY
delay between actor' creation
--runtime_code_location CODE_LOCATION
--runtime_pipeline_id RUNTIME_PIPELINE_ID
pipeline id
--runtime_job_id RUNTIME_JOB_ID
job id
--runtime_code_location RUNTIME_CODE_LOCATION
AST string containing code location
github: Github repository URL.
commit_hash: github commit hash
path: Path within the repository
Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '13241231asdfaed', 'path': 'transforms/universal/ededup' }
Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
'path': 'transforms/universal/code' }
```
Loading