IBM · daw3rd · May 10, 2024 · May 10, 2024 · May 10, 2024 · May 10, 2024
diff --git a/data-processing-lib/doc/advanced-transform-tutorial.md b/data-processing-lib/doc/advanced-transform-tutorial.md
@@ -280,5 +280,5 @@ python ededup_transform.py --hash_cpu 0.5 --num_hashes 2 --doc_column "contents"
   --s3_config "{'input_folder': 'cos-optimal-llm-pile/test/david/input/', 'output_folder': 'cos-optimal-llm-pile/test/david/output/'}"
 ```
 This is a minimal set of options to run locally.
-See the [launcher options](launcher-options.md) for a complete list of
+See the [launcher options](ray-launcher-options) for a complete list of
 transform-independent command line options.
diff --git a/data-processing-lib/doc/logo-ibm-dark.png b/data-processing-lib/doc/logo-ibm-dark.png
diff --git a/data-processing-lib/doc/logo-ibm.png b/data-processing-lib/doc/logo-ibm.png
diff --git a/data-processing-lib/doc/overview.md b/data-processing-lib/doc/overview.md
@@ -13,15 +13,16 @@ developers of data transformation are:
 * [Transformation](../src/data_processing/transform/table_transform.py) - a simple, easily-implemented interface defines
 the specifics of a given data transformation.
 * [Transform Configuration](../src/data_processing/runtime/ray/transform_runtime.py) - defines
-the transform implementation and runtime classes, the 
-command line arguments specific to transform, and the short name for the transform.
-* [Transformation Runtime](../src/data_processing/runtime/ray/transform_runtime.py) - allows for customization of the Ray environment for the transformer.
-This might include provisioning of shared memory objects or creation of additional actors.
+the transform short name, its implementation class,  and command line configuration
+parameters.
 
 To learn more consider the following:
 
 * [Transform Tutorials](transform-tutorials.md)
-* [Testing transformers with S3](using_s3_transformers.md)
+* [Transform Runtimes](transform-runtimes.md)
+* [Transform Examples](transform-tutorial-examples.md)
+* [Testing Transforms](transform-testing.md)
+* [Utilities](transformer-utilities.md)
 * [Architecture Deep Dive](architecture.md)
 * [Transform project root readme](../../transforms/README.md)
 
diff --git a/data-processing-lib/doc/python-launcher-options.md b/data-processing-lib/doc/python-launcher-options.md
@@ -0,0 +1,60 @@
+# Ray Launcher Command Line Options
+A number of command line options are available when launching a transform.  
+
+The following is a current --help output (a work in progress) for 
+the `NOOPTransform` (note the --noop_sleep_sec option):
+
+```
+usage: noop_python_runtime.py [-h] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES]
+                              [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
+                              [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
+
+Driver for noop processing
+
+options:
+  -h, --help            show this help message and exit
+  --noop_sleep_sec NOOP_SLEEP_SEC
+                        Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
+  --noop_pwd NOOP_PWD   A dummy password which should be filtered out of the metadata
+  --data_s3_cred DATA_S3_CRED
+                        AST string of options for s3 credentials. Only required for S3 data access.
+                        access_key: access key help text
+                        secret_key: secret key help text
+                        url: optional s3 url
+                        region: optional s3 region
+                        Example: { 'access_key': 'access', 'secret_key': 'secret', 
+                        'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud', 
+                        'region': 'us-east-1' }
+  --data_s3_config DATA_S3_CONFIG
+                        AST string containing input/output paths.
+                        input_folder: Path to input folder of files to be processed
+                        output_folder: Path to output folder of processed files
+                        Example: { 'input_folder': 's3-path/your-input-bucket', 
+                        'output_folder': 's3-path/your-output-bucket' }
+  --data_local_config DATA_LOCAL_CONFIG
+                        ast string containing input/output folders using local fs.
+                        input_folder: Path to input folder of files to be processed
+                        output_folder: Path to output folder of processed files
+                        Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
+  --data_max_files DATA_MAX_FILES
+                        Max amount of files to process
+  --data_checkpointing DATA_CHECKPOINTING
+                        checkpointing flag
+  --data_data_sets DATA_DATA_SETS
+                        List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
+  --data_files_to_use DATA_FILES_TO_USE
+                        list of file extensions to choose for input.
+  --data_num_samples DATA_NUM_SAMPLES
+                        number of random input files to process
+  --runtime_pipeline_id RUNTIME_PIPELINE_ID
+                        pipeline id
+  --runtime_job_id RUNTIME_JOB_ID
+                        job id
+  --runtime_code_location RUNTIME_CODE_LOCATION
+                        AST string containing code location
+                        github: Github repository URL.
+                        commit_hash: github commit hash
+                        path: Path within the repository
+                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324', 
+                        'path': 'transforms/universal/code' }
+```
diff --git a/data-processing-lib/doc/python-runtime.md b/data-processing-lib/doc/python-runtime.md
@@ -0,0 +1,12 @@
+## Python Runtime
+The python runtime provides a simple mechanism to run a transform on a set of input data to produce
+a set of output data, all within the python execution environment.
+
+A `PythonTransformLauncher` class is provided that enables the running of the transform.  For example,
+
+```python
+launcher = PythonTransformLauncher(YourTransformConfiguration())
+launcher.launch()
+```
+The `YourTransformConfiguration` class configures your transform.
+More details can be found in the [transform tutorial](transform-tutorials.md).
diff --git a/data-processing-lib/doc/launcher-options.md → ...rocessing-lib/doc/ray-launcher-options.md b/data-processing-lib/doc/launcher-options.md → ...rocessing-lib/doc/ray-launcher-options.md
@@ -1,65 +1,57 @@
-# Launcher Command Line Options
+# Ray Launcher Command Line Options
 A number of command line options are available when launching a transform.  
 
 The following is a current --help output (a work in progress) for 
-the `NOOPTransform` (note the --noop_sleep_sec option):
+the `NOOPTransform` (note the --noop_sleep_sec and --noop_pwd options):
 
 ```
-usage: noop_transform.py [-h] 
-                         [--run_locally RUN_LOCALLY]
-                         [--noop_sleep_sec NOOP_SLEEP_SEC] 
-                         [--data_s3_cred DATA_S3_CRED]
-                         [--data_s3_config DATA_S3_CONFIG]
-                         [--data_local_config DATA_LOCAL_CONFIG]
-                         [--data_max_files DATA_MAX_FILES]
-                         [--data_checkpointing DATA_CHECKPOINTING]
-                         [--data_data_sets DATA_DATA_SETS]
-                         [--data_max_files MAX_FILES]
-                         [--data_files_to_use DATA_FILES_TO_USE]
-                         [--data_num_samples DATA_NUM_SAMPLES]
-                         [--runtime_num_workers NUM_WORKERS] 
-                         [--runtime_worker_options WORKER_OPTIONS]
-                         [--runtime_pipeline_id PIPELINE_ID] [--job_id JOB_ID]
-                         [--runtime_creation_delay CREATION_DELAY]
-                         [--runtime_code_location CODE_LOCATION]
+usage: noop_transform.py [-h] [--run_locally RUN_LOCALLY] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG]
+                         [--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES]
+                         [--runtime_num_workers RUNTIME_NUM_WORKERS] [--runtime_worker_options RUNTIME_WORKER_OPTIONS] [--runtime_creation_delay RUNTIME_CREATION_DELAY] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
+                         [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
 
-Driver for NOOP processing
+Driver for noop processing
 
 options:
   -h, --help            show this help message and exit
   --run_locally RUN_LOCALLY
                         running ray local flag
   --noop_sleep_sec NOOP_SLEEP_SEC
                         Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
-  --data_s3_cred S3_CRED     
-                        AST string of options for cos credentials. Only required for COS or Lakehouse.
+  --noop_pwd NOOP_PWD   A dummy password which should be filtered out of the metadata
+  --data_s3_cred DATA_S3_CRED
+                        AST string of options for s3 credentials. Only required for S3 data access.
                         access_key: access key help text
                         secret_key: secret key help text
-                        cos_url: COS url
-                        Example: { 'access_key': 'access', 'secret_key': 'secret', 's3_url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud' }
-  --data_s3_config S3_CONFIG
+                        url: optional s3 url
+                        region: optional s3 region
+                        Example: { 'access_key': 'access', 'secret_key': 'secret', 
+                        'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud', 
+                        'region': 'us-east-1' }
+  --data_s3_config DATA_S3_CONFIG
                         AST string containing input/output paths.
                         input_folder: Path to input folder of files to be processed
                         output_folder: Path to output folder of processed files
-                        Example: { 'input_folder': 'your input folder', 'output_folder ': 'your output folder' }
-  --data_local_config LOCAL_CONFIG
+                        Example: { 'input_folder': 's3-path/your-input-bucket', 
+                        'output_folder': 's3-path/your-output-bucket' }
+  --data_local_config DATA_LOCAL_CONFIG
                         ast string containing input/output folders using local fs.
                         input_folder: Path to input folder of files to be processed
                         output_folder: Path to output folder of processed files
                         Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
-  --data_max_files MAX_FILES
+  --data_max_files DATA_MAX_FILES
                         Max amount of files to process
-  --data_checkpointing CHECKPOINTING
+  --data_checkpointing DATA_CHECKPOINTING
                         checkpointing flag
-  --data_data_sets DATA_SETS
-                        List of data sets
+  --data_data_sets DATA_DATA_SETS
+                        List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
   --data_files_to_use DATA_FILES_TO_USE
-                        files extensions to use, default .parquet
+                        list of file extensions to choose for input.
   --data_num_samples DATA_NUM_SAMPLES
-                        number of randomply picked files to use
-  --runtime_num_workers NUM_WORKERS
+                        number of random input files to process
+  --runtime_num_workers RUNTIME_NUM_WORKERS
                         number of workers
-  --runtime_worker_options WORKER_OPTIONS
+  --runtime_worker_options RUNTIME_WORKER_OPTIONS
                         AST string defining worker resource requirements.
                         num_cpus: Required number of CPUs.
                         num_gpus: Required number of GPUs
@@ -69,16 +61,19 @@ options:
                                    placement_group_bundle_index, placement_group_capture_child_tasks, resources, runtime_env,
                                    scheduling_strategy, _metadata, concurrency_groups, lifetime, max_concurrency, max_restarts,
                                    max_task_retries, max_pending_calls, namespace, get_if_exists
-                        Example: { 'num_cpus': '8', 'num_gpus': '1', 'resources': '{"special_hardware": 1, "custom_label": 1}' }
-  --runtime_pipeline_id PIPELINE_ID
-                        pipeline id
-  --runtime_job_id JOB_ID       job id
-  --runtime_creation_delay CREATION_DELAY
+                        Example: { 'num_cpus': '8', 'num_gpus': '1', 
+                        'resources': '{"special_hardware": 1, "custom_label": 1}' }
+  --runtime_creation_delay RUNTIME_CREATION_DELAY
                         delay between actor' creation
-  --runtime_code_location CODE_LOCATION
+  --runtime_pipeline_id RUNTIME_PIPELINE_ID
+                        pipeline id
+  --runtime_job_id RUNTIME_JOB_ID
+                        job id
+  --runtime_code_location RUNTIME_CODE_LOCATION
                         AST string containing code location
                         github: Github repository URL.
                         commit_hash: github commit hash
                         path: Path within the repository
-                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '13241231asdfaed', 'path': 'transforms/universal/ededup' }
+                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324', 
+                        'path': 'transforms/universal/code' }
 ```