Docs (aws#60)

* Rework README to point directly to framework pages * WIP * WIP * Rename documentation to docs * Updated sagemaker.md * Move README.md to top-level and delete old * Add 'how-to-use' to tensorflow.md
atqy · Nov 28, 2019 · eb48aba · eb48aba
1 parent f8661a8
commit eb48aba
Show file tree

Hide file tree

Showing 11 changed files with 283 additions and 258 deletions.
diff --git a/README.md b/README.md
@@ -1,32 +1,107 @@
-## Tornasole
+# Sagemaker Debugger
 
-Tornasole is an upcoming AWS service designed to be a debugger
-for machine learning models. It lets you go beyond just looking
-at scalars like losses and accuracies during training and
-gives you full visibility into all tensors 'flowing through the graph'
-during training or inference.
+- [Overview](#overview)
+- [Examples](#sagemaker-example)
+- [How It Works](#how-it-works)
 
-Using Tornasole is a two step process:
+## Overview
+Sagemaker Debugger is an AWS service to automatically debug your machine learning training process.
+It helps you develop better, faster, cheaper models by catching common errors quickly. It supports
+TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
 
-### Saving tensors
+- Zero-code-change experience on SageMaker and AWS Deep Learning containers.
+- Automated anomaly detection and state assertions.
+- Realtime training job monitoring and visibility into any tensor value.
+- Distributed training and TensorBoard support.
 
-This needs the `tornasole` package built for the appropriate framework.
-It allows you to collect the tensors you want at the frequency
-that you want, and save them for analysis.
-Please follow the appropriate Readme page to install the correct version.
+There are two ways to use it: Automatic mode and configurable mode.
 
+- Automatic mode: No changes to your training script. Specify the rules you want and launch a SageMaker Estimator job.
+- Configurable mode: More powerful, lets you specify exactly which tensors and collections to save. Use the Python API within your script.
 
-#### [Tornasole TensorFlow](docs/tensorflow/README.md)
-#### [Tornasole MXNet](docs/mxnet/README.md)
-#### [Tornasole PyTorch](docs/pytorch/README.md)
-#### [Tornasole XGBoost](docs/xgboost/README.md)
 
-### Analysis
-Please refer **[this page](docs/rules/README.md)** for more details about how to analyze.
-The analysis of these tensors can be done on a separate machine in parallel with the training job.
+## Example: SageMaker Zero-Code-Change
+This example uses a zero-script-change experience, where you can use your training script as-is.
+See the [example notebooks](https://link.com) for more details.
+```python
+import sagemaker
+from sagemaker.debugger import rule_configs, Rule, CollectionConfig
 
-## ContactUs
-We would like to hear from you. If you have any question or feedback, please reach out to us [email protected]
+# Choose a built-in rule to monitor your training job
+rule = Rule.sagemaker(
+    rule_configs.exploding_tensor(),
+    rule_parameters={
+        "tensor_regex": ".*"
+    },
+    collections_to_save=[
+        CollectionConfig(name="weights"),
+        CollectionConfig(name="losses"),
+    ],
+)
 
-## License
-This library is licensed under the Apache 2.0 License.
+# Pass the rule to the estimator
+sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
+    entry_point="script.py",
+    role=sagemaker.get_execution_role(),
+    framework_version="1.15",
+    py_version="py3",
+    rules=[rule],
+)
+
+sagemaker_simple_estimator.fit()
+```
+
+That's it! SageMaker will automatically monitor your training job for your and create a CloudWatch
+event if you run into exploding tensor values.
+
+If you want greater configuration and control, we offer that too. Simply
+
+
+## Example: Running Locally
+Requires Python 3.6+, and this example uses tf.keras. Run
+```
+pip install smdebug
+```
+
+To use Sagemaker Debugger, simply add a callback hook:
+```python
+import smdebug.tensorflow as smd
+hook = smd.KerasHook.(out_dir=args.out_dir)
+
+model = tf.keras.models.Sequential([ ... ])
+model.compile(
+    optimizer='adam',
+    loss='sparse_categorical_crossentropy',
+)
+
+# Add the hook as a callback
+model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook])
+model.evaluate(x_test, y_test, callbacks=[hook])
+
+# Create a trial to inspect the saved tensors
+trial = smd.create_trial(out_dir=args.out_dir)
+print(f"Saved tensor values for {trial.tensors()}")
+print(f"Loss values were {trial.tensor('CrossEntropyLoss:0')}")
+```
+
+## How It Works
+SageMaker Debugger uses a `hook` to store the values of tensors throughout the training process. Another process called a `rule` job
+simultaneously monitors and validates these outputs to ensure that training is progressing as expected.
+A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization.
+If a rule is triggered, it will raise a CloudWatch event and stop the training job, saving you time
+and money.
+
+SageMaker Debugger can be used inside or outside of SageMaker. There are three main use cases:
+- SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
+- SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script.
+- Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script. See the second example above.
+
+The reason for different setups is that SageMaker Zero-Script-Change uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically.
+These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.
+
+See the [SageMaker page](https://link.com) for details on SageMaker Zero-Script-Change and BYOC experience.\
+See the frameworks pages for details on modifying the training script:
+- [TensorFlow](https://link.com)
+- [PyTorch](https://link.com)
+- [MXNet](https://link.com)
+- [XGBoost](https://link.com)
diff --git a/documentation/API.md → docs/API.md b/documentation/API.md → docs/API.md
@@ -10,55 +10,32 @@ These objects exist across all frameworks.
 - [SaveConfig](#saveconfig)
 - [ReductionConfig](#reductionconfig)
 
----
-## SageMaker Zero-Code-Change vs. Python API
+## Glossary
 
-There are two ways to use sagemaker-debugger: SageMaker Zero-Code-Change or Python API.
+The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`.
 
-SageMaker Zero-Code-Change will use a custom framework fork to automatically instantiate the hook, register tensors, and create collections.
-All you need to do is decide which built-in rules to use. Further documentation is available on [AWS Docs](https://link.com).
-```python
-import sagemaker
-from sagemaker.debugger import rule_configs, Rule, CollectionConfig, DebuggerHookConfig, TensorBoardOutputConfig
-
-hook_config = DebuggerHookConfig(
-    s3_output_path = args.s3_path,
-    container_local_path = args.local_path,
-    hook_parameters = {
-        "save_steps": "0,20,40,60,80"
-    },
-    collection_configs = {
-        { "CollectionName": "weights" },
-        { "CollectionName": "biases" },
-    },
-)
+**Hook**: The main interface to use training. This object can be passed as a model hook/callback
+in Tensorflow and Keras. It keeps track of collections and writes output files at each step.
+- `hook = smd.Hook(out_dir="/tmp/mnist_job")`
 
-rule = Rule.sagemaker(
-    rule_configs.exploding_tensor(),
-    rule_parameters={
-        "tensor_regex": ".*"
-    },
-    collections_to_save=[
-        CollectionConfig(name="weights"),
-        CollectionConfig(name="losses"),
-    ],
-)
+**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase
+you're in. Defaults to "global".
+- `train_mode = smd.modes.TRAIN`
 
-sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
-    entry_point="script.py",
-    role=sagemaker.get_execution_role(),
-    framework_version="1.15",
-    py_version="py3",
-    rules=[rule],
-    debugger_hook_config=hook_config,
-)
+**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for
+tensors to include/exclude.
+- `collection = hook.get_collection("losses")`
 
-sagemaker_simple_estimator.fit()
-```
+**SaveConfig**: A Python dict specifying how often to save losses and tensors.
+- `save_config = smd.SaveConfig(save_interval=10)`
+
+**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor.
+- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])`
+
+**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](https://link.com).
+- `trial = smd.create_trial(out_dir="/tmp/mnist_job")`
 
-The Python API requires more configuration but is also more flexible. You must write your own custom rules
-instead of using SageMaker's built-in rules, but you can use it with a custom container in SageMaker or in your own
-environment. It is described further below.
+**Rule**: A condition that will trigger an exception and terminate the training job early, for example a vanishing gradient. See [rules documentation](https://link.com).
 
 
 ---

diff --git a/documentation/analysis.md → docs/analysis.md b/documentation/analysis.md → docs/analysis.md
diff --git a/documentation/distributed_training.md → docs/distributed_training.md b/documentation/distributed_training.md → docs/distributed_training.md
diff --git a/documentation/mxnet.md → docs/mxnet.md b/documentation/mxnet.md → docs/mxnet.md
diff --git a/documentation/pytorch.md → docs/pytorch.md b/documentation/pytorch.md → docs/pytorch.md
diff --git a/docs/sagemaker.md b/docs/sagemaker.md
@@ -0,0 +1,138 @@
+# SageMaker
+
+There are two cases for SageMaker:
+- Zero-Script-Change (ZSC): Here you specify which rules to use, and run your existing script.
+    - Supported in Deep Learning Containers: `TensorFlow==1.15, PyTorch==1.3, MXNet==1.6`
+- Bring-Your-Own-Container (BYOC): Here you specify the rules to use, and modify your training script.
+    - Supported with `TensorFlow==1.13/1.14/1.15, PyTorch==1.2/1.3, MXNet==1.4,1.5,1.6`
+
+Table of Contents
+- [Configuration Details](#version-support)
+- [Using a Custom Container](#byoc-example)
+
+## Configuration Details
+The DebuggerHookConfig is the main object.
+
+```python
+rule = sagemaker.debugger.Rule.sagemaker(
+    base_config: dict, # Use an import, e.g. sagemaker.debugger.rule_configs.exploding_tensor()
+    name: str=None,
+    instance_type: str=None,
+    container_local_path: str=None,
+    volume_size_in_gb: int=None,
+    other_trials_s3_input_paths: str=None,
+    rule_parameters: dict=None,
+    collections_to_save: list[sagemaker.debugger.CollectionConfig]=None,
+)
+```
+
+```python
+hook_config = sagemaker.debugger.DebuggerHookConfig(
+    s3_output_path: str,
+    container_local_path: str=None,
+    hook_parameters: dict=None,
+    collection_configs: list[sagemaker.debugger.CollectionConfig]=None,
+)
+```
+
+```python
+tb_config = sagemaker.debugger.TensorBoardOutputConfig(
+    s3_output_path: str,
+    container_local_path: str=None,
+)
+```
+
+```python
+collection_config = sagemaker.debugger.CollectionConfig(
+    name: str,
+    parameters: dict,
+)
+```
+
+A full example script is below:
+```python
+import sagemaker
+from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig
+
+hook_parameters = {
+    "include_regex": "my_regex,another_regex", # comma-separated string of regexes
+    "save_interval": 100,
+    "save_steps": "1,2,3,4", # comma-separated string of steps to save
+    "start_step": 1,
+    "end_step": 2000,
+    "reductions": "min,max,mean,std,abs_variance,abs_sum,abs_l2_norm",
+}
+weights_config = CollectionConfiguration("weights")
+biases_config = CollectionConfiguration("biases")
+losses_config = CollectionConfiguration("losses")
+tb_config = TensorBoardOutputConfig(s3_output_path="s3://my-bucket/tensorboard")
+
+hook_config = DebuggerHookConfig(
+    s3_output_path="s3://my-bucket/smdebug",
+    hook_parameters=hook_parameters,
+    collection_configs=[weights_config, biases_config, losses_config],
+)
+
+exploding_tensor_rule = Rule.sagemaker(
+    base_config=rule_configs.exploding_tensor(),
+    rule_parameters={
+        "tensor_regex": ".*",
+    },
+    collections_to_save=[weights_config, losses_config],
+)
+vanishing_gradient_rule = Rule.sagemaker(base_config=rule_configs.vanishing_gradient())
+
+# Or use sagemaker.pytorch.PyTorch or sagemaker.mxnet.MXNet
+sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
+    entry_point=simple_entry_point_script,
+    role=sagemaker.get_execution_role(),
+    base_job_name=args.job_name,
+    train_instance_count=1,
+    train_instance_type="ml.m4.xlarge",
+    framework_version="1.15",
+    py_version="py3",
+    # smdebug-specific arguments below
+    rules=[exploding_tensor_rule, vanishing_gradient_rule],
+    debugger_hook_config=hook_config,
+    tensorboard_output_config=tb_config,
+)
+
+sagemaker_simple_estimator.fit()
+```
+
+## Using a Custom Container
+To use a custom container (without the framework forks), you should modify your script.
+Use the same sagemaker Estimator setup as shown below, and in your script, call
+
+```python
+hook = smd.{hook_class}.create_from_json_file()
+```
+
+and modify the rest of your script as shown in the API docs. Click on your desired framework below.
+- [TensorFlow](https://link.com)
+- [PyTorch](https://link.com)
+- [MXNet](https://link.com)
+- [XGBoost](https://link.com)
+
+
+## Comprehensive Rule List
+Full list of rules is:
+| Rule Name | Behavior |
+| --- | --- |
+| `vanishing_gradient` | Detects a vanishing gradient. |
+| `all_zero` | ??? |
+| `check_input_images` | ??? |
+| `similar_across_runs` | ??? |
+| `weight_update_ratio` | ??? |
+| `exploding_tensor` | ??? |
+| `unchanged_tensor` | ??? |
+| `loss_not_decreasing` | ??? |
+| `dead_relu` | ??? |
+| `confusion` | ??? |
+| `overfit` | ??? |
+| `tree_depth` | ??? |
+| `tensor_variance` | ??? |
+| `overtraining` | ??? |
+| `poor_weight_initialization` | ??? |
+| `saturated_activation` | ??? |
+| `nlp_sequence_ratio` | ??? |
diff --git a/documentation/tensorflow.md → docs/tensorflow.md b/documentation/tensorflow.md → docs/tensorflow.md
@@ -3,12 +3,23 @@
 SageMaker Zero-Code-Change supported container: TensorFlow 1.15. See the [AWS Docs](https://link.com) for details.\
 Python API supported versions: Tensorflow 1.13, 1.14, 1.15. Keras 2.3.
 
+
+
 ## Contents
+- [How to Use](#how-to-use)
 - [Keras Example](#keras-example)
 - [MonitoredSession Example](#monitored-session-example)
 - [Estimator Example](#estimator-example)
 - [Full API](#full-api)
 
+## How to Use
+1. `import smdebug.tensorflow as smd`
+2. Instantiate a hook. `smd.{hook_class}.create_from_json_file()` in a SageMaker environment or `smd.{hook_class}()` elsewhere.
+3. Pass the hook to the model as a callback.
+4. If using a custom container or outside of SageMaker, wrap the optimizer with `optimizer = hook.wrap_optimizer(optimizer)`.
+
+(Optional): Configure collections. See the [Common API](https://link.com) page for details on how to do this.
+
 ## tf.keras Example
 ```python
 import smdebug.tensorflow as smd
@@ -140,3 +151,19 @@ wrap_optimizer(
 )
 ```
 Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process.
+
+## Concepts
+The steps to use Tornasole in any framework are:
+
+1. Create a `hook`.
+2. Register your model and optimizer with the hook.
+3. Specify the `rule` to be used.
+4. After training, create a `trial` to manually analyze the tensors.
+
+See the [API page](https://link.com) for more details.
+
+## Detailed Links
+- [Full API](https://link.com)
+- [Rules and Trials](https://link.com)
+- [Distributed Training](https://link.com)
+- [TensorBoard](https://link.com)
diff --git a/documentation/xgboost.md → docs/xgboost.md b/documentation/xgboost.md → docs/xgboost.md