Skip to content

Commit

Permalink
Docs (aws#60)
Browse files Browse the repository at this point in the history
* Rework README to point directly to framework pages

* WIP

* WIP

* Rename documentation to docs

* Updated sagemaker.md

* Move README.md to top-level and delete old

* Add 'how-to-use' to tensorflow.md
  • Loading branch information
jarednielsen authored and rahul003 committed Nov 28, 2019
1 parent f8661a8 commit eb48aba
Show file tree
Hide file tree
Showing 11 changed files with 283 additions and 258 deletions.
121 changes: 98 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,107 @@
## Tornasole
# Sagemaker Debugger

Tornasole is an upcoming AWS service designed to be a debugger
for machine learning models. It lets you go beyond just looking
at scalars like losses and accuracies during training and
gives you full visibility into all tensors 'flowing through the graph'
during training or inference.
- [Overview](#overview)
- [Examples](#sagemaker-example)
- [How It Works](#how-it-works)

Using Tornasole is a two step process:
## Overview
Sagemaker Debugger is an AWS service to automatically debug your machine learning training process.
It helps you develop better, faster, cheaper models by catching common errors quickly. It supports
TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.

### Saving tensors
- Zero-code-change experience on SageMaker and AWS Deep Learning containers.
- Automated anomaly detection and state assertions.
- Realtime training job monitoring and visibility into any tensor value.
- Distributed training and TensorBoard support.

This needs the `tornasole` package built for the appropriate framework.
It allows you to collect the tensors you want at the frequency
that you want, and save them for analysis.
Please follow the appropriate Readme page to install the correct version.
There are two ways to use it: Automatic mode and configurable mode.

- Automatic mode: No changes to your training script. Specify the rules you want and launch a SageMaker Estimator job.
- Configurable mode: More powerful, lets you specify exactly which tensors and collections to save. Use the Python API within your script.

#### [Tornasole TensorFlow](docs/tensorflow/README.md)
#### [Tornasole MXNet](docs/mxnet/README.md)
#### [Tornasole PyTorch](docs/pytorch/README.md)
#### [Tornasole XGBoost](docs/xgboost/README.md)

### Analysis
Please refer **[this page](docs/rules/README.md)** for more details about how to analyze.
The analysis of these tensors can be done on a separate machine in parallel with the training job.
## Example: SageMaker Zero-Code-Change
This example uses a zero-script-change experience, where you can use your training script as-is.
See the [example notebooks](https://link.com) for more details.
```python
import sagemaker
from sagemaker.debugger import rule_configs, Rule, CollectionConfig

## ContactUs
We would like to hear from you. If you have any question or feedback, please reach out to us [email protected]
# Choose a built-in rule to monitor your training job
rule = Rule.sagemaker(
rule_configs.exploding_tensor(),
rule_parameters={
"tensor_regex": ".*"
},
collections_to_save=[
CollectionConfig(name="weights"),
CollectionConfig(name="losses"),
],
)

## License
This library is licensed under the Apache 2.0 License.
# Pass the rule to the estimator
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
entry_point="script.py",
role=sagemaker.get_execution_role(),
framework_version="1.15",
py_version="py3",
rules=[rule],
)

sagemaker_simple_estimator.fit()
```

That's it! SageMaker will automatically monitor your training job for your and create a CloudWatch
event if you run into exploding tensor values.

If you want greater configuration and control, we offer that too. Simply


## Example: Running Locally
Requires Python 3.6+, and this example uses tf.keras. Run
```
pip install smdebug
```

To use Sagemaker Debugger, simply add a callback hook:
```python
import smdebug.tensorflow as smd
hook = smd.KerasHook.(out_dir=args.out_dir)

model = tf.keras.models.Sequential([ ... ])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
)

# Add the hook as a callback
model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook])
model.evaluate(x_test, y_test, callbacks=[hook])

# Create a trial to inspect the saved tensors
trial = smd.create_trial(out_dir=args.out_dir)
print(f"Saved tensor values for {trial.tensors()}")
print(f"Loss values were {trial.tensor('CrossEntropyLoss:0')}")
```

## How It Works
SageMaker Debugger uses a `hook` to store the values of tensors throughout the training process. Another process called a `rule` job
simultaneously monitors and validates these outputs to ensure that training is progressing as expected.
A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization.
If a rule is triggered, it will raise a CloudWatch event and stop the training job, saving you time
and money.

SageMaker Debugger can be used inside or outside of SageMaker. There are three main use cases:
- SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
- SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script.
- Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script. See the second example above.

The reason for different setups is that SageMaker Zero-Script-Change uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically.
These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.

See the [SageMaker page](https://link.com) for details on SageMaker Zero-Script-Change and BYOC experience.\
See the frameworks pages for details on modifying the training script:
- [TensorFlow](https://link.com)
- [PyTorch](https://link.com)
- [MXNet](https://link.com)
- [XGBoost](https://link.com)
63 changes: 20 additions & 43 deletions documentation/API.md → docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,55 +10,32 @@ These objects exist across all frameworks.
- [SaveConfig](#saveconfig)
- [ReductionConfig](#reductionconfig)

---
## SageMaker Zero-Code-Change vs. Python API
## Glossary

There are two ways to use sagemaker-debugger: SageMaker Zero-Code-Change or Python API.
The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`.

SageMaker Zero-Code-Change will use a custom framework fork to automatically instantiate the hook, register tensors, and create collections.
All you need to do is decide which built-in rules to use. Further documentation is available on [AWS Docs](https://link.com).
```python
import sagemaker
from sagemaker.debugger import rule_configs, Rule, CollectionConfig, DebuggerHookConfig, TensorBoardOutputConfig

hook_config = DebuggerHookConfig(
s3_output_path = args.s3_path,
container_local_path = args.local_path,
hook_parameters = {
"save_steps": "0,20,40,60,80"
},
collection_configs = {
{ "CollectionName": "weights" },
{ "CollectionName": "biases" },
},
)
**Hook**: The main interface to use training. This object can be passed as a model hook/callback
in Tensorflow and Keras. It keeps track of collections and writes output files at each step.
- `hook = smd.Hook(out_dir="/tmp/mnist_job")`

rule = Rule.sagemaker(
rule_configs.exploding_tensor(),
rule_parameters={
"tensor_regex": ".*"
},
collections_to_save=[
CollectionConfig(name="weights"),
CollectionConfig(name="losses"),
],
)
**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase
you're in. Defaults to "global".
- `train_mode = smd.modes.TRAIN`

sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
entry_point="script.py",
role=sagemaker.get_execution_role(),
framework_version="1.15",
py_version="py3",
rules=[rule],
debugger_hook_config=hook_config,
)
**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for
tensors to include/exclude.
- `collection = hook.get_collection("losses")`

sagemaker_simple_estimator.fit()
```
**SaveConfig**: A Python dict specifying how often to save losses and tensors.
- `save_config = smd.SaveConfig(save_interval=10)`

**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor.
- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])`

**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](https://link.com).
- `trial = smd.create_trial(out_dir="/tmp/mnist_job")`

The Python API requires more configuration but is also more flexible. You must write your own custom rules
instead of using SageMaker's built-in rules, but you can use it with a custom container in SageMaker or in your own
environment. It is described further below.
**Rule**: A condition that will trigger an exception and terminate the training job early, for example a vanishing gradient. See [rules documentation](https://link.com).


---
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
138 changes: 138 additions & 0 deletions docs/sagemaker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# SageMaker

There are two cases for SageMaker:
- Zero-Script-Change (ZSC): Here you specify which rules to use, and run your existing script.
- Supported in Deep Learning Containers: `TensorFlow==1.15, PyTorch==1.3, MXNet==1.6`
- Bring-Your-Own-Container (BYOC): Here you specify the rules to use, and modify your training script.
- Supported with `TensorFlow==1.13/1.14/1.15, PyTorch==1.2/1.3, MXNet==1.4,1.5,1.6`

Table of Contents
- [Configuration Details](#version-support)
- [Using a Custom Container](#byoc-example)

## Configuration Details
The DebuggerHookConfig is the main object.

```python
rule = sagemaker.debugger.Rule.sagemaker(
base_config: dict, # Use an import, e.g. sagemaker.debugger.rule_configs.exploding_tensor()
name: str=None,
instance_type: str=None,
container_local_path: str=None,
volume_size_in_gb: int=None,
other_trials_s3_input_paths: str=None,
rule_parameters: dict=None,
collections_to_save: list[sagemaker.debugger.CollectionConfig]=None,
)
```

```python
hook_config = sagemaker.debugger.DebuggerHookConfig(
s3_output_path: str,
container_local_path: str=None,
hook_parameters: dict=None,
collection_configs: list[sagemaker.debugger.CollectionConfig]=None,
)
```

```python
tb_config = sagemaker.debugger.TensorBoardOutputConfig(
s3_output_path: str,
container_local_path: str=None,
)
```

```python
collection_config = sagemaker.debugger.CollectionConfig(
name: str,
parameters: dict,
)
```

A full example script is below:
```python
import sagemaker
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig

hook_parameters = {
"include_regex": "my_regex,another_regex", # comma-separated string of regexes
"save_interval": 100,
"save_steps": "1,2,3,4", # comma-separated string of steps to save
"start_step": 1,
"end_step": 2000,
"reductions": "min,max,mean,std,abs_variance,abs_sum,abs_l2_norm",
}
weights_config = CollectionConfiguration("weights")
biases_config = CollectionConfiguration("biases")
losses_config = CollectionConfiguration("losses")
tb_config = TensorBoardOutputConfig(s3_output_path="s3://my-bucket/tensorboard")

hook_config = DebuggerHookConfig(
s3_output_path="s3://my-bucket/smdebug",
hook_parameters=hook_parameters,
collection_configs=[weights_config, biases_config, losses_config],
)

exploding_tensor_rule = Rule.sagemaker(
base_config=rule_configs.exploding_tensor(),
rule_parameters={
"tensor_regex": ".*",
},
collections_to_save=[weights_config, losses_config],
)
vanishing_gradient_rule = Rule.sagemaker(base_config=rule_configs.vanishing_gradient())

# Or use sagemaker.pytorch.PyTorch or sagemaker.mxnet.MXNet
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
entry_point=simple_entry_point_script,
role=sagemaker.get_execution_role(),
base_job_name=args.job_name,
train_instance_count=1,
train_instance_type="ml.m4.xlarge",
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
rules=[exploding_tensor_rule, vanishing_gradient_rule],
debugger_hook_config=hook_config,
tensorboard_output_config=tb_config,
)

sagemaker_simple_estimator.fit()
```

## Using a Custom Container
To use a custom container (without the framework forks), you should modify your script.
Use the same sagemaker Estimator setup as shown below, and in your script, call

```python
hook = smd.{hook_class}.create_from_json_file()
```

and modify the rest of your script as shown in the API docs. Click on your desired framework below.
- [TensorFlow](https://link.com)
- [PyTorch](https://link.com)
- [MXNet](https://link.com)
- [XGBoost](https://link.com)


## Comprehensive Rule List
Full list of rules is:
| Rule Name | Behavior |
| --- | --- |
| `vanishing_gradient` | Detects a vanishing gradient. |
| `all_zero` | ??? |
| `check_input_images` | ??? |
| `similar_across_runs` | ??? |
| `weight_update_ratio` | ??? |
| `exploding_tensor` | ??? |
| `unchanged_tensor` | ??? |
| `loss_not_decreasing` | ??? |
| `dead_relu` | ??? |
| `confusion` | ??? |
| `overfit` | ??? |
| `tree_depth` | ??? |
| `tensor_variance` | ??? |
| `overtraining` | ??? |
| `poor_weight_initialization` | ??? |
| `saturated_activation` | ??? |
| `nlp_sequence_ratio` | ??? |
27 changes: 27 additions & 0 deletions documentation/tensorflow.md → docs/tensorflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,23 @@
SageMaker Zero-Code-Change supported container: TensorFlow 1.15. See the [AWS Docs](https://link.com) for details.\
Python API supported versions: Tensorflow 1.13, 1.14, 1.15. Keras 2.3.



## Contents
- [How to Use](#how-to-use)
- [Keras Example](#keras-example)
- [MonitoredSession Example](#monitored-session-example)
- [Estimator Example](#estimator-example)
- [Full API](#full-api)

## How to Use
1. `import smdebug.tensorflow as smd`
2. Instantiate a hook. `smd.{hook_class}.create_from_json_file()` in a SageMaker environment or `smd.{hook_class}()` elsewhere.
3. Pass the hook to the model as a callback.
4. If using a custom container or outside of SageMaker, wrap the optimizer with `optimizer = hook.wrap_optimizer(optimizer)`.

(Optional): Configure collections. See the [Common API](https://link.com) page for details on how to do this.

## tf.keras Example
```python
import smdebug.tensorflow as smd
Expand Down Expand Up @@ -140,3 +151,19 @@ wrap_optimizer(
)
```
Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process.

## Concepts
The steps to use Tornasole in any framework are:

1. Create a `hook`.
2. Register your model and optimizer with the hook.
3. Specify the `rule` to be used.
4. After training, create a `trial` to manually analyze the tensors.

See the [API page](https://link.com) for more details.

## Detailed Links
- [Full API](https://link.com)
- [Rules and Trials](https://link.com)
- [Distributed Training](https://link.com)
- [TensorBoard](https://link.com)
File renamed without changes.
Loading

0 comments on commit eb48aba

Please sign in to comment.