forked from aws/amazon-sagemaker-examples
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Rework README to point directly to framework pages * WIP * WIP * Rename documentation to docs * Updated sagemaker.md * Move README.md to top-level and delete old * Add 'how-to-use' to tensorflow.md
- Loading branch information
1 parent
f8661a8
commit eb48aba
Showing
11 changed files
with
283 additions
and
258 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,107 @@ | ||
## Tornasole | ||
# Sagemaker Debugger | ||
|
||
Tornasole is an upcoming AWS service designed to be a debugger | ||
for machine learning models. It lets you go beyond just looking | ||
at scalars like losses and accuracies during training and | ||
gives you full visibility into all tensors 'flowing through the graph' | ||
during training or inference. | ||
- [Overview](#overview) | ||
- [Examples](#sagemaker-example) | ||
- [How It Works](#how-it-works) | ||
|
||
Using Tornasole is a two step process: | ||
## Overview | ||
Sagemaker Debugger is an AWS service to automatically debug your machine learning training process. | ||
It helps you develop better, faster, cheaper models by catching common errors quickly. It supports | ||
TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+. | ||
|
||
### Saving tensors | ||
- Zero-code-change experience on SageMaker and AWS Deep Learning containers. | ||
- Automated anomaly detection and state assertions. | ||
- Realtime training job monitoring and visibility into any tensor value. | ||
- Distributed training and TensorBoard support. | ||
|
||
This needs the `tornasole` package built for the appropriate framework. | ||
It allows you to collect the tensors you want at the frequency | ||
that you want, and save them for analysis. | ||
Please follow the appropriate Readme page to install the correct version. | ||
There are two ways to use it: Automatic mode and configurable mode. | ||
|
||
- Automatic mode: No changes to your training script. Specify the rules you want and launch a SageMaker Estimator job. | ||
- Configurable mode: More powerful, lets you specify exactly which tensors and collections to save. Use the Python API within your script. | ||
|
||
#### [Tornasole TensorFlow](docs/tensorflow/README.md) | ||
#### [Tornasole MXNet](docs/mxnet/README.md) | ||
#### [Tornasole PyTorch](docs/pytorch/README.md) | ||
#### [Tornasole XGBoost](docs/xgboost/README.md) | ||
|
||
### Analysis | ||
Please refer **[this page](docs/rules/README.md)** for more details about how to analyze. | ||
The analysis of these tensors can be done on a separate machine in parallel with the training job. | ||
## Example: SageMaker Zero-Code-Change | ||
This example uses a zero-script-change experience, where you can use your training script as-is. | ||
See the [example notebooks](https://link.com) for more details. | ||
```python | ||
import sagemaker | ||
from sagemaker.debugger import rule_configs, Rule, CollectionConfig | ||
|
||
## ContactUs | ||
We would like to hear from you. If you have any question or feedback, please reach out to us [email protected] | ||
# Choose a built-in rule to monitor your training job | ||
rule = Rule.sagemaker( | ||
rule_configs.exploding_tensor(), | ||
rule_parameters={ | ||
"tensor_regex": ".*" | ||
}, | ||
collections_to_save=[ | ||
CollectionConfig(name="weights"), | ||
CollectionConfig(name="losses"), | ||
], | ||
) | ||
|
||
## License | ||
This library is licensed under the Apache 2.0 License. | ||
# Pass the rule to the estimator | ||
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( | ||
entry_point="script.py", | ||
role=sagemaker.get_execution_role(), | ||
framework_version="1.15", | ||
py_version="py3", | ||
rules=[rule], | ||
) | ||
|
||
sagemaker_simple_estimator.fit() | ||
``` | ||
|
||
That's it! SageMaker will automatically monitor your training job for your and create a CloudWatch | ||
event if you run into exploding tensor values. | ||
|
||
If you want greater configuration and control, we offer that too. Simply | ||
|
||
|
||
## Example: Running Locally | ||
Requires Python 3.6+, and this example uses tf.keras. Run | ||
``` | ||
pip install smdebug | ||
``` | ||
|
||
To use Sagemaker Debugger, simply add a callback hook: | ||
```python | ||
import smdebug.tensorflow as smd | ||
hook = smd.KerasHook.(out_dir=args.out_dir) | ||
|
||
model = tf.keras.models.Sequential([ ... ]) | ||
model.compile( | ||
optimizer='adam', | ||
loss='sparse_categorical_crossentropy', | ||
) | ||
|
||
# Add the hook as a callback | ||
model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) | ||
model.evaluate(x_test, y_test, callbacks=[hook]) | ||
|
||
# Create a trial to inspect the saved tensors | ||
trial = smd.create_trial(out_dir=args.out_dir) | ||
print(f"Saved tensor values for {trial.tensors()}") | ||
print(f"Loss values were {trial.tensor('CrossEntropyLoss:0')}") | ||
``` | ||
|
||
## How It Works | ||
SageMaker Debugger uses a `hook` to store the values of tensors throughout the training process. Another process called a `rule` job | ||
simultaneously monitors and validates these outputs to ensure that training is progressing as expected. | ||
A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization. | ||
If a rule is triggered, it will raise a CloudWatch event and stop the training job, saving you time | ||
and money. | ||
|
||
SageMaker Debugger can be used inside or outside of SageMaker. There are three main use cases: | ||
- SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above. | ||
- SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script. | ||
- Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script. See the second example above. | ||
|
||
The reason for different setups is that SageMaker Zero-Script-Change uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically. | ||
These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments. | ||
|
||
See the [SageMaker page](https://link.com) for details on SageMaker Zero-Script-Change and BYOC experience.\ | ||
See the frameworks pages for details on modifying the training script: | ||
- [TensorFlow](https://link.com) | ||
- [PyTorch](https://link.com) | ||
- [MXNet](https://link.com) | ||
- [XGBoost](https://link.com) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
# SageMaker | ||
|
||
There are two cases for SageMaker: | ||
- Zero-Script-Change (ZSC): Here you specify which rules to use, and run your existing script. | ||
- Supported in Deep Learning Containers: `TensorFlow==1.15, PyTorch==1.3, MXNet==1.6` | ||
- Bring-Your-Own-Container (BYOC): Here you specify the rules to use, and modify your training script. | ||
- Supported with `TensorFlow==1.13/1.14/1.15, PyTorch==1.2/1.3, MXNet==1.4,1.5,1.6` | ||
|
||
Table of Contents | ||
- [Configuration Details](#version-support) | ||
- [Using a Custom Container](#byoc-example) | ||
|
||
## Configuration Details | ||
The DebuggerHookConfig is the main object. | ||
|
||
```python | ||
rule = sagemaker.debugger.Rule.sagemaker( | ||
base_config: dict, # Use an import, e.g. sagemaker.debugger.rule_configs.exploding_tensor() | ||
name: str=None, | ||
instance_type: str=None, | ||
container_local_path: str=None, | ||
volume_size_in_gb: int=None, | ||
other_trials_s3_input_paths: str=None, | ||
rule_parameters: dict=None, | ||
collections_to_save: list[sagemaker.debugger.CollectionConfig]=None, | ||
) | ||
``` | ||
|
||
```python | ||
hook_config = sagemaker.debugger.DebuggerHookConfig( | ||
s3_output_path: str, | ||
container_local_path: str=None, | ||
hook_parameters: dict=None, | ||
collection_configs: list[sagemaker.debugger.CollectionConfig]=None, | ||
) | ||
``` | ||
|
||
```python | ||
tb_config = sagemaker.debugger.TensorBoardOutputConfig( | ||
s3_output_path: str, | ||
container_local_path: str=None, | ||
) | ||
``` | ||
|
||
```python | ||
collection_config = sagemaker.debugger.CollectionConfig( | ||
name: str, | ||
parameters: dict, | ||
) | ||
``` | ||
|
||
A full example script is below: | ||
```python | ||
import sagemaker | ||
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig | ||
|
||
hook_parameters = { | ||
"include_regex": "my_regex,another_regex", # comma-separated string of regexes | ||
"save_interval": 100, | ||
"save_steps": "1,2,3,4", # comma-separated string of steps to save | ||
"start_step": 1, | ||
"end_step": 2000, | ||
"reductions": "min,max,mean,std,abs_variance,abs_sum,abs_l2_norm", | ||
} | ||
weights_config = CollectionConfiguration("weights") | ||
biases_config = CollectionConfiguration("biases") | ||
losses_config = CollectionConfiguration("losses") | ||
tb_config = TensorBoardOutputConfig(s3_output_path="s3://my-bucket/tensorboard") | ||
|
||
hook_config = DebuggerHookConfig( | ||
s3_output_path="s3://my-bucket/smdebug", | ||
hook_parameters=hook_parameters, | ||
collection_configs=[weights_config, biases_config, losses_config], | ||
) | ||
|
||
exploding_tensor_rule = Rule.sagemaker( | ||
base_config=rule_configs.exploding_tensor(), | ||
rule_parameters={ | ||
"tensor_regex": ".*", | ||
}, | ||
collections_to_save=[weights_config, losses_config], | ||
) | ||
vanishing_gradient_rule = Rule.sagemaker(base_config=rule_configs.vanishing_gradient()) | ||
|
||
# Or use sagemaker.pytorch.PyTorch or sagemaker.mxnet.MXNet | ||
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( | ||
entry_point=simple_entry_point_script, | ||
role=sagemaker.get_execution_role(), | ||
base_job_name=args.job_name, | ||
train_instance_count=1, | ||
train_instance_type="ml.m4.xlarge", | ||
framework_version="1.15", | ||
py_version="py3", | ||
# smdebug-specific arguments below | ||
rules=[exploding_tensor_rule, vanishing_gradient_rule], | ||
debugger_hook_config=hook_config, | ||
tensorboard_output_config=tb_config, | ||
) | ||
|
||
sagemaker_simple_estimator.fit() | ||
``` | ||
|
||
## Using a Custom Container | ||
To use a custom container (without the framework forks), you should modify your script. | ||
Use the same sagemaker Estimator setup as shown below, and in your script, call | ||
|
||
```python | ||
hook = smd.{hook_class}.create_from_json_file() | ||
``` | ||
|
||
and modify the rest of your script as shown in the API docs. Click on your desired framework below. | ||
- [TensorFlow](https://link.com) | ||
- [PyTorch](https://link.com) | ||
- [MXNet](https://link.com) | ||
- [XGBoost](https://link.com) | ||
|
||
|
||
## Comprehensive Rule List | ||
Full list of rules is: | ||
| Rule Name | Behavior | | ||
| --- | --- | | ||
| `vanishing_gradient` | Detects a vanishing gradient. | | ||
| `all_zero` | ??? | | ||
| `check_input_images` | ??? | | ||
| `similar_across_runs` | ??? | | ||
| `weight_update_ratio` | ??? | | ||
| `exploding_tensor` | ??? | | ||
| `unchanged_tensor` | ??? | | ||
| `loss_not_decreasing` | ??? | | ||
| `dead_relu` | ??? | | ||
| `confusion` | ??? | | ||
| `overfit` | ??? | | ||
| `tree_depth` | ??? | | ||
| `tensor_variance` | ??? | | ||
| `overtraining` | ??? | | ||
| `poor_weight_initialization` | ??? | | ||
| `saturated_activation` | ??? | | ||
| `nlp_sequence_ratio` | ??? | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Oops, something went wrong.