From eb48aba014a6d9c9f998321a6d3ebce0adb47343 Mon Sep 17 00:00:00 2001 From: Jared T Nielsen Date: Wed, 27 Nov 2019 16:08:09 -0800 Subject: [PATCH] Docs (#60) * Rework README to point directly to framework pages * WIP * WIP * Rename documentation to docs * Updated sagemaker.md * Move README.md to top-level and delete old * Add 'how-to-use' to tensorflow.md --- README.md | 121 ++++++++++++--- {documentation => docs}/API.md | 63 +++----- {documentation => docs}/analysis.md | 0 .../distributed_training.md | 0 {documentation => docs}/mxnet.md | 0 {documentation => docs}/pytorch.md | 0 docs/sagemaker.md | 138 ++++++++++++++++++ {documentation => docs}/tensorflow.md | 27 ++++ {documentation => docs}/xgboost.md | 0 documentation/README.md | 129 ---------------- documentation/sagemaker.md | 63 -------- 11 files changed, 283 insertions(+), 258 deletions(-) rename {documentation => docs}/API.md (86%) rename {documentation => docs}/analysis.md (100%) rename {documentation => docs}/distributed_training.md (100%) rename {documentation => docs}/mxnet.md (100%) rename {documentation => docs}/pytorch.md (100%) create mode 100644 docs/sagemaker.md rename {documentation => docs}/tensorflow.md (83%) rename {documentation => docs}/xgboost.md (100%) delete mode 100644 documentation/README.md delete mode 100644 documentation/sagemaker.md diff --git a/README.md b/README.md index f1eea4552f..070bfb60fb 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,107 @@ -## Tornasole +# Sagemaker Debugger -Tornasole is an upcoming AWS service designed to be a debugger -for machine learning models. It lets you go beyond just looking -at scalars like losses and accuracies during training and -gives you full visibility into all tensors 'flowing through the graph' -during training or inference. +- [Overview](#overview) +- [Examples](#sagemaker-example) +- [How It Works](#how-it-works) -Using Tornasole is a two step process: +## Overview +Sagemaker Debugger is an AWS service to automatically debug your machine learning training process. +It helps you develop better, faster, cheaper models by catching common errors quickly. It supports +TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+. -### Saving tensors +- Zero-code-change experience on SageMaker and AWS Deep Learning containers. +- Automated anomaly detection and state assertions. +- Realtime training job monitoring and visibility into any tensor value. +- Distributed training and TensorBoard support. -This needs the `tornasole` package built for the appropriate framework. -It allows you to collect the tensors you want at the frequency -that you want, and save them for analysis. -Please follow the appropriate Readme page to install the correct version. +There are two ways to use it: Automatic mode and configurable mode. +- Automatic mode: No changes to your training script. Specify the rules you want and launch a SageMaker Estimator job. +- Configurable mode: More powerful, lets you specify exactly which tensors and collections to save. Use the Python API within your script. -#### [Tornasole TensorFlow](docs/tensorflow/README.md) -#### [Tornasole MXNet](docs/mxnet/README.md) -#### [Tornasole PyTorch](docs/pytorch/README.md) -#### [Tornasole XGBoost](docs/xgboost/README.md) -### Analysis -Please refer **[this page](docs/rules/README.md)** for more details about how to analyze. -The analysis of these tensors can be done on a separate machine in parallel with the training job. +## Example: SageMaker Zero-Code-Change +This example uses a zero-script-change experience, where you can use your training script as-is. +See the [example notebooks](https://link.com) for more details. +```python +import sagemaker +from sagemaker.debugger import rule_configs, Rule, CollectionConfig -## ContactUs -We would like to hear from you. If you have any question or feedback, please reach out to us tornasole-users@amazon.com +# Choose a built-in rule to monitor your training job +rule = Rule.sagemaker( + rule_configs.exploding_tensor(), + rule_parameters={ + "tensor_regex": ".*" + }, + collections_to_save=[ + CollectionConfig(name="weights"), + CollectionConfig(name="losses"), + ], +) -## License -This library is licensed under the Apache 2.0 License. +# Pass the rule to the estimator +sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( + entry_point="script.py", + role=sagemaker.get_execution_role(), + framework_version="1.15", + py_version="py3", + rules=[rule], +) + +sagemaker_simple_estimator.fit() +``` + +That's it! SageMaker will automatically monitor your training job for your and create a CloudWatch +event if you run into exploding tensor values. + +If you want greater configuration and control, we offer that too. Simply + + +## Example: Running Locally +Requires Python 3.6+, and this example uses tf.keras. Run +``` +pip install smdebug +``` + +To use Sagemaker Debugger, simply add a callback hook: +```python +import smdebug.tensorflow as smd +hook = smd.KerasHook.(out_dir=args.out_dir) + +model = tf.keras.models.Sequential([ ... ]) +model.compile( + optimizer='adam', + loss='sparse_categorical_crossentropy', +) + +# Add the hook as a callback +model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) +model.evaluate(x_test, y_test, callbacks=[hook]) + +# Create a trial to inspect the saved tensors +trial = smd.create_trial(out_dir=args.out_dir) +print(f"Saved tensor values for {trial.tensors()}") +print(f"Loss values were {trial.tensor('CrossEntropyLoss:0')}") +``` + +## How It Works +SageMaker Debugger uses a `hook` to store the values of tensors throughout the training process. Another process called a `rule` job +simultaneously monitors and validates these outputs to ensure that training is progressing as expected. +A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization. +If a rule is triggered, it will raise a CloudWatch event and stop the training job, saving you time +and money. + +SageMaker Debugger can be used inside or outside of SageMaker. There are three main use cases: +- SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above. +- SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script. +- Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script. See the second example above. + +The reason for different setups is that SageMaker Zero-Script-Change uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically. +These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments. + +See the [SageMaker page](https://link.com) for details on SageMaker Zero-Script-Change and BYOC experience.\ +See the frameworks pages for details on modifying the training script: +- [TensorFlow](https://link.com) +- [PyTorch](https://link.com) +- [MXNet](https://link.com) +- [XGBoost](https://link.com) diff --git a/documentation/API.md b/docs/API.md similarity index 86% rename from documentation/API.md rename to docs/API.md index 73848bb33a..401dc720be 100644 --- a/documentation/API.md +++ b/docs/API.md @@ -10,55 +10,32 @@ These objects exist across all frameworks. - [SaveConfig](#saveconfig) - [ReductionConfig](#reductionconfig) ---- -## SageMaker Zero-Code-Change vs. Python API +## Glossary -There are two ways to use sagemaker-debugger: SageMaker Zero-Code-Change or Python API. +The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`. -SageMaker Zero-Code-Change will use a custom framework fork to automatically instantiate the hook, register tensors, and create collections. -All you need to do is decide which built-in rules to use. Further documentation is available on [AWS Docs](https://link.com). -```python -import sagemaker -from sagemaker.debugger import rule_configs, Rule, CollectionConfig, DebuggerHookConfig, TensorBoardOutputConfig - -hook_config = DebuggerHookConfig( - s3_output_path = args.s3_path, - container_local_path = args.local_path, - hook_parameters = { - "save_steps": "0,20,40,60,80" - }, - collection_configs = { - { "CollectionName": "weights" }, - { "CollectionName": "biases" }, - }, -) +**Hook**: The main interface to use training. This object can be passed as a model hook/callback +in Tensorflow and Keras. It keeps track of collections and writes output files at each step. +- `hook = smd.Hook(out_dir="/tmp/mnist_job")` -rule = Rule.sagemaker( - rule_configs.exploding_tensor(), - rule_parameters={ - "tensor_regex": ".*" - }, - collections_to_save=[ - CollectionConfig(name="weights"), - CollectionConfig(name="losses"), - ], -) +**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase +you're in. Defaults to "global". +- `train_mode = smd.modes.TRAIN` -sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( - entry_point="script.py", - role=sagemaker.get_execution_role(), - framework_version="1.15", - py_version="py3", - rules=[rule], - debugger_hook_config=hook_config, -) +**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for +tensors to include/exclude. +- `collection = hook.get_collection("losses")` -sagemaker_simple_estimator.fit() -``` +**SaveConfig**: A Python dict specifying how often to save losses and tensors. +- `save_config = smd.SaveConfig(save_interval=10)` + +**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor. +- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])` + +**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](https://link.com). +- `trial = smd.create_trial(out_dir="/tmp/mnist_job")` -The Python API requires more configuration but is also more flexible. You must write your own custom rules -instead of using SageMaker's built-in rules, but you can use it with a custom container in SageMaker or in your own -environment. It is described further below. +**Rule**: A condition that will trigger an exception and terminate the training job early, for example a vanishing gradient. See [rules documentation](https://link.com). --- diff --git a/documentation/analysis.md b/docs/analysis.md similarity index 100% rename from documentation/analysis.md rename to docs/analysis.md diff --git a/documentation/distributed_training.md b/docs/distributed_training.md similarity index 100% rename from documentation/distributed_training.md rename to docs/distributed_training.md diff --git a/documentation/mxnet.md b/docs/mxnet.md similarity index 100% rename from documentation/mxnet.md rename to docs/mxnet.md diff --git a/documentation/pytorch.md b/docs/pytorch.md similarity index 100% rename from documentation/pytorch.md rename to docs/pytorch.md diff --git a/docs/sagemaker.md b/docs/sagemaker.md new file mode 100644 index 0000000000..440cbd0ff6 --- /dev/null +++ b/docs/sagemaker.md @@ -0,0 +1,138 @@ +# SageMaker + +There are two cases for SageMaker: +- Zero-Script-Change (ZSC): Here you specify which rules to use, and run your existing script. + - Supported in Deep Learning Containers: `TensorFlow==1.15, PyTorch==1.3, MXNet==1.6` +- Bring-Your-Own-Container (BYOC): Here you specify the rules to use, and modify your training script. + - Supported with `TensorFlow==1.13/1.14/1.15, PyTorch==1.2/1.3, MXNet==1.4,1.5,1.6` + +Table of Contents +- [Configuration Details](#version-support) +- [Using a Custom Container](#byoc-example) + +## Configuration Details +The DebuggerHookConfig is the main object. + +```python +rule = sagemaker.debugger.Rule.sagemaker( + base_config: dict, # Use an import, e.g. sagemaker.debugger.rule_configs.exploding_tensor() + name: str=None, + instance_type: str=None, + container_local_path: str=None, + volume_size_in_gb: int=None, + other_trials_s3_input_paths: str=None, + rule_parameters: dict=None, + collections_to_save: list[sagemaker.debugger.CollectionConfig]=None, +) +``` + +```python +hook_config = sagemaker.debugger.DebuggerHookConfig( + s3_output_path: str, + container_local_path: str=None, + hook_parameters: dict=None, + collection_configs: list[sagemaker.debugger.CollectionConfig]=None, +) +``` + +```python +tb_config = sagemaker.debugger.TensorBoardOutputConfig( + s3_output_path: str, + container_local_path: str=None, +) +``` + +```python +collection_config = sagemaker.debugger.CollectionConfig( + name: str, + parameters: dict, +) +``` + +A full example script is below: +```python +import sagemaker +from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig + +hook_parameters = { + "include_regex": "my_regex,another_regex", # comma-separated string of regexes + "save_interval": 100, + "save_steps": "1,2,3,4", # comma-separated string of steps to save + "start_step": 1, + "end_step": 2000, + "reductions": "min,max,mean,std,abs_variance,abs_sum,abs_l2_norm", +} +weights_config = CollectionConfiguration("weights") +biases_config = CollectionConfiguration("biases") +losses_config = CollectionConfiguration("losses") +tb_config = TensorBoardOutputConfig(s3_output_path="s3://my-bucket/tensorboard") + +hook_config = DebuggerHookConfig( + s3_output_path="s3://my-bucket/smdebug", + hook_parameters=hook_parameters, + collection_configs=[weights_config, biases_config, losses_config], +) + +exploding_tensor_rule = Rule.sagemaker( + base_config=rule_configs.exploding_tensor(), + rule_parameters={ + "tensor_regex": ".*", + }, + collections_to_save=[weights_config, losses_config], +) +vanishing_gradient_rule = Rule.sagemaker(base_config=rule_configs.vanishing_gradient()) + +# Or use sagemaker.pytorch.PyTorch or sagemaker.mxnet.MXNet +sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( + entry_point=simple_entry_point_script, + role=sagemaker.get_execution_role(), + base_job_name=args.job_name, + train_instance_count=1, + train_instance_type="ml.m4.xlarge", + framework_version="1.15", + py_version="py3", + # smdebug-specific arguments below + rules=[exploding_tensor_rule, vanishing_gradient_rule], + debugger_hook_config=hook_config, + tensorboard_output_config=tb_config, +) + +sagemaker_simple_estimator.fit() +``` + +## Using a Custom Container +To use a custom container (without the framework forks), you should modify your script. +Use the same sagemaker Estimator setup as shown below, and in your script, call + +```python +hook = smd.{hook_class}.create_from_json_file() +``` + +and modify the rest of your script as shown in the API docs. Click on your desired framework below. +- [TensorFlow](https://link.com) +- [PyTorch](https://link.com) +- [MXNet](https://link.com) +- [XGBoost](https://link.com) + + +## Comprehensive Rule List +Full list of rules is: +| Rule Name | Behavior | +| --- | --- | +| `vanishing_gradient` | Detects a vanishing gradient. | +| `all_zero` | ??? | +| `check_input_images` | ??? | +| `similar_across_runs` | ??? | +| `weight_update_ratio` | ??? | +| `exploding_tensor` | ??? | +| `unchanged_tensor` | ??? | +| `loss_not_decreasing` | ??? | +| `dead_relu` | ??? | +| `confusion` | ??? | +| `overfit` | ??? | +| `tree_depth` | ??? | +| `tensor_variance` | ??? | +| `overtraining` | ??? | +| `poor_weight_initialization` | ??? | +| `saturated_activation` | ??? | +| `nlp_sequence_ratio` | ??? | diff --git a/documentation/tensorflow.md b/docs/tensorflow.md similarity index 83% rename from documentation/tensorflow.md rename to docs/tensorflow.md index 3cf1704562..c91e8ab679 100644 --- a/documentation/tensorflow.md +++ b/docs/tensorflow.md @@ -3,12 +3,23 @@ SageMaker Zero-Code-Change supported container: TensorFlow 1.15. See the [AWS Docs](https://link.com) for details.\ Python API supported versions: Tensorflow 1.13, 1.14, 1.15. Keras 2.3. + + ## Contents +- [How to Use](#how-to-use) - [Keras Example](#keras-example) - [MonitoredSession Example](#monitored-session-example) - [Estimator Example](#estimator-example) - [Full API](#full-api) +## How to Use +1. `import smdebug.tensorflow as smd` +2. Instantiate a hook. `smd.{hook_class}.create_from_json_file()` in a SageMaker environment or `smd.{hook_class}()` elsewhere. +3. Pass the hook to the model as a callback. +4. If using a custom container or outside of SageMaker, wrap the optimizer with `optimizer = hook.wrap_optimizer(optimizer)`. + +(Optional): Configure collections. See the [Common API](https://link.com) page for details on how to do this. + ## tf.keras Example ```python import smdebug.tensorflow as smd @@ -140,3 +151,19 @@ wrap_optimizer( ) ``` Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process. + +## Concepts +The steps to use Tornasole in any framework are: + +1. Create a `hook`. +2. Register your model and optimizer with the hook. +3. Specify the `rule` to be used. +4. After training, create a `trial` to manually analyze the tensors. + +See the [API page](https://link.com) for more details. + +## Detailed Links +- [Full API](https://link.com) +- [Rules and Trials](https://link.com) +- [Distributed Training](https://link.com) +- [TensorBoard](https://link.com) diff --git a/documentation/xgboost.md b/docs/xgboost.md similarity index 100% rename from documentation/xgboost.md rename to docs/xgboost.md diff --git a/documentation/README.md b/documentation/README.md deleted file mode 100644 index a778e9cfdf..0000000000 --- a/documentation/README.md +++ /dev/null @@ -1,129 +0,0 @@ -# Sagemaker Debugger - -- [Overview](#overview) -- [SageMaker Example](#sagemaker-example) -- [Python Example](#python-example) -- [Concepts](#concepts) -- [Glossary](#glossary) -- [Detailed Links](#detailed-links) - -## Overview -Sagemaker Debugger is an AWS service to automatically debug your machine learning training process. -It helps you develop better, faster, cheaper models by catching common errors quickly. It supports -TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+. - -- Zero-code-change experience on SageMaker and AWS Deep Learning containers. -- Automated anomaly detection and state assertions. -- Realtime training job monitoring and visibility into any tensor value. -- Distributed training and TensorBoard support. - -## SageMaker Example -This example uses a zero-code-change experience, where you can use your training script as-is.\ -See the [sagemaker](https://link.com) page for more details. -```python -import sagemaker -from sagemaker.debugger import rule_configs, Rule, CollectionConfig - -rule = Rule.sagemaker( - rule_configs.exploding_tensor(), - rule_parameters={ - "tensor_regex": ".*" - }, - collections_to_save=[ - CollectionConfig(name="weights"), - CollectionConfig(name="losses"), - ], -) - -sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( - entry_point="script.py", - role=sagemaker.get_execution_role(), - framework_version="1.15", - py_version="py3", - rules=[rule], -) - -sagemaker_simple_estimator.fit() -``` - - -## Python Example -Requires Python 3.6+. Run -``` -pip install smdebug -``` - -This example uses tf.keras. Say your training code looks like this: -```python -model = tf.keras.models.Sequential([ ... ]) -model.compile( - optimizer='adam', - loss='sparse_categorical_crossentropy', -) -model.fit(x_train, y_train, epochs=args.epochs) -model.evaluate(x_test, y_test) -``` - -To use Sagemaker Debugger, simply add a callback hook: -```python -import smdebug.tensorflow as smd -hook = smd.KerasHook(out_dir=args.out_dir) - -model = tf.keras.models.Sequential([ ... ]) -model.compile( - optimizer='adam', - loss='sparse_categorical_crossentropy', -) -model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) -model.evaluate(x_test, y_test, callbacks=[hook]) -``` - -To analyze the result of the training run, create a trial and inspect the tensors. -```python -trial = smd.create_trial(out_dir=args.out_dir) -print(f"Saved tensor values for {trial.tensors()}") -print(f"Loss values were {trial.tensor('CrossEntropyLoss:0')}") -``` - -## Concepts -The steps to use Tornasole in any framework are: - -1. Create a `hook`. -2. Register your model and optimizer with the hook. -3. Specify the `rule` to be used. -4. After training, create a `trial` to manually analyze the tensors. - -See the [API page](https://link.com) for more details. - -## Glossary - -The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`. - -**Hook**: The main interface to use training. This object can be passed as a model hook/callback -in Tensorflow and Keras. It keeps track of collections and writes output files at each step. -- `hook = smd.Hook(out_dir="/tmp/mnist_job")` - -**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase -you're in. Defaults to "global". -- `train_mode = smd.modes.TRAIN` - -**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for -tensors to include/exclude. -- `collection = hook.get_collection("losses")` - -**SaveConfig**: A Python dict specifying how often to save losses and tensors. -- `save_config = smd.SaveConfig(save_interval=10)` - -**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor. -- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])` - -**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](https://link.com). -- `trial = smd.create_trial(out_dir="/tmp/mnist_job")` - -**Rule**: A condition that will trigger an exception and terminate the training job early, for example a vanishing gradient. See [rules documentation](https://link.com). - -## Detailed Links -- [Full API](https://link.com) -- [Rules and Trials](https://link.com) -- [Distributed Training](https://link.com) -- [TensorBoard](https://link.com) diff --git a/documentation/sagemaker.md b/documentation/sagemaker.md deleted file mode 100644 index c6ff49a42a..0000000000 --- a/documentation/sagemaker.md +++ /dev/null @@ -1,63 +0,0 @@ -# SageMaker Examples - -There are two cases for using SageMaker: fully managed or bring-your-own-container (BYOC). -In fully managed mode, SageMaker will automatically inject hooks into your training script - no code -change necessary! This is supported for TensorFlow 1.15, PyTorch 1.3, and MXNet 1.6. - -In BYOC mode, you will need to instantiate the hook and use it yourself. Built-in rules will not be -available, but you can write custom rules and use those. - -## Example Usage (Sagemaker Fully Managed) -This setup will work for any script without code changes. This example shows Tensorflow 1.15. -See the [JSON specification](https://link.com) section of API.md for details on the JSON configuration. - -This example uses TensorFlow. -To use PyTorch or MXNet, simply call `sagemaker.pytorch.PyTorch` or `sagemaker.mxnet.MXNet`. -```python -import sagemaker -from sagemaker.debugger import Rule, rule_configs, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig - -hook_config = DebuggerHookConfig( - s3_output_path = "s3://my-bucket/debugger-logs", - hook_parameters = { - "save_steps": "0,20,40,60,80" - }, - collection_configs = { - { "CollectionName": "weights" }, - { "CollectionName": "biases" }, - }, -) - - -rule = Rule.sagemaker( - rule_configs.exploding_tensor(), - rule_parameters={ - "tensor_regex": ".*" - }, - collections_to_save=[ - CollectionConfig(name="weights", parameters={}), - CollectionConfig(name="losses", parameters={}), - ], -) - -sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow( - entry_point=simple_entry_point_script, - role=sagemaker.get_execution_role(), - base_job_name=args.job_name, - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - debugger_hook_config=hook_config, - rules=[rule], -) - -sagemaker_simple_estimator.fit() -``` - -When a rule triggers, it will create a CloudWatch event. - -## Example Usage (SageMaker BYOC) -Use the same script as fully managed. In the script, call -`hook = smd.{hook_class}.create_from_json_file()` -to get the hook and then use it as described in the rest of the API docs.