Use this repository to enable a MAX model for training using Watson Machine Learning on the IBM Cloud. A training-enabled model repository includes the following components:
This repository contains the MAX model training skeleton files. To enable a MAX model for training complete the tasks listed below.
- Task 1: Clone the model repository
- Task 2: Clone the training skeleton repository
- Task 3: Install and customize the training skeleton
- Task 4: Prepare the model training code
- Task 5: Update the Dockerfile
- Task 6: Configure your environment for testing
- Task 7: Verify that model training works as expected
- Task 8: Customize the model-serving code as required
- Task 9: Add data preparation instructions
- Task 10: Upload pre-trained model assets to the production COS
- Task 11: Customize the model training README
- Task 12: Update the model readme
-
Clone the desired MAX model repository, e.g.
https://github.com/IBM/MAX-Object-Detector
, into a temporary location.$ pwd my-working-directory $ git clone https://github.com/IBM/MAX-... ...
The subsequent tasks will modify the content of this cloned directory.
-
Take note of the cloned directory location.
$ ls * MAX-...
The instructions in this document refer to this location as
$MODEL_REPO_HOME
(but there is no such evironment variable defined!)
The subsequent tasks will modify the content of this directory to enable the model for training with custom data.
The training skeleton repository contains the MAX model-training framework code, which is model agnostic.
-
Clone the training skeleton repository into a temporary location
$ pwd <my-working-directory> $ git clone https://github.com/IBM/MAX-Training-Skeleton $ cd MAX-Training-Skeleton $ pwd <my-working-directory>/MAX-Training-Skeleton
The instructions in this document refer to this location as
$SKELETON_HOME
(but there is no such evironment variable defined!)
The skeleton installer requires Python 3.6 or above. It is recommended that you start with a clean Python environment.
-
In
$SKELETON_HOME
launch the training skeleton installation program.The
install_training_skeleton.sh
shell script installs the training skeleton Python dependencies and launches the program.$ ./install_training_skeleton.sh
To only run the installer:
python install_training_skeleton.py
The installer copies the training skeleton files to the
$MODEL_REPO_HOME/training
directory and creates a model-specific training configuration YAML file in that location using the information you are entering whenever prompted.The installer will abort if
$MODEL_REPO_HOME/training
already exists. (Delete the directory to start over.) -
Follow the prompts.
Whenever possible, a meaningful default value is provided. Hit
ENTER
to accept the default.Model name [MAX Object Detector]: <ENTER> Model identifier [max-object-detector]: <ENTER> Model description [Localize and identify multiple objects in a single image.]: <ENTER>
If no meaningful default is displayed open an issue. (The default values are derived from information in
$MODEL_REPO_HOME
.)You will be prompted to select the model's deep learning framework and serialization format. (If you don't know check how the Docker-based microservice loads the model.)
Deep learning framework: 1. TensorFlow 2. Keras Pick one: 1 tensorflow model serialization format: 1. Checkpoint 2. Frozen graph definition 3. Saved model Pick one: 1
If none of the displayed options applies to the model STOP and open an issue. We will update the training skeleton to add support, as documented here.
You will be prompted to specify an optional Training data key prefix.
Training data key name prefix: <ENTER>
By default WML makes training data automatically available in a local directory that's identified by the
DATA_DIR
environment variable, e.g./some/random/path/
. You can optionally specify a subdirectory name by defining a key prefix. For example, if your training code expects training data to be stored in a subdirectory nameddata
, you must set the key prefix todata/
. Your code can then access the files in$DATA_DIR/data
.
Upon completion of this task, the model repository contains the model training framework along with its configuration files:
$ ls $MODEL_REPO_HOME/training
train_max_model.py
setup_max_model_training.py
requirements.txt
max-...-training-config.yaml
sample_training_data/
training_code/
data_preparation/
train_max_model.py
- The training script that the end-user will run to train the model. This is the core of the training framework.setup_max_model_training.py
- The user must run this script to configure model training prior to runningwml_train.py
.requirements.txt
- Identifies prerequisite packagesmax-...-training-config.yaml
- The model-specific configuration file thatwml_train.py
uses to perform model training. The end-user updates this file usingwml_setup.py
.sample_training_data/
- This dicrectory contains optionally a small training data set that the end-user can use to train the model. (to understand the flow etc)training_code/
- This directory will contain the model-specific training code that YOU provide.data_preparation/
- This directory will contain the model-specific data preparation instructions and code that YOU provide.
Note: The utility also updates
.gitignore
and.dockerignore
in $MODEL_REPO_HOME.
The model training code must produce artifacts that the model-serving code in the Docker image can consume. Most likely you will have to customize/generalize the existing model-serving code and the model-building code to find the right balance. Typical issues:
- The model serving code loads a model from the
my-saved-state.pb
file. Your original training code might save the model inmy-state.pb
. You need to pick one or the other. Continue reading. - The model serving code loads a model from the
my-saved-state-05-05-2018.pb
file, which contains a date (or version or data set name) in the name. Your model building code should not try to save the model under this name because the date, version or data set name have lost their meaning (the model was trained on a different name, using a different data set, etc). In this case you should pick a general name that is semantically correct for both - the pre-trained model and a custom-trained model. (This requires changes to the artifacts stored on COS, but that's okay.) - The model-serving code extracts the model artifacts from a compressed archive (e.g. if model-serving supports multiple model flavors). Therefore the custom model training/packaging code must produce a compressed archive with the same name.
How model-serving has been implemented has a major impact on model training code and model packaging code. At a minimum you must collect the following information:
- Does the model-serving code (not the
Dockerfile
!) process one or more compressed archive(s)?- Note the names
- Document the archive structure
- Does the model-serving code (not the
Dockerfile
!) process serialized model files?- Note the names
-
Place your model training code that WML will run into the
$MODEL_REPO_HOME/training/training_code/
directory. -
Customize
$MODEL_REPO_HOME/training/training_code/train-max-model.sh
by following the embedded instructions. -
Add packages that your model training code requires to
$MODEL_REPO_HOME/training/training_code/training_requirements.txt
.$MODEL_REPO_HOME/training/training_code/train-max-model.sh
will install the listed requirements when WML starts the training run.
DO:
- Add a small ready-to-use sample data set to
$MODEL_REPO_HOME/training/sample_training_data/
. The training framework will upload the content of this directory to the training data bucket in Cloud Object Storage, if that bucket is empty. Note that training data must reside in thesample_training_data/data
subdirectory. - Update README files with model-specific information that a user might find useful.
DO NOT:
- Modify any source code in
$MODEL_REPO_HOME/training/
(aside from files in$MODEL_REPO_HOME/training/training_code/
).
Review the training-enabled Dockerfile recipe and apply the relevant parts to the model's original Dockerfile
.
You should always test in a clean environment (e.g. by creating a brand-new conda environment that does not have any of the prerequites pre-installed.
-
Install the training framework/setup prerequisites
$ pwd $MODEL_REPO_HOME/training $ pip install -r requirements.txt ... $ python setup_max_model_training.py <max-model-name-training-config>.yaml ...
If
setup_max_model_training.py
ortrain_max_model.py
fails at any time due to a missing package open an issue. We will update therequirements.txt
in the training skeleton as necessary. -
Upon completion,
setup_max_model_training.py
should display a set of environment variables that you need to define.
-
Launch the training utility
train_max_model.py
in$MODEL_REPO_HOME/training/
to review its options.$ pwd $MODEL_REPO_HOME/training $ python train_max_model.py
-
Prepare for training. This step creates the required buckets in COS, uploads the training data (if required), and creates the WML model building code archive.
$ python train_max_model.py <max-model-name-training-config>.yaml prepare
-
Inspect the content of the generated model building code archive
<max-model-name>-model-building-code.zip
. This archive is sent to WML when the training is started. -
Train the model and package the artifacts.
$ python train_max_model.py <max-model-name-training-config>.yaml package
-
Review the output and the downloaded training log file
training-log.txt
. No errors should be reported.
The model-serving code in $MODEL_REPO_HOME/
might require some changes based on how your model was trained and packaged. If appropriate change the model serving code to accomodate new requirements. If necessary re-package the originally published model assets.
Verify that the model-serving code works as expected for the pre-trained model assets and the custom-trained model assets.
- Run docker image build using the
--build-arg use_pre_trained_model=true
build argument - Launch the Docker image and verify that model-serving yields the expected results.
- Run docker image build using the
--build-arg use_pre_trained_model=false
build argument - Launch the Docker image and verify that model-serving yields the expected results.
The $MODEL_REPO_HOME/training/data_preparation
directory includes a README template file. Update the template and add code or examples as required.
OPTIONAL: To easily support pre-trained and custom-trained models it might sometimes be necessary to replace the original model assets on COS. For example, a model asset might be named model-0506-2018.pb
. Because it wouldn't make semantically sense for model training to produce a file named model-0506-2018.pb
(0506-2018 representing a date
), the model serving code would have to be enabled to use different names (one for the pre-trained model and one for the custom-trained model). To keep things simple you might therefore decide to re-package the original artifacts.
Sometimes it might After testing has completed upload the re-packaged model assets to the production COS.
Customize the pre-configured training/README.md
file following the steps listed in the file's header.
Make sure the model's README.md
file includes a How to train this model section, as shown in https://github.com/IBM/MAX-Skeleton/blob/master/README-template.md