Date/Revision: February 23, 2024
This page contains step-by-step guides for installing the infrastructure and all necessary components for the PDK environment, covering different Kubernetes plaforms.
The first step is to provision an environment with the PDK components. If you don't have an environment available, follow the links below for deployment information. If you already have access to an environment, you can go directly to the Creating the PDK Dogs-and-Cats Assets section for PDK-specific instructions.
Click on your platform of choice to access the specific deployment guide for that platform.
![]() |
![]() |
![]() |
---|
Regardless of the selected platform, do not proceed until you have a fully functioning cluster, with MLDM, MLDE and KServe.
The diagram below illustrates how the PDK flow will work:
- A new project with 2 pipelines will be created in MLDM
- Data (a collection of files) will be uploaded to the MLDM repository
- This repository will be the input for the the 'Train' pipeline, which will start automatically, to create a new Experiment in MLDE
- To generate a new Experiment, the pipeline will need to download the assets (configuration + code) from github
- Technically speaking, these assets can be stored anywhere, but github is the easiest way to maintain the code
- Once the experiment is completed, MLDM will register the top checkpoint to MLDE and it will create a configuration file with information about the model, which will be used to deploy it
- This configuration file will be stored in the repository that will serve as the input for the 'Deploy' pipeline, which will download the checkpoint from MLDE and deploy the model to KServe, using the configuration file generated by the 'Train' pipeline.
- Each pipeline will pull a specific container image from a registry. You will find instructions in this repository about how you can create your own images and push them to your own registry.
- The container images have the logic to initiate the MLDE experiment (Train pipeline) and deploy the model to KServe (Deploy pipeline). You can study the code by looking through the example folders.
- Sensitive data, like server URLs, passwords, etc will be stored in a secret (that you created as part of the platform setup) and mapped to environment variables at runtime; the MLDM pipeline is then able to pass those values forward to MLDE.
This repository includes an Examples folder with a number of sample PDK projects. Each PDK example will have 3 main components:
-
MLDE Experiment: includes the code and other assets that will be needed to train the model inside MLDE. This code will be pushed to Github, where it will be downloaded by the MLDM pipeline.
-
Docker Images: the
'Train'
and'Deploy'
images described above. Since the same training image can be used with all models, it will be located in a separated folder. As part of this document, we will walk through the steps of building and pushing the images to the registry. Optionally, you can use the hosted images from the provided example (if you don't want to build and push your own container images). -
Pipeline definitions: these are JSON files that will create the
Train
andDeploy
pipelines, assigning the docker images that will be used by each.
In this guide, we will deploy one of the example projects (Dogs and Cats), to ensure that all PDK components are working properly. For each example, you will find a brief description of how to set it up and run the PDK flow, as well as sample data to test the inference service.
If you are planning on creating your own images, or change the experiment settings, the easiest way is to fork the repository, clone it locally and make the changes:
git clone https://github.com/determined-ai/pdk.git .
Once you clone the repository, go to the examples/dog-cat
folder, which contains all the necessary assets:
If you've followed the setup instructions provided in this repository, you now have a working cluster with a config map that contains a number of environment variables. Use the commands below to load them. If you did not follow the instructions to create your environment, some of these variables will be required to setup PDK; make sure to assigne the proper values to them.
export AZ_REGION=$(kubectl get cm pdk-config -o=jsonpath='{.data.region}') && echo $AZ_REGION
export MLDM_BUCKET_NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_bucket_name}') && echo $MLDM_BUCKET_NAME
export MLDM_HOST=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_host}') && echo $MLDM_HOST
export MLDM_PORT=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_port}') && echo $MLDM_PORT
export MLDM_URL=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_url}') && echo $MLDM_URL
export MLDE_BUCKET_NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_bucket_name}') && echo $MLDE_BUCKET_NAME
export MLDE_HOST=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_host}') && echo $MLDE_HOST
export MLDE_PORT=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_port}') && echo $MLDE_PORT
export MLDE_URL=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_url}') && echo $MLDE_URL
export MODEL_ASSETS_BUCKET_NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.model_assets_bucket_name}') && echo $MODEL_ASSETS_BUCKET_NAME
export KSERVE_MODELS_NAMESPACE=$(kubectl get cm pdk-config -o=jsonpath='{.data.kserve_model_namespace}') && echo $KSERVE_MODELS_NAMESPACE
export INGRESS_HOST=$(kubectl get cm pdk-config -o=jsonpath='{.data.kserve_ingress_host}') && echo $INGRESS_HOST
export INGRESS_PORT=$(kubectl get cm pdk-config -o=jsonpath='{.data.kserve_ingress_port}') && echo $INGRESS_PORT
export DB_CONNECTION_URL=$(kubectl get cm pdk-config -o=jsonpath='{.data.db_connection_string}') && echo $DB_CONNECTION_URL
export REGISTRY_URI=$(kubectl get cm pdk-config -o=jsonpath='{.data.registry_uri}') && echo $REGISTRY_URI
export NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.pdk_name}') && echo $NAME
Create the following folders in the storage bucket:
- dogs-and-cats
- dogs-and-cats/config
- dogs-and-cats/model-store
Check the Useful Commands section in the AWS and GCP deployment pages for help.
PS: This step can be skipped if you are not using storage buckets in your PDK environment.
In the dog-cat
folder, go to the experiment
folder.
We'll take a look at some of the files, but the only one you should (optionally) change is the const.yaml
file, which contains the configuration for the MLDE experiment.
In this file, we can see the MLDM parameters that will be used by MLDE. They have empty values because they will be assigned at runtime (through kubernetes secrets mapped as environment variables). Keep in mind that the pipeline is running in one container, which has direct access to the input images, while the MLDE experiment will run in a different container (inside the GPU node), which does not have direct access. For that reason, the training code will connect to MLDM and download the input images so it can train the model.
Also, a Worspace and Project were configured for this experiment. You can change the name of both:
Don't forget to create a Workspace and a Project in MLDE with the same name as configured in the file; otherwise, the experiment will fail to run. This can be done in the Workspaces page in the UI.
The workspace and project can also be created through the command line:
export DET_MASTER=${MLDE_HOST}:${MLDE_PORT}
det u login admin
det w create "PDK Demos"
det p create "PDK Demos" pdk-dogs-and-cats
A brief description of the Experiment files:
data.py
: this file contains logic to retrieve and structure the training images from the MLDM repository. Study the download_pach_repo
to understand how the client is pulling the files. The unique ID of each commit is sent through the environment variables (and can be seen in the logs).
model_def.py
: this is the script that controls model training. It uses the PyTorchTrial API to provide out-of-the-box capabilities like distributed training, checkpointing, hyperparameter search, etc. without the need for additional coding.
startup_hook.sh
: this file will be executed for every experiment, before the python script. It's a good place to run any routines required to prepare the container for the execution of the python code.
The experiment files don't need to be modified, except for the Workspace and Project name in the const.yaml
file. Do keep in mind that, at runtime, the pipeline will pull this code from Github. Any changes to any of the files need to be uploaded to your repository.
In this step, we'll setup the Train and Deploy images. There's no need to change any of the code, though we will review some key parts of it.
In the examples/training_container
folder, you will find the files for the Train image. If you wish to test this flow as-is, there will be no need to rebuild or push new images to the repository. However, assuming that you want to make changes to it (or adapt this code to a different type of model), we'll review the necessary steps.
Taking a closer look at the train.py
file, we can see that a number of input arguments are being parsed:
...
def parse_args():
parser = argparse.ArgumentParser(
description="Determined AI Experiment Runner"
)
parser.add_argument(
"--config",
type=str,
help="Determined's experiment configuration file",
)
parser.add_argument(
"--git-url",
type=str,
help="Git URL of the repository containing the model code",
)
parser.add_argument(
"--git-ref",
type=str,
help="Git Commit/Tag/Branch to use",
)
...
These arguments are configured in the pipeline definition. Depending on how your PDK environment is setup, you will need to configure additional attributes.
Then, in a different function, the MLDM information is mapped to variables that will be sent to MLDE:
def setup_config(config_file, repo, pipeline, job_id, project):
config = read_config(config_file)
config["data"]["pachyderm"]["host"] = os.getenv("PACHD_LB_SERVICE_HOST")
config["data"]["pachyderm"]["port"] = os.getenv("PACHD_LB_SERVICE_PORT")
config["data"]["pachyderm"]["repo"] = repo
config["data"]["pachyderm"]["branch"] = job_id
config["data"]["pachyderm"]["token"] = os.getenv("PAC_TOKEN")
config["data"]["pachyderm"]["project"] = project
config["labels"] = [repo, job_id, pipeline]
return config
The environment variables will be mapped from Kubernetes secrets. We will see this mapping in the pipeline definition file.
You should, of couse, study the entire code. The goal here was to show how data in Kubernetes secrets can be mapped as environment variables and used inside MLDM pipelines, that will then send them over to MLDE for model training.
If you are not planning on building your own images, you can skip this section. The pipelines are configured by default with public images you can use for testing.
Before continuing, make sure Docker Desktop is running.
The first step will be to build and push the Train image. There's no need to make changes to any files.
PS: If you're running this on a MacOS, there are additional settings needed to set the image for linux (otherwise it will fail to run). They are included below.
Go to the /examples/training_container
folder and run the commands below to build, tag and push the Train
image. Don't forget to rename the images.
export DOCKER_DEFAULT_PLATFORM=linux/amd64
docker buildx build --pull --no-cache --platform linux/amd64 -t ${REGISTRY_URI}/<your_name>_cats_dogs_train:1.0 .
# IF YOU ARE USING ECR, YOU MUST CREATE THE REPOSITORY FIRST
## Execute these commands only for AWS ECR
export REGISTRY_URL=<the value of REGISTRY_URI without the repository name>
aws ecr get-login-password --region ${AZ_REGION} | docker login --username AWS --password-stdin ${REGISTRY_URL}
aws ecr create-repository --repository-name=${NAME}/<your_name>_cats_dogs_train --region ${AZ_REGION}
##
docker push ${REGISTRY_URI}/<your_name>_cats_dogs_train:1.0
The build process can take several minutes. PS: If you do need to rebuild this image for whatever reason, make sure to change the version number (and update the pipeline JSON file with the new version number). This will force the container to pull the new version of the image, instead of using the cached one.
Check your registry to make sure the image was pushed successfully. Review the command output for EOF or other error messages and retry as needed.
Go to the examples/dog-cat/container/deploy
folder. The code for deploy is more complicated, since it involves KServe as well. Study the code to understand how the process is being handled (the common.py
file contains utility functions).
Run these commands to build, tag and push the Deploy image:
cd ../deploy
docker buildx build --pull --no-cache --platform linux/amd64 -t ${REGISTRY_URI}/<your_name>_cats_dogs_deploy:1.0 .
# IF YOU ARE USING ECR, YOU MUST CREATE THE REPOSITORY FIRST
## Execute these commands only for AWS ECR
aws ecr get-login-password --region ${AZ_REGION} | docker login --username AWS --password-stdin ${REGISTRY_URL}
aws ecr create-repository --repository-name=${NAME}/denisd_cats_dogs_deploy --region ${AZ_REGION}
##
docker push ${REGISTRY_URI}/<your_name>_cats_dogs_deploy:1.0
This can take a long time, because of the dependencies needed to build the image.
If you made any changes to any of the files, make sure to push them to your Github repo.
PS: if you're using a Mac, delete the .DS_store files before committing (or add it to .gitignore
).
find . -name '.DS_Store' -type f -delete
git add .
git status
git commit -m 'changed experiment files'
git remote add origin https://github.com/YOUR_GIT_USERNAME/pdk.git
git push -u origin main
First, create the project and repo in MLDM:
pachctl connect ${MLDM_URL}
pachctl config set active-context ${MLDM_URL}
pachctl create project pdk-dogs-and-cats
pachctl config update context --project pdk-dogs-and-cats
pachctl create repo dogs-and-cats-data
pachctl list repo
Next, go to the pipelines
folder.
In this folder, there are 2 sets of pipeline definition files, one for on-prem (shared folders) and another for environments that use cloud buckets. There differences between them are:
- On-prem environments must mount the shared folder into the containers where the pipelines will run, so the code has access to the files there. This can be configured through the
pod_patch
parameter, which is applied as a JSON Patch. Within this parameter, set the path to your shared folders (in our deployment example, we use the/mnt/efs/shared_fs
path). More information about thepod_patch
configuration can be found in the Documentation page. - In on-prem environments, the pipeline containers must run as root to avoid permission errors in the shared folder.
- In on-prem environments, a service account parameter must be set, to allow the deployment code to access the MLDM repository through the S3 interface. For environments that use cloud storage, these permissions are granted through service account permission mapping.
If you have a cloud-based environment, use the training-pipeline.json
and deployment-pipeline.json
files.
If you have an on-prem environment with shared folders, use the _onprem_training-pipeline.json
and _onprem_deployment-pipeline.json
files.
In the Training pipeline file, change the command line to point to your github repo (if you want to run your own code), and the image name to match the image you just pushed. You can leave the default values, if you did not create an image or made any changes to the experiment code.
"stdin": [
"python train.py --git-url https://[email protected]:/determined-ai/pdk.git --git-ref main --sub-dir examples/dog-cat/experiment --config const.yaml --repo dogs-and-cats-data --model dogs-and-cats --project pdk-dogs-and-cats"
],
"image": "pachyderm/pdk:train-v0.0.3",
Now we're ready to create the pipelines.
Go back to the pipelines
folder and create the pipeline (make sure to use the right file for your environment):
pachctl create pipeline -f training-pipeline.json
pachctl list pipelines
The MLDM UI will show the new Project, the repository and the pipeline:
Each new pipeline will create a pod in the ${MLDM_NAMESPACE}
namespace. With the cluster defaults in place, the pod will be deleted if there are no active workloads to process. Check the status of the Pod before continuing. imgPullBackError
means the cluster was unable to pull the image from your registry. Other errors might indicate lack of permissions, etc.
Next, create the deployment pipeline:
The deployment pipeline will deploy the trained model for inference (KServe). To recap, it completes the following steps:
- Gets triggered when a new checkpoint is stored in the
dogs-and-cats-model
repo. - Pulls the checkpoint from MLDE and loads the trial/model.
- Saves the model as a ScriptModule.
- Creates a
.mar
file from theScriptModule
and the customTorchServe handler
. - Creates the
config.properties
file for the model. - Uploads the
.mar
file and the config.properties file to the GCS bucket. - Connects to the K8s cluster and creates the InferenceService. If the pipeline runs for the first time, it will create a brand new InferenceService. If an older version of the InferenceService already exists, it will do a rolling update of the InferenceService using the updated model.
- Waits for the InferenceService to be available and provides the URL.
As before, select the correct the JSON file based on you environment and update the image name and the arguments.
If you have a cloud environment, make sure to set the following parameters in the command line:
cloud-model-host
: aws or gcpcloud-model-bucket
: the bucket used to store models for KServe (${MODEL_ASSETS_BUCKET_NAME})
For on-prem, these attributes are not necessary, and the service account configured in the file is correct.
Also, replace the path to your image, or use the default value.
"stdin": [
"python deploy.py --deployment-name dog-cat --cloud-model-host gcp --cloud-model-bucket pdk-repo-models --resource-requests cpu=2,memory=8Gi --resource-limits cpu=10,memory=8Gi"
],
"image": "pachyderm/pdk:dog-cat-deploy-v0.0.3",
Create the deploy pipeline:
pachctl create pipeline -f deployment-pipeline.json
pachctl list pipelines
The MLDM UI should now display both pipelines, connected (since the output of the Train pipeline is the input for the Deploy pipeline)
It will take a few minutes for the pipeline to pull the image from the registry. The status will change to 'Success' in the UI once the pipeline is up and running.
Our environment should now be ready to receive and process data.
As mentioned before, the pipelines automatically run when new data is committed to the input repository dogs-and-cats
.
Some sample images of dogs and cats can be found in the sample-data
folder. Unzip the dataset-dog-cat.zip
file to obtain a sample dataset that can be used to train the model.
With the command below you can commit all images in the dog-cat
directory of your machine to the folder data1
in the MLDM repository dogs-and-cats
.
The folder data1
will be created as part of the commit process; make sure to increment the number if you need to re-upload this folder (otherwise it won't be considered as new data by MLDM).
IMPORTANT: While the folder
data1
can have any name, do not use the words "dog(s)" or "cat(s)" as it will impact the labeling of the images in the data pre-processing code.
PS: If you're using a MacOS and browsed through the images, delete all .DS_Store
files before uploading, as they can break the pipeline (code doesn't handle that exception).
find ./dog-cat/ -name '.DS_Store' -type f -delete
pachctl put file dogs-and-cats-data@master:/data1 -f ./dog-cat -r
Once the uploads are complete, MLDM will start the training pipeline. At this stage, check the MLDE UI to see the experiment run. Once it completes, also check the MLDE Model Registry to see a new model registered. The Model Version name will be equal to the MLDM Commit ID.
The new experiment will appear in the project inside your Workspace:
The experiment might take a minute to start, as it's preparing the environment. IF there are no GPUs available, a new node will be provisioned automatically.
Once the training is complete, the deployment pipeline will be executed. You can look at the logs of the pipeline execution by clicking on Pipeline
, then click on Subjob - Running
. You should see a message in the logs about the model being deployed to KServe.
Once the pipeline execution completes, you should have a new InferenceService called dogcat-deploy
in the Namespace models
. You can validate that with this command:
kubectl -n ${KSERVE_MODELS_NAMESPACE} get inferenceservices
This is the expected output of this command:
kubectl -n ${KSERVE_MODELS_NAMESPACE} get inferenceservices
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
dog-cat http://dog-cat.models.example.com True 100 dog-cat-predictor-00001 2m5s
sklearn-iris http://sklearn-iris.models.example.com True 100 sklearn-iris-predictor-00001 120m
It might take a minute for the inference service to go from Unknown
to True
.
With everything ready to go, it is time to make a prediction with the dogcat-deploy
InferenceService.
KServe expects data to be submitted in the JSON
format. For this simple test, you can find cat.json and dog.json in the sample-data directory.
If you want to convert your own images to JSON
, you can use the img2bytearray.py Python script in the internal Github repo.
Once the JSON
files are ready, we can make a call to the inference service.
To make a prediction, you can use the curl command below. First, let's submit the cat.json
file. Replace the IP with your istio-ingressgateway
external IP Address and execute the command.
curl -v \
-H "Content-Type: application/json" \
-H "Host: dog-cat.models.example.com" \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/dogs-and-cats:predict \
-d @./cat.json
Then, make a prediction for dog.json
by replacing the IP with your istio-ingressgateway
external IP Address and executing the command.
curl -v \
-H "Content-Type: application/json" \
-H "Host: dog-cat.models.example.com" \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/dogs-and-cats:predict \
-d @./dog.json
If all goes well, you should get the predictions returned for both the cat.json
and the dog.json
examples with the HTTP status 200 (OK).
For cat.json
the response should be a class 1
prediction and for the dog.json
it should be a class 0
prediction.
If this works, you have successfully deployed the Pachyderm-Determined-KServe (PDK) environment.
Pending work that needs to be done:
- None
Known issues:
- For the dogs-and-cats use case, if the committed folder has dogs or cats in the name, the images will be incorreclty labeled for training.