You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: bring-your-own-model/readme.md
+22-10
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
# PDK - Pachyderm | Determined | KServe
6
6
## Bringing Your Model to PDK
7
-
**Date/Revision:**August 30, 2023
7
+
**Date/Revision:**January 02, 2024
8
8
9
9
In this section, we will train and deploy a simple customer churn model on PDK.
10
10
@@ -72,25 +72,25 @@ data:
72
72
73
73
* Additionally, if the original experiment had a training length specified in number of epochs, it may be convenient to **define training length in number of batches instead** (the same applies for **min_validation_period**).
74
74
* Indeed, the number of samples in the training set will now vary as new data gets committed to the MLDM repository, and knowing that number of samples is mandatory to define training length in number of epochs.
75
-
* Note that the training pipeline image could be modified to deal with that issue, but specifying the training length in batches is a simple solution.
75
+
* Note that the training pipeline image could be modified to deal with that issue, but specifying the training length in batches is a simpler solution.
76
76
* Depending on the organization of the MLDE cluster where these automatically triggered experiments are expected to run, it may also be a good idea to **edit the workspace and project fields accordingly**.
77
77
78
78
79
79
80
80
### Step 1-2: Add code to download data from MLDM
81
-
* In **startup-hook.sh**, install python-pachyderm.
82
-
* In **data.py**, add the imports (_os_, _shutil_, _python-pachyderm_) that are required to define the two new functions to add: _safe_open_wb_, and _download_pach_repo_. The later one being used to download data from the MLDM repository.
83
-
***Note:** In this example, _download_pach_repo_ will only download files corresponding to the difference between current and last commit on the MLDM repository. It won't redownload and retrain on the initial *data_part1* if *data_part2* has been committed afterwards. You can change that behaviour by editing the _download_pach_repo_ function.
81
+
* In **startup-hook.sh**, install `pachyderm-sdk`.
82
+
* In **data.py**, add the imports (`os`, `shutil`, `python-pachyderm`) that are required to define the two new functions to add: `safe_open_wb`, and `download_pach_repo`. The later one being used to download data from the MLDM repository.
83
+
***Note:** In this example, `download_pach_repo` will only download files corresponding to the difference between current and last commit on the MLDM repository. It won't redownload and retrain on the initial *data_part1* if *data_part2* has been committed afterwards. You can change that behaviour by editing the `download_pach_repo` function.
84
84
* In **model_def.py**:
85
-
* Add _os_, _logging_ and _download_pach_repo_ as imports
85
+
* Add `os`, `logging` and `download_pach_repo` as imports
86
86
* In \_\__init___, check if the model is expected to be trained (which would require downloading data from the MLDM repository, building the training set and building the validation sets) or not.
87
-
* Add the _download_data_ function, that will call the _download_pach_repo_ function to download files from the MLDM repository and return the list of those files.
87
+
* Add the `download_data` function, that will call the `download_pach_repo` function to download files from the MLDM repository and return the list of those files.
88
88
89
89
### Step 1-3: Make sure the code handles the output of the _download_data_ function
90
90
91
-
The original code may not handle a list of files, as output by the _download_data_ function. In this example, in the base experiment, a single csv data file was expected, while a list of files can be expected with the PDK experiment. Depending on your original code, and how you expect your data to be committed to MLDM, this may or may not require changes.
91
+
The original code may not handle a list of files, as output by the `download_data` function. In this example, in the base experiment, a single csv data file was expected, while a list of files can be expected with the PDK experiment. Depending on your original code, and how you expect your data to be committed to MLDM, this may or may not require changes.
92
92
93
-
In this example, the _get_train_and_validation_datasets_ function from **data.py** has been changed to concatenate a list of csv files into a single pandas DataFrame.
93
+
In this example, the `get_train_and_validation_datasets` function from **data.py** has been changed to concatenate a list of csv files into a single pandas DataFrame.
94
94
95
95
## Step 2: Preparing MLDM and MLDE
96
96
@@ -115,6 +115,18 @@ By default, we are using the same Workspace that was created in the deployment t
115
115
det p create "PDK Demos" pdk-customer-churn
116
116
```
117
117
118
+
### Step 2-3: Create the storage bucket folders
119
+
120
+
Create the following folder structure in the storage bucket (can be skipped for vanilla kubernetes deployments):
121
+
122
+
```bash
123
+
customer-churn
124
+
customer-churn/config
125
+
customer-churn/model-store
126
+
```
127
+
128
+
129
+
118
130
119
131
120
132
## Step 3: Create the training pipeline
@@ -132,7 +144,7 @@ In case this is not the case or if you want to dig deeper into the details, all
132
144
* Name this MLDM pipeline by changing the _pipeline.name_.
133
145
* Make sure the input repo matches the MLDM repository where data is expected to be committed.
134
146
* Under _transform_:
135
-
* Define the image to be used. The current image corresponds to files in the **container/train** folder and should work well as it is.
147
+
* Define the image to be used. The current image configured in the pipeline should work well as it is.
136
148
*_stdin_ command will be run when the pipeline is triggered. Make sure to change all the relevant options, in particular:
137
149
*_--git-url_ to point to the Git URL containing the model code, since you probably want to change details in the experiment files.
138
150
*_--sub-dir_ if the file structure of your git repository is different to this one.
Copy file name to clipboardexpand all lines: deploy/README.md
+3-7
Original file line number
Diff line number
Diff line change
@@ -152,9 +152,7 @@ Also, a Worspace and Project were configured for this experiment. You can change
152
152
153
153
154
154
155
-
**Important**: The default setting for the examples included here is to run on the *gpu-pool* resource pool. If your MLDE instance does not have a resource pool called *gpu-pool*, the experiments will fail to run. Make sure to modify the experiment files as needed.
156
-
157
-
Also, don't forget to create a Workspace and a Project in MLDE with the same name as configured in the file; otherwise, the experiment will fail to run. This can be done in the Workspaces page in the UI.
155
+
Don't forget to create a Workspace and a Project in MLDE with the same name as configured in the file; otherwise, the experiment will fail to run. This can be done in the Workspaces page in the UI.
158
156
159
157
![alt text][github_03_workspaces]
160
158
@@ -192,8 +190,6 @@ A brief description of the Experiment files:
192
190
193
191
The experiment files don't need to be modified, except for the Workspace and Project name in the `const.yaml` file. Do keep in mind that, at runtime, the pipeline will pull this code from Github. Any changes to any of the files need to be uploaded to your repository.
194
192
195
-
196
-
197
193
198
194
199
195
### MLDM Images
@@ -375,7 +371,7 @@ In the Training pipeline file, change the command line to point to your github r
# The prompt will freeze as it waits for the password. Type the password and press enter.
957
+
# The prompt will freeze as it loads the pod. Wait for the message "If you don't see a command prompt, try pressing enter".
958
+
# Then, type the password and press enter.
950
959
951
960
postgres=> CREATE DATABASE pachyderm;
952
961
@@ -1041,7 +1050,7 @@ After running this command, wait about 10 minutes for all the services to be pro
1041
1050
1042
1051
As of MLDM version 2.8.1, a single Helm chart can be used to deploy both MLDM and MDLE.
1043
1052
1044
-
Because we're using the AWS buckets, there are 2 service accounts in the MLDM namespace that will need access to S3: the main MLDM service account and the `worker` MLDM service account, which runs the pipeline code.
1053
+
Because we're using the AWS buckets, there are 2 service accounts that will need access to S3: the main MLDM service account and the `worker` MLDM service account, which runs the pipeline code.
1045
1054
1046
1055
The EKS installation command created the necessary roles with the right permissions, all we need to do is configure the service account to leverage those roles. Run these commands to set the proper ARNs for the roles:
pachctl put file images@master:AT-AT.png -f 8MN9Kg0.png
@@ -1394,6 +1401,8 @@ pachctl list job
1394
1401
1395
1402
1396
1403
1404
+
PS: If you used the default image size for the CPU nodes, the new pipelines may fail at first due to lack of available CPUs. In this case, the autoscaler should automatically add a new node to the CPU node group. Once the new CPUs are available, the pipeline should start automatically.
1405
+
1397
1406
At this time, you should see the OpenCV project and pipeline in the MLDM UI:
1398
1407
1399
1408
@@ -1649,7 +1658,7 @@ A more detailed explanation of these attributes:
1649
1658
1650
1659
1651
1660
1652
-
This secret needs to be created in the MLDM namespace, as it will be used by the pipelines (that will then map the variables to the MLDE experiment):
1661
+
This secret will be used by the pipelines, to map the variables for the MLDE experiments:
0 commit comments