Skip to content

Commit

Permalink
docs: Update docs to make isvc focal
Browse files Browse the repository at this point in the history
Signed-off-by: Paul Van Eck <[email protected]>
  • Loading branch information
pvaneck committed Jul 18, 2022
1 parent 4739d38 commit e623d96
Show file tree
Hide file tree
Showing 36 changed files with 735 additions and 406 deletions.
27 changes: 27 additions & 0 deletions config/example-isvcs/example-keras-mnist-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-keras-mnist
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: keras
storage:
key: localMinIO
path: keras/mnist.h5
27 changes: 27 additions & 0 deletions config/example-isvcs/example-lightgbm-mushroom-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-lightgbm-mushroom
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: lightgbm
storage:
key: localMinIO
path: lightgbm/mushroom.bst
27 changes: 27 additions & 0 deletions config/example-isvcs/example-mlserver-sklearn-mnist-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-sklearn-mnist-svm
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: sklearn
storage:
key: localMinIO
path: sklearn/mnist-svm.joblib
27 changes: 27 additions & 0 deletions config/example-isvcs/example-onnx-mnist-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-onnx-mnist
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: onnx
storage:
key: localMinIO
path: onnx/mnist.onnx
27 changes: 27 additions & 0 deletions config/example-isvcs/example-pytorch-cifar-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-pytorch-cifar
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: pytorch
storage:
key: localMinIO
path: pytorch/cifar
27 changes: 27 additions & 0 deletions config/example-isvcs/example-tensorflow-mnist-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-tensorflow-mnist
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: tensorflow
storage:
key: localMinIO
path: tensorflow/mnist.savedmodel
27 changes: 27 additions & 0 deletions config/example-isvcs/example-xgboost-mushroom-isvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2022 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-xgboost-mushroom
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: xgboost
storage:
key: localMinIO
path: xgboost/mushroom.json
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

ModelMesh Serving is a Kubernetes-based platform for realtime serving of ML/DL models, optimized for high volume/density use cases. Utilization of available system resources is maximized via intelligent management of in-memory model data across clusters of deployed Pods, based on usage of those models over time.

Leveraging existing third-party model servers, a number of standard ML/DL [model formats](model-types/) are supported out-of-the box with more to follow: TensorFlow, PyTorch ScriptModule, ONNX, scikit-learn, XGBoost, LightGBM, OpenVINO IR. It's also possible to extend with custom runtimes to support arbitrary model formats.
Leveraging existing third-party model servers, a number of standard ML/DL [model formats](model-formats/) are supported out-of-the box with more to follow: TensorFlow, PyTorch ScriptModule, ONNX, scikit-learn, XGBoost, LightGBM, OpenVINO IR. It's also possible to extend with custom runtimes to support arbitrary model formats.

The architecture comprises a controller Pod that orchestrates one or more Kubernetes "model runtime" Deployments which load/serve the models, and a Service that accepts inferencing requests. A routing layer spanning the runtime pods ensures that models are loaded in the right places at the right times and handles forwarding of those requests.

Expand All @@ -11,9 +11,9 @@ The model data itself is pulled from one or more external [storage instances](pr
ModelMesh Serving makes use of two core Kubernetes Custom Resource types:

- `ServingRuntime` - Templates for Pods that can serve one or more particular model formats. There are three "built in" runtimes that cover the out-of-the-box model types (Triton, MLServer and OpenVINO Model Server OVMS), [custom runtimes](runtimes/) can be defined by creating additional ones.
- [`Predictor`](predictors/) - This represents a logical endpoint for serving predictions using a particular model. The Predictor spec specifies the model type, the storage in which it resides and the path to the model within that storage. The corresponding endpoint is "stable" and will seamlessly transition between different model versions or types when the spec is updated.
- [`InferenceService`](predictors/) - This is the main interface KServe uses for managing models on Kubernetes. ModelMesh Serving can be used for deploying `InferenceService` predictors which represent a logical endpoint for serving predictions using a particular model. The `InferenceService` predictor spec specifies the model format, the storage location in which the model resides, and other optional configuration. The corresponding endpoint is "stable" and will seamlessly transition between different model versions or types when the spec is updated. Note that many features like transformers, explainers, and canary rollouts do not currently apply or fully work using InferenceServices with `deploymentMode` set to `ModelMesh`. And `PodSpec` fields that are set in the `InferenceService` predictor spec will be ignored.

The Pods that correspond to a particular `ServingRuntime` are started only if/when there are one or more defined `Predictor`s that require them.
The Pods that correspond to a particular `ServingRuntime` are started only if/when there are one or more defined `InferenceServices`s that require them.

We have standardized on the [KServe v2 data plane API](inference/ks-v2-grpc.md) for inferencing, this is supported for all of the built-in model types. We now have support for both the gRPC and REST versions of this. REST is not supported for custom runtimes, but they are free to use any gRPC Service APIs for inferencing including the KSv2 API.

Expand Down
8 changes: 4 additions & 4 deletions docs/architecture/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Architecture Overview

The central component of ModelMesh Serving is a Kubernetes controller responsible for reconciling the `ServingRuntime` and `Predictor` custom resource types.
The central component of ModelMesh Serving is a Kubernetes controller responsible for reconciling the `ServingRuntime` and `InferenceService` custom resource types.

### Serving Runtime Deployments

Expand All @@ -16,11 +16,11 @@ A single Kubernetes `Service` points to all Pods across all Deployments. Externa

![ModelMeshServing high-level architecture](../images/0.2.0-highlevel.png)

### Predictors
### InferenceService

For each defined `Predictor`, the controller registers a "VModel" (virtual model) in model-mesh of the same name, as well as a concrete model whose name incorporates a hash of the Predictor's current Spec. The VModel represents a stable endpoint which will resolve to the most recent successfully loaded concrete model. Logical CRUD operations on these model-mesh entities are performed via its gRPC-based model-management interface during `Predictor` reconciliation.
For each defined `InferenceService` predictor, the controller registers a "VModel" (virtual model) in model-mesh of the same name, as well as a concrete model whose name incorporates a hash of the InferenceService's current predictor Spec. The VModel represents a stable endpoint which will resolve to the most recent successfully loaded concrete model. Logical CRUD operations on these model-mesh entities are performed via its gRPC-based model-management interface during `InferenceService` reconciliation.

A central etcd is used "internally" by the model-mesh cluster to keep track of active Pods, registered models/vmodels, and which models currently reside on which runtimes. Model-mesh does not currently provide a way of listening events when the state of its managed models/vmodels change, and so in a small violation of encapsulation the controller also watches model-mesh's internal datastructures in etcd directly, but in a read-only manner and just for the purpose of reacting to state change events. The status information subsequently returned by model-mesh's model management gRPC API requests is used by the controller to update `Predictor`s' Statuses.
A central etcd is used "internally" by the model-mesh cluster to keep track of active Pods, registered models/vmodels, and which models currently reside on which runtimes. Model-mesh does not currently provide a way of listening events when the state of its managed models/vmodels change, and so in a small violation of encapsulation the controller also watches model-mesh's internal datastructures in etcd directly, but in a read-only manner and just for the purpose of reacting to state change events. The status information subsequently returned by model-mesh's model management gRPC API requests is used by the controller to update `InferenceService`s' Statuses.

### Model Server Integration Options

Expand Down
16 changes: 9 additions & 7 deletions docs/architecture/isolation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Isolation

ModelMesh Serving exposes two main concepts through the Kubernetes resource API: Serving Runtimes which provide technology specific model serving capabilities and Predictors which represent the deployment of an individual model.
ModelMesh Serving exposes two main concepts through the Kubernetes resource API: Serving Runtimes which provide technology specific model serving capabilities and InferenceServices which represent the deployment of an individual model.

This guide explains how the associated resources (such as the pods) are created and the isolation concerns which should be considered.

Expand All @@ -25,16 +25,18 @@ Once these resources are created and the controller processes them, a pod will b

Although the runtime pod provides some defense, serving runtimes should only be deployed when the associated container images are trusted. In addition, by deploying a Network Policy, the interactions with a given runtime can be more closely controlled.

### Predictors
### InferenceServices

When a Predictor is deployed, it is assigned to an available runtime by evaluating the modelType found in the spec and cross referencing that against the available runtimes. For example, this model is an sklearn model:
When an InferenceService is deployed, it is assigned to an available runtime by evaluating the modelType found in the spec and cross referencing that against the available runtimes. For example, this model is an sklearn model:

```
spec:
modelType:
name: sklearn
predictor:
model:
modelFormat:
name: sklearn
```

This Predictor would likely be matched against the serving runtime referenced previously. Once assigned to the runtime, the model is subject to loading on demand. A load request would cause the model data to be extracted to the runtime pods local disk, and the server process would be notified by the associated adapter process to load the model data.
This InferenceService would likely be matched against the serving runtime referenced previously. Once assigned to the runtime, the model is subject to loading on demand. A load request would cause the model data to be extracted to the runtime pods local disk, and the server process would be notified by the associated adapter process to load the model data.

Since the same container and pod are processing all of the predictors with the same model type, there is no pod isolation between predictors of a given model type.
Since the same container and pod are processing all of the InferenceService predictors with the same model format, there is no pod isolation between InferenceServices of a given model format.
4 changes: 2 additions & 2 deletions docs/configuration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ The following parameters are currently supported. _Note_ the keys are expressed
| `metrics.port` | Port on which to serve metrics via the `/metrics` endpoint | `2112` |
| `metrics.scheme` | Scheme to use for the `/metrics` endpoint (`http` or `https`) | `https` |
| `metrics.disablePrometheusOperatorSupport` | Disable the support of Prometheus operator for metrics only if `metrics.enabled` is true | `false` |
| `scaleToZero.enabled` | Whether to scale down Serving Runtimes that have no Predictors | `true` |
| `scaleToZero.gracePeriodSeconds` | The number of seconds to wait after Predictors are deleted before scaling to zero | `60` |
| `scaleToZero.enabled` | Whether to scale down Serving Runtimes that have no InferenceServices | `true` |
| `scaleToZero.gracePeriodSeconds` | The number of seconds to wait after InferenceServices are deleted before scaling to zero | `60` |
| `grpcMaxMessageSizeBytes` | The max number of bytes for the gRPC request payloads (\*\*\*\* see below) | `16777216` (16MiB) |
| `restProxy.enabled` | Enables the provided REST proxy container being deployed in each ServingRuntime deployment | `true` |
| `restProxy.port` | Port on which the REST proxy to serve REST requests | `8008` |
Expand Down
4 changes: 2 additions & 2 deletions docs/configuration/built-in-runtimes.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ModelMesh Serving currently supports three built-in `ServingRuntime`s:
- MLServer - lightgbm, sklearn, xgboost
- OpenVINO Model Server OVMS - OpenVINO's Intermediate Representation (IR) format, and onnx models.

When a `Predictor`s using one of these model types is deployed, Pods corresponding to the supporting `ServingRuntime` will be started if they aren't already running. Most of the built-in `ServingRuntime` fields should not be modified, but some can be changed to customize the details of the corresponding Pods:
When an `InferenceService` using one of these model types is deployed, Pods corresponding to the supporting `ServingRuntime` will be started if they aren't already running. Most of the built-in `ServingRuntime` fields should not be modified, but some can be changed to customize the details of the corresponding Pods:

- `containers[ ].resources` - resource allocation of the model server container.
- Note that adjusting the resource allocations along with the replicas will affect the model serving capacity and performance. If there is insufficient memory to hold all models, the least recently used ones will not remain loaded, which will impact latency if/when they are subsequently used.
Expand Down Expand Up @@ -39,7 +39,7 @@ When a `Predictor`s using one of these model types is deployed, Pods correspondi
memory: 1Gi
```
- `replicas` - if not set, the value defaults to the global config parameter `podsPerRuntime` with value of 2.
- Remember that if [`ScaleToZero`](../production-use/scaling.md#scale-to-zero) is enabled which it is by default, runtimes will have 0 replicas until a Predictor is created that uses that runtime. Once a Predictor is assigned, the runtime pods will scale up to this number.
- Remember that if [`ScaleToZero`](../production-use/scaling.md#scale-to-zero) is enabled which it is by default, runtimes will have 0 replicas until an InferenceService is created that uses that runtime. Once an InferenceService is assigned, the runtime pods will scale up to this number.
- `containers[ ].imagePullPolicy` - set to default `IfNotPresent`
- `nodeSelector`
- `affinity`
Expand Down
Loading

0 comments on commit e623d96

Please sign in to comment.