MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them.
The necessity of MLOps can be summarized as follows:
- ML models rely on a huge amount of data, difficult for a single person to keep track of.
- Difficult to keep track of parameters we tweak in ML models. Small changes can lead to enormous differences in the results.
- We have to keep track of the features the model works with, feature engineering is a separate task that contributes largely to model accuracy.
- Monitoring an ML model isn’t like monitoring a deployed software or web app.
- Debugging an ML model is an extremely complicated art
- Models rely on real-world data for predicting, as real-world data changes, so should the model. This means we have to keep track of new data changes and make sure the model learns accordingly.
Data Ingestion - Collecting data by using various systems, frameworks and formats, such as internal/external databases, data marts, OLAP cubes, data warehouses, OLTP systems, Spark, HDFS etc. This step might also include synthetic data generation or data enrichment.
Service Category | Service Description | Available Implementations |
---|---|---|
End-to-end Machine Learning Operations (MLOps) platforms | End-to-end MLOps platforms provide a unified ecosystem that streamlines the entire ML workflow, from data preparation and model development to deployment and monitoring. |
DataRobot, Azure ML, Databricks, Domino, Amazon Sagemaker, Metaflow, Vertex AI, Weights & Biases, Qwak, Kubeflow, Valohai |
Experiment tracking, model metadata storage and management | Experiment tracking and model metadata management tools provide you with the ability to track experiment parameters, metrics, and visualizations, ensuring reproducibility and facilitating collaboration. | |
Dataset labeling and annotation | Dataset labeling and annotation tools form a critical component of machine learning (ML) systems, enabling you to prepare high-quality training data for their models. These tools provide a streamlined workflow for annotating data, ensuring accurate and consistent labeling that fuels model training and evaluation. |
Labelbox, Scale AI, Snorkel Flow, AWS SageMaker Ground Truth, Kili, Super Annotate, Encord Annotate |
Data storage and versioning | You need data storage and versioning tools to maintain data integrity, enable collaboration, facilitate the reproducibility of experiments and analyses, and ensure accurate ML model development and deployment. Versioning allows you to trace and compare different iterations of datasets. |
DVC, Dolt, Delta Lake, Pachyderm, LakeFS |
Data quality monitoring and management | You may want to continuously observe data quality, consistency, and distribution to identify anomalies or shifts that may impact model performance. |
Monte Carlo, Metaplane, Talend Data Quality, Great Expectations, Databand, Soda Core |
Feature stores | Feature stores provide a centralized repository for storing, managing, and serving ML features, enabling you to find and share feature values for both model training and serving. |
Feast, DataBricks, Tecton, Vertex AI, Hopsworks, Featureform |
Model hubs | Model hubs provide a centralized platform for managing, sharing, and deploying ML models. They empower you to streamline model management, foster collaboration, and accelerate the deployment of ML models. | |
Hyperparameter Optimization | Hyperparameter optimization tooling | |
Model quality testing | Model quality testing tools provide features to ensure the reliability, robustness, and accuracy of ML models. | |
Workflow orchestration and pipelining tools | Workflow orchestration and pipelining tools are essential components for streamlining and automating complex ML workflows. | |
Model deployment and serving | Model deployment and serving tools enable you to deploy trained models into production environments and serve predictions to end-users or downstream systems. |
BentoML, OctoML, NVIDIA TensorRT, NVIDIA Triton Inference Server, Seldon Core |
Model observability | Model observability tools can allow you to gain insights into the behavior, performance, and health of your deployed ML models. | |
Responsible AI | You can use responsible AI tools to deploy ML models through ethical, fair, and accountable techniques. | |
Compute and infrastructure | The compute and infrastructure component is a vital aspect of machine learning (ML) systems, providing the necessary resources and environment to train, deploy, and run ML models at scale. | |
GPU Cloud Servers and Serverless GPUs |
GPU Cloud vendors have also exploded in popularity in 2023. The vendor offerings are divided into two classes:
|
|
Vector databases and data retrieval | Vector databases are a new category of a database management system designed to search across images, video, text, audio, and other forms of unstructured data via their content rather than human-generated labels or tags. | |
LLMOps and foundation model training frameworks | Foundation model training frameworks | |
ML Platform | Provisioned as our opinionated preference for ML workflows running on a highly scalable software infrastructure. | |
ML Frameworks | Select your machine learning and deep learning framework, toolkit, and libraries. | |
Storage Volumne Management | Choose from software and tools for storage to meet your high performance ML needs. |
Local FS, AWS EFS, AWS EBS, Ceph (block and object), Minio, NFS, HDFS |
Container Image Governance | Choose from software and tools that register, secure, and manage the distribution of container images. | |
Workflow Engine | Provisioned by default to govern scheduling and coordination of jobs. | |
Model Training | Include collaboration tooling and interative model training as part of your template. | |
Model Serving | Pick the tool to expose trained models to business applications. | |
Model Validation | Set by default, models will be evaluated against test data as part of your ML pipeline. | |
Data Storage Services | Choose from storage options befitting the performance of other ML services. | |
Data Preparation and Processing | Select your tooling to manage the data processing ovent of your ML pipeline. | |
Infrastructure Monitoring | Elect which reporting and dashboarding tool gives you the better optics into your stack performance. | |
Model Monitoring | Find and choose the appropriate tool to watch model accuracy over time. | |
Load Balancing & Ingress | Determine the appropriate tool to expose cluster services broadly to other application services. | |
Security | Find the right tooling for you to manage certificates, passwords and secret tuned for RBAC across all hybrid-cloud environments. | |
Log Management | Make logging easier by choosing pre-integrated tools for ingest, analysis and reporting. |
also check @github/[awesome-production-ml] and this lfai mlops landscape chart.
Model Packaging - The process of exporting the final ML model into a specific format (e.g. PMML, PFA, or ONNX), which describes the model to be consumed by the business application. PyTorch serves models by using their proprietary Torch Script as a .pt file. Their model format can be served from a C– application.
Open Format | Vendor | File Extension | License | ML Tools & Platforms Support | Human-readable | Compression | |
---|---|---|---|---|---|---|---|
"almagination" |
− | − | − | − | − | − | ✓ |
PMML | ✓ | .pmml | AGPL | R, Python, Spark | ✓ (XML) | ✘ | |
✓ | DMG | JSON | PFA-enabled runtime | ✓ (JSON) | ✘ | ||
✓ | SIG, LFAI | .onnx | TF, CNTK, Core ML, MXNet, ML.NET | − | ✓ | ||
TF Serving Format | ✓ | .pf | Tensor Flow | ✘ | g-zip | ||
Pickle Format | ✓ | .pkl | scikit-learn | ✘ | g-zip | ||
JAR/ POJO | ✓ | .jar | H2O | ✘ | ✓ | ||
HDF | ✓ | .h5 | Keras | ✘ | ✓ | ||
MLEAP | ✓ | .jar/ .zip | Spark, TF, scikit-learn | ✘ | ✓ | ||
Torch Script | ✘ | .pt | PyTorch | ✘ | ✓ | ||
Apple .mlmodel | ✘ | Apple | .mlmodel | TensorFlow, scikit-learn, Core ML | − | ✓ |
There are two ways how we perform ML Model Training:
-
Offline learning (aka batch or static learning): The model is trained on a set of already collected data. After deploying to the production environment, the ML model remains constant until it re-trained because the model will see a lot of real-live data and becomes stale. This phenomenon is called ‘model decay’ and should be carefully monitored.
-
Online learning (aka dynamic learning): The model is regularly being re-trained as new data arrives, e.g. as data streams. This is usually the case for ML systems that use time-series data, such as sensor, or stock trading data to accommodate the temporal effects in the ML model.
End-to-end MLOps solution : These are fully managed services that provide developers and data scientists with the ability to build, train, and deploy ML models quickly. The top commercial solutions are:
Ray serve, Amazon Sagemaker, a suite of tools to build, train, deploy, and monitor machine learning models.
Microsoft Azure MLOps suite: Azure Machine Learning to build, train, and validate reproducible ML pipelines Azure Pipelines to automate ML deployments Azure Monitor to track and analyze metrics Azure Kubernetes Services and other additional tools.
Google Cloud MLOps suite: Dataflow to extract, validate, and transform data as well as to evaluate models AI Platform Notebook to develop and train models Cloud Build to build and test machine learning pipelines TFX to deploy ML pipelines Kubeflow Pipelines to arrange ML deployments on top of Google Kubernetes Engine (GKE).
Five patterns to put the ML model in production: Model-as-Service, Model-as-Dependency, Precompute, Model-on-Demand, and Hybrid-Serving.
Hybrid-Serving (Federated Learning) Federated Learning, also known as hybrid-serving, is another way of serving a model to the users. It is unique in the way it does, there is not only one model that predicts the outcome, but there are also lots of it. Exactly spoken there are as many models as users exist, in addition to the one that’s held on a server.
The big benefit of this is that the data used for training and testing, which is highly personal, never leaves the devices while still capturing all data that is available. This way it is possible to train highly accurate models while not having to store tons of (probably personal) data in the cloud. Tool : TensorFlow Federated (TFF).
MLOPs tools : Project Jupyter, Nbdev, Airflow, Kubeflow, MLflow, Optuna, AutoML tools.
Continuous Integration (CI) is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
Continuous Deployment (CD) is no longer about a single software package or service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service) or roll back changes from a model.
Continuous Testing (CT) is a new property, unique to ML systems, that’s concerned with automatically retraining and serving the models.
Does training large language models (LLMOps) differ from traditional MLOps? While many of the concepts of MLOps still apply, there are other considerations when training large language models:
Computational Resources: Training and fine-tuning large language models typically involves performing orders of magnitude more calculations on large data sets. To speed this process up, specialized hardware like GPUs are used for much faster data-parallel operations. Having access to these specialized compute resources becomes essential for both training and deploying large language models. The cost of inference can also make model compression and distillation techniques important.
Transfer Learning: Unlike many traditional ML models that are created or trained from scratch, many large language models start from a foundation model and are fine-tuned with new data to improve performance in a more specific domain. Fine-tuning allows state-of-the-art performance for specific applications using less data and fewer compute resources.
Human Feedback: One of the major improvements in training large language models has come through reinforcement learning from human feedback (RLHF). More generally, since LLM tasks are often very open-ended, human feedback from your application’s end users is often critical for evaluating LLM performance. Integrating this feedback loop within your LLMOps pipelines can often increase the performance of your trained large language model.
Hyperparameter Tuning: In classical ML, hyperparameter tuning often centers around improving accuracy or other metrics. For LLMs, tuning also becomes important for reducing the cost and computational power requirements of training and inference. For example, tweaking batch sizes and learning rates can dramatically change the speed and cost of training. Thus, both classical ML and LLMs benefit from tracking and optimizing the tuning process, but with different emphases.
Performance Metrics: Traditional ML models have very clearly defined performance metrics, such as accuracy, AUC, F1 score, etc. These metrics are fairly straightforward to calculate. When it comes to evaluating LLMs, however, a whole different set of standard metrics and scoring apply — such as bilingual evaluation understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROGUE) that require some extra considering when implementing.
Distributed Machine Learning (DML) involves training machine learning models across multiple machines or nodes to leverage distributed computing resources, reduce training time, and handle large-scale data. This approach is crucial in scenarios where the dataset is too large to fit into the memory of a single machine, or when model training requires substantial computational power.
DML can be broadly categorized into two architectural paradigms:
-
Data Parallelism
: The dataset is partitioned across different nodes, with each node independently training a model replica on its data subset. The gradients from each node are then averaged or summed to update the global model. Tools like Horovod and PyTorch Distributed excel in this paradigm. -
Model Parallelism
: The model itself is split across different nodes. Each node processes a part of the model, and the nodes collaborate to complete the forward and backward passes. This paradigm is beneficial for training extremely large models, such as GPT-3. Mesh-TensorFlow and DeepSpeed are popular in this space.
Horovod [code]
-
Overview
: Developed by Uber, Horovod is a popular open-source library designed to make distributed deep learning fast and easy to use. It extends TensorFlow, Keras, PyTorch, and Apache MXNet with efficient inter-GPU and inter-node communication using the NCCL (NVIDIA Collective Communications Library) or MPI (Message Passing Interface).-
Key Features:
Ring-Allreduce
: Implements a ring-allreduce algorithm for gradient averaging, reducing communication overhead.Elastic Training
: Supports dynamic scaling of worker processes, making it resilient to node failures and dynamic resource availability.Mixed Precision Training
: Supports mixed precision training to accelerate computation and reduce memory footprint.
-
PyTorch Distributed [code]
-
Overview
: PyTorch Distributed provides native support for distributed training, including both data and model parallelism. It offers seamless integration with existing PyTorch codebases, making it highly versatile for research and production.-
Key Features:
Distributed Data Parallel (DDP)
: Optimized module that synchronizes gradients across multiple nodes, offering better performance compared to traditional data parallelism.RPC-Based Parallelism
: Supports Remote Procedure Call (RPC) for model parallelism, enabling distributed execution of arbitrary PyTorch code.Fully Sharded Data Parallel (FSDP)
: Efficient memory management for large-scale models, allowing layers to be sharded across GPUs.
-
TensorFlow Distributed [code]
-
Overview
: TensorFlow Distributed is a comprehensive framework within TensorFlow for distributing training across multiple devices or machines. It supports a wide range of strategies, making it suitable for different scales of distributed training.-
Key Features:
MultiWorkerMirroredStrategy
: Distributes the dataset across multiple workers while keeping model replicas synchronized using all-reduce algorithms.ParameterServerStrategy
: Useful for training large models by distributing the variables across parameter servers and the computation across workers.TPU Support
: Native support for distributing training on Google's Tensor Processing Units (TPUs) for enhanced performance.
-
DeepSpeed [code]
-
Overview
: DeepSpeed is an open-source deep learning optimization library that enables distributed training of large-scale models. Developed by Microsoft, it is specifically designed for training trillion-parameter models.-
Key Features:
ZeRO (Zero Redundancy Optimizer)
: A memory optimization technique that reduces the memory footprint by partitioning model states across data parallel processes.Sparse Attention
: Optimizes memory and compute for models with attention mechanisms, enabling the training of extremely large transformers.Hybrid Parallelism
: Combines data, pipeline, and model parallelism, offering flexibility in scaling models across different hardware configurations.
-
Ray and Ray Train [code]
-
Overview
: Ray is a distributed execution framework that simplifies scaling Python workloads, including machine learning. Ray Train, built on top of Ray, is designed for distributed ML training.-
Key Features:
Fault Tolerance
: Automatically handles task failures and retries, ensuring robust distributed training.Hyperparameter Tuning
: Integrates with Ray Tune for distributed hyperparameter search, optimizing model performance across multiple configurations.Scalable Inference
: Ray Serve allows for scalable model serving, ensuring that models can be deployed and scaled efficiently.
-
Mesh-TensorFlow [code]
-
Overview
: Mesh-TensorFlow is an extension of TensorFlow designed to support model parallelism by splitting tensors and computations across a "mesh" of devices. It is particularly useful for training large-scale transformer models.-
Key Features:
Automatic Partitioning
: Automatically partitions the model and tensor operations across multiple devices, optimizing for communication and computation.Scalability
: Scales efficiently across thousands of devices, making it suitable for training models with billions of parameters.Flexibility
: Supports various parallelism strategies, including data parallelism, model parallelism, and pipeline parallelism.
-
resources : ml-ops, gcloud-mlops, three levels of mlops, state of mlops, landscape.lfai.foundation/, @github/awesome-prodction-ml, MLOps Landscape in 2023: Top Tools and Platforms, A Gentle Introduction to MLOps, google cloud services for mlops, book : Practitioners guide to MLOps: A framework for continuous delivery and automation of machine learning., mlops on vertex ai, mlops on gcp, courses : Machine Learning Engineering for Production (MLOps) Specialization, Machine Learning Operations (MLOps): Getting Started, ML Operations with Vertex AI, Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store, MLOps Course (ZenML) – Build Machine Learning Production Grade Projects.