-
Notifications
You must be signed in to change notification settings - Fork 555
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* WIP: Overview page * img * Updates * updates * rewords * remove comments * updates * Rewords
- Loading branch information
1 parent
fbe6d03
commit d4b1b43
Showing
4 changed files
with
303 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,298 @@ | ||
.. _overview: | ||
|
||
======================== | ||
Overview | ||
======================== | ||
|
||
SkyPilot combines your cloud infra --- Kubernetes | ||
clusters, clouds and regions for VMs, and existing machines --- into a unified compute pool, which is optimized for running AI workloads. | ||
|
||
.. image:: images/skypilot-abstractions-long-2.png | ||
:width: 90% | ||
:align: center | ||
|
||
|
||
You can run AI workloads on this pool in a unified interface, using these core abstractions: | ||
|
||
- Clusters | ||
- Jobs | ||
- Services | ||
|
||
These abstractions support all use cases in the AI lifecycle: | ||
Batch processing, development, (pre)training, finetuning, hyperparameter sweeps, batch inference, and online serving. | ||
|
||
Using SkyPilot to run workloads offers these benefits: | ||
|
||
.. dropdown:: Unified execution on any cloud, region, and cluster | ||
|
||
Regardless of how many clouds, regions, and clusters you have, you can use a unified interface | ||
to submit, run, and manage workloads on them. | ||
|
||
You focus on the workload, and SkyPilot alleviates the burden of | ||
dealing with cloud infra details and differences. | ||
|
||
.. dropdown:: Cost and capacity optimization | ||
|
||
When launching a workload, SkyPilot will automatically choose the cheapest and most available infra choice in your search space. | ||
|
||
.. dropdown:: Auto-failover across infra choices | ||
|
||
When launching a workload, you can give SkyPilot a search space of infra | ||
choices --- as unrestricted or as specific as you like. | ||
|
||
If an infra choice has no capacity, | ||
SkyPilot automatically falls back to the next best choice in your infra search space. | ||
|
||
.. dropdown:: No cloud lock-in | ||
|
||
Should you add infra choices (e.g., a new cloud, region, or cluster) in the future, your existing workloads can easily run on them. | ||
No complex migration or workflow changes. | ||
See the underlying :ref:`Sky Computing <sky-computing>` vision. | ||
|
||
.. _concept-dev-clusters: | ||
|
||
Clusters | ||
------------ | ||
|
||
A *cluster* is SkyPilot's core resource unit: one or more VMs or Kubernetes pods in the same location. | ||
|
||
You can use ``sky launch`` to launch a cluster: | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: CLI | ||
:sync: cli | ||
|
||
.. code-block:: console | ||
$ sky launch | ||
$ sky launch --gpus L4:8 | ||
$ sky launch --num-nodes 10 --cpus 32+ | ||
$ sky launch --down cluster.yaml | ||
$ sky launch --help # See all flags. | ||
.. tab-item:: Python | ||
:sync: python | ||
|
||
.. code-block:: python | ||
import sky | ||
task = sky.Task().set_resources(sky.Resources(accelerators='L4:8')) | ||
sky.launch(task, cluster_name='my-cluster') | ||
You can do the following with a cluster: | ||
|
||
- SSH into any node | ||
- Connect VSCode/IDE to it | ||
- Submit and queue many jobs on it | ||
- Have it automatically shut down or stop to save costs | ||
- Easily launch and use many virtual, ephemeral clusters | ||
|
||
|
||
Optionally, you can bring your custom Docker or VM image when launching, or use SkyPilot's sane defaults, which configure the correct CUDA versions for different GPUs. | ||
|
||
Note that a SkyPilot cluster is a *virtual* collection of either cloud instances, or pods | ||
launched on the *physical* clusters you bring to SkyPilot (:ref:`Kubernetes | ||
clusters <concept-kubernetes-clusters>` or :ref:`existing machines | ||
<concept-existing-machines>`). | ||
|
||
See :ref:`quickstart` and :ref:`dev-cluster` to get started. | ||
|
||
.. _concept-jobs: | ||
|
||
Jobs | ||
------------ | ||
|
||
A *job* is a program you want to run. Two types of jobs are supported: | ||
|
||
.. list-table:: | ||
:widths: 50 50 | ||
:header-rows: 1 | ||
:align: center | ||
|
||
* - **Jobs on Clusters** | ||
- **Managed Jobs** | ||
* - Usage: ``sky exec`` | ||
- Usage: ``sky jobs launch`` | ||
* - Jobs are submitted to an existing cluster and reuse that cluster's setup. | ||
- Each job runs in its own temporary cluster, with auto-recovery. | ||
* - Ideal for interactive development and debugging on an existing cluster. | ||
- Ideal for jobs requiring recovery (e.g., spot instances) or scaling to many parallel jobs. | ||
|
||
|
||
|
||
A job can contain one or :ref:`more <pipeline>` tasks. In most cases, a job has just one task; we'll refer to them interchangeably. | ||
|
||
|
||
|
||
.. _concept-jobs-on-dev-cluster: | ||
|
||
Jobs on clusters | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
You can use ``sky exec`` to queue and run jobs on an existing cluster. | ||
This is ideal for interactive development, reusing a cluster's setup. | ||
|
||
See :ref:`job-queue` to get started. | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: CLI | ||
:sync: cli | ||
|
||
.. code-block:: bash | ||
sky exec my-cluster --gpus L4:1 --workdir=. -- python train.py | ||
sky exec my-cluster train.yaml # Specify everything in a YAML. | ||
# Fractional GPUs are also supported. | ||
sky exec my-cluster --gpus L4:0.5 -- python eval.py | ||
# Multi-node jobs are also supported. | ||
sky exec my-cluster --num-nodes 2 -- hostname | ||
.. tab-item:: Python | ||
:sync: python | ||
|
||
.. code-block:: python | ||
# Assume you have 'my-cluster' already launched. | ||
# Queue a job requesting 1 GPU. | ||
train = sky.Task(run='python train.py').set_resources( | ||
sky.Resources(accelerators='L4:1')) | ||
train = sky.Task.from_yaml('train.yaml') # Or load from a YAML. | ||
sky.exec(train, cluster_name='my-cluster', detach_run=True) | ||
# Queue a job requesting 0.5 GPU. | ||
eval = sky.Task(run='python eval.py').set_resources( | ||
sky.Resources(accelerators='L4:0.5')) | ||
sky.exec(eval, cluster_name='my-cluster', detach_run=True) | ||
.. _concept-managed-jobs: | ||
|
||
Managed jobs | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
||
*Managed jobs* automatically provision a temporary cluster for each job and handle | ||
auto-recovery. A lightweight jobs controller is used to offer hands-off monitoring and recovery. | ||
You can use ``sky jobs launch`` to launch managed jobs. | ||
|
||
Managed jobs are especially ideal for running jobs on preemptible spot instances (e.g., | ||
finetuning, batch inference). Spot GPUs can typically save 3--6x costs. They are also | ||
ideal for scaling to many parallel jobs. | ||
|
||
Suggested pattern: Use clusters to interactively develop and debug your code first, and then | ||
use managed jobs to run them at scale. | ||
|
||
See :ref:`managed-jobs` and :ref:`many-jobs` to get started. | ||
|
||
.. _concept-services: | ||
|
||
Services | ||
-------- | ||
|
||
A *service* is for AI model serving. | ||
A service can have one or more replicas, potentially spanning across locations (regions, clouds, clusters), pricing models (on-demand, spot, etc.), or even GPU types. | ||
|
||
See :ref:`sky-serve` to get started. | ||
|
||
Bringing your infra | ||
------------------------------------------------------------------- | ||
|
||
SkyPilot easily connects to your existing infra---clouds, Kubernetes | ||
clusters, or on-prem machines---using each infra's native authentication | ||
(cloud credentials, kubeconfig, SSH). | ||
|
||
Cloud VMs | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
SkyPilot can launch VMs on the clouds and regions you have access to. | ||
Run ``sky check`` to check access. | ||
|
||
SkyPilot supports most major cloud providers. See :ref:`cloud-account-setup` for details. | ||
|
||
.. raw:: html | ||
|
||
<p align="center"> | ||
<picture> | ||
<img class="only-light" alt="SkyPilot Supported Clouds" src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-light.png" width=85%> | ||
<img class="only-dark" alt="SkyPilot Supported Clouds" src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png" width=85%> | ||
</picture> | ||
</p> | ||
|
||
By default, SkyPilot reuses your existing cloud authentication methods. Optionally, you can also :ref:`set up <cloud-permissions>` specific roles, permissions, or service accounts for SkyPilot to use. | ||
|
||
.. _concept-kubernetes-clusters: | ||
|
||
Kubernetes clusters | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
You can bring existing Kubernetes clusters, including managed clusters (e.g., | ||
EKS, GKE, AKS) or on-prem ones, into SkyPilot. Auto-failover | ||
between multiple clusters is also supported. | ||
|
||
.. image:: images/k8s-skypilot-architecture-light.png | ||
:width: 45% | ||
:align: center | ||
:class: no-scaled-link, only-light | ||
|
||
.. image:: images/k8s-skypilot-architecture-dark.png | ||
:width: 45% | ||
:align: center | ||
:class: no-scaled-link, only-dark | ||
|
||
See :ref:`kubernetes-overview`. | ||
|
||
.. _concept-existing-machines: | ||
|
||
Existing machines | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
If you have existing machines, i.e., a list of IP addresses you can SSH into, you can bring them into SkyPilot. | ||
|
||
.. figure:: images/sky-existing-infra-workflow-light.png | ||
:width: 85% | ||
:align: center | ||
:alt: Deploying SkyPilot on existing machines | ||
:class: no-scaled-link, only-light | ||
|
||
.. figure:: images/sky-existing-infra-workflow-dark.png | ||
:width: 85% | ||
:align: center | ||
:alt: Deploying SkyPilot on existing machines | ||
:class: no-scaled-link, only-dark | ||
|
||
See :ref:`Using Existing Machines <existing-machines>`. | ||
|
||
SkyPilot's cost and capacity optimization | ||
------------------------------------------------------------------- | ||
|
||
Whenever new compute is needed for a cluster, job, or service, | ||
SkyPilot's provisioner natively optimizes for cost and capacity, choosing the infra option that is the cheapest and available. | ||
|
||
For example, if you want to launch a cluster with 8 A100 GPUs, SkyPilot will try all infra | ||
options in the given search space in the "cheapest and available" order, | ||
with auto-failover: | ||
|
||
.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/failover.png | ||
:width: 95% | ||
:align: center | ||
:alt: SkyPilot auto-failover | ||
:class: no-scaled-link | ||
|
||
As such, SkyPilot users no longer need to worry about specific infra details, manual retry, or manual setup. | ||
Workloads also obtain higher GPU capacity and cost savings. | ||
|
||
Users can specify each workload's search space. It can be as flexible or as specific as desired. Example search spaces that can be specified: | ||
|
||
- Use the cheapest and available GPUs out of a set, ``{A10g:8, A10:8, L4:8, A100:8}`` | ||
- Use my Kubernetes cluster or any accessible clouds (pictured above) | ||
- Use either a spot or on-demand H100 GPU | ||
- Use AWS's five European regions only | ||
- Use a specific zone, region, or cloud | ||
|
||
Optimization is performed within the search space. | ||
See :ref:`auto-failover` for details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters