Skip to content

Commit

Permalink
[Docs] Overview page (#4622)
Browse files Browse the repository at this point in the history
* WIP: Overview page

* img

* Updates

* updates

* rewords

* remove comments

* updates

* Rewords
  • Loading branch information
concretevitamin authored Feb 3, 2025
1 parent fbe6d03 commit d4b1b43
Show file tree
Hide file tree
Showing 4 changed files with 303 additions and 3 deletions.
6 changes: 4 additions & 2 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ SkyPilot is a framework for running AI and batch workloads on any infra, offerin

SkyPilot **abstracts away infra burdens**:

- Launch :ref:`dev clusters <dev-cluster>`, :ref:`jobs <managed-jobs>`, and :ref:`serving <sky-serve>` on any infra
- Launch :ref:`clusters <dev-cluster>`, :ref:`jobs <managed-jobs>`, and :ref:`serving <sky-serve>` on any infra
- Easy job management: queue, run, and auto-recover many jobs

SkyPilot **supports multiple clusters, clouds, and hardware** (`the Sky <https://arxiv.org/abs/2205.07147>`_):
Expand All @@ -48,6 +48,7 @@ SkyPilot **cuts your cloud costs & maximizes GPU availability**:

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.


:ref:`Current supported infra <installation>` (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):

.. raw:: html
Expand All @@ -62,7 +63,7 @@ SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code change
Ready to get started?
----------------------

:ref:`Install SkyPilot <installation>` in 1 minute. Then, launch your first dev cluster in 2 minutes in :ref:`Quickstart <quickstart>`.
:ref:`Install SkyPilot <installation>` in 1 minute. Then, launch your first cluster in 2 minutes in :ref:`Quickstart <quickstart>`.

SkyPilot is BYOC: Everything is launched within your cloud accounts, VPCs, and clusters.

Expand Down Expand Up @@ -132,6 +133,7 @@ Read the research:
:maxdepth: 1
:caption: Getting Started

../overview
../getting-started/installation
../getting-started/quickstart
../examples/interactive-development
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
298 changes: 298 additions & 0 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
.. _overview:

========================
Overview
========================

SkyPilot combines your cloud infra --- Kubernetes
clusters, clouds and regions for VMs, and existing machines --- into a unified compute pool, which is optimized for running AI workloads.

.. image:: images/skypilot-abstractions-long-2.png
:width: 90%
:align: center


You can run AI workloads on this pool in a unified interface, using these core abstractions:

- Clusters
- Jobs
- Services

These abstractions support all use cases in the AI lifecycle:
Batch processing, development, (pre)training, finetuning, hyperparameter sweeps, batch inference, and online serving.

Using SkyPilot to run workloads offers these benefits:

.. dropdown:: Unified execution on any cloud, region, and cluster

Regardless of how many clouds, regions, and clusters you have, you can use a unified interface
to submit, run, and manage workloads on them.

You focus on the workload, and SkyPilot alleviates the burden of
dealing with cloud infra details and differences.

.. dropdown:: Cost and capacity optimization

When launching a workload, SkyPilot will automatically choose the cheapest and most available infra choice in your search space.

.. dropdown:: Auto-failover across infra choices

When launching a workload, you can give SkyPilot a search space of infra
choices --- as unrestricted or as specific as you like.

If an infra choice has no capacity,
SkyPilot automatically falls back to the next best choice in your infra search space.

.. dropdown:: No cloud lock-in

Should you add infra choices (e.g., a new cloud, region, or cluster) in the future, your existing workloads can easily run on them.
No complex migration or workflow changes.
See the underlying :ref:`Sky Computing <sky-computing>` vision.

.. _concept-dev-clusters:

Clusters
------------

A *cluster* is SkyPilot's core resource unit: one or more VMs or Kubernetes pods in the same location.

You can use ``sky launch`` to launch a cluster:

.. tab-set::

.. tab-item:: CLI
:sync: cli

.. code-block:: console
$ sky launch
$ sky launch --gpus L4:8
$ sky launch --num-nodes 10 --cpus 32+
$ sky launch --down cluster.yaml
$ sky launch --help # See all flags.
.. tab-item:: Python
:sync: python

.. code-block:: python
import sky
task = sky.Task().set_resources(sky.Resources(accelerators='L4:8'))
sky.launch(task, cluster_name='my-cluster')
You can do the following with a cluster:

- SSH into any node
- Connect VSCode/IDE to it
- Submit and queue many jobs on it
- Have it automatically shut down or stop to save costs
- Easily launch and use many virtual, ephemeral clusters


Optionally, you can bring your custom Docker or VM image when launching, or use SkyPilot's sane defaults, which configure the correct CUDA versions for different GPUs.

Note that a SkyPilot cluster is a *virtual* collection of either cloud instances, or pods
launched on the *physical* clusters you bring to SkyPilot (:ref:`Kubernetes
clusters <concept-kubernetes-clusters>` or :ref:`existing machines
<concept-existing-machines>`).

See :ref:`quickstart` and :ref:`dev-cluster` to get started.

.. _concept-jobs:

Jobs
------------

A *job* is a program you want to run. Two types of jobs are supported:

.. list-table::
:widths: 50 50
:header-rows: 1
:align: center

* - **Jobs on Clusters**
- **Managed Jobs**
* - Usage: ``sky exec``
- Usage: ``sky jobs launch``
* - Jobs are submitted to an existing cluster and reuse that cluster's setup.
- Each job runs in its own temporary cluster, with auto-recovery.
* - Ideal for interactive development and debugging on an existing cluster.
- Ideal for jobs requiring recovery (e.g., spot instances) or scaling to many parallel jobs.



A job can contain one or :ref:`more <pipeline>` tasks. In most cases, a job has just one task; we'll refer to them interchangeably.



.. _concept-jobs-on-dev-cluster:

Jobs on clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can use ``sky exec`` to queue and run jobs on an existing cluster.
This is ideal for interactive development, reusing a cluster's setup.

See :ref:`job-queue` to get started.

.. tab-set::

.. tab-item:: CLI
:sync: cli

.. code-block:: bash
sky exec my-cluster --gpus L4:1 --workdir=. -- python train.py
sky exec my-cluster train.yaml # Specify everything in a YAML.
# Fractional GPUs are also supported.
sky exec my-cluster --gpus L4:0.5 -- python eval.py
# Multi-node jobs are also supported.
sky exec my-cluster --num-nodes 2 -- hostname
.. tab-item:: Python
:sync: python

.. code-block:: python
# Assume you have 'my-cluster' already launched.
# Queue a job requesting 1 GPU.
train = sky.Task(run='python train.py').set_resources(
sky.Resources(accelerators='L4:1'))
train = sky.Task.from_yaml('train.yaml') # Or load from a YAML.
sky.exec(train, cluster_name='my-cluster', detach_run=True)
# Queue a job requesting 0.5 GPU.
eval = sky.Task(run='python eval.py').set_resources(
sky.Resources(accelerators='L4:0.5'))
sky.exec(eval, cluster_name='my-cluster', detach_run=True)
.. _concept-managed-jobs:

Managed jobs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


*Managed jobs* automatically provision a temporary cluster for each job and handle
auto-recovery. A lightweight jobs controller is used to offer hands-off monitoring and recovery.
You can use ``sky jobs launch`` to launch managed jobs.

Managed jobs are especially ideal for running jobs on preemptible spot instances (e.g.,
finetuning, batch inference). Spot GPUs can typically save 3--6x costs. They are also
ideal for scaling to many parallel jobs.

Suggested pattern: Use clusters to interactively develop and debug your code first, and then
use managed jobs to run them at scale.

See :ref:`managed-jobs` and :ref:`many-jobs` to get started.

.. _concept-services:

Services
--------

A *service* is for AI model serving.
A service can have one or more replicas, potentially spanning across locations (regions, clouds, clusters), pricing models (on-demand, spot, etc.), or even GPU types.

See :ref:`sky-serve` to get started.

Bringing your infra
-------------------------------------------------------------------

SkyPilot easily connects to your existing infra---clouds, Kubernetes
clusters, or on-prem machines---using each infra's native authentication
(cloud credentials, kubeconfig, SSH).

Cloud VMs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SkyPilot can launch VMs on the clouds and regions you have access to.
Run ``sky check`` to check access.

SkyPilot supports most major cloud providers. See :ref:`cloud-account-setup` for details.

.. raw:: html

<p align="center">
<picture>
<img class="only-light" alt="SkyPilot Supported Clouds" src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-light.png" width=85%>
<img class="only-dark" alt="SkyPilot Supported Clouds" src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png" width=85%>
</picture>
</p>

By default, SkyPilot reuses your existing cloud authentication methods. Optionally, you can also :ref:`set up <cloud-permissions>` specific roles, permissions, or service accounts for SkyPilot to use.

.. _concept-kubernetes-clusters:

Kubernetes clusters
~~~~~~~~~~~~~~~~~~~~~

You can bring existing Kubernetes clusters, including managed clusters (e.g.,
EKS, GKE, AKS) or on-prem ones, into SkyPilot. Auto-failover
between multiple clusters is also supported.

.. image:: images/k8s-skypilot-architecture-light.png
:width: 45%
:align: center
:class: no-scaled-link, only-light

.. image:: images/k8s-skypilot-architecture-dark.png
:width: 45%
:align: center
:class: no-scaled-link, only-dark

See :ref:`kubernetes-overview`.

.. _concept-existing-machines:

Existing machines
~~~~~~~~~~~~~~~~~~~~~

If you have existing machines, i.e., a list of IP addresses you can SSH into, you can bring them into SkyPilot.

.. figure:: images/sky-existing-infra-workflow-light.png
:width: 85%
:align: center
:alt: Deploying SkyPilot on existing machines
:class: no-scaled-link, only-light

.. figure:: images/sky-existing-infra-workflow-dark.png
:width: 85%
:align: center
:alt: Deploying SkyPilot on existing machines
:class: no-scaled-link, only-dark

See :ref:`Using Existing Machines <existing-machines>`.

SkyPilot's cost and capacity optimization
-------------------------------------------------------------------

Whenever new compute is needed for a cluster, job, or service,
SkyPilot's provisioner natively optimizes for cost and capacity, choosing the infra option that is the cheapest and available.

For example, if you want to launch a cluster with 8 A100 GPUs, SkyPilot will try all infra
options in the given search space in the "cheapest and available" order,
with auto-failover:

.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/failover.png
:width: 95%
:align: center
:alt: SkyPilot auto-failover
:class: no-scaled-link

As such, SkyPilot users no longer need to worry about specific infra details, manual retry, or manual setup.
Workloads also obtain higher GPU capacity and cost savings.

Users can specify each workload's search space. It can be as flexible or as specific as desired. Example search spaces that can be specified:

- Use the cheapest and available GPUs out of a set, ``{A10g:8, A10:8, L4:8, A100:8}``
- Use my Kubernetes cluster or any accessible clouds (pictured above)
- Use either a spot or on-demand H100 GPU
- Use AWS's five European regions only
- Use a specific zone, region, or cloud

Optimization is performed within the search space.
See :ref:`auto-failover` for details.
2 changes: 1 addition & 1 deletion docs/source/sky-computing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Just like autonomous driving has different levels of autonomy (e.g., Level 1-5),
**For users on a fixed cluster** (e.g., Kubernetes, Slurm), SkyPilot provides:

- A simple interface to submit and manage AI workloads, tailored to AI users' ergonomics.
- Support for dev clusters, jobs, and serving on your cluster.
- Support for clusters, jobs, and serving on your cluster.
- Cost savings: Autostop, queueing, and higher hardware utilization.
- Future-proofness: No retooling when you add other clusters or clouds in the future.

Expand Down

0 comments on commit d4b1b43

Please sign in to comment.