diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 4b371484589..f74cc726fcd 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -32,7 +32,7 @@ SkyPilot is a framework for running AI and batch workloads on any infra, offerin SkyPilot **abstracts away infra burdens**: -- Launch :ref:`dev clusters `, :ref:`jobs `, and :ref:`serving ` on any infra +- Launch :ref:`clusters `, :ref:`jobs `, and :ref:`serving ` on any infra - Easy job management: queue, run, and auto-recover many jobs SkyPilot **supports multiple clusters, clouds, and hardware** (`the Sky `_): @@ -48,6 +48,7 @@ SkyPilot **cuts your cloud costs & maximizes GPU availability**: SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes. + :ref:`Current supported infra ` (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere): .. raw:: html @@ -62,7 +63,7 @@ SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code change Ready to get started? ---------------------- -:ref:`Install SkyPilot ` in 1 minute. Then, launch your first dev cluster in 2 minutes in :ref:`Quickstart `. +:ref:`Install SkyPilot ` in 1 minute. Then, launch your first cluster in 2 minutes in :ref:`Quickstart `. SkyPilot is BYOC: Everything is launched within your cloud accounts, VPCs, and clusters. @@ -132,6 +133,7 @@ Read the research: :maxdepth: 1 :caption: Getting Started + ../overview ../getting-started/installation ../getting-started/quickstart ../examples/interactive-development diff --git a/docs/source/images/skypilot-abstractions-long-2.png b/docs/source/images/skypilot-abstractions-long-2.png new file mode 100644 index 00000000000..0dfa4dcaa86 Binary files /dev/null and b/docs/source/images/skypilot-abstractions-long-2.png differ diff --git a/docs/source/overview.rst b/docs/source/overview.rst new file mode 100644 index 00000000000..67bf070166a --- /dev/null +++ b/docs/source/overview.rst @@ -0,0 +1,298 @@ +.. _overview: + +======================== +Overview +======================== + +SkyPilot combines your cloud infra --- Kubernetes +clusters, clouds and regions for VMs, and existing machines --- into a unified compute pool, which is optimized for running AI workloads. + +.. image:: images/skypilot-abstractions-long-2.png + :width: 90% + :align: center + + +You can run AI workloads on this pool in a unified interface, using these core abstractions: + +- Clusters +- Jobs +- Services + +These abstractions support all use cases in the AI lifecycle: +Batch processing, development, (pre)training, finetuning, hyperparameter sweeps, batch inference, and online serving. + +Using SkyPilot to run workloads offers these benefits: + +.. dropdown:: Unified execution on any cloud, region, and cluster + + Regardless of how many clouds, regions, and clusters you have, you can use a unified interface + to submit, run, and manage workloads on them. + + You focus on the workload, and SkyPilot alleviates the burden of + dealing with cloud infra details and differences. + +.. dropdown:: Cost and capacity optimization + + When launching a workload, SkyPilot will automatically choose the cheapest and most available infra choice in your search space. + +.. dropdown:: Auto-failover across infra choices + + When launching a workload, you can give SkyPilot a search space of infra + choices --- as unrestricted or as specific as you like. + + If an infra choice has no capacity, + SkyPilot automatically falls back to the next best choice in your infra search space. + +.. dropdown:: No cloud lock-in + + Should you add infra choices (e.g., a new cloud, region, or cluster) in the future, your existing workloads can easily run on them. + No complex migration or workflow changes. + See the underlying :ref:`Sky Computing ` vision. + +.. _concept-dev-clusters: + +Clusters +------------ + +A *cluster* is SkyPilot's core resource unit: one or more VMs or Kubernetes pods in the same location. + +You can use ``sky launch`` to launch a cluster: + +.. tab-set:: + + .. tab-item:: CLI + :sync: cli + + .. code-block:: console + + $ sky launch + $ sky launch --gpus L4:8 + $ sky launch --num-nodes 10 --cpus 32+ + $ sky launch --down cluster.yaml + $ sky launch --help # See all flags. + + .. tab-item:: Python + :sync: python + + .. code-block:: python + + import sky + task = sky.Task().set_resources(sky.Resources(accelerators='L4:8')) + sky.launch(task, cluster_name='my-cluster') + +You can do the following with a cluster: + +- SSH into any node +- Connect VSCode/IDE to it +- Submit and queue many jobs on it +- Have it automatically shut down or stop to save costs +- Easily launch and use many virtual, ephemeral clusters + + +Optionally, you can bring your custom Docker or VM image when launching, or use SkyPilot's sane defaults, which configure the correct CUDA versions for different GPUs. + +Note that a SkyPilot cluster is a *virtual* collection of either cloud instances, or pods +launched on the *physical* clusters you bring to SkyPilot (:ref:`Kubernetes +clusters ` or :ref:`existing machines +`). + +See :ref:`quickstart` and :ref:`dev-cluster` to get started. + +.. _concept-jobs: + +Jobs +------------ + +A *job* is a program you want to run. Two types of jobs are supported: + +.. list-table:: + :widths: 50 50 + :header-rows: 1 + :align: center + + * - **Jobs on Clusters** + - **Managed Jobs** + * - Usage: ``sky exec`` + - Usage: ``sky jobs launch`` + * - Jobs are submitted to an existing cluster and reuse that cluster's setup. + - Each job runs in its own temporary cluster, with auto-recovery. + * - Ideal for interactive development and debugging on an existing cluster. + - Ideal for jobs requiring recovery (e.g., spot instances) or scaling to many parallel jobs. + + + +A job can contain one or :ref:`more ` tasks. In most cases, a job has just one task; we'll refer to them interchangeably. + + + +.. _concept-jobs-on-dev-cluster: + +Jobs on clusters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can use ``sky exec`` to queue and run jobs on an existing cluster. +This is ideal for interactive development, reusing a cluster's setup. + +See :ref:`job-queue` to get started. + +.. tab-set:: + + .. tab-item:: CLI + :sync: cli + + .. code-block:: bash + + sky exec my-cluster --gpus L4:1 --workdir=. -- python train.py + sky exec my-cluster train.yaml # Specify everything in a YAML. + + # Fractional GPUs are also supported. + sky exec my-cluster --gpus L4:0.5 -- python eval.py + + # Multi-node jobs are also supported. + sky exec my-cluster --num-nodes 2 -- hostname + + .. tab-item:: Python + :sync: python + + .. code-block:: python + + # Assume you have 'my-cluster' already launched. + + # Queue a job requesting 1 GPU. + train = sky.Task(run='python train.py').set_resources( + sky.Resources(accelerators='L4:1')) + train = sky.Task.from_yaml('train.yaml') # Or load from a YAML. + sky.exec(train, cluster_name='my-cluster', detach_run=True) + + # Queue a job requesting 0.5 GPU. + eval = sky.Task(run='python eval.py').set_resources( + sky.Resources(accelerators='L4:0.5')) + sky.exec(eval, cluster_name='my-cluster', detach_run=True) + + +.. _concept-managed-jobs: + +Managed jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + +*Managed jobs* automatically provision a temporary cluster for each job and handle +auto-recovery. A lightweight jobs controller is used to offer hands-off monitoring and recovery. +You can use ``sky jobs launch`` to launch managed jobs. + +Managed jobs are especially ideal for running jobs on preemptible spot instances (e.g., +finetuning, batch inference). Spot GPUs can typically save 3--6x costs. They are also +ideal for scaling to many parallel jobs. + +Suggested pattern: Use clusters to interactively develop and debug your code first, and then +use managed jobs to run them at scale. + +See :ref:`managed-jobs` and :ref:`many-jobs` to get started. + +.. _concept-services: + +Services +-------- + +A *service* is for AI model serving. +A service can have one or more replicas, potentially spanning across locations (regions, clouds, clusters), pricing models (on-demand, spot, etc.), or even GPU types. + +See :ref:`sky-serve` to get started. + +Bringing your infra +------------------------------------------------------------------- + +SkyPilot easily connects to your existing infra---clouds, Kubernetes +clusters, or on-prem machines---using each infra's native authentication +(cloud credentials, kubeconfig, SSH). + +Cloud VMs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +SkyPilot can launch VMs on the clouds and regions you have access to. +Run ``sky check`` to check access. + +SkyPilot supports most major cloud providers. See :ref:`cloud-account-setup` for details. + +.. raw:: html + +

+ + SkyPilot Supported Clouds + SkyPilot Supported Clouds + +

+ +By default, SkyPilot reuses your existing cloud authentication methods. Optionally, you can also :ref:`set up ` specific roles, permissions, or service accounts for SkyPilot to use. + +.. _concept-kubernetes-clusters: + +Kubernetes clusters +~~~~~~~~~~~~~~~~~~~~~ + +You can bring existing Kubernetes clusters, including managed clusters (e.g., +EKS, GKE, AKS) or on-prem ones, into SkyPilot. Auto-failover +between multiple clusters is also supported. + +.. image:: images/k8s-skypilot-architecture-light.png + :width: 45% + :align: center + :class: no-scaled-link, only-light + +.. image:: images/k8s-skypilot-architecture-dark.png + :width: 45% + :align: center + :class: no-scaled-link, only-dark + +See :ref:`kubernetes-overview`. + +.. _concept-existing-machines: + +Existing machines +~~~~~~~~~~~~~~~~~~~~~ + +If you have existing machines, i.e., a list of IP addresses you can SSH into, you can bring them into SkyPilot. + +.. figure:: images/sky-existing-infra-workflow-light.png + :width: 85% + :align: center + :alt: Deploying SkyPilot on existing machines + :class: no-scaled-link, only-light + +.. figure:: images/sky-existing-infra-workflow-dark.png + :width: 85% + :align: center + :alt: Deploying SkyPilot on existing machines + :class: no-scaled-link, only-dark + +See :ref:`Using Existing Machines `. + +SkyPilot's cost and capacity optimization +------------------------------------------------------------------- + +Whenever new compute is needed for a cluster, job, or service, +SkyPilot's provisioner natively optimizes for cost and capacity, choosing the infra option that is the cheapest and available. + +For example, if you want to launch a cluster with 8 A100 GPUs, SkyPilot will try all infra +options in the given search space in the "cheapest and available" order, +with auto-failover: + +.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/failover.png + :width: 95% + :align: center + :alt: SkyPilot auto-failover + :class: no-scaled-link + +As such, SkyPilot users no longer need to worry about specific infra details, manual retry, or manual setup. +Workloads also obtain higher GPU capacity and cost savings. + +Users can specify each workload's search space. It can be as flexible or as specific as desired. Example search spaces that can be specified: + +- Use the cheapest and available GPUs out of a set, ``{A10g:8, A10:8, L4:8, A100:8}`` +- Use my Kubernetes cluster or any accessible clouds (pictured above) +- Use either a spot or on-demand H100 GPU +- Use AWS's five European regions only +- Use a specific zone, region, or cloud + +Optimization is performed within the search space. +See :ref:`auto-failover` for details. diff --git a/docs/source/sky-computing.rst b/docs/source/sky-computing.rst index 15134204ebf..8f36fd2527d 100644 --- a/docs/source/sky-computing.rst +++ b/docs/source/sky-computing.rst @@ -100,7 +100,7 @@ Just like autonomous driving has different levels of autonomy (e.g., Level 1-5), **For users on a fixed cluster** (e.g., Kubernetes, Slurm), SkyPilot provides: - A simple interface to submit and manage AI workloads, tailored to AI users' ergonomics. -- Support for dev clusters, jobs, and serving on your cluster. +- Support for clusters, jobs, and serving on your cluster. - Cost savings: Autostop, queueing, and higher hardware utilization. - Future-proofness: No retooling when you add other clusters or clouds in the future.