From 299f02a138d2e8ff0fb1156022ac09ea08144f62 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Thu, 9 Jan 2025 12:33:46 +0000 Subject: [PATCH 1/2] Move Community and API Reference to the bottom Signed-off-by: DarkLight1337 --- README.md | 2 +- .../source/design/automatic_prefix_caching.md | 2 +- docs/source/index.md | 62 ++++++++++++------- 3 files changed, 40 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 253a0bb913e37..993fd6801fa35 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput -- Efficient management of attention key and value memory with **PagedAttention** +- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai) - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. diff --git a/docs/source/design/automatic_prefix_caching.md b/docs/source/design/automatic_prefix_caching.md index 4398536b2b4ad..69498fe6c6be5 100644 --- a/docs/source/design/automatic_prefix_caching.md +++ b/docs/source/design/automatic_prefix_caching.md @@ -2,7 +2,7 @@ # Automatic Prefix Caching -The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. +The core idea of [PagedAttention](https://vllm.ai) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. diff --git a/docs/source/index.md b/docs/source/index.md index 23e4304fe29d9..e1f13774959bf 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput -- Efficient management of attention key and value memory with **PagedAttention** +- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai) - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8 @@ -54,6 +54,8 @@ For more information, check out the following: ## Documentation +% How to start using vLLM? + ```{toctree} :caption: Getting Started :maxdepth: 1 @@ -65,6 +67,8 @@ getting_started/troubleshooting getting_started/faq ``` +% What does vLLM support? + ```{toctree} :caption: Models :maxdepth: 1 @@ -75,6 +79,8 @@ models/supported_models models/extensions/index ``` +% Additional capabilities + ```{toctree} :caption: Features :maxdepth: 1 @@ -89,6 +95,8 @@ features/spec_decode features/compatibility_matrix ``` +% Details about running vLLM + ```{toctree} :caption: Inference and Serving :maxdepth: 1 @@ -104,6 +112,8 @@ serving/usage_stats serving/integrations/index ``` +% Scaling up vLLM for production + ```{toctree} :caption: Deployment :maxdepth: 1 @@ -115,6 +125,8 @@ deployment/frameworks/index deployment/integrations/index ``` +% Making the most out of vLLM + ```{toctree} :caption: Performance :maxdepth: 1 @@ -123,28 +135,7 @@ performance/optimization performance/benchmarks ``` -% Community: User community resources - -```{toctree} -:caption: Community -:maxdepth: 1 - -community/meetups -community/sponsors -``` - -```{toctree} -:caption: API Reference -:maxdepth: 2 - -api/offline_inference/index -api/engine/index -api/inference_params -api/multimodal/index -api/model/index -``` - -% Design Documents: Details about vLLM internals +% Explanation of vLLM internals ```{toctree} :caption: Design Documents @@ -159,7 +150,7 @@ design/automatic_prefix_caching design/multiprocessing ``` -% Developer Guide: How to contribute to the vLLM project +% How to contribute to the vLLM project ```{toctree} :caption: Developer Guide @@ -172,6 +163,29 @@ contributing/model/index contributing/vulnerability_management ``` +% Technical API specifications + +```{toctree} +:caption: API Reference +:maxdepth: 2 + +api/offline_inference/index +api/engine/index +api/inference_params +api/multimodal/index +api/model/index +``` + +% Latest news and acknowledgements + +```{toctree} +:caption: Community +:maxdepth: 1 + +community/meetups +community/sponsors +``` + # Indices and tables - {ref}`genindex` From 88caecc63b20e51c6291371b4c626a7ff2b2ad0a Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 10 Jan 2025 00:19:57 +0800 Subject: [PATCH 2/2] Apply suggestions from code review Signed-off-by: DarkLight1337 Co-authored-by: Simon Mo --- README.md | 2 +- docs/source/design/automatic_prefix_caching.md | 2 +- docs/source/index.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 993fd6801fa35..67c557bfe13a9 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput -- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai) +- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html) - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. diff --git a/docs/source/design/automatic_prefix_caching.md b/docs/source/design/automatic_prefix_caching.md index 69498fe6c6be5..6d3dd056e6a60 100644 --- a/docs/source/design/automatic_prefix_caching.md +++ b/docs/source/design/automatic_prefix_caching.md @@ -2,7 +2,7 @@ # Automatic Prefix Caching -The core idea of [PagedAttention](https://vllm.ai) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. +The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. diff --git a/docs/source/index.md b/docs/source/index.md index e1f13774959bf..356fa4b7fd573 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput -- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai) +- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html) - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8