Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 25 01 27 #391

Merged
merged 109 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
7bd3630
[Misc] Update CODEOWNERS (#12229)
ywang96 Jan 20, 2025
af69a6a
fix: update platform detection for M-series arm based MacBook process…
isikhi Jan 20, 2025
da75122
[misc] add cuda runtime version to usage data (#12190)
youkaichao Jan 21, 2025
06a760d
[bugfix] catch xgrammar unsupported array constraints (#12210)
Jason-CKY Jan 21, 2025
750f4ca
[Kernel] optimize moe_align_block_size for cuda graph and large num_e…
jinzhen-lin Jan 21, 2025
ecf6781
Add quantization and guided decoding CODEOWNERS (#12228)
mgoin Jan 21, 2025
d4b62d4
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)
gshtras Jan 21, 2025
5fe6bf2
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
NickLucche Jan 21, 2025
2fc6944
[ci/build] disable failed and flaky tests (#12240)
youkaichao Jan 21, 2025
9691255
[Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244)
DarkLight1337 Jan 21, 2025
1f1542a
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#1…
jeejeelee Jan 21, 2025
f2e9f2a
[Misc] Remove redundant TypeVar from base model (#12248)
DarkLight1337 Jan 21, 2025
a94eee4
[Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
DarkLight1337 Jan 21, 2025
c81081f
[torch.compile] transparent compilation with more logging (#12246)
youkaichao Jan 21, 2025
b197a5c
[V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
ywang96 Jan 21, 2025
9a7c3a0
Remove pytorch comments for outlines + compressed-tensors (#12260)
tdoublep Jan 21, 2025
c646128
[Platform] improve platforms getattr (#12264)
MengqingCao Jan 21, 2025
3aec49e
[ci/build] update nightly torch for gh200 test (#12270)
youkaichao Jan 21, 2025
9705b90
[Bugfix] fix race condition that leads to wrong order of token return…
joennlae Jan 21, 2025
1e60f87
[Kernel] fix moe_align_block_size error condition (#12239)
jinzhen-lin Jan 21, 2025
132a132
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
rickyyx Jan 21, 2025
18fd4a8
[Bugfix] Multi-sequence broken (#11898)
andylolu2 Jan 21, 2025
347eeeb
[Misc] Remove experimental dep from tracing.py (#12007)
codefromthecrypt Jan 21, 2025
fa9ee08
[Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
wangxiyuan Jan 21, 2025
9c485d9
[Core] Free CPU pinned memory on environment cleanup (#10477)
janimo Jan 21, 2025
2acba47
[bugfix] moe tuning. rm is_navi() (#12273)
divakar-amd Jan 21, 2025
69196a9
[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…
maleksan85 Jan 21, 2025
09ccc9c
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …
hongxiayang Jan 21, 2025
df76e5a
[VLM] Simplify post-processing of replacement info (#12269)
DarkLight1337 Jan 22, 2025
64ea24d
[ci/lint] Add back default arg for pre-commit (#12279)
khluu Jan 22, 2025
016e367
[CI] add docker volume prune to neuron CI (#12291)
liangfu Jan 22, 2025
cbdc4ad
[Ci/Build] Fix mypy errors on main (#12296)
DarkLight1337 Jan 22, 2025
222a9dc
[Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288)
njhill Jan 22, 2025
66818e5
[core] separate builder init and builder prepare for each batch (#12253)
youkaichao Jan 22, 2025
4004f14
[Build] update requirements of no-device (#12299)
MengqingCao Jan 22, 2025
68ad4e3
[Core] Support fully transparent sleep mode (#11743)
youkaichao Jan 22, 2025
cd7b6f0
[VLM] Avoid unnecessary tokenization (#12310)
DarkLight1337 Jan 22, 2025
528dbca
[Model][Bugfix]: correct Aria model output (#12309)
xffxff Jan 22, 2025
16366ee
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for…
ywang96 Jan 22, 2025
6609cdf
[Doc] Add docs for prompt replacement (#12318)
DarkLight1337 Jan 22, 2025
fc66dee
[Misc] Fix the error in the tip for the --lora-modules parameter (#12…
WangErXiao Jan 22, 2025
84bee4b
[Misc] Improve the readability of BNB error messages (#12320)
jeejeelee Jan 22, 2025
96f6a75
[Bugfix] Fix HPU multiprocessing executor (#12167)
kzawora-intel Jan 22, 2025
7206ce4
[Core] Support `reset_prefix_cache` (#12284)
comaniac Jan 22, 2025
aea9436
[Frontend][V1] Online serving performance improvements (#12287)
njhill Jan 22, 2025
68c4421
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is brok…
rasmith Jan 23, 2025
8d7aa9d
[Bugfix] Fixing AMD LoRA CI test. (#12329)
Alexei-V-Ivanov-AMD Jan 23, 2025
01a5594
[Docs] Update FP8 KV Cache documentation (#12238)
mgoin Jan 23, 2025
7551a34
[Docs] Document vulnerability disclosure process (#12326)
russellb Jan 23, 2025
f0ef372
[V1] Add `uncache_blocks` (#12333)
comaniac Jan 23, 2025
5116274
[doc] explain common errors around torch.compile (#12340)
youkaichao Jan 23, 2025
8ae5ff2
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package u…
zhenwei-intel Jan 23, 2025
c5b4b11
[Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Isotr0py Jan 23, 2025
978b45f
[Kernel] Flash Attention 3 Support (#12093)
LucasWilkinson Jan 23, 2025
d07efb3
[Doc] Troubleshooting errors during model inspection (#12351)
DarkLight1337 Jan 23, 2025
99d01a5
[V1] Simplify M-RoPE (#12352)
ywang96 Jan 23, 2025
8c01b80
[Bugfix] Fix broken internvl2 inference with v1 (#12360)
Isotr0py Jan 23, 2025
3f50c14
[core] add wake_up doc and some sanity check (#12361)
youkaichao Jan 23, 2025
6e650f5
[torch.compile] decouple compile sizes and cudagraph sizes (#12243)
youkaichao Jan 23, 2025
e97f802
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
gshtras Jan 23, 2025
2c85529
[TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
lsy323 Jan 23, 2025
2cbeeda
[Docs] Document Phi-4 support (#12362)
Isotr0py Jan 23, 2025
eb5cb5e
[BugFix] Fix parameter names and `process_after_weight_loading` for W…
dsikka Jan 23, 2025
9726ad6
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
jsato8094 Jan 23, 2025
682b55b
[Docs] Add meetup slides (#12345)
WoosukKwon Jan 23, 2025
c5cffcd
[Docs] Update spec decode + structured output in compat matrix (#12373)
russellb Jan 24, 2025
24b0205
[V1][Frontend] Coalesce bunched `RequestOutput`s (#12298)
njhill Jan 24, 2025
d3d6bb1
Set weights_only=True when using torch.load() (#12366)
russellb Jan 24, 2025
5e5630a
[Bugfix] Path join when building local path for S3 clone (#12353)
omer-dayan Jan 24, 2025
55ef66e
Update compressed-tensors version (#12367)
dsikka Jan 24, 2025
0e74d79
[V1] Increase default batch size for H100/H200 (#12369)
WoosukKwon Jan 24, 2025
6dd94db
[perf] fix perf regression from #12253 (#12380)
youkaichao Jan 24, 2025
3c818bd
[Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
ywang96 Jan 24, 2025
c7c9851
[ci/build] fix wheel size check (#12396)
youkaichao Jan 24, 2025
9a0f3bd
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)
MohitIntel Jan 24, 2025
e784c6b
[ci/build] sync default value for wheel size (#12398)
youkaichao Jan 24, 2025
3bb8e2c
[Misc] Enable proxy support in benchmark script (#12356)
jsato8094 Jan 24, 2025
ab5bbf5
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
LucasWilkinson Jan 24, 2025
df5dafa
[Misc] Remove deprecated code (#12383)
DarkLight1337 Jan 24, 2025
3132a93
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build o…
LucasWilkinson Jan 24, 2025
221d388
[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413)
ElizaWszola Jan 25, 2025
fb30ee9
[Bugfix] Fix BLIP-2 processing (#12412)
DarkLight1337 Jan 25, 2025
bf21481
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
divakar-amd Jan 25, 2025
f1fc051
[Misc] Add FA2 support to ViT MHA layer (#12355)
Isotr0py Jan 25, 2025
324960a
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
lsy323 Jan 25, 2025
2a0309a
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
ywang96 Jan 26, 2025
fa63e71
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync…
youngkent Jan 26, 2025
0ee349b
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
ywang96 Jan 26, 2025
a525527
[Misc] Revert FA on ViT #12355 and #12435 (#12445)
ywang96 Jan 26, 2025
9ddc352
[Frontend] generation_config.json for maximum tokens(#12242)
mhendrey Jan 26, 2025
aa2cd2c
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
tlrmchlsmth Jan 26, 2025
72f4880
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
tlrmchlsmth Jan 26, 2025
68f1114
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
LucasWilkinson Jan 26, 2025
72bac73
[Build/CI] Fix libcuda.so linkage (#12424)
tlrmchlsmth Jan 26, 2025
0034b09
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
K-Mistele Jan 27, 2025
582cf78
[DOC] Add link to vLLM blog (#12460)
terrytangyuan Jan 27, 2025
28e0750
[V1] Avoid list creation in input preparation (#12457)
WoosukKwon Jan 27, 2025
0cc6b38
[Frontend] Support scores endpoint in run_batch (#12430)
pooyadavoodi Jan 27, 2025
5204ff5
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
DarkLight1337 Jan 27, 2025
372bf08
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Isotr0py Jan 27, 2025
624a1e4
[V1][Minor] Minor optimizations for update_from_output (#12454)
WoosukKwon Jan 27, 2025
ce69f7f
[Bugfix] Fix gpt2 GGUF inference (#12467)
Isotr0py Jan 27, 2025
103bd17
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
LucasWilkinson Jan 27, 2025
01ba927
[V1][Metrics] Add initial Prometheus logger (#12416)
markmc Jan 27, 2025
3f1fc74
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
WoosukKwon Jan 27, 2025
2bc3fbb
[FlashInfer] Upgrade to 0.2.0 (#11194)
abmfy Jan 27, 2025
8e6d987
Merge remote-tracking branch 'upstream/main'
gshtras Jan 27, 2025
a892ecc
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_27
gshtras Jan 28, 2025
c8b8654
Direct call on ROCm
gshtras Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))


def print_top_10_largest_files(zip_file):
Expand Down
5 changes: 4 additions & 1 deletion .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
docker system prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune -f
# Remove huggingface model artifacts and compiler cache
rm -rf "${HF_MOUNT:?}/*"
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
echo "$current_time" > /tmp/neuron-docker-build-timestamp
Expand Down
21 changes: 18 additions & 3 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,9 @@ steps:
- tests/basic_correctness/test_basic_correctness
- tests/basic_correctness/test_cpu_offload
- tests/basic_correctness/test_preemption
- tests/basic_correctness/test_cumem.py
commands:
- pytest -v -s basic_correctness/test_cumem.py
- pytest -v -s basic_correctness/test_basic_correctness.py
- pytest -v -s basic_correctness/test_cpu_offload.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
Expand Down Expand Up @@ -181,7 +183,16 @@ steps:
- vllm/
- tests/v1
commands:
- VLLM_USE_V1=1 pytest -v -s v1
# split the test to avoid interference
- VLLM_USE_V1=1 pytest -v -s v1/core
- VLLM_USE_V1=1 pytest -v -s v1/engine
- VLLM_USE_V1=1 pytest -v -s v1/sample
- VLLM_USE_V1=1 pytest -v -s v1/worker
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- VLLM_USE_V1=1 pytest -v -s v1/e2e

- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -477,7 +488,9 @@ steps:
- pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
# this test fails consistently.
# TODO: investigate and fix
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/disagg_test.py

Expand Down Expand Up @@ -515,7 +528,9 @@ steps:
- vllm/engine
- tests/multi_step
commands:
- pytest -v -s multi_step/test_correctness_async_llm.py
# this test is quite flaky
# TODO: investigate and fix.
# - pytest -v -s multi_step/test_correctness_async_llm.py
- pytest -v -s multi_step/test_correctness_llm.py

- label: Pipeline Parallelism Test # 45min
Expand Down
27 changes: 15 additions & 12 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,35 @@
# for more info about CODEOWNERS file

# This lists cover the "core" components of vLLM that require careful review
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/core @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/core @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
/vllm/model_executor/guided_decoding @mgoin
/vllm/multimodal @DarkLight1337 @ywang96
CMakeLists.txt @tlrmchlsmth

# vLLM V1
/vllm/v1 @WoosukKwon @robertgshaw2-neuralmagic @njhill @ywang96 @comaniac @alexm-neuralmagic
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat

# Test ownership
/tests/async_engine @njhill @robertgshaw2-neuralmagic @simon-mo
/tests/async_engine @njhill @robertgshaw2-redhat @simon-mo
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo
/tests/models @DarkLight1337 @ywang96
/tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/quantization @mgoin @robertgshaw2-neuralmagic
/tests/quantization @mgoin @robertgshaw2-redhat
/.buildkite/lm-eval-harness @mgoin @simon-mo
/tests/distributed/test_multi_node_assignment.py @youkaichao
/tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao
/tests/multi_step @alexm-neuralmagic @comaniac
/tests/multi_step @alexm-redhat @comaniac
/tests/weight_loading @mgoin @youkaichao
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac
82 changes: 53 additions & 29 deletions CMakeLists.txt
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,6 @@ include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
# Suppress potential warnings about unused manually-specified variables
set(ignoreMe "${VLLM_PYTHON_PATH}")

# Prevent installation of dependencies (cutlass) by default.
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)

#
# Supported python versions. These versions will be searched in order, the
# first match will be selected. These should be kept in sync with setup.py.
Expand Down Expand Up @@ -215,6 +212,31 @@ endif()
# Define extension targets
#

#
# cumem_allocator extension
#

set(VLLM_CUMEM_EXT_SRC
"csrc/cumem_allocator.cpp")

set_gencode_flags_for_srcs(
SRCS "${VLLM_CUMEM_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")

if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Enabling cumem allocator extension.")
# link against cuda driver library
list(APPEND CUMEM_LIBS cuda)
define_gpu_extension_target(
cumem_allocator
DESTINATION vllm
LANGUAGE CXX
SOURCES ${VLLM_CUMEM_EXT_SRC}
LIBRARIES ${CUMEM_LIBS}
USE_SABI 3.8
WITH_SOABI)
endif()

#
# _C extension
#
Expand Down Expand Up @@ -287,7 +309,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Only build Marlin kernels if we are building for at least some compatible archs.
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" ${CUDA_ARCHS})
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
if (MARLIN_ARCHS)
set(MARLIN_SRCS
"csrc/quantization/fp8/fp8_marlin.cu"
Expand All @@ -308,8 +330,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()

# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper, 9.0/9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0;9.0a" "${CUDA_ARCHS}")
# CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
Expand Down Expand Up @@ -363,7 +385,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# 2:4 Sparse Kernels

# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper, 9.0/9.0a for now).
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_compressor_c3x.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
Expand Down Expand Up @@ -463,6 +485,9 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
endif()

message(STATUS "Enabling C extension.")
if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_C_LIBS cuda)
endif()
define_gpu_extension_target(
_C
DESTINATION vllm
Expand All @@ -471,6 +496,7 @@ define_gpu_extension_target(
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
LIBRARIES ${VLLM_C_LIBS}
USE_SABI 3
WITH_SOABI)

Expand Down Expand Up @@ -570,7 +596,7 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
endif()

# vllm-flash-attn currently only supported on CUDA
if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda")
if (NOT VLLM_GPU_LANG STREQUAL "CUDA")
return()
endif ()

Expand All @@ -593,7 +619,7 @@ endif()
# They should be identical but if they aren't, this is a massive footgun.
#
# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
# To only install vllm-flash-attn, use --component vllm_flash_attn_c.
# To only install vllm-flash-attn, use --component _vllm_fa2_C (for FA2) or --component _vllm_fa3_C (for FA3).
# If no component is specified, vllm-flash-attn is still installed.

# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
Expand All @@ -605,43 +631,41 @@ if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
endif()

if(VLLM_FLASH_ATTN_SRC_DIR)
FetchContent_Declare(vllm-flash-attn SOURCE_DIR ${VLLM_FLASH_ATTN_SRC_DIR})
FetchContent_Declare(
vllm-flash-attn SOURCE_DIR
${VLLM_FLASH_ATTN_SRC_DIR}
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 96266b1111111f3d11aabefaf3bacbab6a89d03c
GIT_TAG d4e09037abf588af1ec47d0e966b237ee376876c
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
endif()

# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
set(VLLM_PARENT_BUILD ON)

# Ensure the vllm/vllm_flash_attn directory exists before installation
install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)

# Make sure vllm-flash-attn install rules are nested under vllm/
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)

# Fetch the vllm-flash-attn library
FetchContent_MakeAvailable(vllm-flash-attn)
message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")

# Restore the install prefix
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
# Copy over the vllm-flash-attn python files (duplicated for fa2 and fa3, in
# case only one is built, in the case both are built redundant work is done)
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm_flash_attn
COMPONENT _vllm_fa2_C
FILES_MATCHING PATTERN "*.py"
)

# Copy over the vllm-flash-attn python files
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm/vllm_flash_attn
COMPONENT vllm_flash_attn_c
FILES_MATCHING PATTERN "*.py"
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm_flash_attn
COMPONENT _vllm_fa3_C
FILES_MATCHING PATTERN "*.py"
)

# Nothing after vllm-flash-attn, see comment about macros above
29 changes: 24 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ WORKDIR /workspace
# after this step
RUN --mount=type=cache,target=/root/.cache/pip \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
python3 -m pip install --index-url https://download.pytorch.org/whl/nightly/cu124 "torch==2.6.0.dev20241210+cu124" "torchvision==0.22.0.dev20241215"; \
python3 -m pip install --index-url https://download.pytorch.org/whl/nightly/cu126 "torch==2.7.0.dev20250121+cu126" "torchvision==0.22.0.dev20250121"; \
fi

COPY requirements-common.txt requirements-common.txt
Expand Down Expand Up @@ -126,8 +126,8 @@ RUN --mount=type=cache,target=/root/.cache/ccache \

# Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY .buildkite/check-wheel-size.py check-wheel-size.py
# Default max size of the wheel is 250MB
ARG VLLM_MAX_SIZE_MB=250
# sync the default value with .buildkite/check-wheel-size.py
ARG VLLM_MAX_SIZE_MB=300
ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
ARG RUN_WHEEL_CHECK=true
RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
Expand All @@ -149,7 +149,8 @@ RUN --mount=type=cache,target=/root/.cache/pip \

#################### vLLM installation IMAGE ####################
# image with vLLM installed
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 AS vllm-base
# TODO: Restore to base image after FlashInfer AOT wheel fixed
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 AS vllm-base
ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12
WORKDIR /vllm-workspace
Expand Down Expand Up @@ -194,12 +195,30 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install dist/*.whl --verbose

# How to build this FlashInfer wheel:
# $ export FLASHINFER_ENABLE_AOT=1
# $ # Note we remove 7.0 from the arch list compared to the list below, since FlashInfer only supports sm75+
# $ export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX'
# $ git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
# $ cd flashinfer
# $ git checkout 524304395bd1d8cd7d07db083859523fcaa246a4
# $ python3 setup.py bdist_wheel --dist-dir=dist --verbose

RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
python3 -m pip install https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
fi
COPY examples examples

# Although we build Flashinfer with AOT mode, there's still
# some issues w.r.t. JIT compilation. Therefore we need to
# install build dependencies for JIT compilation.
# TODO: Remove this once FlashInfer AOT wheel is fixed
COPY requirements-build.txt requirements-build.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-build.txt

#################### vLLM installation IMAGE ####################

#################### TEST IMAGE ####################
Expand Down
Loading