From 87b3c5664cbab08f0a165416dab0e13bfd2167ee Mon Sep 17 00:00:00 2001 From: TJian Date: Thu, 6 Feb 2025 00:28:26 +0800 Subject: [PATCH] [Bug Fix] Missing vllm.envs (#405) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * [Model] Initialize support for Deepseek-VL2 models (#11578) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100) Signed-off-by: Akshat Tripathi Signed-off-by: Oleg Mosalov Signed-off-by: Jee Jee Li Co-authored-by: Oleg Mosalov Co-authored-by: Jee Jee Li Co-authored-by: Isotr0py <2037008807@qq.com> * [Hardware][TPU] workaround fix for MoE on TPU (#11764) * [V1][Core][1/n] Logging and Metrics (#11962) Signed-off-by: rshaw@neuralmagic.com * [Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973) Signed-off-by: rshaw@neuralmagic.com * [MISC] fix typo in kv transfer send recv test (#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (#11972) Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947) Signed-off-by: Yida Wu * [Misc]Minor Changes about Worker (#11555) Signed-off-by: Chenguang Li <757486878@qq.com> * [platform] add ray_device_key (#11948) Signed-off-by: youkaichao * Fix Max Token ID for Qwen-VL-Chat (#11980) Signed-off-by: Alex-Brooks * [Kernel] unified_attention for Attention.forward (#11967) Signed-off-by: Chen Zhang * [Doc][V1] Update model implementation guide for V1 support (#11998) Signed-off-by: Roger Wang Co-authored-by: Cyrus Leung * [Doc] Organise installation documentation into categories and tabs (#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [platform] add device_control env var (#12009) Signed-off-by: youkaichao * [Platform] Move get_punica_wrapper() function to Platform (#11516) Signed-off-by: Shanshan Shen <467638484@qq.com> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (#11982) Signed-off-by: elijah * [Doc] Fix build from source and installation link in README.md (#12013) Signed-off-by: Yikun * Using list * [Bugfix] Fix deepseekv3 gate bias error (#12002) Signed-off-by: mgoin Co-authored-by: mgoin * Revert "[misc] improve memory profiling (#11809)" This reverts commit 889e662eae19fe8f30469883c6854ee4df4315a9. * Multi-lingual P3L (#356) * Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Trying to make scales work with compileable attention * [Docs] Add Sky Computing Lab to project intro (#12019) Signed-off-by: Woosuk Kwon * [HPU][Bugfix] set_forward_context and CI test execution (#12014) Signed-off-by: Konrad Zawora * [Doc] Update Quantization Hardware Support Documentation (#12025) Signed-off-by: tjtanaa Co-authored-by: tjtanaa * [HPU][misc] add comments for explanation (#12034) Signed-off-by: youkaichao * [Bugfix] Fix various bugs in multi-modal processor (#12031) Signed-off-by: DarkLight1337 * [Kernel] Revert the API change of Attention.forward (#12038) Signed-off-by: Chen Zhang * [Platform] Add output for Attention Backend (#11981) Signed-off-by: wangxiyuan * [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040) Signed-off-by: Chen Zhang * Explain where the engine args go when using Docker (#12041) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Docs lint * [Doc]: Update the Json Example of the `Engine Arguments` document (#12045) * [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (#11924) Signed-off-by: Jee Jee Li * [Kernel] Support MulAndSilu (#11624) Signed-off-by: Jee Jee Li * [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (#12046) Signed-off-by: Konrad Zawora * [Platform] move current_memory_usage() into platform (#11369) Signed-off-by: Shanshan Shen <467638484@qq.com> * [V1][BugFix] Fix edge case in VLM scheduling (#12065) Signed-off-by: Woosuk Kwon * [Misc] Add multipstep chunked-prefill support for FlashInfer (#10467) * [core] Turn off GPU communication overlap for Ray executor (#12051) Signed-off-by: Rui Qiao * [core] platform agnostic executor via collective_rpc (#11256) Signed-off-by: youkaichao * [Doc] Update examples to remove SparseAutoModelForCausalLM (#12062) Signed-off-by: Kyle Sayers * [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003) * Fix: cases with empty sparsity config (#12057) Signed-off-by: Rahul Tuli * Type-fix: make execute_model output type optional (#12020) * [Platform] Do not raise error if _Backend is not found (#12023) Signed-off-by: wangxiyuan Signed-off-by: Mengqing Cao Co-authored-by: Mengqing Cao * [Model]: Support internlm3 (#12037) * Misc: allow to use proxy in `HTTPConnection` (#12042) Signed-off-by: Yuan Zhou * [Misc][Quark] Upstream Quark format to VLLM (#10765) Signed-off-by: kewang-xlnx Signed-off-by: kewang2 Co-authored-by: kewang2 Co-authored-by: Michael Goin * [Doc]: Update `OpenAI-Compatible Server` documents (#12082) * [Bugfix] use right truncation for non-generative tasks (#12050) Signed-off-by: Joe Runde * [V1][Core] Autotune encoder cache budget (#11895) Signed-off-by: Roger Wang * [Bugfix] Fix _get_lora_device for HQQ marlin (#12090) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath * Allow hip sources to be directly included when compiling for rocm. (#12087) * [Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651) Signed-off-by: mgoin Co-authored-by: Michael Goin Co-authored-by: mgoin * [Doc] Add documentation for specifying model architecture (#12105) * Various cosmetic/comment fixes (#12089) Signed-off-by: mgoin * [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12067) Signed-off-by: Isotr0py <2037008807@qq.com> * Support torchrun and SPMD-style offline inference (#12071) Signed-off-by: youkaichao * [core] LLM.collective_rpc interface and RLHF example (#12084) Signed-off-by: youkaichao * [Bugfix] Fix max image feature size for Llava-one-vision (#12104) Signed-off-by: Roger Wang * Enable user marker for vllm profiling (#357) * Enable user marker for vllm profiling --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * [misc] Add LoRA kernel micro benchmarks (#11579) * [Model] Add support for deepseek-vl2-tiny model (#12068) Signed-off-by: Isotr0py <2037008807@qq.com> * Deepseek V3 support (#364) * Changing the hard coded datatype to see if it's enough for the model to work * Picking the upstrteam moe kernel version * make upstream fix for v3 also works for rocm v2 * Conditional fnuz dtype * Requantizing from fn to fnuz * Requantizing moe as well * Actually requantizing moe weights * Conditional requantization and assert on padding in block quant * Format --------- Co-authored-by: charlifu * [Bugfix] Set enforce_eager automatically for mllama (#12127) Signed-off-by: Chen Zhang * [Bugfix] Fix a path bug in disaggregated prefill example script. (#12121) Signed-off-by: Kuntai Du * [CI]add genai-perf benchmark in nightly benchmark (#10704) Signed-off-by: Kunshang Ji * [Doc] Add instructions on using Podman when SELinux is active (#12136) Signed-off-by: Yuan Tang * [Bugfix] Fix issues in CPU build Dockerfile (#12135) Signed-off-by: Yuan Tang * [BugFix] add more `is not None` check in VllmConfig.__post_init__ (#12138) Signed-off-by: Chen Zhang * [Misc] Add deepseek_vl2 chat template (#12143) Signed-off-by: Isotr0py <2037008807@qq.com> * [ROCm][MoE] moe tuning support for rocm (#12049) Signed-off-by: Divakar Verma * [V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960) Signed-off-by: Chen Zhang Co-authored-by: Cody Yu * [Misc][LoRA] Improve the readability of LoRA error messages (#12102) Signed-off-by: Jee Jee Li * [CI/Build][CPU][Bugfix] Fix CPU CI (#12150) Signed-off-by: jiang1.li * [core] allow callable in collective_rpc (#12151) Signed-off-by: youkaichao * [Bugfix] Fix score api for missing max_model_len validation (#12119) Signed-off-by: Wallas Santos * [Bugfix] Mistral tokenizer encode accept list of str (#12149) Signed-off-by: Kunshang Ji * [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134) Signed-off-by: Gregory Shtrasberg * [torch.compile] disable logging when cache is disabled (#12043) Signed-off-by: youkaichao * [misc] fix cross-node TP (#12166) Signed-off-by: youkaichao * [AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172) Signed-off-by: hongxyan * [core] further polish memory profiling (#12126) Signed-off-by: youkaichao * [Docs] Fix broken link in SECURITY.md (#12175) Signed-off-by: Russell Bryant * [Model] Port deepseek-vl2 processor, remove dependency (#12169) Signed-off-by: Isotr0py <2037008807@qq.com> * [core] clean up executor class hierarchy between v1 and v0 (#12171) Signed-off-by: youkaichao * [Misc] Support register quantization method out-of-tree (#11969) * [V1] Collect env var for usage stats (#12115) * [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#12152) Signed-off-by: Michal Adamczyk * [Bugfix] Fix multi-modal processors for transformers 4.48 (#12187) * [torch.compile] store inductor compiled Python file (#12182) Signed-off-by: youkaichao * benchmark_serving support --served-model-name param (#12109) Signed-off-by: zibai Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> * [Misc] Add BNB support to GLM4-V model (#12184) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1] Add V1 support of Qwen2-VL (#12128) Signed-off-by: Roger Wang Signed-off-by: DarkLight1337 Co-authored-by: imkero Co-authored-by: DarkLight1337 * [Model] Support for fairseq2 Llama (#11442) Signed-off-by: Martin Gleize Co-authored-by: mgleize user * [Bugfix] Fix num_heads value for simple connector when tp enabled (#12074) Signed-off-by: Shangming Cai * [torch.compile] fix sym_tensor_indices (#12191) Signed-off-by: youkaichao * Move linting to `pre-commit` (#11975) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [DOC] Fix typo in docstring and assert message (#12194) Signed-off-by: Yuan Tang * [DOC] Add missing docstring in LLMEngine.add_request() (#12195) Signed-off-by: Yuan Tang * [Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196) Signed-off-by: Yuan Tang * [Model] Add Qwen2 PRM model support (#12202) Signed-off-by: Isotr0py <2037008807@qq.com> * [Core] Interface for accessing model from `VllmRunner` (#10353) Signed-off-by: DarkLight1337 * [misc] add placeholder format.sh (#12206) Signed-off-by: youkaichao * [CI/Build] Remove dummy CI steps (#12208) Signed-off-by: DarkLight1337 * [CI/Build] Make pre-commit faster (#12212) Signed-off-by: DarkLight1337 * [Model] Upgrade Aria to transformers 4.48 (#12203) Signed-off-by: DarkLight1337 * [misc] print a message to suggest how to bypass commit hooks (#12217) Signed-off-by: youkaichao * [core][bugfix] configure env var during import vllm (#12209) Signed-off-by: youkaichao * [V1] Remove `_get_cache_block_size` (#12214) Signed-off-by: Chen Zhang * [Misc] Pass `attention` to impl backend (#12218) Signed-off-by: wangxiyuan * [Bugfix] Fix `HfExampleModels.find_hf_info` (#12223) Signed-off-by: DarkLight1337 * [CI] Pass local python version explicitly to pre-commit mypy.sh (#12224) Signed-off-by: Chen Zhang * Using ROCm6.3.1 base docker and building hipblas-common (#366) * [Misc] Update CODEOWNERS (#12229) * fix: update platform detection for M-series arm based MacBook processors (#12227) Signed-off-by: isikhi * [misc] add cuda runtime version to usage data (#12190) Signed-off-by: youkaichao Co-authored-by: Roger Wang * [bugfix] catch xgrammar unsupported array constraints (#12210) Signed-off-by: Jason Cheng * [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (#12222) Signed-off-by: Jinzhen Lin Co-authored-by: Michael Goin Co-authored-by: Tyler Michael Smith * Add quantization and guided decoding CODEOWNERS (#12228) Signed-off-by: mgoin * [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777) Signed-off-by: Gregory Shtrasberg * [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230) Signed-off-by: NickLucche * [ci/build] disable failed and flaky tests (#12240) Signed-off-by: youkaichao * [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244) Signed-off-by: DarkLight1337 * [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#12237) Signed-off-by: Jee Jee Li * [Misc] Remove redundant TypeVar from base model (#12248) Signed-off-by: DarkLight1337 * [Bugfix] Fix mm_limits access for merged multi-modal processor (#12252) Signed-off-by: DarkLight1337 * [torch.compile] transparent compilation with more logging (#12246) Signed-off-by: youkaichao * [V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259) Signed-off-by: Roger Wang * Remove pytorch comments for outlines + compressed-tensors (#12260) Signed-off-by: Thomas Parnell * [Platform] improve platforms getattr (#12264) Signed-off-by: Mengqing Cao * [ci/build] update nightly torch for gh200 test (#12270) Signed-off-by: youkaichao * [Bugfix] fix race condition that leads to wrong order of token returned (#10802) Signed-off-by: Jannis Schönleber * [Kernel] fix moe_align_block_size error condition (#12239) Signed-off-by: Jinzhen Lin * [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907) Signed-off-by: rickyx * [Bugfix] Multi-sequence broken (#11898) Signed-off-by: Andy Lo * [Misc] Remove experimental dep from tracing.py (#12007) Signed-off-by: Adrian Cole * [Misc] Set default backend to SDPA for get_vit_attn_backend (#12235) Signed-off-by: wangxiyuan * [Core] Free CPU pinned memory on environment cleanup (#10477) * Update pre-commit.yml (#374) * Update pre-commit.yml * Reapplying missing format * New codespell exclude location --------- Co-authored-by: Kevin H. Luu * [bugfix] moe tuning. rm is_navi() (#12273) Signed-off-by: Divakar Verma * [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (#12277) Signed-off-by: maleksan85 Co-authored-by: maleksan85 * [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (#12281) Signed-off-by: Hongxia Yang * [VLM] Simplify post-processing of replacement info (#12269) Signed-off-by: DarkLight1337 * [ci/lint] Add back default arg for pre-commit (#12279) Signed-off-by: kevin * [CI] add docker volume prune to neuron CI (#12291) Signed-off-by: Liangfu Chen * [Ci/Build] Fix mypy errors on main (#12296) Signed-off-by: DarkLight1337 * [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288) Signed-off-by: Nick Hill * [core] separate builder init and builder prepare for each batch (#12253) Signed-off-by: youkaichao * [Build] update requirements of no-device (#12299) Signed-off-by: Mengqing Cao * [Core] Support fully transparent sleep mode (#11743) Signed-off-by: youkaichao * [VLM] Avoid unnecessary tokenization (#12310) Signed-off-by: DarkLight1337 * [Model][Bugfix]: correct Aria model output (#12309) Signed-off-by: xffxff <1247714429@qq.com> * [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (#12313) Signed-off-by: Roger Wang * [Doc] Add docs for prompt replacement (#12318) Signed-off-by: DarkLight1337 * [Misc] Fix the error in the tip for the --lora-modules parameter (#12319) Signed-off-by: wangerxiao <863579016@qq.com> * [Misc] Improve the readability of BNB error messages (#12320) Signed-off-by: Jee Jee Li * Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367) * switching detokenize flag to be False * detokenize = False for benchmarks * restoring default in main vllm code for detokenize * removing extra spaces * moving detokenize to flag * adding support for token ids --------- Co-authored-by: maleksan85 * [Bugfix] Fix HPU multiprocessing executor (#12167) Signed-off-by: Konrad Zawora * [Core] Support `reset_prefix_cache` (#12284) * [Frontend][V1] Online serving performance improvements (#12287) * [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282) Signed-off-by: Randall Smith * FP8 FA fixes (#381) * FP8 FA fixes Summary: Add missing clamp and fix reciprocal scale computation. * linter * Returning the use of the proper stream in allreduce (#382) * [Bugfix] Fixing AMD LoRA CI test. (#12329) Signed-off-by: Alexei V. Ivanov * [Docs] Update FP8 KV Cache documentation (#12238) Signed-off-by: mgoin Co-authored-by: Cyrus Leung * [Docs] Document vulnerability disclosure process (#12326) Signed-off-by: Russell Bryant * [V1] Add `uncache_blocks` (#12333) * [doc] explain common errors around torch.compile (#12340) Signed-off-by: youkaichao * [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (#12338) Signed-off-by: zhenwei * [Bugfix] Fix k_proj's bias for whisper self attention (#12342) Signed-off-by: Isotr0py <2037008807@qq.com> * [Kernel] Flash Attention 3 Support (#12093) Signed-off-by: Lucas Wilkinson * [Doc] Troubleshooting errors during model inspection (#12351) Signed-off-by: DarkLight1337 * [V1] Simplify M-RoPE (#12352) Signed-off-by: Roger Wang Co-authored-by: imkero * [Bugfix] Fix broken internvl2 inference with v1 (#12360) Signed-off-by: Isotr0py <2037008807@qq.com> * [core] add wake_up doc and some sanity check (#12361) Signed-off-by: youkaichao * [torch.compile] decouple compile sizes and cudagraph sizes (#12243) Signed-off-by: youkaichao * [FP8][Kernel] Dynamic kv cache scaling factors computation (#11906) Signed-off-by: Gregory Shtrasberg Co-authored-by: Micah Williamson * [TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334) Signed-off-by: Siyuan Liu * [Docs] Document Phi-4 support (#12362) Signed-off-by: Isotr0py <2037008807@qq.com> * [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (#11528) Signed-off-by: ElizaWszola Co-authored-by: ElizaWszola Co-authored-by: Michael Goin * [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357) Signed-off-by: Junichi Sato * [Docs] Add meetup slides (#12345) Signed-off-by: Woosuk Kwon * Using pytorch commit past the point when rowwise PR (https://github.com/pytorch/pytorch/pull/144432) was merged (#384) * [Docs] Update spec decode + structured output in compat matrix (#12373) Signed-off-by: Russell Bryant * [V1][Frontend] Coalesce bunched `RequestOutput`s (#12298) Signed-off-by: Nick Hill Co-authored-by: Robert Shaw * Set weights_only=True when using torch.load() (#12366) Signed-off-by: Russell Bryant * [Bugfix] Path join when building local path for S3 clone (#12353) Signed-off-by: Omer Dayan (SW-GPU) * Update compressed-tensors version (#12367) * [V1] Increase default batch size for H100/H200 (#12369) Signed-off-by: Woosuk Kwon * [perf] fix perf regression from #12253 (#12380) Signed-off-by: youkaichao * [Misc] Use VisionArena Dataset for VLM Benchmarking (#12389) Signed-off-by: Roger Wang * [ci/build] fix wheel size check (#12396) Signed-off-by: youkaichao * [Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382) * [ci/build] sync default value for wheel size (#12398) Signed-off-by: youkaichao * [Misc] Enable proxy support in benchmark script (#12356) Signed-off-by: Junichi Sato * [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375) Signed-off-by: Lucas Wilkinson * Applying scales rename to fp8 config (#387) * [Misc] Remove deprecated code (#12383) Signed-off-by: DarkLight1337 * [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405) Signed-off-by: Lucas Wilkinson * Dev-docker Documentation Updates (#378) * Dev-docker Documentation Updates Minor updates to several sections, with links to other documents where appropriate. * Fix formatting of GEMM filename * README cleanup - Reorder some sections of the README to make them easier to follow - Improve formatting of bash commands - Prefer use of huggingface model names instead of hard-coded directories - Clean up wording * Expanded sample commands for Latency and Throughput * Fix markdown links * Fix pre-commit errors * Updates from review Initial updates to incorporate feedback from a review session held with @t-parry * Update script args to match current recommendations * Remove recommended max-num-seqs values for now --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * [Bugfix][Kernel] Fix moe align block issue for mixtral (#12413) * [Bugfix] Fix BLIP-2 processing (#12412) Signed-off-by: DarkLight1337 * [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408) Signed-off-by: Divakar Verma * [Misc] Add FA2 support to ViT MHA layer (#12355) Signed-off-by: Isotr0py <2037008807@qq.com> * [TPU][CI] Update torchxla version in requirement-tpu.txt (#12422) Signed-off-by: Siyuan Liu * [Misc][Bugfix] FA3 support to ViT MHA layer (#12435) Signed-off-by: Roger Wang Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094) Signed-off-by: Keyun Tong * [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439) Signed-off-by: Roger Wang * [Misc] Revert FA on ViT #12355 and #12435 (#12445) * [Frontend] generation_config.json for maximum tokens(#12242) Signed-off-by: Matthew Hendrey Signed-off-by: Shangming Cai Signed-off-by: youkaichao Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuan Tang Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 Signed-off-by: Chen Zhang Signed-off-by: wangxiyuan Co-authored-by: shangmingc Co-authored-by: youkaichao Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Yuan Tang Co-authored-by: Isotr0py Co-authored-by: Cyrus Leung Co-authored-by: Chen Zhang Co-authored-by: wangxiyuan * [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417) Signed-off-by: Tyler Michael Smith Co-authored-by: mgoin * [Bugfix/CI] Fix broken kernels/test_mha.py (#12450) * [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434) Signed-off-by: Lucas Wilkinson * [Build/CI] Fix libcuda.so linkage (#12424) Signed-off-by: Tyler Michael Smith * [Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376) Signed-off-by: Kyle Mistele * [DOC] Add link to vLLM blog (#12460) Signed-off-by: Yuan Tang * [V1] Avoid list creation in input preparation (#12457) Signed-off-by: Woosuk Kwon * [Frontend] Support scores endpoint in run_batch (#12430) Signed-off-by: Pooya Davoodi * [Bugfix] Fix Granite 3.0 MoE model loading (#12446) Signed-off-by: DarkLight1337 * [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][Minor] Minor optimizations for update_from_output (#12454) Signed-off-by: Woosuk Kwon * [Bugfix] Fix gpt2 GGUF inference (#12467) Signed-off-by: Isotr0py <2037008807@qq.com> * [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339) Signed-off-by: Lucas Wilkinson * [V1][Metrics] Add initial Prometheus logger (#12416) Signed-off-by: Mark McLoughlin * [V1][CI/Test] Do basic test for top-p & top-k sampling (#12469) Signed-off-by: Woosuk Kwon * [FlashInfer] Upgrade to 0.2.0 (#11194) Signed-off-by: Bowen Wang Signed-off-by: youkaichao Co-authored-by: youkaichao * Support FP8 FA from Quark format (#388) * Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment * Direct call on ROCm * 20250127 docs update (#392) * updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace * Faster Custom Paged Attention kernels (#372) * integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of https://github.com/ROCm/vllm/pull/372 (#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: TJian Co-authored-by: vllmellm * Using a more precise profiling on ROCm to properly account for weights padding (#394) * Update Dockerfile.rocm * [Bugfix]: inclucde the env variables required for running FastSyncLLM Signed-off-by: vllmellm * fix pre-commit lint Signed-off-by: vllmellm * [Bugfix] included missing environment variable Signed-off-by: vllmellm --------- Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Akshat Tripathi Signed-off-by: Oleg Mosalov Signed-off-by: Jee Jee Li Signed-off-by: rshaw@neuralmagic.com Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Yida Wu Signed-off-by: Chenguang Li <757486878@qq.com> Signed-off-by: youkaichao Signed-off-by: Alex-Brooks Signed-off-by: Chen Zhang Signed-off-by: Roger Wang Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Shanshan Shen <467638484@qq.com> Signed-off-by: elijah Signed-off-by: Yikun Signed-off-by: mgoin Signed-off-by: Woosuk Kwon Signed-off-by: Konrad Zawora Signed-off-by: tjtanaa Signed-off-by: DarkLight1337 Signed-off-by: wangxiyuan Signed-off-by: yisheng Signed-off-by: Abatom Signed-off-by: Liangfu Chen Signed-off-by: Russell Bryant Signed-off-by: Yuan Zhou Signed-off-by: Sourashis Roy Signed-off-by: Nishidha Panpaliya Signed-off-by: Ilya Lavrenov Signed-off-by: simon-mo Signed-off-by: Wallas Santos Signed-off-by: jiang1.li Signed-off-by: yan ma Signed-off-by: Randall Smith Signed-off-by: Max de Bayser Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Signed-off-by: Ye Qi Signed-off-by: Mengqing Cao Signed-off-by: Joe Runde Signed-off-by: Kunshang Ji Signed-off-by: Kuntai Du Signed-off-by: Ren MinMin Signed-off-by: Travis Johnson Signed-off-by: Fred Reiss Signed-off-by: shaochangxu.scx Signed-off-by: NickLucche Signed-off-by: Rafael Vasquez Signed-off-by: Rui Qiao Signed-off-by: Kyle Sayers Signed-off-by: Rahul Tuli Signed-off-by: kewang-xlnx Signed-off-by: kewang2 Signed-off-by: Varun Sundar Rabindranath Signed-off-by: Yuan Tang Signed-off-by: Divakar Verma Signed-off-by: Gregory Shtrasberg Signed-off-by: hongxyan Signed-off-by: Michal Adamczyk Signed-off-by: zibai Signed-off-by: Martin Gleize Signed-off-by: Shangming Cai Signed-off-by: isikhi Signed-off-by: Jason Cheng Signed-off-by: Jinzhen Lin Signed-off-by: Thomas Parnell Signed-off-by: Jannis Schönleber Signed-off-by: rickyx Signed-off-by: Andy Lo Signed-off-by: Adrian Cole Signed-off-by: maleksan85 Signed-off-by: Hongxia Yang Signed-off-by: kevin Signed-off-by: Nick Hill Signed-off-by: xffxff <1247714429@qq.com> Signed-off-by: wangerxiao <863579016@qq.com> Signed-off-by: Alexei V. Ivanov Signed-off-by: zhenwei Signed-off-by: Lucas Wilkinson Signed-off-by: Siyuan Liu Signed-off-by: ElizaWszola Signed-off-by: Junichi Sato Signed-off-by: Omer Dayan (SW-GPU) Signed-off-by: Keyun Tong Signed-off-by: Matthew Hendrey Signed-off-by: Tyler Michael Smith Signed-off-by: Kyle Mistele Signed-off-by: Pooya Davoodi Signed-off-by: Mark McLoughlin Signed-off-by: Bowen Wang Signed-off-by: vllmellm Co-authored-by: Isotr0py Co-authored-by: Cyrus Leung Co-authored-by: Akshat Tripathi Co-authored-by: Oleg Mosalov Co-authored-by: Jee Jee Li Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Yangcheng Li Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by: Concurrensee Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: youkaichao Co-authored-by: Alex Brooks Co-authored-by: Chen Zhang Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Cyrus Leung Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com> Co-authored-by: Yikun Jiang Co-authored-by: Gregory Shtrasberg Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by: mgoin Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: Alexei V. Ivanov Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Woosuk Kwon Co-authored-by: Konrad Zawora Co-authored-by: wangxiyuan Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com> Co-authored-by: YiSheng5 Co-authored-by: Zhonghua Deng Co-authored-by: Liangfu Chen Co-authored-by: XiaobingZhang Co-authored-by: Russell Bryant Co-authored-by: Yuan Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com> Co-authored-by: jiadi.jjd Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com> Co-authored-by: Jie Fu (傅杰) Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: WangErXiao <863579016@qq.com> Co-authored-by: Nishidha Co-authored-by: Ilya Lavrenov Co-authored-by: Simon Mo Co-authored-by: Wallas Henrique Co-authored-by: Li, Jiang Co-authored-by: Yan Ma Co-authored-by: rasmith Co-authored-by: Tyler Michael Smith Co-authored-by: Maximilien de Bayser Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Co-authored-by: Ye (Charlotte) Qi Co-authored-by: yeq Co-authored-by: Mengqing Cao Co-authored-by: Charles Frye Co-authored-by: Joe Runde Co-authored-by: Kunshang Ji Co-authored-by: cennn <61925104+cennn@users.noreply.github.com> Co-authored-by: Kuntai Du Co-authored-by: minmin Co-authored-by: Ren MinMin Co-authored-by: Travis Johnson Co-authored-by: Fred Reiss Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by: shaochangxu.scx Co-authored-by: Nicolò Lucchesi Co-authored-by: sixgod Co-authored-by: Rafael Vasquez Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Kyle Sayers Co-authored-by: Rahul Tuli Co-authored-by: Keyun Tong Co-authored-by: RunningLeon Co-authored-by: kewang-xlnx <73578509+kewang-xlnx@users.noreply.github.com> Co-authored-by: kewang2 Co-authored-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Co-authored-by: tvirolai-amd Co-authored-by: Michael Goin Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: charlifu Co-authored-by: Yuan Tang Co-authored-by: Cody Yu Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: yancong <32220263+ice-tong@users.noreply.github.com> Co-authored-by: Michal Adamczyk Co-authored-by: gujing <925973396@qq.com> Co-authored-by: imkero Co-authored-by: Martin Gleize Co-authored-by: mgleize user Co-authored-by: shangmingc Co-authored-by: Işık <41375111+isikhi@users.noreply.github.com> Co-authored-by: Roger Wang Co-authored-by: Cheng Kuan Yong Jason Co-authored-by: Jinzhen Lin Co-authored-by: Thomas Parnell Co-authored-by: Jannis Schönleber Co-authored-by: Ricky Xu Co-authored-by: Andy Lo Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com> Co-authored-by: Jani Monoses Co-authored-by: Kevin H. Luu Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: maleksan85 Co-authored-by: Nick Hill Co-authored-by: zhou fan <1247714429@qq.com> Co-authored-by: ilia-cher <30845429+ilia-cher@users.noreply.github.com> Co-authored-by: liuzhenwei Co-authored-by: Lucas Wilkinson Co-authored-by: Micah Williamson Co-authored-by: Siyuan Liu Co-authored-by: Dipika Sikka Co-authored-by: ElizaWszola Co-authored-by: Junichi Sato Co-authored-by: Robert Shaw Co-authored-by: omer-dayan Co-authored-by: Mohit Deopujari Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com> Co-authored-by: Matthew Hendrey Co-authored-by: Kyle Mistele Co-authored-by: Pooya Davoodi Co-authored-by: Mark McLoughlin Co-authored-by: Bowen Wang Co-authored-by: Bowen Bao Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: sanyalington Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: vllmellm --- vllm/envs.py | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/vllm/envs.py b/vllm/envs.py index 0445447dd9df0..c40f7e47097ca 100644 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -92,6 +92,10 @@ V_SCALE_CONSTANT: int = 10 VLLM_SERVER_DEV_MODE: bool = False VLLM_V1_OUTPUT_PROC_CHUNK_SIZE: int = 128 + VLLM_MLA_DISABLE: bool = False + VLLM_MLA_PERFORM_MATRIX_ABSORPTION: bool = True + VLLM_MLA_DISABLE_REQUANTIZATION: bool = False + VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON: bool = False def get_default_cache_root(): @@ -580,6 +584,49 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]: lambda: float(os.getenv("VLLM_LOG_BATCHSIZE_INTERVAL", "-1")), "VLLM_DISABLE_COMPILE_CACHE": lambda: bool(int(os.getenv("VLLM_DISABLE_COMPILE_CACHE", "0"))), + + # If set, vllm will run in development mode, which will enable + # some additional endpoints for developing and debugging, + # e.g. `/reset_prefix_cache` + "VLLM_SERVER_DEV_MODE": + lambda: bool(int(os.getenv("VLLM_SERVER_DEV_MODE", "0"))), + + # Controls the maximum number of requests to handle in a + # single asyncio task when processing per-token outputs in the + # V1 AsyncLLM interface. It is applicable when handling a high + # concurrency of streaming requests. + # Setting this too high can result in a higher variance of + # inter-message latencies. Setting it too low can negatively impact + # TTFT and overall throughput. + "VLLM_V1_OUTPUT_PROC_CHUNK_SIZE": + lambda: int(os.getenv("VLLM_V1_OUTPUT_PROC_CHUNK_SIZE", "128")), + + # If set, vLLM will disable the MLA attention optimizations. + "VLLM_MLA_DISABLE": + lambda: bool(int(os.getenv("VLLM_MLA_DISABLE", "0"))), + + # Flag that can control whether or not we perform matrix-absorption for MLA + # decode, i.e. absorb W_UK into W_Q/W_UK and W_UV into W_O, absorbing the + # matrices reduces the runtime FLOPs needed to compute MLA but requires + # storing more weights, W_Q_UK and W_UV_O, so can increase memory usage, + # the is enabled by default + "VLLM_MLA_PERFORM_MATRIX_ABSORPTION": + lambda: bool(int(os.getenv("VLLM_MLA_PERFORM_MATRIX_ABSORPTION", "1"))), + + # When running MLA with matrix-absorption enabled and fp8 quantized weights + # we perform the matrix-absorption in float32 precision, after the matrices + # are absorbed we requantize the weights back to fp8, this flag can be used + # to disable the requantization step, and instead convert the absorbed + # matrices to match the activation type. This can lead to higher memory and + # compute usage but better preserves the accuracy of the original model. + "VLLM_MLA_DISABLE_REQUANTIZATION": + lambda: bool(int(os.getenv("VLLM_MLA_DISABLE_REQUANTIZATION", "0"))), + + # If set, vLLM will use the Triton implementation of moe_align_block_size, + # i.e. moe_align_block_size_triton in fused_moe.py. + "VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON": + lambda: bool(int(os.getenv("VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON", "0")) + ), } # end-env-vars-definition