From 87b3c5664cbab08f0a165416dab0e13bfd2167ee Mon Sep 17 00:00:00 2001
From: TJian <tunjian1996@gmail.com>
Date: Thu, 6 Feb 2025 00:28:26 +0800
Subject: [PATCH] [Bug Fix] Missing vllm.envs (#405)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* [Model] Initialize support for Deepseek-VL2 models (#11578)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)

Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [Hardware][TPU] workaround fix for MoE on TPU (#11764)

* [V1][Core][1/n] Logging and Metrics (#11962)

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973)

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

* [MISC] fix typo in kv transfer send recv test (#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (#11972)

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>

* [Misc]Minor Changes about Worker (#11555)

Signed-off-by: Chenguang Li <757486878@qq.com>

* [platform] add ray_device_key (#11948)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Fix Max Token ID for Qwen-VL-Chat (#11980)

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

* [Kernel] unified_attention for Attention.forward (#11967)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Doc][V1] Update model implementation guide for V1 support (#11998)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

* [Doc] Organise installation documentation into categories and tabs (#11935)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [platform] add device_control env var (#12009)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Platform] Move get_punica_wrapper() function to Platform (#11516)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (#11982)

Signed-off-by: elijah <f1renze.142857@gmail.com>

* [Doc] Fix build from source and installation link in README.md (#12013)

Signed-off-by: Yikun <yikunkero@gmail.com>

* Using list

* [Bugfix] Fix deepseekv3 gate bias error (#12002)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* Revert "[misc] improve memory profiling (#11809)"

This reverts commit 889e662eae19fe8f30469883c6854ee4df4315a9.

* Multi-lingual P3L (#356)

* Commiting the *multilingual* P3L test.

* Created a *multi-lingual* P3L test.

* Making ruff happy.

* .

* Added a reference to the language-scripture Confluence table.

* Typo fixing.

* Harmonizing naming.

* Fixing comments in the header.

---------

Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* Trying to make scales work with compileable attention

* [Docs] Add Sky Computing Lab to project intro (#12019)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [HPU][Bugfix] set_forward_context and CI test execution (#12014)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Doc] Update Quantization Hardware Support Documentation (#12025)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [HPU][misc] add comments for explanation (#12034)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix various bugs in multi-modal processor (#12031)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Kernel] Revert the API change of Attention.forward (#12038)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Platform] Add output for Attention Backend (#11981)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Explain where the engine args go when using Docker (#12041)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Docs lint

* [Doc]: Update the Json Example of the `Engine Arguments` document (#12045)

* [Misc]  Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (#11924)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Kernel] Support MulAndSilu (#11624)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (#12046)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Platform] move current_memory_usage() into platform (#11369)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* [V1][BugFix] Fix edge case in VLM scheduling (#12065)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)

* [core] Turn off GPU communication overlap for Ray executor (#12051)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

* [core] platform agnostic executor via collective_rpc (#11256)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003)

* Fix: cases with empty sparsity config (#12057)

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

* Type-fix: make execute_model output type optional (#12020)

* [Platform] Do not raise error if _Backend is not found (#12023)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>

* [Model]: Support internlm3 (#12037)

* Misc: allow to use proxy in `HTTPConnection` (#12042)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Misc][Quark] Upstream Quark format to VLLM (#10765)

Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Doc]: Update `OpenAI-Compatible Server` documents (#12082)

* [Bugfix] use right truncation for non-generative tasks (#12050)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [V1][Core] Autotune encoder cache budget (#11895)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Bugfix] Fix _get_lora_device for HQQ marlin (#12090)

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* Allow hip sources to be directly included when compiling for rocm. (#12087)

* [Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Doc] Add documentation for specifying model architecture (#12105)

* Various cosmetic/comment fixes (#12089)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12067)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Support torchrun and SPMD-style offline inference (#12071)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core] LLM.collective_rpc interface and RLHF example (#12084)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix max image feature size for Llava-one-vision (#12104)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Enable user marker for vllm profiling (#357)

* Enable user marker for vllm profiling

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [misc] Add LoRA kernel micro benchmarks (#11579)

* [Model] Add support for deepseek-vl2-tiny model (#12068)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Deepseek V3 support (#364)

* Changing the hard coded datatype to see if it's enough for the model to work

* Picking the upstrteam moe kernel version

* make upstream fix for v3 also works for rocm v2

* Conditional fnuz dtype

* Requantizing from fn to fnuz

* Requantizing moe as well

* Actually requantizing moe weights

* Conditional requantization and assert on padding in block quant

* Format

---------

Co-authored-by: charlifu <charlifu@amd.com>

* [Bugfix] Set enforce_eager automatically for mllama (#12127)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix] Fix a path bug in disaggregated prefill example script. (#12121)

Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

* [CI]add genai-perf benchmark in nightly benchmark (#10704)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [Doc] Add instructions on using Podman when SELinux is active (#12136)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix issues in CPU build Dockerfile (#12135)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [BugFix] add more `is not None` check in VllmConfig.__post_init__ (#12138)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Add deepseek_vl2 chat template (#12143)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [ROCm][MoE] moe tuning support for rocm (#12049)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

* [Misc][LoRA] Improve the readability of LoRA error messages (#12102)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [CI/Build][CPU][Bugfix] Fix CPU CI (#12150)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [core] allow callable in collective_rpc (#12151)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix score api for missing max_model_len validation (#12119)

Signed-off-by: Wallas Santos <wallashss@ibm.com>

* [Bugfix] Mistral tokenizer encode accept list of str (#12149)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [torch.compile] disable logging when cache is disabled (#12043)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [misc] fix cross-node TP (#12166)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)

Signed-off-by: hongxyan <hongxyan@amd.com>

* [core] further polish memory profiling (#12126)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Docs] Fix broken link in SECURITY.md (#12175)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Model] Port deepseek-vl2 processor, remove dependency (#12169)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] clean up executor class hierarchy between v1 and v0 (#12171)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Support register quantization method out-of-tree (#11969)

* [V1] Collect env var for usage stats (#12115)

* [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#12152)

Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>

* [Bugfix] Fix multi-modal processors for transformers 4.48 (#12187)

* [torch.compile] store inductor compiled Python file (#12182)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* benchmark_serving support --served-model-name param (#12109)

Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

* [Misc] Add BNB support to GLM4-V model (#12184)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1] Add V1 support of Qwen2-VL (#12128)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Support for fairseq2 Llama (#11442)

Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>

* [Bugfix] Fix num_heads value for simple connector when tp enabled (#12074)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [torch.compile] fix sym_tensor_indices (#12191)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Move linting to `pre-commit` (#11975)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [DOC] Fix typo in docstring and assert message (#12194)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [DOC] Add missing docstring in LLMEngine.add_request() (#12195)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Model] Add Qwen2 PRM model support (#12202)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Core] Interface for accessing model from `VllmRunner` (#10353)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] add placeholder format.sh (#12206)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [CI/Build] Remove dummy CI steps (#12208)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI/Build] Make pre-commit faster (#12212)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Upgrade Aria to transformers 4.48 (#12203)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] print a message to suggest how to bypass commit hooks (#12217)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core][bugfix] configure env var during import vllm (#12209)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1] Remove `_get_cache_block_size` (#12214)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Pass `attention` to impl backend (#12218)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Fix `HfExampleModels.find_hf_info` (#12223)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Using ROCm6.3.1 base docker and building hipblas-common (#366)

* [Misc] Update CODEOWNERS (#12229)

* fix: update platform detection for M-series arm based MacBook processors (#12227)

Signed-off-by: isikhi <huseyin.isik000@gmail.com>

* [misc] add cuda runtime version to usage data (#12190)

Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

* [bugfix] catch xgrammar unsupported array constraints (#12210)

Signed-off-by: Jason Cheng <jasoncky96@gmail.com>

* [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (#12222)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Add quantization and guided decoding CODEOWNERS (#12228)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [ci/build] disable failed and flaky tests (#12240)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration  (#12237)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Misc] Remove redundant TypeVar from base model (#12248)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [torch.compile] transparent compilation with more logging (#12246)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Remove pytorch comments for outlines + compressed-tensors (#12260)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Platform] improve platforms getattr (#12264)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [ci/build] update nightly torch for gh200 test (#12270)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] fix race condition that leads to wrong order of token returned (#10802)

Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

* [Kernel] fix moe_align_block_size error condition (#12239)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

* [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types  (#10907)

Signed-off-by: rickyx <rickyx@anyscale.com>

* [Bugfix] Multi-sequence broken (#11898)

Signed-off-by: Andy Lo <andy@mistral.ai>

* [Misc] Remove experimental dep from tracing.py (#12007)

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>

* [Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Core] Free CPU pinned memory on environment cleanup (#10477)

* Update pre-commit.yml (#374)

* Update pre-commit.yml

* Reapplying missing format

* New codespell exclude location

---------

Co-authored-by: Kevin H. Luu <kevin@anyscale.com>

* [bugfix] moe tuning. rm is_navi() (#12273)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (#12277)

Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: maleksan85 <maleksan@amd.com>

* [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (#12281)

Signed-off-by: Hongxia Yang <hongxyan@amd.com>

* [VLM] Simplify post-processing of replacement info (#12269)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ci/lint] Add back default arg for pre-commit (#12279)

Signed-off-by: kevin <kevin@anyscale.com>

* [CI] add docker volume prune to neuron CI (#12291)

Signed-off-by: Liangfu Chen <liangfc@amazon.com>

* [Ci/Build] Fix mypy errors on main (#12296)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [core] separate builder init and builder prepare for each batch (#12253)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Build] update requirements of no-device (#12299)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [Core] Support fully transparent sleep mode (#11743)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [VLM] Avoid unnecessary tokenization (#12310)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model][Bugfix]: correct Aria model output (#12309)

Signed-off-by: xffxff <1247714429@qq.com>

* [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (#12313)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Doc] Add docs for prompt replacement (#12318)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Fix the error in the tip for the --lora-modules parameter (#12319)

Signed-off-by: wangerxiao <863579016@qq.com>

* [Misc]  Improve the readability of BNB error messages  (#12320)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367)

* switching detokenize flag to be False

* detokenize = False for benchmarks

* restoring default in main vllm code for detokenize

* removing extra spaces

* moving detokenize to flag

* adding support for token ids

---------

Co-authored-by: maleksan85 <maleksan@amd.com>

* [Bugfix] Fix HPU multiprocessing executor (#12167)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Core] Support `reset_prefix_cache` (#12284)

* [Frontend][V1] Online serving performance improvements (#12287)

* [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* FP8 FA fixes (#381)

* FP8 FA fixes

Summary:
Add missing clamp and fix reciprocal scale computation.

* linter

* Returning the use of the proper stream in allreduce (#382)

* [Bugfix] Fixing  AMD LoRA CI test. (#12329)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Docs] Update FP8 KV Cache documentation (#12238)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Docs] Document vulnerability disclosure process (#12326)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1] Add `uncache_blocks` (#12333)

* [doc] explain common errors around torch.compile (#12340)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (#12338)

Signed-off-by: zhenwei <zhenweiliu@habana.ai>

* [Bugfix] Fix k_proj's bias for whisper self attention (#12342)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Kernel] Flash Attention 3 Support (#12093)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Doc] Troubleshooting errors during model inspection (#12351)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1] Simplify M-RoPE (#12352)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: imkero <kerorek@outlook.com>

* [Bugfix] Fix broken internvl2 inference with v1 (#12360)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] add wake_up doc and some sanity check (#12361)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [torch.compile] decouple compile sizes and cudagraph sizes (#12243)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

* [TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Docs] Document Phi-4 support (#12362)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (#11528)

Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Docs] Add meetup slides (#12345)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Using pytorch commit past the point when rowwise PR (https://github.com/pytorch/pytorch/pull/144432) was merged (#384)

* [Docs] Update spec decode + structured output in compat matrix (#12373)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1][Frontend] Coalesce bunched `RequestOutput`s (#12298)

Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

* Set weights_only=True when using torch.load() (#12366)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Bugfix] Path join when building local path for S3 clone (#12353)

Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>

* Update compressed-tensors version (#12367)

* [V1] Increase default batch size for H100/H200 (#12369)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [perf] fix perf regression from #12253 (#12380)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [ci/build] fix wheel size check (#12396)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)

* [ci/build] sync default value for wheel size (#12398)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Enable proxy support in benchmark script (#12356)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Applying scales rename to fp8 config (#387)

* [Misc] Remove deprecated code (#12383)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Dev-docker Documentation Updates (#378)

* Dev-docker Documentation Updates

Minor updates to several sections, with links to other documents where appropriate.

* Fix formatting of GEMM filename

* README cleanup

- Reorder some sections of the README to make them easier to follow
- Improve formatting of bash commands
- Prefer use of huggingface model names instead of hard-coded directories
- Clean up wording

* Expanded sample commands for Latency and Throughput

* Fix markdown links

* Fix pre-commit errors

* Updates from review

Initial updates to incorporate feedback from a review session held with @t-parry

* Update script args to match current recommendations

* Remove recommended max-num-seqs values for now

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [Bugfix][Kernel] Fix moe align block issue for mixtral (#12413)

* [Bugfix] Fix BLIP-2 processing (#12412)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Misc] Add FA2 support to ViT MHA layer (#12355)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Misc][Bugfix] FA3 support to ViT MHA layer (#12435)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094)

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

* [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Misc] Revert FA on ViT #12355 and #12435 (#12445)

* [Frontend] generation_config.json for  maximum tokens(#12242)

Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Bugfix/CI] Fix broken kernels/test_mha.py (#12450)

* [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Build/CI] Fix libcuda.so linkage (#12424)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [Frontend] Rerank API (Jina- and Cohere-compatible API)  (#12376)

Signed-off-by: Kyle Mistele <kyle@mistele.com>

* [DOC] Add link to vLLM blog (#12460)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [V1] Avoid list creation in input preparation (#12457)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Frontend] Support scores endpoint in run_batch (#12430)

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

* [Bugfix] Fix Granite 3.0 MoE model loading (#12446)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1][Minor] Minor optimizations for update_from_output (#12454)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix] Fix gpt2 GGUF inference (#12467)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [V1][Metrics] Add initial Prometheus logger (#12416)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [FlashInfer] Upgrade to 0.2.0 (#11194)

Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* Support FP8 FA from Quark format (#388)

* Support FP8 FA from Quark format

* Support FP8 FA from Quark format

* nit: update comment

* Direct call on ROCm

* 20250127 docs update (#392)

* updating code blocks

* typo

* updated manifest

* Including feedback

* whitespace

* Deepseek instructions

* hyperlink fix

* hyperlink fix

* updating what is new

* cpx update

* typo

* whitespace

* whitespace

* Faster Custom Paged Attention kernels (#372)

* integrate new cpa kernel, update tests and benchmark

* added comments to mfma4 kernel

* further comments for mfma16 kernel

* clang-format

* Lint

* add flag for logits rtz conversion and disable by default

* lint

* [Bugfix]: Fix paged attention unit tests of https://github.com/ROCm/vllm/pull/372 (#389)

* [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and  `csrc/rocm/attention.cu`.

* improve code documentation.

* lint

---------

Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

* Using a more precise profiling on ROCm to properly account for weights padding (#394)

* Update Dockerfile.rocm

* [Bugfix]: inclucde the env variables required for running FastSyncLLM

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* fix pre-commit lint

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Bugfix] included missing environment variable

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Ye Qi <yeq@meta.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Fred Reiss <frreiss@us.ibm.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: hongxyan <hongxyan@amd.com>
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Signed-off-by: Martin Gleize <mgleize@meta.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: isikhi <huseyin.isik000@gmail.com>
Signed-off-by: Jason Cheng <jasoncky96@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
Signed-off-by: maleksan85 <maleksan@amd.com>
Signed-off-by: Hongxia Yang <hongxyan@amd.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: wangerxiao <863579016@qq.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Kyle Mistele <kyle@mistele.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: YiSheng5 <yi.sheng@intel.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
Co-authored-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuan <yuan.zhou@intel.com>
Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com>
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com>
Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com>
Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: WangErXiao <863579016@qq.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Charles Frye <cfrye59@gmail.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: cennn <61925104+cennn@users.noreply.github.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: minmin <rmm0811@gmail.com>
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Fred Reiss <frreiss@us.ibm.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Keyun Tong <tongkeyun@gmail.com>
Co-authored-by: RunningLeon <maningsheng@sensetime.com>
Co-authored-by: kewang-xlnx <73578509+kewang-xlnx@users.noreply.github.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: tvirolai-amd <teemu.virolainen@amd.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: charlifu <charlifu@amd.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: yancong <32220263+ice-tong@users.noreply.github.com>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: gujing <925973396@qq.com>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: Işık <41375111+isikhi@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Jannis Schönleber <joennlae@gmail.com>
Co-authored-by: Ricky Xu <xuchen727@hotmail.com>
Co-authored-by: Andy Lo <andylolu24@gmail.com>
Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com>
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: zhou fan <1247714429@qq.com>
Co-authored-by: ilia-cher <30845429+ilia-cher@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: omer-dayan <omer@run.ai>
Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com>
Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Co-authored-by: Kyle Mistele <kyle@mistele.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Co-authored-by: sanyalington <shomy.sanyal@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
---
 vllm/envs.py | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/vllm/envs.py b/vllm/envs.py
index 0445447dd9df0..c40f7e47097ca 100644
--- a/vllm/envs.py
+++ b/vllm/envs.py
@@ -92,6 +92,10 @@
     V_SCALE_CONSTANT: int = 10
     VLLM_SERVER_DEV_MODE: bool = False
     VLLM_V1_OUTPUT_PROC_CHUNK_SIZE: int = 128
+    VLLM_MLA_DISABLE: bool = False
+    VLLM_MLA_PERFORM_MATRIX_ABSORPTION: bool = True
+    VLLM_MLA_DISABLE_REQUANTIZATION: bool = False
+    VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON: bool = False
 
 
 def get_default_cache_root():
@@ -580,6 +584,49 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
     lambda: float(os.getenv("VLLM_LOG_BATCHSIZE_INTERVAL", "-1")),
     "VLLM_DISABLE_COMPILE_CACHE":
     lambda: bool(int(os.getenv("VLLM_DISABLE_COMPILE_CACHE", "0"))),
+
+    # If set, vllm will run in development mode, which will enable
+    # some additional endpoints for developing and debugging,
+    # e.g. `/reset_prefix_cache`
+    "VLLM_SERVER_DEV_MODE":
+    lambda: bool(int(os.getenv("VLLM_SERVER_DEV_MODE", "0"))),
+
+    # Controls the maximum number of requests to handle in a
+    # single asyncio task when processing per-token outputs in the
+    # V1 AsyncLLM interface. It is applicable when handling a high
+    # concurrency of streaming requests.
+    # Setting this too high can result in a higher variance of
+    # inter-message latencies. Setting it too low can negatively impact
+    # TTFT and overall throughput.
+    "VLLM_V1_OUTPUT_PROC_CHUNK_SIZE":
+    lambda: int(os.getenv("VLLM_V1_OUTPUT_PROC_CHUNK_SIZE", "128")),
+
+    # If set, vLLM will disable the MLA attention optimizations.
+    "VLLM_MLA_DISABLE":
+    lambda: bool(int(os.getenv("VLLM_MLA_DISABLE", "0"))),
+
+    # Flag that can control whether or not we perform matrix-absorption for MLA
+    # decode, i.e. absorb W_UK into W_Q/W_UK and W_UV into W_O, absorbing the
+    # matrices reduces the runtime FLOPs needed to compute MLA but requires
+    # storing more weights, W_Q_UK and W_UV_O, so can increase memory usage,
+    # the is enabled by default
+    "VLLM_MLA_PERFORM_MATRIX_ABSORPTION":
+    lambda: bool(int(os.getenv("VLLM_MLA_PERFORM_MATRIX_ABSORPTION", "1"))),
+
+    # When running MLA with matrix-absorption enabled and fp8 quantized weights
+    # we perform the matrix-absorption in float32 precision, after the matrices
+    # are absorbed we requantize the weights back to fp8, this flag can be used
+    # to disable the requantization step, and instead convert the absorbed
+    # matrices to match the activation type. This can lead to higher memory and
+    # compute usage but better preserves the accuracy of the original model.
+    "VLLM_MLA_DISABLE_REQUANTIZATION":
+    lambda: bool(int(os.getenv("VLLM_MLA_DISABLE_REQUANTIZATION", "0"))),
+
+    # If set, vLLM will use the Triton implementation of moe_align_block_size,
+    # i.e. moe_align_block_size_triton in fused_moe.py.
+    "VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON":
+    lambda: bool(int(os.getenv("VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON", "0"))
+                 ),
 }
 
 # end-env-vars-definition