Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 03 24 #143

Merged
merged 197 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
197 commits
Select commit Hold shift + click to select a range
d7f3964
Update comment (#2934)
ronensc Feb 22, 2024
5574081
Added early stopping to completion APIs (#2939)
Maxusmusti Feb 22, 2024
344020c
Migrate MistralForCausalLM to LlamaForCausalLM (#2868)
esmeetu Feb 22, 2024
95529e3
Use Llama RMSNorm custom op for Gemma (#2974)
WoosukKwon Feb 22, 2024
93dc5a2
chore(vllm): codespell for spell checking (#2820)
mspronesti Feb 22, 2024
fd5dcc5
Optimize GeGLU layer in Gemma (#2975)
WoosukKwon Feb 22, 2024
c530e2c
[FIX] Fix a bug in initializing Yarn RoPE (#2983)
44670 Feb 22, 2024
6f32cdd
Remove Flash Attention in test env (#2982)
WoosukKwon Feb 22, 2024
4caf704
Include tokens from prompt phase in `counter_generation_tokens` (#2802)
ronensc Feb 22, 2024
57f0449
Fix nvcc not found in vlm-openai image (#2781)
zhaoyang-star Feb 22, 2024
f7c1234
[Fix] Fissertion on YaRN model len (#2984)
WoosukKwon Feb 23, 2024
ef978fe
Port metrics from `aioprometheus` to `prometheus_client` (#2730)
hmellor Feb 25, 2024
70f3e8e
Add LogProbs for Chat Completions in OpenAI (#2918)
jlcmoore Feb 26, 2024
cfc15a1
Optimize Triton MoE Kernel (#2979)
pcmoritz Feb 26, 2024
d6e4a13
[Minor] Remove gather_cached_kv kernel (#3043)
WoosukKwon Feb 26, 2024
d9f726c
[Minor] Remove unused config files (#3039)
esmeetu Feb 27, 2024
c1c0d00
Don't use cupy when `enforce_eager=True` (#3037)
esmeetu Feb 27, 2024
4dd6416
Fix stablelm (#3038)
esmeetu Feb 27, 2024
48a8f4a
Support Orion model (#2539)
dachengai Feb 27, 2024
2410e32
fix `get_ip` error in pure ipv6 environment (#2931)
Jingru Feb 27, 2024
4bd18ec
[Minor] Fix type annotation in fused moe (#3045)
WoosukKwon Feb 27, 2024
e0ade06
Support logit bias for OpenAI API (#3027)
dylanwhawk Feb 27, 2024
8b430d7
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046)
WoosukKwon Feb 27, 2024
71bcaf9
Enable GQA support in the prefix prefill kernels (#3007)
sighingnow Feb 27, 2024
a868310
multi-lora documentation fix (#3064)
ElefHead Feb 28, 2024
e46fa5d
Restrict prometheus_client >= 0.18.0 to prevent errors when importing…
AllenDou Feb 28, 2024
3b7178c
[Neuron] Support inference with transformers-neuronx (#2569)
liangfu Feb 28, 2024
929b4f2
Add LoRA support for Gemma (#3050)
WoosukKwon Feb 28, 2024
01a5d18
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330)
chu-tianxiang Feb 29, 2024
a6d471c
Fix: `AttributeError` in OpenAI-compatible server (#3018)
jaywonchung Feb 29, 2024
9289e57
add cache_config's info to prometheus metrics. (#3100)
AllenDou Feb 29, 2024
bfdcfa6
Support starcoder2 architecture (#3089)
sh0416 Feb 29, 2024
2c08ff2
Fix building from source on WSL (#3112)
aliencaocao Feb 29, 2024
29a8d6a
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#…
njhill Feb 29, 2024
703e42e
Add guided decoding for OpenAI API server (#2819)
felixzhu555 Feb 29, 2024
54d3544
Fix: Output text is always truncated in some models (#3016)
HyperdriveHustle Mar 1, 2024
27ca23d
Remove exclude_unset in streaming response (#3143)
sh0416 Mar 1, 2024
49d849b
docs: Add tutorial on deploying vLLM model with KServe (#2586)
terrytangyuan Mar 1, 2024
90fbf12
fix relative import path of protocol.py (#3134)
Huarong Mar 1, 2024
c0c2335
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
robertgshaw2-redhat Mar 1, 2024
82091b8
Bump up to v0.3.3 (#3129)
WoosukKwon Mar 1, 2024
29e70e3
allow user chose log level by --log-level instead of fixed 'info'. (#…
AllenDou Mar 1, 2024
baee28c
Reorder kv dtype check to avoid nvcc not found error on AMD platform …
cloudhan Mar 2, 2024
ce4f5a2
Add Automatic Prefix Caching (#2762)
SageMoore Mar 2, 2024
d65fac2
Add vLLM version info to logs and openai API server (#3161)
jasonacox Mar 3, 2024
996d095
[FIX] Fix styles in automatic prefix caching & add a automatic prefix…
zhuohan123 Mar 3, 2024
17c3103
Make it easy to profile workers with nsight (#3162)
pcmoritz Mar 4, 2024
d0fae88
[DOC] add setup document to support neuron backend (#2777)
liangfu Mar 4, 2024
901cf4c
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
gty111 Mar 4, 2024
27a7b07
Add document for vllm paged attention kernel. (#2978)
pian13131 Mar 4, 2024
9cbc7e5
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
AllenDou Mar 4, 2024
76e8a70
[Minor fix] The domain dns.google may cause a socket.gaierror excepti…
ttbachyinsda Mar 4, 2024
22de452
Push logprob generation to LLMEngine (#3065)
Yard1 Mar 4, 2024
ff578ca
Add health check, make async Engine more robust (#3015)
Yard1 Mar 4, 2024
9a4548b
Fix the openai benchmarking requests to work with latest OpenAI apis …
wangchen615 Mar 4, 2024
05af6da
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#…
hongxiayang Mar 5, 2024
8999ec3
Store `eos_token_id` in `Sequence` for easy access (#3166)
njhill Mar 5, 2024
2efce05
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
njhill Mar 6, 2024
24aecf4
[Tests] Add block manager and scheduler tests (#3108)
rkooo567 Mar 6, 2024
a33ce60
[Testing] Fix core tests (#3224)
cadedaniel Mar 6, 2024
4cb3b92
Add tqdm `dynamic_ncols=True` (#3242)
chujiezheng Mar 6, 2024
d3c04b6
Add GPTQ support for Gemma (#3200)
TechxGenus Mar 7, 2024
cbf4c05
Update requirements-dev.txt to include package for benchmarking scrip…
wangchen615 Mar 7, 2024
2daf23a
Separate attention backends (#3005)
WoosukKwon Mar 7, 2024
385da2d
Measure model memory usage (#3120)
mgoin Mar 7, 2024
8cbba46
Possible fix for conflict between Automated Prefix Caching (#2762) an…
jacobthebanana Mar 7, 2024
b35cc93
Fix auto prefix bug (#3239)
ElizaWszola Mar 8, 2024
d2339d6
Connect engine healthcheck to openai server (#3260)
njhill Mar 8, 2024
c59e120
Feature add lora support for Qwen2 (#3177)
whyiug Mar 8, 2024
1ece1ae
[Minor Fix] Fix comments in benchmark_serving (#3252)
gty111 Mar 8, 2024
99c3cfb
[Docs] Fix Unmocked Imports (#3275)
ywang96 Mar 8, 2024
1cb0cc2
[FIX] Make `flash_attn` optional (#3269)
WoosukKwon Mar 8, 2024
c2c5e09
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241)
mgoin Mar 8, 2024
f48c679
[FIX] Fix prefix test error on main (#3286)
zhuohan123 Mar 9, 2024
8437bae
[Speculative decoding 3/9] Worker which speculates, scores, and appli…
cadedaniel Mar 9, 2024
0bba88d
Enhance lora tests with more layer and rank variations (#3243)
tterrysun Mar 10, 2024
e4a28e5
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUD…
dllehr-amd Mar 10, 2024
9e8744a
[BugFix] Fix get tokenizer when using ray (#3301)
esmeetu Mar 11, 2024
4b59f00
[Fix] Fix best_of behavior when n=1 (#3298)
njhill Mar 11, 2024
2f8844b
Re-enable the 80 char line width limit (#3305)
zhuohan123 Mar 11, 2024
657061f
[docs] Add LoRA support information for models (#3299)
pcmoritz Mar 11, 2024
4c92270
Add distributed model executor abstraction (#3191)
zhuohan123 Mar 11, 2024
c9415c1
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321)
kliuae Mar 11, 2024
654865e
Support Mistral Model Inference with transformers-neuronx (#3153)
DAIZHENWEI Mar 11, 2024
b0925b3
docs: Add BentoML deployment doc (#3336)
Sherlock113 Mar 12, 2024
49a3c86
Fixes #1556 double free (#3347)
br3no Mar 13, 2024
602358f
Add kernel for GeGLU with approximate GELU (#3337)
WoosukKwon Mar 13, 2024
b167109
[Fix] Fix quantization="gptq" when using Marlin (#3319)
DreamTeamWangbowen Mar 13, 2024
e221910
add hf_transfer to requirements.txt (#3031)
RonanKMcGovern Mar 13, 2024
ba8dc95
[Minor] Fix bias in if to remove ambiguity (#3259)
hliuca Mar 13, 2024
739c350
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256)
chenxu2048 Mar 13, 2024
ae0ccb4
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism…
orsharir Mar 13, 2024
7e9bd08
Add batched RoPE kernel (#3095)
tterrysun Mar 13, 2024
c33afd8
Fix lint (#3388)
Yard1 Mar 13, 2024
eeab52a
[FIX] Simpler fix for async engine running on ray (#3371)
zhuohan123 Mar 13, 2024
81653d9
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion …
simon-mo Mar 14, 2024
a37415c
allow user to chose which vllm's merics to display in grafana (#3393)
AllenDou Mar 14, 2024
8fe8386
[Kernel] change benchmark script so that result can be directly used;…
youkaichao Mar 14, 2024
06ec486
Install `flash_attn` in Docker image (#3396)
tdoublep Mar 14, 2024
c17ca8e
Add args for mTLS support (#3410)
declark1 Mar 14, 2024
dfc7740
[issue templates] add some issue templates (#3412)
youkaichao Mar 14, 2024
54be8a0
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
chenxu2048 Mar 14, 2024
b983ba3
fix marlin config repr (#3414)
qeternity Mar 14, 2024
78b6c48
Dynamically configure shared memory size for moe_align_block_size_ker…
akhoroshev Mar 15, 2024
b522c44
[Misc] add HOST_IP env var (#3419)
youkaichao Mar 15, 2024
21539e6
Add chat templates for Falcon (#3420)
Dinghow Mar 15, 2024
253a980
Add chat templates for ChatGLM (#3418)
Dinghow Mar 15, 2024
429284d
Fix `dist.broadcast` stall without group argument (#3408)
GindaChen Mar 15, 2024
a7c8716
Fix tie_word_embeddings for Qwen2. (#3344)
fyabc Mar 15, 2024
03d37f2
[Fix] Add args for mTLS support (#3430)
declark1 Mar 15, 2024
14b8ae0
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
sighingnow Mar 15, 2024
604f235
[Misc] add error message in non linux platform (#3438)
youkaichao Mar 15, 2024
a7af453
Fix issue templates (#3436)
hmellor Mar 15, 2024
8fa7357
fix document error for value and v_vec illustration (#3421)
laneeeee Mar 15, 2024
fb96c1e
Asynchronous tokenization (#2879)
Yard1 Mar 15, 2024
10585e0
Removed Extraneous Print Message From OAI Server (#3440)
robertgshaw2-redhat Mar 16, 2024
413366e
[Misc] PR templates (#3413)
youkaichao Mar 16, 2024
3123f15
Fixes the incorrect argument in the prefix-prefill test cases (#3246)
sighingnow Mar 16, 2024
14e3f9a
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (…
ronensc Mar 16, 2024
cf6ff18
Fix Baichuan chat template (#3340)
Dinghow Mar 16, 2024
ad50bf4
fix lint
simon-mo Mar 16, 2024
8e67598
[Misc] fix line length for entire codebase (#3444)
simon-mo Mar 16, 2024
120157f
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211)
simon-mo Mar 16, 2024
6b78837
Fix setup.py neuron-ls issue (#2671)
simon-mo Mar 16, 2024
abfc4f3
[Misc] Use dataclass for InputMetadata (#3452)
WoosukKwon Mar 17, 2024
93348d9
[CI] Shard tests for LoRA and Kernels to speed up (#3445)
simon-mo Mar 17, 2024
9101d83
[Bugfix] Make moe_align_block_size AMD-compatible (#3470)
WoosukKwon Mar 18, 2024
8c654c0
CI: Add ROCm Docker Build (#2886)
simon-mo Mar 18, 2024
482b0ad
[Testing] Add test_config.py to CI (#3437)
cadedaniel Mar 18, 2024
097aa0e
[CI/Build] Fix Bad Import In Test (#3473)
robertgshaw2-redhat Mar 18, 2024
c0c17d4
[Misc] Fix PR Template (#3478)
zhuohan123 Mar 18, 2024
9fdf3de
Cmake based build system (#2830)
bnellnm Mar 18, 2024
49eedea
[Core] Zero-copy asdict for InputMetadata (#3475)
Yard1 Mar 18, 2024
b30880a
[Misc] Update README for the Third vLLM Meetup (#3479)
zhuohan123 Mar 18, 2024
b37cdce
[Core] Cache some utils (#3474)
Yard1 Mar 19, 2024
6a9c583
[Core] print error before deadlock (#3459)
youkaichao Mar 19, 2024
ef65dcf
[Doc] Add docs about OpenAI compatible server (#3288)
simon-mo Mar 19, 2024
7341c77
[BugFix] Avoid initializing CUDA too early (#3487)
njhill Mar 19, 2024
c614cfe
Update dockerfile with ModelScope support (#3429)
ifsheldon Mar 19, 2024
2a60c9b
[Doc] minor fix to neuron-installation.rst (#3505)
jimburtoft Mar 19, 2024
cc63d03
Revert "[Core] Cache some utils" (#3507)
simon-mo Mar 19, 2024
63e8b28
[Doc] minor fix of spelling in amd-installation.rst (#3506)
jimburtoft Mar 19, 2024
20478c4
Use lru_cache for some environment detection utils (#3508)
simon-mo Mar 19, 2024
9474e89
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator perfor…
ElizaWszola Mar 20, 2024
4ad521d
[Core] Add generic typing to `LRUCache` (#3511)
njhill Mar 20, 2024
5ee1449
[Misc] Remove cache stream and cache events (#3461)
WoosukKwon Mar 20, 2024
84eaa68
Abort when nvcc command is not found in the PATH (#3527)
AllenDou Mar 20, 2024
ba8ae1d
Check for _is_cuda() in compute_num_jobs (#3481)
bnellnm Mar 20, 2024
80e2548
[Bugfix] Fix ROCm support in CMakeLists.txt (#3534)
jamestwhedbee Mar 20, 2024
426ec4e
[1/n] Triton sampling kernel (#3186)
Yard1 Mar 20, 2024
6e435de
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
rkooo567 Mar 20, 2024
f1c0fc3
Migrate `logits` computation and gather to `model_runner` (#3233)
esmeetu Mar 20, 2024
523e30e
[BugFix] Hot fix in setup.py for neuron build (#3537)
zhuohan123 Mar 21, 2024
6ebd02b
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (#3431)
ElizaWszola Mar 21, 2024
3bbff9e
Fix 1D query issue from `_prune_hidden_states` (#3539)
rkooo567 Mar 21, 2024
4c07dd2
[🚀 Ready to be merged] Added support for Jais models (#3183)
grandiose-pizza Mar 21, 2024
8657323
[Misc][Log] Add log for tokenizer length not equal to vocabulary size…
esmeetu Mar 21, 2024
c188ecb
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (#3551)
WoosukKwon Mar 21, 2024
b7050ca
[BugFix] gemma loading after quantization or LoRA. (#3553)
taeminlee Mar 21, 2024
ea5f14e
[Bugfix][Model] Fix Qwen2 (#3554)
esmeetu Mar 22, 2024
e90fc21
[Hardware][Neuron] Refactor neuron support (#3471)
zhuohan123 Mar 22, 2024
f721096
[BugFix] Some fixes for custom allreduce kernels (#2760)
hanzhi713 Mar 22, 2024
cf2f084
Dynamic scheduler delay to improve ITL performance (#3279)
tdoublep Mar 22, 2024
bfdb1ba
[Core] Improve detokenization performance for prefill (#3469)
Yard1 Mar 22, 2024
743a0b7
[Bugfix] use SoftLockFile instead of LockFile (#3578)
kota-iizuka Mar 23, 2024
3c5ab9b
[Misc] Fix BLOOM copyright notice (#3591)
WoosukKwon Mar 24, 2024
f8a12ec
[Misc] Bump transformers version (#3592)
ywang96 Mar 24, 2024
af9e534
[BugFix] Fix Falcon tied embeddings (#3590)
WoosukKwon Mar 24, 2024
17ac306
Merge branch 'upstream-main' into upstream-sync-2024-03-24
afeldman-nm Mar 24, 2024
d3c6ea8
initial merge
afeldman-nm Mar 24, 2024
a828ef3
cleanup benchmark_prefix caching
afeldman-nm Mar 24, 2024
6f6ab1c
cleanup pybind
afeldman-nm Mar 24, 2024
03b78a4
cleanup requirements-dev.txt
afeldman-nm Mar 24, 2024
8c96a1c
cleanup test skip comments
afeldman-nm Mar 24, 2024
119bd05
cleanup model comments
afeldman-nm Mar 24, 2024
018c902
cleanup sampler
afeldman-nm Mar 24, 2024
6844a99
cleanup config
afeldman-nm Mar 24, 2024
474ccb7
fixed block allocator to match upstream (bad merge)
afeldman-nm Mar 24, 2024
ab76a09
cleanup engine args
afeldman-nm Mar 24, 2024
519c6fa
cleanup llm-engine
afeldman-nm Mar 24, 2024
767bf23
cleanup LLM front end
afeldman-nm Mar 24, 2024
8788f27
minor cleanups
afeldman-nm Mar 24, 2024
acd2876
linear
afeldman-nm Mar 24, 2024
23e29a9
various cleanups
afeldman-nm Mar 24, 2024
d6bd5dc
fixed Neuron
afeldman-nm Mar 24, 2024
fa7482a
removed neuron models
afeldman-nm Mar 24, 2024
571bbf7
starcoder tmp fix
afeldman-nm Mar 24, 2024
281e3c5
final neuron fixes
afeldman-nm Mar 24, 2024
2ec44fd
small cleanups
afeldman-nm Mar 24, 2024
a1f583d
fixed BlockSpaceManager
afeldman-nm Mar 24, 2024
4265468
yapf / ruff
afeldman-nm Mar 24, 2024
d696d74
ruff 2
afeldman-nm Mar 24, 2024
a102e13
format
afeldman-nm Mar 24, 2024
476798e
fixed basic correctness failure by running with --forked
afeldman-nm Mar 24, 2024
e973135
fixed tests for nightly
robertgshaw2-redhat Mar 25, 2024
4ce1f87
add nvcc_threads to gha
Mar 26, 2024
8ddab6a
Remove Gi_per_thread arg to nm-build-vllm action
Mar 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This script build the ROCm docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Print ROCm version
rocminfo

# Try building the docker image
docker build -t rocm -f Dockerfile.rocm .

# Setup cleanup
remove_docker_container() { docker rm -f rocm || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device /dev/kfd --device /dev/dri --network host --name rocm rocm python3 -m vllm.entrypoints.api_server &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
9 changes: 6 additions & 3 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ steps:
num_gpus: 2 # only support 1 or 2 for now.

- label: Engine Test
command: pytest -v -s engine tokenization test_sequence.py
command: pytest -v -s engine tokenization test_sequence.py test_config.py

- label: Entrypoints Test
command: pytest -v -s entrypoints
Expand All @@ -47,7 +47,10 @@ steps:
- pytest -v -s prefix_caching

- label: Samplers Test
command: pytest -v -s samplers --forked
command: pytest -v -s samplers

- label: LogitsProcessor Test
command: pytest -v -s test_logits_processor.py

- label: Worker Test
command: pytest -v -s worker
Expand All @@ -56,7 +59,7 @@ steps:
command: pytest -v -s spec_decode

- label: LoRA Test %N
command: pytest -v -s lora --forked --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

- label: Metrics Test
Expand Down
5 changes: 5 additions & 0 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@
{% set default_working_dir = "/vllm-workspace/tests" %}

steps:
- label: "AMD Test"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh

- label: ":docker: build image"
commands:
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
Expand Down
14 changes: 9 additions & 5 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (*link existing issues this PR will resolve*)

**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE**

---

<details>
<!-- inside this <details> section, markdown rendering does not work, so we use raw html here. -->
<summary><b> PR Checklist (Click to expand. Please read before submitting.) </b></summary>
<summary><b> PR Checklist (Click to Expand) </b></summary>

<p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p>

Expand Down Expand Up @@ -53,8 +61,4 @@

</details>

---

Please provide a brief explanation of the motivation behind the PR and the changes it introduces. This helps reviewers understand the context and rationale for the contribution. If possible, please link existing issues this PR will resolve.


3 changes: 0 additions & 3 deletions .github/actions/nm-build-vllm/action.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
name: build nm-vllm
description: 'build nm-vllm'
inputs:
Gi_per_thread:
description: 'requested GiB to reserve per thread'
required: true
python:
description: 'python version, e.g. 3.10.12'
required: true
Expand Down
5 changes: 5 additions & 0 deletions .github/actions/nm-set-env/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ inputs:
Gi_per_thread:
description: 'requested GiB to reserve per thread'
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
runs:
using: composite
steps:
Expand All @@ -16,6 +20,7 @@ runs:
echo "HF_HOME=/EFS/hf_home" >> $GITHUB_ENV
NUM_THREADS=$(./.github/scripts/determine-threading -G ${{ inputs.Gi_per_thread }})
echo "MAX_JOBS=${NUM_THREADS}" >> $GITHUB_ENV
echo "NVCC_THREADS=${{ inputs.nvcc_threads }}" >> $GITHUB_ENV
echo "VLLM_INSTALL_PUNICA_KERNELS=1" >> $GITHUB_ENV
echo "NCCL_IGNORE_DISABLED_P2P=1" >> $GITHUB_ENV
echo "PYENV_ROOT=/usr/local/apps/pyenv" >> $GITHUB_ENV
Expand Down
2 changes: 2 additions & 0 deletions .github/scripts/run-tests
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ do
coverage run --data-file=.coverage-$(basename ${TEST}) -m pytest --forked --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
elif [[ "${TEST}" == *"models_logprobs"* ]]; then
coverage run --data-file=.coverage-$(basename ${TEST}) -m pytest --forked --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
elif [[ "${TEST}" == *"basic_correctness"* ]]; then
coverage run --data-file=.coverage-$(basename ${TEST}) -m pytest --forked --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
else
coverage run --data-file=.coverage-$(basename ${TEST}) -m pytest --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
fi
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand Down Expand Up @@ -47,6 +51,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand Down Expand Up @@ -79,6 +87,7 @@ jobs:
with:
hf_token: ${{ secrets.NM_HF_TOKEN }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}
nvcc_threads: ${{ inputs.nvcc_threads }}

- name: set python
id: set_python
Expand Down Expand Up @@ -111,7 +120,6 @@ jobs:
id: build
uses: ./.github/actions/nm-build-vllm/
with:
Gi_per_thread: ${{ inputs.Gi_per_thread }}
python: ${{ inputs.python }}
venv: TEST
pypi: ${{ secrets.NM_PRIVATE_PYPI_LOCATION }}
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/build-whl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand All @@ -43,6 +47,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand Down Expand Up @@ -76,6 +84,7 @@ jobs:
with:
hf_token: ${{ secrets.NM_HF_TOKEN }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}
nvcc_threads: ${{ inputs.nvcc_threads }}

- name: set python
id: set_python
Expand All @@ -101,7 +110,6 @@ jobs:
id: build
uses: ./.github/actions/nm-build-vllm/
with:
Gi_per_thread: ${{ inputs.Gi_per_thread }}
python: ${{ inputs.python }}
venv: ${{ env.VENV_BUILD_BASE }}
pypi: ${{ secrets.NM_PRIVATE_PYPI_LOCATION }}
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/gen-whl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ jobs:
timeout: 30
gitref: ${{ inputs.gitref }}
Gi_per_thread: 4
nvcc_threads: 8
python: ${{ matrix.python }}
secrets: inherit
5 changes: 5 additions & 0 deletions .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ jobs:
timeout: 240
gitref: ${{ github.ref }}
Gi_per_thread: 4
nvcc_threads: 8
python: 3.10.12
test_skip_list:
secrets: inherit
Expand All @@ -35,6 +36,7 @@ jobs:
timeout: 300
gitref: ${{ github.ref }}
Gi_per_thread: 12
nvcc_threads: 1
python: 3.11.4
test_skip_list:
secrets: inherit
Expand All @@ -48,6 +50,7 @@ jobs:
# timeout: 480
# gitref: '${{ github.ref }}'
# Gi_per_thread: 4
# nvcc_threads: 8
# python: "3.10.12"
# # Always push if it is a scheduled job
# push_benchmark_results_to_gh_pages: "${{ github.event_name == 'schedule' || inputs.push_benchmark_results_to_gh_pages }}"
Expand All @@ -62,6 +65,7 @@ jobs:
timeout: 720
gitref: '${{ github.ref }}'
Gi_per_thread: 12
nvcc_threads: 1
python: "3.10.12"
# Always push if it is a scheduled job
push_benchmark_results_to_gh_pages: "${{ github.event_name == 'schedule' || inputs.push_benchmark_results_to_gh_pages }}"
Expand All @@ -75,5 +79,6 @@ jobs:
timeout: 60
gitref: '${{ github.ref }}'
Gi_per_thread: 12
nvcc_threads: 1
python: "3.10.12"
secrets: inherit
10 changes: 9 additions & 1 deletion .github/workflows/nm-benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand Down Expand Up @@ -55,6 +59,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand Down Expand Up @@ -89,6 +97,7 @@ jobs:
with:
hf_token: ${{ secrets.NM_HF_TOKEN }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}
nvcc_threads: ${{ inputs.nvcc_threads }}

- name: set python
id: set_python
Expand All @@ -107,7 +116,6 @@ jobs:
id: build
uses: ./.github/actions/nm-build-vllm/
with:
Gi_per_thread: ${{ inputs.Gi_per_thread }}
python: ${{ inputs.python }}
venv: TEST
pypi: ${{ secrets.NM_PRIVATE_PYPI_LOCATION }}
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/nm-lm-eval-accuracy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand All @@ -43,6 +47,10 @@ on:
description: 'requested GiB to reserve per thread'
type: string
required: true
nvcc_threads:
description: "number of threads nvcc build threads"
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand All @@ -68,6 +76,7 @@ jobs:
with:
hf_token: ${{ secrets.NM_HF_TOKEN }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}
nvcc_threads: ${{ inputs.nvcc_threads }}

- name: set python
id: set_python
Expand All @@ -86,7 +95,6 @@ jobs:
id: build
uses: ./.github/actions/nm-build-vllm/
with:
Gi_per_thread: ${{ inputs.Gi_per_thread }}
python: ${{ inputs.python }}
venv: TEST
pypi: ${{ secrets.NM_PRIVATE_PYPI_LOCATION }}
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/remote-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ jobs:
timeout: 240
gitref: '${{ github.ref }}'
Gi_per_thread: 4
nvcc_threads: 8
python: ${{ matrix.python }}
test_skip_list: neuralmagic/tests/skip-for-remote-push.txt
secrets: inherit
Expand All @@ -37,6 +38,7 @@ jobs:
# timeout: 60
# gitref: '${{ github.ref }}'
# Gi_per_thread: 12
# nvcc_threads: 1
# python: "3.10.12"
# push_benchmark_results_to_gh_pages: "false"
# secrets: inherit
Loading
Loading