Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Bad words sampling parameter #5986

Closed
wants to merge 1,467 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1467 commits
Select commit Hold shift + click to select a range
083c5a4
[misc] add forward context for attention (#9029)
youkaichao Oct 3, 2024
6a54e28
Fix failing spec decode test (#9054)
sroy745 Oct 3, 2024
f7607ce
[Bugfix] Weight loading fix for OPT model (#9042)
domenVres Oct 3, 2024
5b8f494
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-…
sydnash Oct 4, 2024
47a230c
[CI/Build] Per file CUDA Archs (improve wheel size and dev build time…
LucasWilkinson Oct 4, 2024
0b6f003
[Misc] Enable multi-step output streaming by default (#9047)
mgoin Oct 4, 2024
1e9547f
[Models] Add remaining model PP support (#7168)
andoorve Oct 4, 2024
8298842
[Misc] Move registry to its own file (#9064)
DarkLight1337 Oct 4, 2024
6a1970b
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen…
whyiug Oct 4, 2024
5e3731c
[Bugfix] Flash attention arches not getting set properly (#9062)
LucasWilkinson Oct 4, 2024
e8f8446
[Model] add a bunch of supported lora modules for mixtral (#9008)
prashantgupta24 Oct 4, 2024
69239fd
Remove AMD Ray Summit Banner (#9075)
simon-mo Oct 4, 2024
8877bd8
[Hardware][PowerPC] Make oneDNN dependency optional for Power (#9039)
varad-ahirwadkar Oct 4, 2024
3959b30
[Core][VLM] Test registration for OOT multimodal models (#8717)
ywang96 Oct 4, 2024
ec7933f
Adds truncate_prompt_tokens param for embeddings creation (#8999)
flaviabeo Oct 4, 2024
35e10d6
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE…
ElizaWszola Oct 4, 2024
07c4528
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add…
KuntaiDu Oct 4, 2024
94909d4
[Misc] Improved prefix cache example (#9077)
Imss27 Oct 4, 2024
55fe153
[Misc] Add random seed for prefix cache benchmark (#9081)
Imss27 Oct 4, 2024
57c0a9b
[Misc] Fix CI lint (#9085)
comaniac Oct 4, 2024
a9679ac
[Hardware][Neuron] Add on-device sampling support for Neuron (#8746)
chongmni-aws Oct 4, 2024
935804e
[torch.compile] improve allreduce registration (#9061)
youkaichao Oct 4, 2024
034bd03
[Doc] Update README.md with Ray summit slides (#9088)
zhuohan123 Oct 5, 2024
9ed62db
[Bugfix] use blockmanagerv1 for encoder-decoder (#9084)
heheda12345 Oct 5, 2024
a37ac05
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (#8979)
hhzhang16 Oct 5, 2024
74c8360
[Model] Support Gemma2 embedding model (#9004)
xyang16 Oct 5, 2024
1677a09
[Bugfix] Deprecate registration of custom configs to huggingface (#9083)
heheda12345 Oct 5, 2024
deae5a4
[Bugfix] Fix order of arguments matters in config.yaml (#8960)
Imss27 Oct 5, 2024
5dda0ee
[core] use forward context for flash infer (#9097)
youkaichao Oct 6, 2024
b3861aa
[Bugfix] Fix try-catch conditions to import correct Flash Attention B…
tjtanaa Oct 6, 2024
28835b8
[Frontend] API support for beam search (#9087)
LunrEclipse Oct 6, 2024
e098267
[Misc] Remove user-facing error for removed VLM args (#9104)
DarkLight1337 Oct 6, 2024
eada6ba
[Model] PP support for embedding models and update docs (#9090)
DarkLight1337 Oct 6, 2024
9c90c1f
[Bugfix] fix tool_parser error handling when serve a model not suppor…
liuyanyi Oct 6, 2024
9db54b6
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step s…
varun-sundar-rabindranath Oct 6, 2024
2eaac16
[Bugfix][Hardware][CPU] Fix CPU model input for decode (#9044)
Isotr0py Oct 7, 2024
38e22b1
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103)
sroy745 Oct 7, 2024
8864bee
[core] remove beam search from the core (#9105)
youkaichao Oct 7, 2024
ccd9a4d
[Model] Explicit interface for vLLM models and support OOT embedding …
DarkLight1337 Oct 7, 2024
7ac3a00
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on…
Isotr0py Oct 7, 2024
b69ab78
[Core] Refactor GGUF parameters packing and forwarding (#8859)
Isotr0py Oct 7, 2024
0a4cef2
[Model] Support NVLM-D and fix QK Norm in InternViT (#9045)
DarkLight1337 Oct 7, 2024
b190ee3
[Doc]: Add deploying_with_k8s guide (#8451)
haitwang-cloud Oct 7, 2024
17322bb
[CI/Build] Add linting for github actions workflows (#7876)
russellb Oct 7, 2024
81c070e
[Doc] Include performance benchmark in README (#9135)
KuntaiDu Oct 7, 2024
63a757d
[misc] fix comment and variable name (#9139)
youkaichao Oct 7, 2024
580e214
Add Slack to README (#9137)
simon-mo Oct 8, 2024
e5df072
[misc] update utils to support comparing multiple settings (#9140)
youkaichao Oct 8, 2024
57ffa4f
[Intel GPU] Fix xpu decode input (#9145)
jikunshang Oct 8, 2024
fc325b8
[misc] improve ux on readme (#9147)
youkaichao Oct 8, 2024
57c0d8a
[Frontend] API support for beam search for MQLLMEngine (#9117)
LunrEclipse Oct 8, 2024
2d2b3e5
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs (…
alex-jw-brooks Oct 8, 2024
9ec7338
[Frontend] Add Early Validation For Chat Template / Tool Call Parser …
alex-jw-brooks Oct 8, 2024
c5fb47a
[CI/Build] Add examples folder into Docker image so that we can lever…
panpan0000 Oct 8, 2024
06442fd
[Bugfix] fix OpenAI API server startup with --disable-frontend-multip…
dtrifiro Oct 8, 2024
0dde95e
[Doc] Update vlm.rst to include an example on videos (#9155)
sayakpaul Oct 8, 2024
7034bd6
[Doc] Improve contributing and installation documentation (#9132)
rafvasq Oct 8, 2024
6b1336f
[Bugfix] Try to handle older versions of pytorch (#9086)
bnellnm Oct 8, 2024
71d567d
mypy: check additional directories (#9162)
russellb Oct 8, 2024
5e51395
Add `lm-eval` directly to requirements-test.txt (#9161)
mgoin Oct 9, 2024
c14dc6b
support bitsandbytes quantization with more models (#9148)
chenqianfzh Oct 9, 2024
11a4694
Add classifiers in setup.py (#9171)
terrytangyuan Oct 9, 2024
c24ede9
Update link to KServe deployment guide (#9173)
terrytangyuan Oct 9, 2024
3c3dd61
[Misc] Improve validation errors around best_of and n (#9167)
tjohnson31415 Oct 9, 2024
3ce9791
[Bugfix][Doc] Report neuron error in output (#9159)
joerowell Oct 9, 2024
cd9d007
[Model] Remap FP8 kv_scale in CommandR and DBRX (#9174)
hliuca Oct 9, 2024
2922569
[Frontend] Log the maximum supported concurrency (#8831)
AlpinDale Oct 9, 2024
87f0017
[Bugfix] fix composite weight loading and EAGLE weight loading (#9160)
DarkLight1337 Oct 9, 2024
42a0732
[ci][test] use load dummy for testing (#9165)
youkaichao Oct 9, 2024
af76c79
[Doc] Fix VLM prompt placeholder sample bug (#9170)
ycool Oct 9, 2024
0bbb127
[Bugfix] Fix lora loading for Compressed Tensors in #9120 (#9179)
fahadh4ilyas Oct 9, 2024
a28a3ff
[Bugfix] Access `get_vocab` instead of `vocab` in tool parsers (#9188)
DarkLight1337 Oct 9, 2024
f56c6e4
Add Dependabot configuration for GitHub Actions updates (#1217)
EwoutH Oct 9, 2024
bd30384
[Hardware][CPU] Support AWQ for CPU backend (#7515)
bigPYJ1151 Oct 9, 2024
63e38f1
[CI/Build] mypy: check vllm/entrypoints (#9194)
russellb Oct 9, 2024
7140021
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 (#9…
mgoin Oct 9, 2024
59a1710
[Core] Fix invalid args to _process_request (#9201)
russellb Oct 10, 2024
bd094b4
[misc] improve model support check in another process (#9208)
youkaichao Oct 10, 2024
6a91dbb
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models (#9213)
mgoin Oct 10, 2024
701c483
[Bugfix] Machete garbage results for some models (large K dim) (#9212)
LucasWilkinson Oct 10, 2024
793c702
[Core] Add an environment variable which needs to be set explicitly t…
sroy745 Oct 10, 2024
dc1b2fa
[Bugfix] Fix lm_head weights tying with lora for llama (#9227)
Isotr0py Oct 10, 2024
c325b93
[Model] support input image embedding for minicpmv (#9237)
whyiug Oct 10, 2024
4a6e7da
[OpenVINO] Use torch 2.4.0 and newer optimim version (#9121)
ilya-lavrenov Oct 10, 2024
d8ce551
[Bugfix] Fix Machete unittests failing with `NotImplementedError` (#9…
LucasWilkinson Oct 10, 2024
7035233
[Doc] Improve debugging documentation (#9204)
rafvasq Oct 10, 2024
56e08c7
[CI/Build] Make the `Dockerfile.cpu` file's `PIP_EXTRA_INDEX_URL` Co…
jyono Oct 10, 2024
aeb4ab6
Suggest codeowners for the core componenets (#9210)
simon-mo Oct 10, 2024
3c12c6b
[torch.compile] integration with compilation control (#9058)
youkaichao Oct 10, 2024
c9187c1
Bump actions/github-script from 6 to 7 (#9197)
dependabot[bot] Oct 10, 2024
1342ebe
Bump actions/checkout from 3 to 4 (#9196)
dependabot[bot] Oct 10, 2024
8f7d16c
Bump actions/setup-python from 3 to 5 (#9195)
dependabot[bot] Oct 10, 2024
fc849bb
[ci/build] Add placeholder command for custom models test (#9262)
khluu Oct 10, 2024
b6696eb
[torch.compile] generic decorators (#9258)
youkaichao Oct 10, 2024
72a0356
[Doc][Neuron] add note to neuron documentation about resolving triton…
omrishiv Oct 10, 2024
88e7b33
[Misc] Fix sampling from sonnet for long context case (#9235)
Imss27 Oct 11, 2024
10bc4f7
[misc] hide best_of from engine (#9261)
youkaichao Oct 11, 2024
36dffa3
[Misc] Collect model support info in a single process per model (#9233)
DarkLight1337 Oct 11, 2024
9a01341
[Misc][LoRA] Support loading LoRA weights for target_modules in reg f…
jeejeelee Oct 11, 2024
bd742a2
[Bugfix] Fix priority in multiprocessing engine (#9277)
schoennenbeck Oct 11, 2024
77e7923
[Model] Support Mamba (#6484)
tlrmchlsmth Oct 11, 2024
6ead5ae
[Kernel] adding fused moe kernel config for L40S TP4 (#9245)
bringlein Oct 11, 2024
840537a
[Model] Add GLM-4v support and meet vllm==0.6.2 (#9242)
sixsixcoder Oct 11, 2024
2a1d1d0
[Doc] Remove outdated comment to avoid misunderstanding (#9287)
homeffjy Oct 11, 2024
da7dbe4
[Doc] Compatibility matrix for mutual exclusive features (#8512)
wallashss Oct 11, 2024
8a6a9f5
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being …
LucasWilkinson Oct 11, 2024
5946c10
[Bugfix] Sets `is_first_step_output` for TPUModelRunner (#9202)
Oct 11, 2024
329630a
[bugfix] fix f-string for error (#9295)
prashantgupta24 Oct 12, 2024
d6a429e
[BugFix] Fix tool call finish reason in streaming case (#9209)
maxdebayser Oct 12, 2024
55040d3
[SpecDec] Remove Batch Expansion (2/3) (#9298)
LiuXiaoxuanPKU Oct 12, 2024
dd2357a
[Bugfix] Fix bug of xformer prefill for encoder-decoder (#9026)
xiangxu-google Oct 12, 2024
530b840
[Misc][Installation] Improve source installation script and doc (#9309)
cermeng Oct 12, 2024
ab33c25
[Bugfix]Fix MiniCPM's LoRA bug (#9286)
jeejeelee Oct 12, 2024
4da6255
[CI] Fix merge conflict (#9317)
LiuXiaoxuanPKU Oct 13, 2024
c1beb48
[Bugfix] Bandaid fix for speculative decoding tests (#9327)
tlrmchlsmth Oct 13, 2024
62bb3a0
[Model] Molmo vLLM Integration (#9016)
mrsalehi Oct 14, 2024
16fcf7a
[Hardware][intel GPU] add async output process for xpu (#8897)
jikunshang Oct 14, 2024
d40342a
[CI/Build] setuptools-scm fixes (#8900)
dtrifiro Oct 14, 2024
cb96c1c
[Docs] Remove PDF build from Readtehdocs (#9347)
simon-mo Oct 14, 2024
a26a8ed
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel (#9350)
WoosukKwon Oct 14, 2024
33ab19d
[Frontend] merge beam search implementations (#9296)
LunrEclipse Oct 14, 2024
e41c099
[Model] Make llama3.2 support multiple and interleaved images (#9095)
xiangxu-google Oct 14, 2024
33df278
[Bugfix] Clean up some cruft in mamba.py (#9343)
tlrmchlsmth Oct 15, 2024
503b30f
[Frontend] Clarify model_type error messages (#9345)
stevegrubb Oct 15, 2024
3d11424
[Doc] Fix code formatting in spec_decode.rst (#9348)
mgoin Oct 15, 2024
deabda1
[Bugfix] Update InternVL input mapper to support image embeds (#9351)
hhzhang16 Oct 15, 2024
35714af
[BugFix] Fix chat API continuous usage stats (#9357)
njhill Oct 15, 2024
5c477d2
pass ignore_eos parameter to all benchmark_serving calls (#9349)
gracehonv Oct 15, 2024
3525b76
[Misc] Directly use compressed-tensors for checkpoint definitions (#8…
mgoin Oct 15, 2024
16b9ec5
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with emp…
CatherineSue Oct 15, 2024
58a4984
[Bugfix][CI/Build] Fix CUDA 11.8 Build (#9386)
LucasWilkinson Oct 16, 2024
dea7d1a
[Bugfix] Molmo text-only input bug fix (#9397)
mrsalehi Oct 16, 2024
5aad8ae
[Misc] Standardize RoPE handling for Qwen2-VL (#9250)
DarkLight1337 Oct 16, 2024
c913237
[Model] VLM2Vec, the first multimodal embedding model in vLLM (#9303)
DarkLight1337 Oct 16, 2024
c4e2202
[CI/Build] Test VLM embeddings (#9406)
DarkLight1337 Oct 16, 2024
f384d7b
[Core] Rename input data types (#8688)
DarkLight1337 Oct 16, 2024
b2c2008
[Misc] Consolidate example usage of OpenAI client for multimodal mode…
ywang96 Oct 16, 2024
77e7cb4
[Model] Support SDPA attention for Molmo vision backbone (#9410)
Isotr0py Oct 16, 2024
1da1504
Support mistral interleaved attn (#9414)
patrickvonplaten Oct 16, 2024
988f9c1
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189)
mzusman Oct 16, 2024
c522a75
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCP…
0xjunhao Oct 16, 2024
86678fd
[Performance][Spec Decode] Optimize ngram lookup performance (#9333)
LiuXiaoxuanPKU Oct 16, 2024
23eeec0
[CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267)
russellb Oct 16, 2024
a73c22d
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token qu…
tlrmchlsmth Oct 16, 2024
2745502
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391)
rasmith Oct 17, 2024
009001c
Add notes on the use of Slack (#9442)
terrytangyuan Oct 17, 2024
04a88d0
[Kernel] Add Exllama as a backend for compressed-tensors (#9395)
LucasWilkinson Oct 17, 2024
7b5afb1
[Misc] Print stack trace using `logger.exception` (#9461)
DarkLight1337 Oct 17, 2024
1936afd
[misc] CUDA Time Layerwise Profiler (#8337)
LucasWilkinson Oct 17, 2024
a926b02
[Bugfix] Allow prefill of assistant response when using `mistral_comm…
sasha0552 Oct 17, 2024
c2ab3eb
[TPU] Call torch._sync(param) during weight loading (#9437)
WoosukKwon Oct 17, 2024
0819068
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344)
bigPYJ1151 Oct 17, 2024
7cd2f07
[Core] Deprecating block manager v1 and make block manager v2 default…
KuntaiDu Oct 17, 2024
8ff70a3
[CI/Build] remove .github from .dockerignore, add dirty repo check (#…
dtrifiro Oct 17, 2024
c67cb17
[Misc] Remove commit id file (#9470)
DarkLight1337 Oct 17, 2024
3ce27ae
[torch.compile] Fine-grained CustomOp enabling mechanism (#9300)
ProExpertProg Oct 17, 2024
304e1dc
[Bugfix] Fix support for dimension like integers and ScalarType (#9299)
bnellnm Oct 17, 2024
b2ff7e3
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving s…
wukaixingxp Oct 17, 2024
a304e17
[Bugfix] Print warnings related to `mistral_common` tokenizer only on…
sasha0552 Oct 17, 2024
da4769f
[Hardwware][Neuron] Simplify model load for transformers-neuronx libr…
sssrijan-amazon Oct 17, 2024
eaba6f9
Support `BERTModel` (first `encoder-only` embedding model) (#9056)
robertgshaw2-redhat Oct 17, 2024
1caa710
[BugFix] Stop silent failures on compressed-tensors parsing (#9381)
dsikka Oct 18, 2024
41df4b8
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory u…
joerunde Oct 18, 2024
0e060f7
[Qwen2.5] Support bnb quant for Qwen2.5 (#9467)
blueyo0 Oct 18, 2024
6c05266
[CI/Build] Use commit hash references for github actions (#9430)
russellb Oct 18, 2024
daaa21b
[BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473)
njhill Oct 18, 2024
e1bc0b4
[Frontend][Feature] Add jamba tool parser (#9154)
tomeras91 Oct 18, 2024
78c7fc9
[BugFix] Fix and simplify completion API usage streaming (#9475)
njhill Oct 18, 2024
ca487f8
[CI/Build] Fix lint errors in mistral tokenizer (#9504)
DarkLight1337 Oct 18, 2024
a885623
[Bugfix] Fix offline_inference_with_prefix.py (#9505)
tlrmchlsmth Oct 18, 2024
689ba4d
[Misc] benchmark: Add option to set max concurrency (#9390)
russellb Oct 18, 2024
3e03206
[Model] Add user-configurable task for models that support both gener…
DarkLight1337 Oct 18, 2024
fd02b2f
[CI/Build] Add error matching config for mypy (#9512)
russellb Oct 18, 2024
0fa30c6
[Model] Support Pixtral models in the HF Transformers format (#9036)
mgoin Oct 18, 2024
c69a58f
[MISC] Add lora requests to metrics (#9477)
coolkp Oct 18, 2024
fe73700
[MISC] Consolidate cleanup() and refactor offline_inference_with_pref…
comaniac Oct 18, 2024
799f0ee
[Kernel] Add env variable to force flashinfer backend to enable tenso…
tdoublep Oct 19, 2024
1fd083c
[Bugfix] Fix offline mode when using `mistral_common` (#9457)
sasha0552 Oct 19, 2024
2c79351
:bug: fix torch memory profiling (#9516)
joerunde Oct 19, 2024
8a565e9
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessari…
njhill Oct 19, 2024
04376dc
[Doc] update gpu-memory-utilization flag docs (#9507)
joerunde Oct 19, 2024
a9d3d0f
[CI/Build] Add error matching for ruff output (#9513)
russellb Oct 19, 2024
dd26b14
[CI/Build] Configure matcher for actionlint workflow (#9511)
russellb Oct 19, 2024
87c6e2f
[Frontend] Support simpler image input format (#9478)
yue-anyscale Oct 19, 2024
300c884
[Bugfix] Fix missing task for speculative decoding (#9524)
DarkLight1337 Oct 19, 2024
0657139
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf (#9…
mgoin Oct 19, 2024
a727e6f
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stron…
heheda12345 Oct 20, 2024
1a76a1b
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision (…
mgoin Oct 20, 2024
61da264
[Kernel] Support sliding window in flash attention backend (#9403)
heheda12345 Oct 20, 2024
89569ba
[Frontend][Misc] Goodput metric support (#9338)
Imss27 Oct 20, 2024
6b01fdf
[CI/Build] Split up decoder-only LM tests (#9488)
DarkLight1337 Oct 21, 2024
564e35b
[Doc] Consistent naming of attention backends (#9498)
tdoublep Oct 21, 2024
e174d8c
[Model] FalconMamba Support (#9325)
dhiaEddineRhaiem Oct 21, 2024
7fee2fa
[Bugfix][Misc]: fix graph capture for decoder (#9549)
yudian0504 Oct 21, 2024
fe64fa6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint (#9492)
varad-ahirwadkar Oct 21, 2024
d92f7b8
[Model][Bugfix] Fix batching with multi-image in PixtralHF (#9518)
mgoin Oct 21, 2024
0d38423
[Frontend] Reduce frequency of client cancellation checking (#7959)
njhill Oct 21, 2024
03cb564
[doc] fix format (#9562)
youkaichao Oct 21, 2024
a0a8204
[BugFix] Update draft model TP size check to allow matching target TP…
njhill Oct 21, 2024
9ce062e
[Frontend] Don't log duplicate error stacktrace for every request in …
wallashss Oct 21, 2024
298443f
[CI] Make format checker error message more user-friendly by using em…
KuntaiDu Oct 21, 2024
744b0a1
:bug: Fixup more test failures from memory profiling (#9563)
joerunde Oct 22, 2024
abdff1d
[core] move parallel sampling out from vllm core (#9302)
youkaichao Oct 22, 2024
4e945e9
[Bugfix]: serialize config by value for --trust-remote-code (#6751)
tjohnson31415 Oct 22, 2024
a2753fa
[CI/Build] Remove unnecessary `fork_new_process` (#9484)
DarkLight1337 Oct 22, 2024
082e415
[Bugfix][OpenVINO] fix_dockerfile_openvino (#9552)
ngrozae Oct 22, 2024
1a7d23b
[Bugfix]: phi.py get rope_theta from config file (#9503)
Falko1 Oct 22, 2024
4e54685
[CI/Build] Replaced some models on tests for smaller ones (#9570)
wallashss Oct 22, 2024
43b51fb
[Core] Remove evictor_v1 (#9572)
KuntaiDu Oct 22, 2024
0bdbb81
[Doc] Use shell code-blocks and fix section headers (#9508)
rafvasq Oct 22, 2024
96441ea
support TP in qwen2 bnb (#9574)
chenqianfzh Oct 22, 2024
a6db150
[Hardware][CPU] using current_platform.is_cpu (#9536)
wangshuai09 Oct 22, 2024
60052ed
[V1] Implement vLLM V1 [1/N] (#9289)
WoosukKwon Oct 22, 2024
81ea641
[CI/Build][LoRA] Temporarily fix long context failure issue (#9579)
jeejeelee Oct 22, 2024
eccd891
[Neuron] [Bugfix] Fix neuron startup (#9374)
xendo Oct 22, 2024
2591a30
[Model][VLM] Initialize support for Mono-InternVL model (#9528)
Isotr0py Oct 22, 2024
65c761d
[Bugfix] Eagle: change config name for fc bias (#9580)
gopalsarda Oct 22, 2024
a3fe53d
[Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212)
zhouyuan Oct 22, 2024
08344af
[Frontend] Support custom request_id from request (#9550)
guoyuhong Oct 22, 2024
44f801b
[BugFix] Prevent exporting duplicate OpenTelemetry spans (#9017)
ronensc Oct 22, 2024
1ff6757
[torch.compile] auto infer dynamic_arg_dims from type annotation (#9589)
youkaichao Oct 22, 2024
43b79c7
[Bugfix] fix detokenizer shallow copy (#5919)
aurickq Oct 22, 2024
1da7c87
[Misc] Make benchmarks use EngineArgs (#9529)
JArnoldAMD Oct 22, 2024
e99007c
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on…
LucasWilkinson Oct 22, 2024
2fa6b8d
[BugFix] Fix metrics error for --num-scheduler-steps > 1 (#8234)
yuleil Oct 22, 2024
17ab832
[Doc]: Update tensorizer docs to include vllm[tensorizer] (#7889)
sethkimmel3 Oct 22, 2024
8d277f2
[Bugfix] Generate exactly input_len tokens in benchmark_throughput (#…
heheda12345 Oct 23, 2024
9e8fbb6
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend…
sfc-gh-zhwang Oct 23, 2024
847db2c
[Model] Support E5-V (#9576)
DarkLight1337 Oct 23, 2024
1f9fe33
[Build] Fix `FetchContent` multiple build issue (#9596)
ProExpertProg Oct 23, 2024
462b88a
[Hardware][XPU] using current_platform.is_xpu (#9605)
MengqingCao Oct 23, 2024
3c82f66
[Model] Initialize Florence-2 language backbone support (#9555)
Isotr0py Oct 23, 2024
45142e3
[VLM] Enable overriding whether post layernorm is used in vision enco…
DarkLight1337 Oct 23, 2024
715c37a
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs…
alex-jw-brooks Oct 23, 2024
81ac865
[Bugfix] Fix `_init_vision_model` in NVLM_D model (#9611)
DarkLight1337 Oct 23, 2024
05e8bd1
[misc] comment to avoid future confusion about baichuan (#9620)
youkaichao Oct 23, 2024
c6e8f26
[Bugfix] Fix divide by zero when serving Mamba models (#9617)
tlrmchlsmth Oct 23, 2024
c2a264e
[Misc] Separate total and output tokens in benchmark_throughput.py (#…
mgoin Oct 23, 2024
358a773
[torch.compile] Adding torch compile annotations to some models (#9614)
CRZbulabula Oct 23, 2024
4c8aa4d
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
alex-jw-brooks Oct 23, 2024
c3293da
[Model] Add Qwen2-Audio model support (#9248)
faychu Oct 23, 2024
7d9fb11
[CI/Build] Add bot to close stale issues and PRs (#9436)
russellb Oct 23, 2024
3775ce6
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched mul…
mgoin Oct 24, 2024
cb6364d
[Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
mgoin Oct 24, 2024
1b6de73
[Bugfix]: Make chat content text allow type content (#9358)
vrdn-23 Oct 24, 2024
7f1a962
[XPU] avoid triton import for xpu (#9440)
yma11 Oct 24, 2024
a0ac193
[Bugfix] Fix PP for ChatGLM and Molmo (#9422)
DarkLight1337 Oct 24, 2024
fa75d40
[V1][Bugfix] Clean up requests when aborted (#9629)
WoosukKwon Oct 24, 2024
50cd76a
[core] simplify seq group code (#9569)
youkaichao Oct 24, 2024
72d8a53
fix code style
Alvant Oct 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
35 changes: 21 additions & 14 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,43 @@
import os
import sys
import zipfile

MAX_SIZE_MB = 200
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")
print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")


def check_wheel_size(directory):
"""Check the size of .whl files in the given directory."""
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
for file_name in files:
if file_name.endswith(".whl"):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
f"({wheel_size_mb:.2f} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
if len(sys.argv) < 2:
print("Usage: python check-wheel-size.py <directory>")
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
18 changes: 0 additions & 18 deletions .buildkite/download-images.sh

This file was deleted.

12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.671
- name: "exact_match,flexible-extract"
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.905
- name: "exact_match,flexible-extract"
value: 0.905
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.752
- name: "exact_match,flexible-extract"
value: 0.754
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.753
- name: "exact_match,flexible-extract"
value: 0.753
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.755
- name: "exact_match,flexible-extract"
value: 0.755
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.753
- name: "exact_match,flexible-extract"
value: 0.753
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.728
- name: "exact_match,flexible-extract"
value: 0.728
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.758
- name: "exact_match,flexible-extract"
value: 0.759
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.233
- name: "exact_match,flexible-extract"
value: 0.236
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.86
- name: "exact_match,flexible-extract"
value: 0.86
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.624
- name: "exact_match,flexible-extract"
value: 0.624
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.578
- name: "exact_match,flexible-extract"
value: 0.585
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.593
- name: "exact_match,flexible-extract"
value: 0.588
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.595
- name: "exact_match,flexible-extract"
value: 0.582
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.792
- name: "exact_match,flexible-extract"
value: 0.824
limit: 250
num_fewshot: 5
5 changes: 5 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform.yaml
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
10 changes: 10 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
Loading