[pull] main from vllm-project:main #19

pull · 2024-05-09T10:30:19Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Co-authored-by: Lei Wen <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Cody Yu <[email protected]>

…int (#3467) Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

…ersion (#4467)

…#4494) Co-authored-by: Simon Mo <[email protected]>

… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <[email protected]>

Signed-off-by: Travis Johnson <[email protected]>

Co-authored-by: Lei Wen <[email protected]>

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>

…#4357)

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency

Remove the device="cuda" declarations in mixtral as promised in #4343

…to be provided (#4273)

…n is not 1 and max_tokens is large & Add tests for preemption (#4451)

…to swap (#4659)

Co-authored-by: Cade Daniel <[email protected]>

…gprobs (#4672)

…4626)

Co-authored-by: Michael Goin <[email protected]>

openshift-merge-robot · 2024-05-09T10:30:27Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-05-09T10:30:31Z

Hi @pull[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

z103cb · 2024-05-09T13:23:32Z

/ok-to-test

z103cb · 2024-05-09T13:44:34Z

/lgtm
/approve

openshift-ci · 2024-05-09T13:44:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pull[bot], z103cb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [z103cb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Update dockerfile.ubi to build vllm using wheels! I had to update some `init` files since we need those packages to be picked up when building the wheel for vllm. ### Integration tests https://v3.travis.ibm.com/github/ai-foundation/fmaas-inference-server/builds/17962397 Image pushed to quay for testing: ``` quay.io/wxpe/tgis-vllm:release-vllm-wheel.eec7a7b ``` <img width="1020" alt="Screenshot 2024-04-23 at 12 18 00" src="https://github.com/IBM/vllm/assets/9909241/f261bc38-d1f9-4d1a-a5d6-9db14aa362a6"> Useful command to build the above tests: ``` env: global: - REMOTE_INTEGRATION_TESTS=true - REMOTE_INTEGRATION_TEST_IMAGE=quay.io/wxpe/tgis-vllm:release-vllm-wheel.eec7a7b - REMOTE_INTEGRATION_TEST_CONFIG=product.vllm ``` --- <details>  <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> --------- Signed-off-by: Prashant Gupta <[email protected]>

leiwen83 and others added 30 commits April 30, 2024 10:12

[BugFix] fix num_lookahead_slots missing in async executor (#4165)

4bb53e2

Co-authored-by: Lei Wen <[email protected]>

[Doc] add visualization for multi-stage dockerfile (#4456)

b31a1fb

Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Roger Wang <[email protected]>

[Frontend] Support complex message content for chat completions endpo…

a494140

…int (#3467) Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Frontend] [Core] Tensorizer: support dynamic num_readers, update v…

715c2d8

…ersion (#4467)

[Bugfix][Minor] Make ignore_eos effective (#4468)

dd1a50a

fix_tokenizer_snapshot_download_bug (#4493)

6ad58f4

Unable to find Punica extension issue during source code installation (…

ee37328

…#4494) Co-authored-by: Simon Mo <[email protected]>

[Core] Centralize GPU Worker construction (#4419)

2e240c6

[Misc][Typo] type annotation fix (#4495)

f458112

[Misc] fix typo in block manager (#4453)

a822eb3

Allow user to define whitespace pattern for outlines (#4305)

c3845d8

[Misc]Add customized information for models (#4132)

d6f4bd7

[Test] Add ignore_eos test (#4519)

6f1df80

[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to…

a88bb9b

… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <[email protected]>

[Bugfix] Fix 307 Redirect for /metrics (#4523)

4dc8026

[Doc] update(example model): for OpenAI compatible serving (#4503)

e491c7e

[Bugfix] Use random seed if seed is -1 (#4531)

6990912

[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)

8b798ee

Signed-off-by: Travis Johnson <[email protected]>

[Speculative decoding] Add ngram prompt lookup decoding (#4237)

b38e42f

Co-authored-by: Lei Wen <[email protected]>

[Core] Enable prefix caching with block manager v2 enabled (#4142)

24750f4

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>

[Core] Add multiproc_worker_utils for multiprocessing-based workers (…

a657bfc

…#4357)

[Bugfix] Add validation for seed (#4529)

c47ba4a

[Bugfix][Core] Fix and refactor logging stats (#4336)

3a922c1

[Core][Distributed] fix pynccl del error (#4508)

6ef09b0

[Misc] Remove Mixtral device="cuda" declarations (#4543)

c9d852d

Remove the device="cuda" declarations in mixtral as promised in #4343

[Misc] Fix expert_ids shape in MoE (#4517)

826b82a

[MISC] Rework logger to enable pythonic custom logging configuration …

b8afa8b

…to be provided (#4273)

[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.…

0d62fe5

…n is not 1 and max_tokens is large & Add tests for preemption (#4451)

Yard1 and others added 10 commits May 8, 2024 10:33

[Core] Faster startup for LoRA enabled models (#4634)

ad932a2

[Core][Optimization] change python dict to pytorch tensor for blocks …

20cfcde

…to swap (#4659)

[CI/Test] fix swap test for multi gpu (#4689)

230c4b3

[Misc] Use vllm-flash-attn instead of flash-attn (#4686)

89579a2

[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)

f942efb

Co-authored-by: Cade Daniel <[email protected]>

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…

8b9241b

…gprobs (#4672)

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#…

e288df0

…4626)

[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)

16bc0a0

Co-authored-by: Michael Goin <[email protected]>

[Frontend] Move async logic outside of constructor (#4674)

f12b20d

[Misc] Remove unnecessary ModelRunner imports (#4703)

190bc83

openshift-ci bot requested review from dtrifiro and Xaenalt May 9, 2024 10:30

openshift-merge-robot added the needs-rebase label May 9, 2024

openshift-ci bot added the needs-ok-to-test label May 9, 2024

pull bot added ⤵️ pull and removed needs-ok-to-test needs-rebase labels May 9, 2024

openshift-ci bot added the ok-to-test label May 9, 2024

z103cb enabled auto-merge May 9, 2024 13:44

openshift-ci bot assigned z103cb May 9, 2024

openshift-ci bot added the lgtm label May 9, 2024

openshift-ci bot added the approved label May 9, 2024

z103cb merged this pull request into opendatahub-io:main May 9, 2024
1 of 3 checks passed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from vllm-project:main #19

[pull] main from vllm-project:main #19

pull bot commented May 9, 2024 •

edited

Loading

openshift-merge-robot commented May 9, 2024

openshift-ci bot commented May 9, 2024

z103cb commented May 9, 2024

z103cb commented May 9, 2024

openshift-ci bot commented May 9, 2024

[pull] main from vllm-project:main #19

[pull] main from vllm-project:main #19

Conversation

pull bot commented May 9, 2024 • edited Loading

openshift-merge-robot commented May 9, 2024

openshift-ci bot commented May 9, 2024

z103cb commented May 9, 2024

z103cb commented May 9, 2024

openshift-ci bot commented May 9, 2024

pull bot commented May 9, 2024 •

edited

Loading