Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge QKV into one linear layer #15

Merged
merged 4 commits into from
Apr 2, 2023
Merged

Merge QKV into one linear layer #15

merged 4 commits into from
Apr 2, 2023

Conversation

zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Mar 30, 2023

@WoosukKwon Feel free to merge this after your review.

@zhuohan123 zhuohan123 requested a review from WoosukKwon March 30, 2023 17:04
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort. Please check my comments.

@WoosukKwon WoosukKwon mentioned this pull request Apr 2, 2023
@WoosukKwon
Copy link
Collaborator

The performance regression problem in this PR is fixed in #20 . I will merge the two PRs together when PR #20 is approved.

@WoosukKwon WoosukKwon self-requested a review April 2, 2023 07:23
@WoosukKwon WoosukKwon merged commit 1f01a18 into main Apr 2, 2023
@zhuohan123 zhuohan123 deleted the qkv_combined branch June 18, 2023 07:22
bigPYJ1151 added a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 26, 2024
…no-model-executor-opt

[CPU] Avoid copy result and force allocation
z103cb referenced this pull request in z103cb/opendatahub_vllm Apr 22, 2024
This PR updates our grpc_server to add TGIS-style logs similar to
https://github.com/IBM/text-generation-inference/blob/main/router/src/grpc_server.rs#L504-L512

This also disables the vllm per-request logging so that we don't
double-log each request

The timing info collected here is pretty rough, it doesn't plumb into
the LLMEngine, it just times the generators to get the total time spent
in the engine. We could do better, but this is a start.

Example logs:

```
INFO 04-09 21:51:01 logs.py:43] generate_stream{input=[b'This is the story of Obama ridin...'] prefix_id= input_chars=[70] params=sampling { } stopping { max_new_tokens: 200 min_new_tokens: 16 } response { } decoding { } tokenization_time=0.45ms queue_and_inference_time=1096.67ms time_per_token=5.48ms total_time=1097.12ms input_toks=16}: Streaming response generated 200 tokens before NOT_FINISHED, output 848 chars: b' California. The story is told i...'
INFO 04-09 21:51:08 logs.py:43] generate{input=[b'Lorem ipsum dolor sit amet, cons...', b'foooood man where is it'] prefix_id= input_chars=[469] params=sampling { } stopping { max_new_tokens: 20 min_new_tokens: 16 } response { } decoding { } tokenization_time=2.03ms queue_and_inference_time=122.23ms time_per_token=6.11ms total_time=124.26ms input_toks=124}: Sub-request 0 from batch of 2 generated 20 tokens before MAX_TOKENS, output 25 chars: b'?\\n\\n<!--\\n<!--\\n<!--\\n<!--\\n<!'
INFO 04-09 21:51:08 logs.py:43] generate{input=[b'Lorem ipsum dolor sit amet, cons...', b'foooood man where is it'] prefix_id= input_chars=[469] params=sampling { } stopping { max_new_tokens: 20 min_new_tokens: 16 } response { } decoding { } tokenization_time=2.07ms queue_and_inference_time=122.22ms time_per_token=6.11ms total_time=124.29ms input_toks=7}: Sub-request 1 from batch of 2 generated 20 tokens before MAX_TOKENS, output 70 chars: b"?\\nI don't know.\\nI don't know.\\nI ..."
```

---------

Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024
Correctly calculating the same value for the required cache blocks num for all torchrun processes
ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024
…-wenxh/fp8-on-a100-v5-pr

Revert "0612 kernel of FP8 on A100"
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants