[model] Reduce medusa weight #10454

skylee-01 · 2024-11-19T15:17:10Z

Medusa predicts N tokens in speculative decoding and trains N lheads，In actual deployments, only ResidualBlock is usually trained, not lm_head.So I just keep a copy of lm_head and share it in different heads. In practice, every lm_head saved will reduce 1G of HBM, which is crucial on graphics cards such as the 4090.At the same time, medusa can be predicted longer.

Signed-off-by: skylee-01 <[email protected]>

github-actions · 2024-11-19T15:17:28Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-11-19T15:23:37Z

Is original_lm_head a common config field for this type of model? Imo if it's just for personal use, you can use OOT registration to handle this case instead of updating the main repo. (Sorry I didn't look into this deeply earlier)

skylee-01 · 2024-11-19T15:34:11Z

Is original_lm_head a common config field for this type of model? Imo if it's just for personal use, you can use OOT registration to handle this case instead of updating the main repo. (Sorry I didn't look into this deeply earlier)

original_lm_head is only available in the medusa model. This strategy can reduce HBM very well. I think it is an optimization of usage and improves the length of medusa prediction. The original author only considered the case of qps=1, which is not consistent with the vllm scenario. In the scenario I'm exposed to, sharing lm_head is a good strategy. If necessary, I can do some experiments to test its effect.

DarkLight1337 · 2024-11-19T15:36:02Z

For future reference, can you link an example HF repo which uses this config field?

skylee-01 · 2024-11-19T15:40:56Z

For future reference, can you link an example HF repo which uses this config field?

I also really agree with your point.

skylee-01 · 2024-11-20T04:34:56Z

For future reference, can you link an example HF repo which uses this config field?

Here is a config example. In medusa if your weights are not trained on lm_head, you can configure original_lm_head=true so that all heads will share the original lm_head and reduce HBM.
https://huggingface.co/SkyLee-01/Medusa/blob/main/config.json

DarkLight1337 · 2024-11-20T04:45:23Z

Sorry for the false ping.

DarkLight1337 · 2024-11-20T04:47:18Z

Thanks for adding this optimization!

Signed-off-by: skylee-01 <[email protected]>

Signed-off-by: skylee-01 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: skylee-01 <[email protected]>

[Misc] Reduce medusa weight

d8415e8

Signed-off-by: skylee-01 <[email protected]>

DarkLight1337 requested a review from tlrmchlsmth November 20, 2024 04:44

DarkLight1337 approved these changes Nov 20, 2024

View reviewed changes

DarkLight1337 removed the request for review from tlrmchlsmth November 20, 2024 04:45

DarkLight1337 enabled auto-merge (squash) November 20, 2024 04:45

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2024

DarkLight1337 merged commit 343041c into vllm-project:main Nov 20, 2024
65 checks passed

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[model] Reduce medusa weight (vllm-project#10454)

173c5aa

Signed-off-by: skylee-01 <[email protected]>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[model] Reduce medusa weight (vllm-project#10454)

958f4f1

Signed-off-by: skylee-01 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

skylee-01 deleted the reduce_medusa_weight branch November 30, 2024 13:26

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[model] Reduce medusa weight (vllm-project#10454)

506d826

Signed-off-by: skylee-01 <[email protected]>

anko-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Feb 10, 2025

[model] Reduce medusa weight (vllm-project#10454)

1c76f0e

Signed-off-by: skylee-01 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model] Reduce medusa weight #10454

[model] Reduce medusa weight #10454

skylee-01 commented Nov 19, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 19, 2024

DarkLight1337 commented Nov 19, 2024

skylee-01 commented Nov 19, 2024

DarkLight1337 commented Nov 19, 2024

skylee-01 commented Nov 19, 2024 •

edited

Loading

skylee-01 commented Nov 20, 2024 •

edited

Loading

DarkLight1337 commented Nov 20, 2024

DarkLight1337 commented Nov 20, 2024

[model] Reduce medusa weight #10454

[model] Reduce medusa weight #10454

Conversation

skylee-01 commented Nov 19, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 19, 2024

DarkLight1337 commented Nov 19, 2024

skylee-01 commented Nov 19, 2024

DarkLight1337 commented Nov 19, 2024

skylee-01 commented Nov 19, 2024 • edited Loading

skylee-01 commented Nov 20, 2024 • edited Loading

DarkLight1337 commented Nov 20, 2024

DarkLight1337 commented Nov 20, 2024

skylee-01 commented Nov 19, 2024 •

edited by github-actions bot

Loading

skylee-01 commented Nov 19, 2024 •

edited

Loading

skylee-01 commented Nov 20, 2024 •

edited

Loading