Refactor flashinfer logic for deepseek v3 and fix accuracy bug #3785

Fridge003 · 2025-02-22T09:15:49Z

Motivation

flashinfer_backend.py for attention is too complex, this PR extract the logic of MLA and creates a new flashinfer_mla_backend.py

Also, #3716 #3751 reports an accuracy bug when enabling flashinfer mla. This PR solves this bug by correctly handling rope scaling with yarn, with the help of @yzh119.

Modifications

Define FlashInferMLAAttnBackend in flashinfer_mla_backend.py by removing codes irrelevant to MLA in flashinfer_backend.py
Remove magic numbers in code so Deepseek v2 can also be supported
Simplify the code in forward of MLA
Change flash attention backend to auto so the newest fa3 backend of flashinfer can be used
Fix accuracy bug with yarn rope scaling

Accuracy Test

The baseline results of not enabling flashinfer mla can be referred to #3486.

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --enable-flashinfer-mla

gsm8k

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Accuracy: 0.960
Invalid: 0.000
Latency: 101.807 s
Output throughput: 1329.111 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000

Total latency: 159.003
Average accuracy: 0.871

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

python/sglang/srt/layers/attention/flashinfer_mla_backend.py

python/sglang/srt/configs/model_config.py

Fridge003 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners February 22, 2025 09:15

Fridge003 changed the title ~~[] Add flashinfer mla backend for deepseek v3~~ Refactor flashinfer logic for deepseek v3 Feb 22, 2025

zhyncs self-assigned this Feb 22, 2025

zhyncs added the high priority label Feb 22, 2025

Fridge003 force-pushed the deepseek branch from 834ad75 to 44cab67 Compare February 23, 2025 00:19

yzh119 reviewed Feb 23, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashinfer_mla_backend.py Outdated Show resolved Hide resolved

Fridge003 force-pushed the deepseek branch from 44cab67 to 0e42f7a Compare February 23, 2025 19:14

yzh119 reviewed Feb 23, 2025

View reviewed changes

Fridge003 changed the title ~~Refactor flashinfer logic for deepseek v3~~ Refactor flashinfer logic for deepseek v3 and fix accuracy bug Feb 23, 2025

Fridge003 mentioned this pull request Feb 23, 2025

[Bug] MLA slower than default for small context long outputs *and* generating bad output reproducibly #3716

Open

5 tasks

zhyncs reviewed Feb 24, 2025

View reviewed changes

python/sglang/srt/configs/model_config.py Show resolved Hide resolved

Fridge003 mentioned this pull request Feb 24, 2025

[Feature] mla speed #3806

Closed

2 tasks

Fridge003 added 5 commits February 24, 2025 00:37

Add flashinfer mla backend for deepseek v3

5f51818

fix updater prefill

c73b0bb

little fix

430e619

change fa backend to auto

6a9be2f

fix accuracy issue by handling yarn scaling

877bd66

Fridge003 force-pushed the deepseek branch from 71cb5a7 to 877bd66 Compare February 24, 2025 08:37

Merge branch 'main' into deepseek

dd80320

zhyncs merged commit b110084 into sgl-project:main Feb 24, 2025
5 of 18 checks passed

Fridge003 deleted the deepseek branch February 25, 2025 06:28

Fridge003 mentioned this pull request Feb 25, 2025

[DeepseekR1]How ragged prefill manage kv_cache? #3849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor flashinfer logic for deepseek v3 and fix accuracy bug #3785

Refactor flashinfer logic for deepseek v3 and fix accuracy bug #3785

Fridge003 commented Feb 22, 2025 •

edited

Loading

Refactor flashinfer logic for deepseek v3 and fix accuracy bug #3785

Refactor flashinfer logic for deepseek v3 and fix accuracy bug #3785

Conversation

Fridge003 commented Feb 22, 2025 • edited Loading

Motivation

Modifications

Accuracy Test

Server

gsm8k

mmlu

Checklist

Fridge003 commented Feb 22, 2025 •

edited

Loading