Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked prefill support #392

Closed
Juelianqvq opened this issue Jul 24, 2024 · 7 comments
Closed

Chunked prefill support #392

Juelianqvq opened this issue Jul 24, 2024 · 7 comments

Comments

@Juelianqvq
Copy link

Any plan on this?

@yzh119
Copy link
Collaborator

yzh119 commented Jul 24, 2024

Hi, I don't see why do we need special support for chunked prefill, the paper that proposes chunked prefill (sarashi-serve) uses flashinfer and I suppose we already support this feature?

@Juelianqvq
Copy link
Author

Hi, I don't see why do we need special support for chunked prefill, the paper that proposes chunked prefill (sarashi-serve) already uses flashinfer and I suppose we already support this feature?

Thanks for your kind reply. I'm sorry I've experienced a bad misaligned result when using chunked prefill + flashinfer in vLLM? Further investigation is needed.

@yzh119
Copy link
Collaborator

yzh119 commented Jul 24, 2024

Thanks for letting me know, there might be some misconfiguration on vLLM side for chunked prefill and I'd love to help fix the issue. Can you point me to the implementation to chunked prefill in vLLM?

cc @LiuXiaoxuanPKU as lily might provide some useful insights.

@Juelianqvq
Copy link
Author

Juelianqvq commented Jul 24, 2024

Thanks for letting me know, there might be some misconfiguration on vLLM side for chunked prefill and I'd love to help fix the issue. Can you point me to the implementation to chunked prefill in vLLM?

cc @LiuXiaoxuanPKU as lily might provide some useful insights.

https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flashinfer.py#L197 & https://github.com/vllm-project/vllm/blob/main/vllm/worker/model_runner.py searching if self.backend_name == "flashinfer":

@jon-chuang
Copy link

Hello @Juelianqvq there is an ongoing effort to unify vLLM's use of flash attention as it currently calls prefill and decode kernels separately. I suspect that a similar situation is happening for flash-infer, will investigate.

vllm-project/vllm#6052

@elfiegg
Copy link

elfiegg commented Oct 23, 2024

What is the status of this?

I have a pending PR for supporting chunked-prefill with FlashInfer but I sometimes hit Illegal Memory Access issue with FlashInfer's BatchDecodeWithKvCache kernel when running mixed batch of prefill + decode tokens (issue is flaky). I'm wondering if anyone had the same issue and maybe we can collaborate on debugging.

@zhyncs
Copy link
Member

zhyncs commented Oct 23, 2024

I think it supports. We use FlashInfer in SGLang, and we support chunked prefill.

@zhyncs zhyncs closed this as completed Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants