Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend Improvements #47

Closed
zhuohan123 opened this issue Apr 22, 2023 · 3 comments · Fixed by #116 or #135
Closed

Frontend Improvements #47

zhuohan123 opened this issue Apr 22, 2023 · 3 comments · Fixed by #116 or #135
Assignees
Labels

Comments

@zhuohan123
Copy link
Member

  1. Current implementation of the FastAPI+asyncio+ray combination seems slow
  2. Merge Hao’s throughput profiling code.
  3. Make the frontend looks like OpenAI’s API.
@zhuohan123 zhuohan123 self-assigned this Apr 22, 2023
@WoosukKwon WoosukKwon added the P0 label May 10, 2023
@merrymercy
Copy link
Contributor

merrymercy commented May 20, 2023

For the OpenAI API server, you can learn from FastChat’s implementation (https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md), which supports all major OpenAI features such as completion, chat, and embedding. It can work with existing apps (e.g., LangChain) without code modifications

Another option is to directly import FastChat and extend the existing the cacheflow integration in FastChat.

@WoosukKwon
Copy link
Collaborator

@merrymercy Thanks for the suggestion! @zhuohan123 is working on the first direction.

@zhuohan123
Copy link
Member Author

For the OpenAI API server, you can learn from FastChat’s implementation (https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md), which supports all major OpenAI features such as completion, chat, and embedding. It can work with existing apps (e.g., LangChain) without code modifications

Another option is to directly import FastChat and extend the existing the cacheflow integration in FastChat.

Thanks for the suggestion! We implemented a OpenAI API Server in #116 following fastchat's implementation. We currently implement the completion API. In the future, we are planning to import fastchat to implement the chat completion API.

yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this issue Jul 22, 2024
* support quark

* using torch/all.h

* loading weight from quark output

* support both ammo and quark

* Update doc

* fix load ammo

* fix linter

* fix isort
JHLEE17 pushed a commit to JHLEE17/vllm that referenced this issue Aug 1, 2024
@alixiaodi alixiaodi mentioned this issue Aug 2, 2024
pi314ever pushed a commit to pi314ever/vllm that referenced this issue Jan 17, 2025
remove expert_max hard code (vllm-project#47)
vLLM-Ext: Full enabling of ALiBi (vllm-project#34)
Add version inference via setuptools-scm (vllm-project#58)
Revert "vLLM-Ext: Full enabling of ALiBi (vllm-project#34)" (vllm-project#59)
Remove punica_hpu.py from vllm_hpu_extension (vllm-project#66)
Removed previous (not-pipelined) pa implementation (vllm-project#72)
Add flag to enable running softmax in fp32 (vllm-project#71)
Update calibration readme link (vllm-project#73)
allow lm_head quantization in calibration process (vllm-project#65)
Pad to bmin if value is less (vllm-project#67)
Update pyproject.toml (HabanaAI#75)

---------

Co-authored-by: Michał Kuligowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants