Frontend Improvements #47

zhuohan123 · 2023-04-22T03:57:50Z

Current implementation of the FastAPI+asyncio+ray combination seems slow
Merge Hao’s throughput profiling code.
Make the frontend looks like OpenAI’s API.

merrymercy · 2023-05-20T23:10:51Z

For the OpenAI API server, you can learn from FastChat’s implementation (https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md), which supports all major OpenAI features such as completion, chat, and embedding. It can work with existing apps (e.g., LangChain) without code modifications

Another option is to directly import FastChat and extend the existing the cacheflow integration in FastChat.

WoosukKwon · 2023-05-21T20:55:10Z

@merrymercy Thanks for the suggestion! @zhuohan123 is working on the first direction.

zhuohan123 · 2023-05-23T05:56:59Z

For the OpenAI API server, you can learn from FastChat’s implementation (https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md), which supports all major OpenAI features such as completion, chat, and embedding. It can work with existing apps (e.g., LangChain) without code modifications

Another option is to directly import FastChat and extend the existing the cacheflow integration in FastChat.

Thanks for the suggestion! We implemented a OpenAI API Server in #116 following fastchat's implementation. We currently implement the completion API. In the future, we are planning to import fastchat to implement the chat completion API.

* support quark * using torch/all.h * loading weight from quark output * support both ammo and quark * Update doc * fix load ammo * fix linter * fix isort

remove expert_max hard code (vllm-project#47) vLLM-Ext: Full enabling of ALiBi (vllm-project#34) Add version inference via setuptools-scm (vllm-project#58) Revert "vLLM-Ext: Full enabling of ALiBi (vllm-project#34)" (vllm-project#59) Remove punica_hpu.py from vllm_hpu_extension (vllm-project#66) Removed previous (not-pipelined) pa implementation (vllm-project#72) Add flag to enable running softmax in fp32 (vllm-project#71) Update calibration readme link (vllm-project#73) allow lm_head quantization in calibration process (vllm-project#65) Pad to bmin if value is less (vllm-project#67) Update pyproject.toml (HabanaAI#75) --------- Co-authored-by: Michał Kuligowski <[email protected]>

zhuohan123 self-assigned this Apr 22, 2023

WoosukKwon added the P0 label May 10, 2023

zhuohan123 mentioned this issue May 23, 2023

OpenAI Compatible Frontend #116

Merged

zhuohan123 closed this as completed in #116 May 24, 2023

zhuohan123 mentioned this issue Jun 2, 2023

Fixing various issues of async servers #135

Merged

2 tasks

shanshanpt mentioned this issue Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this issue Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024

Update README.md (vllm-project#47)

5bac756

ZHJ19970917 mentioned this issue Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

JHLEE17 pushed a commit to JHLEE17/vllm that referenced this issue Aug 1, 2024

Remove usage of wrap_in_hpu_graph in PT eager (vllm-project#47)

1c5d12e

alixiaodi mentioned this issue Aug 2, 2024

[Bug]: #7072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frontend Improvements #47

Frontend Improvements #47

zhuohan123 commented Apr 22, 2023

merrymercy commented May 20, 2023 •

edited

Loading

WoosukKwon commented May 21, 2023

zhuohan123 commented May 23, 2023

Frontend Improvements #47

Frontend Improvements #47

Comments

zhuohan123 commented Apr 22, 2023

merrymercy commented May 20, 2023 • edited Loading

WoosukKwon commented May 21, 2023

zhuohan123 commented May 23, 2023

merrymercy commented May 20, 2023 •

edited

Loading