-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support generation from input embedding #1265
Closed
Closed
Changes from 16 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
bed0e15
feat: add prompt_embeds interface
pfldy2850 3394d25
fix: add get_input_embeddings
pfldy2850 aa9b215
feat: support all models to generate from embeds
pfldy2850 ce70fe7
Merge branch 'main' into feature-input-embeds
pfldy2850 de4199d
fix: bugfix for inputs_embeds and add last line
pfldy2850 9275b2d
fix: add prompt_embeds to async engine
pfldy2850 e6963eb
Merge branch 'main' into feature-input-embeds
pfldy2850 bd5539a
fix: bugfix of get_last_token_id
pfldy2850 99605bc
fix: apply prompt_embeds to api_server
pfldy2850 87162d2
refact: refactor test_models
pfldy2850 a3d9de6
fix: apply style guide
pfldy2850 44ff4ec
fix: improve comments
pfldy2850 a37cef0
refact: refactor prepare_inputs and models
pfldy2850 9633148
fix: apply style guide
pfldy2850 eec19ed
refact: refactor zero embeds
pfldy2850 bebc26b
fix: apply style guide
pfldy2850 a2f2054
Merge branch 'main' into feature-input-embeds
pfldy2850 58391ac
Merge branch 'main' into feature-input-embeds
pfldy2850 c28d8bf
fix: update for new prepare_inputs
pfldy2850 117b47f
fix: rollback commented
pfldy2850 c0fae79
fix: update style
pfldy2850 2151bc1
Merge branch 'main' into feature-input-embeds
pfldy2850 d613790
Merge branch 'main' into feature-input-embeds
pfldy2850 1956ce4
Merge branch 'main' into feature-input-embeds
pfldy2850 d26465a
Merge branch 'main' into feature-input-embeds
pfldy2850 0790351
fix: update model_runner with input_embeds
pfldy2850 e313eae
fix: fix typo
pfldy2850 57c1701
fix: bug fix
pfldy2850 d266c39
fix: change input_embeds argument
pfldy2850 662a658
refact: refactor replace_prompt_embeds
pfldy2850 1110834
fix: bugfix
pfldy2850 ff22471
Merge branch 'main' into feature-input-embeds
pfldy2850 f2b10c3
Merge branch 'main' into feature-input-embeds
pfldy2850 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
from fastapi import FastAPI, Request | ||
from fastapi.responses import JSONResponse, Response, StreamingResponse | ||
import uvicorn | ||
import torch | ||
|
||
from vllm.engine.arg_utils import AsyncEngineArgs | ||
from vllm.engine.async_llm_engine import AsyncLLMEngine | ||
|
@@ -23,16 +24,27 @@ async def generate(request: Request) -> Response: | |
|
||
The request should be a JSON object with the following fields: | ||
- prompt: the prompt to use for the generation. | ||
- prompt_embeds: the prompt embedding to use for the generation | ||
instead of the prompt. | ||
- stream: whether to stream the results or not. | ||
- other fields: the sampling parameters (See `SamplingParams` for details). | ||
""" | ||
request_dict = await request.json() | ||
prompt = request_dict.pop("prompt") | ||
prompt_embeds = request_dict.pop("prompt_embeds", None) | ||
if prompt_embeds is not None: | ||
prompt_embeds = torch.tensor(prompt_embeds).to("cuda") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This loads stuff in float32. Eats all the GPU. |
||
prompt = None | ||
stream = request_dict.pop("stream", False) | ||
sampling_params = SamplingParams(**request_dict) | ||
request_id = random_uuid() | ||
|
||
results_generator = engine.generate(prompt, sampling_params, request_id) | ||
results_generator = engine.generate( | ||
prompt, | ||
sampling_params, | ||
request_id, | ||
prompt_embeds=prompt_embeds, | ||
) | ||
|
||
# Streaming case | ||
async def stream_results() -> AsyncGenerator[bytes, None]: | ||
|
@@ -58,7 +70,12 @@ async def stream_results() -> AsyncGenerator[bytes, None]: | |
|
||
assert final_output is not None | ||
prompt = final_output.prompt | ||
text_outputs = [prompt + output.text for output in final_output.outputs] | ||
if prompt: | ||
text_outputs = [ | ||
prompt + output.text for output in final_output.outputs | ||
] | ||
else: | ||
text_outputs = [output.text for output in final_output.outputs] | ||
ret = {"text": text_outputs} | ||
return JSONResponse(ret) | ||
|
||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throws an error when only prompt_embeds are passed.