[Feature] Enhanced support/structure for Multi-modal models #2439

tp-nan · 2024-12-11T06:45:25Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

In vllm, the framework can accept either image or image-embedding. See vllm [Feature] Add vision language model support. #3042 and vllm-llava impl.

Embedding an image requires fixed predictable compute and is easy to batch and run in a separate framework(for instance, tensorrt-based serving framework). See discussion in vllm-project/vllm#307 (comment)

Ideally, it is necessary to maintain the infrastructure to overlap (image (gpu) preprocessing + inference) and (llm inference) within the same process (avoiding the need for nvidia MPS)

Related resources

No response

tp-nan · 2024-12-11T07:14:53Z

Sorry, duplicated from #745

tp-nan closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enhanced support/structure for Multi-modal models #2439

[Feature] Enhanced support/structure for Multi-modal models #2439

tp-nan commented Dec 11, 2024

tp-nan commented Dec 11, 2024

[Feature] Enhanced support/structure for Multi-modal models #2439

[Feature] Enhanced support/structure for Multi-modal models #2439

Comments

tp-nan commented Dec 11, 2024

Checklist

Motivation

Related resources

tp-nan commented Dec 11, 2024