parallel inference for multiple requests #10329

ooaykac · 2024-11-16T05:43:09Z

ooaykac
Nov 16, 2024

What should I do to enable multiple users to ask questions to the language model simultaneously and receive responses? Does llama.cpp support parallel inference for concurrent operations?

How can we ensure that requests made to the language model are processed and inferred in parallel, rather than sequentially, to serve multiple users simultaneously?

lukaLLM · 2025-01-29T16:00:07Z

lukaLLM
Jan 29, 2025

+1

0 replies

bandoti · 2025-01-29T22:09:29Z

bandoti
Jan 29, 2025
Collaborator

This discussion on batch processing may provide some guidance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel inference for multiple requests #10329

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

parallel inference for multiple requests #10329

ooaykac Nov 16, 2024

Replies: 2 comments

lukaLLM Jan 29, 2025

bandoti Jan 29, 2025 Collaborator

ooaykac
Nov 16, 2024

lukaLLM
Jan 29, 2025

bandoti
Jan 29, 2025
Collaborator