-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for concurrency in llm models #519
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks @dtrawins, I will test this. You can fix the style check by running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @dtrawins
logger.info(f"Compiling the encoder to {self._device} ...") | ||
self.request = core.compile_model(self.model, self._device, self.ov_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to keep support for at least a couple of minor releases + add a warning to specify that this attribute will be deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the attribute request was not exposed in any methods so I would expect it to be internal class implementation specific. It is now called compiled_model which better describes the stored object. We could keep it as a duplicate but I wonder if that could bring some confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, compiled_model is a better name, but people are used to .request now, I use it quite a lot and expect I'm not alone (for example to get a compiled model property). This can also be used in integrations that use optimum-intel. So I think moving to compiled_model is good, but ideally we should show a deprecation warning for .request, and if not at least keep it as an alias for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helena-intel Would it be ok to create a new attribute infer_request to store infer_request object and keep request as an alias for the compiled_model? I wonder how we could add a deprecation notice about the request attribute while it is not used in any methods. A comment in the code? I guess it wasn't documented as an internal implementation detail.
@@ -197,8 +198,14 @@ def forward( | |||
inputs["token_type_ids"] = token_type_ids | |||
|
|||
# Run inference | |||
outputs = self.request(inputs) | |||
logits = torch.from_numpy(outputs["logits"]).to(self.device) if not np_inputs else outputs["logits"] | |||
infer_request = self.compiled_model.create_infer_request() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to create an inference request for each prediction ? (from my understanding it's needed for stateful models at each generation steps, but might not be the case here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually new infer_requests are needed for stateless models. It ensure each prediction from each execution thread is independent. Otherwise only one generate operation would be possible on a model. In stateful LLM models there is different approach - each generate operation has a single infer request so the state is preserved between integrations.
from transformers import AutoConfig, AutoTokenizer, set_seed | ||
|
||
from optimum.intel import OVModelForCausalLM | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to add additional tests can it be integrated in test_modeling.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was meant as an example to test it more than an automated test. I would suggest to remove it for now, and then later add an example (in the examples folder. The code in this example is good for testing the potential of multiconcurrency in combination with larger batches, but it is not clear if the performance benefit comes from the larger batch or the multiconcurrency. I made some code to make that clearer, I can clean that up and add an example later, if that's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest tests will be added to validate the multithreading so those scripts are to be dropped. I guess they could be added as examples but probably it would be a part of a separate PR.
This functionality is continued in #573 |
Allows running llm generate operations in multithreaded application
Support in parallel execution for stateful LLM models is implemented using a cloned model object but with shared OV model and compiled model to avoid duplicating memory consumption. Each cloned model object is using a single OpenVINO
inference_request
object which keeps the execution context and state.For stable diffusion and seq2seq pipelines which are without stateful models, no changes are needed for multi threaded execution.
Before submitting