-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issue with imposibility remove uncompressed model in tmp after compression #925
Conversation
4048b60
to
ff6bd41
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
f113f31
to
0b2a1ee
Compare
0b2a1ee
to
b02753a
Compare
13c9356
to
06625ca
Compare
3d521c0
to
ef74d0c
Compare
ef74d0c
to
83a8800
Compare
@@ -444,12 +439,14 @@ class StoreAttr(object): | |||
|
|||
from optimum.intel.openvino.quantization import _weight_only_quantization | |||
|
|||
submodel = core.read_model(submodel_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the model is gonna be read either ways, why nesting return_model_sizes
and modify the behavior of all the export functions ? please explain the necessity for making export functions return model sizes as well, because it doesn't make sense for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the porpose of this read model in this place is prepare model for quantization, if model not planned to be compressed, there is no need in this step.
There are several cases for this:
- nncf is not available (in this case we would like to notify user that if model applicable for quantization by model size, he loose opportunity to optimize it and calculation of model size is helpful)
- quantization disabled by export config (e.g. load_in_8bit=False in from_pretrained or --weight-format fp16/fp32 in optimum-cli
- default decision about necessary to compress model based on model size, we have rule by default apply int8 weight compression only in case if model size > 1B parameters and any additional quantization configs not provided (it is also the reason why we need calculate model size and avoid extra reading model if at this moment we already know that weight compression is not applicable)
in some cases, when user uses cli and not plan to use model in the same machine, it is just waste of time and possibly memory to read model, when it does not need (if we are not speaking about additional issues caused by need to take care about it removal from memory that we try to avoid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree that avoiding an extra model read in case the model is not going to be quantized anyway is a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing return_model_sizes
introduces some complexity so would prefer to avoid if possible, what do you think about moving this logic (applying quantization on each submodels depending on multiple criteria) from main_export
to _save_model
(or export_pytorch
/ export_pytorch_via_onnx
/ export_tensorflow
) ? Also the ov_config
could be updated to know whether we should check the model's size (in case where quantization wasn't specified) or if the model should or shouldn't be quantized (load_in_8bit
set to True
of False
for example) wdyt @eaidova ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is also thing what we wanted to avoid, when we save ov model, pytorch model still in memory and this memory can not be deallocated until we do not return into place where it was loaded (export_main) that lead to extra memory consumption during weight compression, see for details #878
The PR has been stale for a while. @IlyasMoutawwakil @echarlaix do you have any additional concerns regarding this PR? |
@@ -444,12 +439,14 @@ class StoreAttr(object): | |||
|
|||
from optimum.intel.openvino.quantization import _weight_only_quantization | |||
|
|||
submodel = core.read_model(submodel_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing return_model_sizes
introduces some complexity so would prefer to avoid if possible, what do you think about moving this logic (applying quantization on each submodels depending on multiple criteria) from main_export
to _save_model
(or export_pytorch
/ export_pytorch_via_onnx
/ export_tensorflow
) ? Also the ov_config
could be updated to know whether we should check the model's size (in case where quantization wasn't specified) or if the model should or shouldn't be quantized (load_in_8bit
set to True
of False
for example) wdyt @eaidova ?
@echarlaix @IlyasMoutawwakil could you please merge these changes? issue with inability to remove model from tmp dir is blocking several customer requests, thanks |
What does this PR do?
Fixes # openvinotoolkit/openvino#26844
Reworked conversion part and weight compression after it for avoiding tmp directory lock on windows and unnecessary reading model.
Added handler for tmp dir that allow ignore permission errors (natively added in python3.10, for forward-porting to <3.10, we can remove it after removing python3.9 support)
Before submitting