Raise an error when OVQuantizer is invoked on an compressed model #1122

l-bat · 2025-01-20T12:02:28Z

What does this PR do?

For models exceeding 1 billion parameters, optimum-intel applies weight compression by default during the from_pretrained method call. This behavior impacts the functionality of subsequent quantization algorithms because patterns cannot be matched properly, leading to incorrect results (ref: 160308).

This PR introduces a mechanism to detect if the model was compressed by checking runtime information. If a user attempts to quantize a compressed model, an error is raised to prevent incorrect behavior.

Fixes # (issue)
ticket: 160760

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

l-bat · 2025-01-20T12:05:17Z

@AlexKoff88, @nikita-savelyevv please take a look

AlexKoff88 · 2025-01-20T14:51:18Z

optimum/intel/openvino/quantization.py

@@ -315,6 +315,31 @@ def quantize(
        else:
            raise TypeError(f"Unsupported model type: {type(self.model)}")

+    def _check_model_state(self, sub_model_names: List[str] = None):
+        message_template = (
+            "Couldn't apply optimization to the model because it was already compressed with config: {}. "


Suggested change

"Couldn't apply optimization to the model because it was already compressed with config: {}. "

"Cannot apply optimization to the model because it was already optimized with the following config: {}. "

nikita-savelyevv · 2025-01-20T14:09:47Z

optimum/intel/openvino/quantization.py

                elif isinstance(self.model, OVModelForVisualCausalLM):
                    language_model = self.model.language_model
-                    _weight_only_quantization(language_model.model, quantization_config, calibration_dataset, **kwargs)
                    sub_model_names = ["vision_embeddings", "text_embeddings"] + self.model.additional_parts
+                    self._check_model_state(sub_model_names + ["language_model"])
+                    _weight_only_quantization(language_model.model, quantization_config, calibration_dataset, **kwargs)


It seems like we now should obtain the model type and submodel names once in the beginning and then reuse these variables later to choose which type of dataset to collect and how to apply optimization. With this, we could also do the _check_model_state only once.

I can do this myself in a separate follow-up PR though.

nikita-savelyevv · 2025-01-20T14:19:15Z

optimum/intel/openvino/quantization.py

+                elif isinstance(self.model, OVModelForSeq2SeqLM):
+                    sub_model_names = ["encoder", "decoder"]
+                    if self.model.decoder_with_past is not None:
+                        sub_model_names.append("decoder_with_past")
+                    self._check_model_state(sub_model_names)
+                    sub_models = [getattr(self.model, name) for name in sub_model_names]
+                    for sub_model in sub_models:
+                        _weight_only_quantization(sub_model, quantization_config, **kwargs)
+                    self.model.clear_requests()


Thank you for the effort, but I believe compression support for seq2seq models should be added separately because we also need to write a dataset collection logic for this type of model and add some tests.

ticket: 160958

nikita-savelyevv · 2025-01-20T14:42:21Z

optimum/intel/openvino/quantization.py

@@ -315,6 +315,31 @@ def quantize(
        else:
            raise TypeError(f"Unsupported model type: {type(self.model)}")

+    def _check_model_state(self, sub_model_names: List[str] = None):


To me the name seems to be too general. How about this?

Suggested change

def _check_model_state(self, sub_model_names: List[str] = None):

def _verify_not_optimized(self, sub_model_names: List[str] = None):

Could you please also move it lower to the end of class methods declaration?

nikita-savelyevv · 2025-01-20T14:46:07Z

tests/openvino/test_quantization.py

+            with pytest.raises(RuntimeError):
+                quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))


Could you please also specify a message with which the error should be raised?

nikita-savelyevv · 2025-01-20T14:46:21Z

tests/openvino/test_quantization.py

+        if isinstance(model, OVModelOpenCLIPForZeroShotImageClassification):
+            with pytest.raises(TypeError):
+                quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))


Why is it different for OVModelOpenCLIPForZeroShotImageClassification?

Quantization is not supported for OVModelOpenCLIPForZeroShotImageClassification. I'll skip it in test.

ticket: 161043

nikita-savelyevv · 2025-01-20T14:51:29Z

tests/openvino/test_quantization.py

@@ -698,6 +698,23 @@ def test_ovmodel_load_with_compressed_weights(self, model_cls, model_type, trust
            _, num_weight_nodes = get_num_quantized_nodes(model)
            self.assertEqual(expected_ov_int8[i], num_weight_nodes["int8"])

+    @parameterized.expand(SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION)
+    def test_raise_error_WC_over_WC(self, model_cls, model_type, trust_remote_code):


The list of models inside SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION is quite big. My concern is that test time will increase significantly with addition of this test.

We could limit the model scope. Or even better add this logic to the end of the existing test: test_ovmodel_load_with_compressed_weights which already does int8 compression. Running a quantization that will instantly fail over this should not increase testing time much.

The test time for all cases is 0:03:15 on my machine. I can incorporate this into test_ovmodel_load_with_compressed_weights to minimize the impact on testing time.

AlexKoff88 · 2025-01-20T14:53:32Z

optimum/intel/openvino/quantization.py

+    def _check_model_state(self, sub_model_names: List[str] = None):
+        message_template = (
+            "Couldn't apply optimization to the model because it was already compressed with config: {}. "
+            "To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "


Suggested change

"To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "

"To avoid this issue, check that you set load_in_8bit=False or not using quantization_config at export in the .from_pretrained(),"

AlexKoff88 · 2025-01-20T14:54:11Z

optimum/intel/openvino/quantization.py

+        message_template = (
+            "Couldn't apply optimization to the model because it was already compressed with config: {}. "
+            "To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "
+            "or explicitly specify the desired weight format using --weight_format fp16/fp32 for CLI."


Suggested change

"or explicitly specify the desired weight format using --weight_format fp16/fp32 for CLI."

"or explicitly specify weight format with --weight_format fp16/fp32 when using CLI."

AlexKoff88 · 2025-01-20T14:55:21Z

tests/openvino/test_quantization.py

@@ -698,6 +698,23 @@ def test_ovmodel_load_with_compressed_weights(self, model_cls, model_type, trust
            _, num_weight_nodes = get_num_quantized_nodes(model)
            self.assertEqual(expected_ov_int8[i], num_weight_nodes["int8"])

+    @parameterized.expand(SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION)


We need to check that the exception is thrown if we compress quantized or compressed model and if we quantize quantized or compressed model.

…odel

nikita-savelyevv · 2025-01-24T10:47:37Z

optimum/intel/openvino/quantization.py

@@ -1010,6 +1010,7 @@ def _weight_only_quantization(
    calibration_dataset: Optional[Union[nncf.Dataset, Iterable]] = None,
    **kwargs,
 ) -> openvino.runtime.Model:
+    _verify_not_optimized(model)


nikita-savelyevv · 2025-01-24T10:57:37Z

tests/openvino/test_quantization.py

@@ -214,6 +214,9 @@ def preprocess_function(examples, tokenizer):
            # Verify that the configuration is correctly saved and loaded
            loaded_config = OVConfig.from_pretrained(tmp_dir)
            self.assertEqual(ov_config.quantization_config.to_dict(), loaded_config.quantization_config.to_dict())
+            check_optimization_not_applicable_to_optimized_model(


I see that you've added checks for quantized models when compression is applied to them. Could you please also add this check to some tests which apply compression? For example test_ovmodel_load_with_compressed_weights and test_ovmodel_4bit_auto_compression_with_config.

I now see that check for test_ovmodel_load_with_compressed_weights is already added

AlexKoff88 requested a review from nikita-savelyevv January 20, 2025 14:46

AlexKoff88 reviewed Jan 20, 2025

View reviewed changes

nikita-savelyevv reviewed Jan 20, 2025

View reviewed changes

AlexKoff88 reviewed Jan 20, 2025

View reviewed changes

Raise an error when OVQuantizer is invoked on an already compressed m…

6ed6a09

…odel

l-bat force-pushed the lt/handle_quant_after_wc branch from 0420051 to 92facae Compare January 23, 2025 16:58

l-bat added 2 commits January 23, 2025 17:09

Update tests

e51f426

apply comments

92facae

AlexKoff88 approved these changes Jan 24, 2025

View reviewed changes

nikita-savelyevv reviewed Jan 24, 2025

View reviewed changes

nikita-savelyevv approved these changes Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise an error when OVQuantizer is invoked on an compressed model #1122

Raise an error when OVQuantizer is invoked on an compressed model #1122

l-bat commented Jan 20, 2025 •

edited

Loading

l-bat commented Jan 20, 2025

AlexKoff88 Jan 20, 2025

nikita-savelyevv Jan 20, 2025

nikita-savelyevv Jan 20, 2025

l-bat Jan 21, 2025

nikita-savelyevv Jan 20, 2025

nikita-savelyevv Jan 20, 2025

nikita-savelyevv Jan 20, 2025

l-bat Jan 21, 2025

l-bat Jan 22, 2025

nikita-savelyevv Jan 20, 2025

l-bat Jan 21, 2025

AlexKoff88 Jan 20, 2025

AlexKoff88 Jan 20, 2025

AlexKoff88 Jan 20, 2025

nikita-savelyevv Jan 24, 2025

nikita-savelyevv Jan 24, 2025

nikita-savelyevv Jan 24, 2025

	"Couldn't apply optimization to the model because it was already compressed with config: {}. "
	"Cannot apply optimization to the model because it was already optimized with the following config: {}. "

	def _check_model_state(self, sub_model_names: List[str] = None):
	def _verify_not_optimized(self, sub_model_names: List[str] = None):

		with pytest.raises(RuntimeError):
		quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))

	"To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "
	"To avoid this issue, check that you set load_in_8bit=False or not using quantization_config at export in the .from_pretrained(),"

	"or explicitly specify the desired weight format using --weight_format fp16/fp32 for CLI."
	"or explicitly specify weight format with --weight_format fp16/fp32 when using CLI."

Raise an error when OVQuantizer is invoked on an compressed model #1122

Are you sure you want to change the base?

Raise an error when OVQuantizer is invoked on an compressed model #1122

Conversation

l-bat commented Jan 20, 2025 • edited Loading

What does this PR do?

Before submitting

l-bat commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l-bat commented Jan 20, 2025 •

edited

Loading