-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Release_v2160] Update Release notes #3380
base: release_v2160
Are you sure you want to change the base?
[Release_v2160] Update Release notes #3380
Conversation
@alexsu52, @ljaljushkin, @l-bat, @nikita-savelyevv, @andreyanufr, @andrey-churkin, @daniil-lyakhov, @kshpv, @AlexanderDokuchaev, @anzr299 fill the document with your changes for the upcoming release, please. |
- Features: | ||
- ... | ||
- Fixes: | ||
- Fixed occasional failures of weight compression algorithm on ARM CPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fixes: | ||
- Fixed occasional failures of weight compression algorithm on ARM CPUs. | ||
- Improvements: | ||
- Reduced the run time and peak memory of mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in mixed precision case is about 20-40%; peak memory reduction is about 20%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReleaseNotes.md
Outdated
- General: | ||
- ... | ||
- Features: | ||
- (Torch) Introduced a novel weight compression method for Large Language Models (LLMs) that significantly improves accuracy with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weight` API now includes a new `compression_format` option, `CompressionFormat.FQ_LORA`, for this QAT method, and a sample compression pipeline with preview support is available [here](examples/llm_compression/torch/qat_with_lora). |
- (Torch) Introduced a novel weight compression method for Large Language Models (LLMs) that significantly improves accuracy with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weight` API now includes a new `compression_format` option, `CompressionFormat.FQ_LORA`, for this QAT method, and a sample compression pipeline with preview support is available [here](examples/llm_compression/torch/qat_with_lora). | ||
- Fixes: | ||
- Fixed occasional failures of weight compression algorithm on ARM CPUs. | ||
- (Torch) Fixed weight compression for float16/bfloat16 models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworked FQ + Lora
|
||
Requirements: | ||
|
||
- Updated PyTorch (2.6.0) and Torchvision (0.21.0) versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- (Torch) Fixed weight compression for float16/bfloat16 models. | ||
- Improvements: | ||
- Reduced the run time and peak memory of mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in mixed precision case is about 20-40%; peak memory reduction is about 20%. | ||
- (TorchFX, Experimental) Added quantization support for (TorchFX)[https://pytorch.org/docs/stable/fx.html] models exported with dynamic shapes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ... | ||
- Features: | ||
- (Torch) Introduced a novel weight compression method to significantly improve the accuracy of Large Language Models (LLMs) with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weight` API now includes a new `compression_format` option, `CompressionFormat.FQ_LORA`, for this QAT method, and a sample compression pipeline with preview support is available [here](examples/llm_compression/torch/qat_with_lora). | ||
- (Torch) Add support for 4-bit weight compression, along with AWQ and Scale Estimation data-aware methods to reduce quality loss after compression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes
Reason for changes
Related tickets
For the contributors:
Please add your changes (as a commit to the branch) to the list according to the template and previous notes
Do not add tests-related notes
Provide the list of the PRs (for all your notes) in the comment for the discussion