Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release_v2160] Update Release notes #3380

Open
wants to merge 9 commits into
base: release_v2160
Choose a base branch
from

Conversation

nikita-malininn
Copy link
Collaborator

Changes

  • Added v2.16.0 template

Reason for changes

  • Upcoming release

Related tickets

  • 164968

For the contributors:

Please add your changes (as a commit to the branch) to the list according to the template and previous notes
Do not add tests-related notes
Provide the list of the PRs (for all your notes) in the comment for the discussion

@nikita-malininn nikita-malininn requested a review from a team as a code owner March 26, 2025 10:11
@nikita-malininn
Copy link
Collaborator Author

@alexsu52, @ljaljushkin, @l-bat, @nikita-savelyevv, @andreyanufr, @andrey-churkin, @daniil-lyakhov, @kshpv, @AlexanderDokuchaev, @anzr299 fill the document with your changes for the upcoming release, please.

@github-actions github-actions bot added documentation Improvements or additions to documentation release target labels Mar 26, 2025
- Features:
- ...
- Fixes:
- Fixed occasional failures of weight compression algorithm on ARM CPUs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Fixes:
- Fixed occasional failures of weight compression algorithm on ARM CPUs.
- Improvements:
- Reduced the run time and peak memory of mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in mixed precision case is about 20-40%; peak memory reduction is about 20%.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseNotes.md Outdated
- General:
- ...
- Features:
- (Torch) Introduced a novel weight compression method for Large Language Models (LLMs) that significantly improves accuracy with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weight` API now includes a new `compression_format` option, `CompressionFormat.FQ_LORA`, for this QAT method, and a sample compression pipeline with preview support is available [here](examples/llm_compression/torch/qat_with_lora).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- (Torch) Introduced a novel weight compression method for Large Language Models (LLMs) that significantly improves accuracy with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weight` API now includes a new `compression_format` option, `CompressionFormat.FQ_LORA`, for this QAT method, and a sample compression pipeline with preview support is available [here](examples/llm_compression/torch/qat_with_lora).
- Fixes:
- Fixed occasional failures of weight compression algorithm on ARM CPUs.
- (Torch) Fixed weight compression for float16/bfloat16 models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworked FQ + Lora

Requirements:

- Updated PyTorch (2.6.0) and Torchvision (0.21.0) versions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- (Torch) Fixed weight compression for float16/bfloat16 models.
- Improvements:
- Reduced the run time and peak memory of mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in mixed precision case is about 20-40%; peak memory reduction is about 20%.
- (TorchFX, Experimental) Added quantization support for (TorchFX)[https://pytorch.org/docs/stable/fx.html] models exported with dynamic shapes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- ...
- Features:
- (Torch) Introduced a novel weight compression method to significantly improve the accuracy of Large Language Models (LLMs) with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weight` API now includes a new `compression_format` option, `CompressionFormat.FQ_LORA`, for this QAT method, and a sample compression pipeline with preview support is available [here](examples/llm_compression/torch/qat_with_lora).
- (Torch) Add support for 4-bit weight compression, along with AWQ and Scale Estimation data-aware methods to reduce quality loss after compression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation release target
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants