Apply weight compression after model save to reduce peak RAM during export #878

nikita-savelyevv · 2024-08-22T16:06:55Z

What does this PR do?

Currently during OV model export with compression, compression is applied right after the model is converted. Because of this, we can't deallocated the memory required for the conversion before the compression start.

This PR introduces an intermediate full precision model saving step. Model compression is applied after this step allowing to save some memory. For INT8 compression the improvement is estimated to be at least 20%. Please see the figures below for "before" and "after" for INT8 compression of llama-2-7b-hf model.

Memory figures were obtained with https://github.com/openvinotoolkit/nncf/blob/develop/tools/memory_monitor.py

Memory improvement is only for data-free compression via optimum-cli. For export via Python API, i.e. from_pretrained(..., export=True) the memory is already handled efficiently with intermediate saving. The same for data-aware compression via optimum-cli.

Before	After

Model	Peak Mem. Before (MiB)	Peak Mem. After (MiB)	Time Before (sec.)	Time After (sec.)
opt-350m, INT8	1789	1461 (-19%)	20	20 (+0%)
opt-350m, INT4	1734	1466 (-15%)	28	29 (+3%)
tiny-llama-1.1b, INT8	4370	3888 (-11%)	33	32 (-3%)
tiny-llama-1.1b, INT4	4005	3798 (-5%)	63	62 (-2%)
open-llama-3b, INT8	11203	7822 (-30%)	48	55 (+15%)
open-llama-3b, INT4	9750	7822 (-20%)	150	158 (+5%)
llama2-7b-hf, INT8	21108	14551 (-31%)	80	94 (+18%)
llama2-7b-hf, INT4	18039	14531 (-19%)	276	292 (+6%)

For some unknown reason the results for tiny-llama-1.1b are unexpected. There is no yet an explanation for that. For other models there is noticeable memory saving with the cost of longer conversion due to intermediate model saving.

Related tickets
147935

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

nikita-savelyevv · 2024-08-26T12:48:23Z

@eaidova @AlexKoff88 please take a look

HuggingFaceDocBuilderDev · 2024-08-27T09:35:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/exporters/openvino/__main__.py

eaidova · 2024-08-30T03:52:29Z

@IlyasMoutawwakil @echarlaix could you please merge?

nikita-savelyevv force-pushed the compress-after-model-saving branch from aab15e1 to 3ffacd3 Compare August 22, 2024 16:07

nikita-savelyevv added 4 commits August 22, 2024 18:08

Initial commit

3ffacd3

Style

7789c55

Adopt tests

194a997

Add no-nncf warning

e744188

nikita-savelyevv marked this pull request as ready for review August 26, 2024 12:46

nikita-savelyevv changed the title ~~Apply weight compression after model is saved to save RAM~~ Apply weight compression after model save to reduce peak RAM during export Aug 26, 2024

eaidova approved these changes Aug 27, 2024

View reviewed changes

eaidova requested review from AlexKoff88, IlyasMoutawwakil and echarlaix August 27, 2024 08:13

IlyasMoutawwakil reviewed Aug 27, 2024

View reviewed changes

optimum/exporters/openvino/__main__.py Outdated Show resolved Hide resolved

Apply suggested changes

e7c94b0

echarlaix approved these changes Aug 28, 2024

View reviewed changes

optimum/exporters/openvino/__main__.py Outdated Show resolved Hide resolved

Do not save in fp16 in case of weight compression

c7a7f68

IlyasMoutawwakil approved these changes Aug 29, 2024

View reviewed changes

eaidova mentioned this pull request Aug 30, 2024

fix openvino nightly install in tests #885

Merged

3 tasks

nikita-savelyevv added 2 commits August 30, 2024 13:31

Replace model files right away

a168332

Merge branch 'main' into compress-after-model-saving

77c8f35

echarlaix merged commit b5998f2 into huggingface:main Aug 30, 2024
16 of 17 checks passed

eaidova mentioned this pull request Oct 17, 2024

fix issue with imposibility remove uncompressed model in tmp after compression #925

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply weight compression after model save to reduce peak RAM during export #878

Apply weight compression after model save to reduce peak RAM during export #878

nikita-savelyevv commented Aug 22, 2024 •

edited

Loading

nikita-savelyevv commented Aug 26, 2024

HuggingFaceDocBuilderDev commented Aug 27, 2024

eaidova commented Aug 30, 2024

Apply weight compression after model save to reduce peak RAM during export #878

Apply weight compression after model save to reduce peak RAM during export #878

Conversation

nikita-savelyevv commented Aug 22, 2024 • edited Loading

What does this PR do?

Before submitting

nikita-savelyevv commented Aug 26, 2024

HuggingFaceDocBuilderDev commented Aug 27, 2024

eaidova commented Aug 30, 2024

nikita-savelyevv commented Aug 22, 2024 •

edited

Loading