Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run model as compressed/uncompressed mode #34719

Merged

Conversation

horheynm
Copy link
Contributor

@horheynm horheynm commented Nov 13, 2024

What does this PR do?

Loading quantized model using compressed-tensors is currently hardcoded to run in run_compressed mode.
This PR allows the model to be loaded in different ways

from transformers import AutoModelForCausalLM, AutoConfig
from transformers.utils.quantization_config import CompressedTensorsConfig

pretrained_model_name_or_path = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic" # static config file

quantization_config = CompressedTensorsConfig(run_compressed=False)
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path,
    quantization_config=quantization_config
)

@Rocketknight1
Copy link
Member

cc @SunMarc @MekkCyber for quantization

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the goal is to overwrite the run_compressed attribute in the quantization config. To do so, we have the merge_quantization_configs function and you mostly just need to create the get_loading_attributes function. I think this will make the user experience better also.

In the end, the user will only need to do:

quantization_config = CompressedTensorsConfig(run_compressed=False)
model.from_pretrained(...,quantization_config=quantization_config) 

to load the uncompressed model

@horheynm horheynm changed the title draft, run model as compreszed/uncompressed mode Run model as compressed/uncompressed mode Nov 22, 2024
@horheynm
Copy link
Contributor Author

PR is in a decent state to review. Will add tests for it to be finalized

@horheynm horheynm marked this pull request as ready for review November 25, 2024 18:40
@horheynm
Copy link
Contributor Author

@SunMarc
Hey Marc, this PR is ready

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the integration ! Left a few comments

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: offline discussion
check if the warnings we're seeing on this branch are specific to an uncompressed model vs compressed model

@horheynm
Copy link
Contributor Author

@ArthurZucker
Could I get a review please!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice nice! Good addition, thanks 🤗

@ArthurZucker ArthurZucker merged commit e4e404f into huggingface:main Dec 13, 2024
22 checks passed
dsikka pushed a commit to vllm-project/llm-compressor that referenced this pull request Jan 10, 2025
~~Contingent on merge of
huggingface/transformers#34719
~~^ has been merged not yet released~~
^ has been released

SUMMARY:
Update test to use AutoModelForCausalLM decompressor instead of manually
instantiating the compressor and decompressing. AutoModelForCausalLM
will run code that if quantization_config is recognized, it will run the
same decompression

TEST PLAN:
Ran the test using transformers main
Must pass:
tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py
mgoin pushed a commit to vllm-project/llm-compressor that referenced this pull request Jan 14, 2025
~~Contingent on merge of
huggingface/transformers#34719
^ has been merged not yet released


SUMMARY:
Update run_compressed tests from decompression tests to run_comrpressed
tests -> test if run_compressed True/False models generate the same
output

Add decompress tests that copies attrs from the source dir path's model
to the target model.

TEST PLAN:
ran the test using transformers main
must pass
tests/llmcompressor/transformers/compression/test_decompress.py
and tests/llmcompressor/transformers/compression/test_run_compressed.py
dsikka pushed a commit to vllm-project/llm-compressor that referenced this pull request Jan 15, 2025
…d" (#1072)

SUMMARY:
Removed breakpoints and addressed comments for
#970

TEST PLAN:
Ran pytest for the two test files


#970
ORIGINAL PR DESCRIPTION:
~~Contingent on merge of
huggingface/transformers#34719
^ has been merged not yet released


SUMMARY:
Update run_compressed tests from decompression tests to run_comrpressed
tests -> test if run_compressed True/False models generate the same
output

Add decompress tests that copies attrs from the source dir path's model
to the target model.

TEST PLAN:
ran the test using transformers main
must pass
tests/llmcompressor/transformers/compression/test_decompress.py
and tests/llmcompressor/transformers/compression/test_run_compressed.py
kylesayrs pushed a commit to vllm-project/llm-compressor that referenced this pull request Jan 15, 2025
~~Contingent on merge of
huggingface/transformers#34719
~~^ has been merged not yet released~~
^ has been released

SUMMARY:
Update test to use AutoModelForCausalLM decompressor instead of manually
instantiating the compressor and decompressing. AutoModelForCausalLM
will run code that if quantization_config is recognized, it will run the
same decompression

TEST PLAN:
Ran the test using transformers main
Must pass:
tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py

Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs pushed a commit to vllm-project/llm-compressor that referenced this pull request Jan 15, 2025
~~Contingent on merge of
huggingface/transformers#34719
^ has been merged not yet released

SUMMARY:
Update run_compressed tests from decompression tests to run_comrpressed
tests -> test if run_compressed True/False models generate the same
output

Add decompress tests that copies attrs from the source dir path's model
to the target model.

TEST PLAN:
ran the test using transformers main
must pass
tests/llmcompressor/transformers/compression/test_decompress.py
and tests/llmcompressor/transformers/compression/test_run_compressed.py

Signed-off-by: Kyle Sayers <[email protected]>
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Jan 23, 2025
~~Contingent on merge of
huggingface/transformers#34719
^ has been merged not yet released

SUMMARY:
Add test to 
* Given a model, oneshot quantize, then run ptq - training. 
Model must be run_compressed = False to run

Note:
* When running finetune on an already optimized (one-shotted) mode, the
model needs to be decompressed explicitly using
`CompressedTensorsConfig`. See
https://github.com/vllm-project/llm-compressor/pull/964/files#diff-e480ed475c0a5b2beb4052c1dd2aca671999634ace41a5ea017fdff1ce68be0bR130-R135
* Tests using x2 H100s passed

Also fix a bug where in log_sparsification, the layer name is not being
recognized so fails. Here nothting is being sparsified, so num params is
set to zero

TEST PLAN:
ran the test using transformers main
must pass
tests/llmcompressor/transformers/finetune/test_oneshot_then_finetune.py

---------

Co-authored-by: Dipika Sikka <[email protected]>
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Jan 23, 2025
~~Contingent on merge of
huggingface/transformers#34719
~~ ^ has been merged not yet released ~~
^ has been released


Blocked on 
neuralmagic/compressed-tensors#237

SUMMARY:
* In multiple optimization tests, automatically decompress model if
provided as optimized model
* Fix recipe stage length
* Revive old code
* When running multiple optimizations (ex. oneshot then finetune,
oneshot and oneshot), the recipes needs to be added to the session using
`initialize_recipe`. Example here
https://github.com/vllm-project/llm-compressor/pull/971/files#diff-c9ae8b3ad24d13abeea5b649a5fd6d0b0925f5c9cc40220cbfbe21ae81242f8dR63-R65


TEST PLAN:
ran the test using transformers main
Must pass tests/llmcompressor/transformers/obcq/test_consecutive_runs.py

---------

Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Rahul Tuli <[email protected]>
rahul-tuli added a commit to vllm-project/llm-compressor that referenced this pull request Jan 28, 2025
~~Contingent on merge of
huggingface/transformers#34719
~~ ^ has been merged not yet released ~~
^ has been released

Blocked on
neuralmagic/compressed-tensors#237

SUMMARY:
* In multiple optimization tests, automatically decompress model if
provided as optimized model
* Fix recipe stage length
* Revive old code
* When running multiple optimizations (ex. oneshot then finetune,
oneshot and oneshot), the recipes needs to be added to the session using
`initialize_recipe`. Example here
https://github.com/vllm-project/llm-compressor/pull/971/files#diff-c9ae8b3ad24d13abeea5b649a5fd6d0b0925f5c9cc40220cbfbe21ae81242f8dR63-R65

TEST PLAN:
ran the test using transformers main
Must pass tests/llmcompressor/transformers/obcq/test_consecutive_runs.py

---------

Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants