-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run model as compressed/uncompressed mode #34719
Run model as compressed/uncompressed mode #34719
Conversation
cc @SunMarc @MekkCyber for quantization |
…magic/upstream-transformers into compressed-tensors/run_compressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the goal is to overwrite the run_compressed
attribute in the quantization config. To do so, we have the merge_quantization_configs function and you mostly just need to create the get_loading_attributes function. I think this will make the user experience better also.
In the end, the user will only need to do:
quantization_config = CompressedTensorsConfig(run_compressed=False)
model.from_pretrained(...,quantization_config=quantization_config)
to load the uncompressed model
PR is in a decent state to review. Will add tests for it to be finalized |
…magic/upstream-transformers into compressed-tensors/run_compressed
@SunMarc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the integration ! Left a few comments
tests/quantization/compressed_tensor/test_run_compressed_model.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: offline discussion
check if the warnings we're seeing on this branch are specific to an uncompressed model vs compressed model
…magic/upstream-transformers into compressed-tensors/run_compressed
@ArthurZucker |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice nice! Good addition, thanks 🤗
~~Contingent on merge of huggingface/transformers#34719 ~~^ has been merged not yet released~~ ^ has been released SUMMARY: Update test to use AutoModelForCausalLM decompressor instead of manually instantiating the compressor and decompressing. AutoModelForCausalLM will run code that if quantization_config is recognized, it will run the same decompression TEST PLAN: Ran the test using transformers main Must pass: tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py
~~Contingent on merge of huggingface/transformers#34719 ^ has been merged not yet released SUMMARY: Update run_compressed tests from decompression tests to run_comrpressed tests -> test if run_compressed True/False models generate the same output Add decompress tests that copies attrs from the source dir path's model to the target model. TEST PLAN: ran the test using transformers main must pass tests/llmcompressor/transformers/compression/test_decompress.py and tests/llmcompressor/transformers/compression/test_run_compressed.py
…d" (#1072) SUMMARY: Removed breakpoints and addressed comments for #970 TEST PLAN: Ran pytest for the two test files #970 ORIGINAL PR DESCRIPTION: ~~Contingent on merge of huggingface/transformers#34719 ^ has been merged not yet released SUMMARY: Update run_compressed tests from decompression tests to run_comrpressed tests -> test if run_compressed True/False models generate the same output Add decompress tests that copies attrs from the source dir path's model to the target model. TEST PLAN: ran the test using transformers main must pass tests/llmcompressor/transformers/compression/test_decompress.py and tests/llmcompressor/transformers/compression/test_run_compressed.py
~~Contingent on merge of huggingface/transformers#34719 ~~^ has been merged not yet released~~ ^ has been released SUMMARY: Update test to use AutoModelForCausalLM decompressor instead of manually instantiating the compressor and decompressing. AutoModelForCausalLM will run code that if quantization_config is recognized, it will run the same decompression TEST PLAN: Ran the test using transformers main Must pass: tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py Signed-off-by: Kyle Sayers <[email protected]>
~~Contingent on merge of huggingface/transformers#34719 ^ has been merged not yet released SUMMARY: Update run_compressed tests from decompression tests to run_comrpressed tests -> test if run_compressed True/False models generate the same output Add decompress tests that copies attrs from the source dir path's model to the target model. TEST PLAN: ran the test using transformers main must pass tests/llmcompressor/transformers/compression/test_decompress.py and tests/llmcompressor/transformers/compression/test_run_compressed.py Signed-off-by: Kyle Sayers <[email protected]>
~~Contingent on merge of huggingface/transformers#34719 ^ has been merged not yet released SUMMARY: Add test to * Given a model, oneshot quantize, then run ptq - training. Model must be run_compressed = False to run Note: * When running finetune on an already optimized (one-shotted) mode, the model needs to be decompressed explicitly using `CompressedTensorsConfig`. See https://github.com/vllm-project/llm-compressor/pull/964/files#diff-e480ed475c0a5b2beb4052c1dd2aca671999634ace41a5ea017fdff1ce68be0bR130-R135 * Tests using x2 H100s passed Also fix a bug where in log_sparsification, the layer name is not being recognized so fails. Here nothting is being sparsified, so num params is set to zero TEST PLAN: ran the test using transformers main must pass tests/llmcompressor/transformers/finetune/test_oneshot_then_finetune.py --------- Co-authored-by: Dipika Sikka <[email protected]>
~~Contingent on merge of huggingface/transformers#34719 ~~ ^ has been merged not yet released ~~ ^ has been released Blocked on neuralmagic/compressed-tensors#237 SUMMARY: * In multiple optimization tests, automatically decompress model if provided as optimized model * Fix recipe stage length * Revive old code * When running multiple optimizations (ex. oneshot then finetune, oneshot and oneshot), the recipes needs to be added to the session using `initialize_recipe`. Example here https://github.com/vllm-project/llm-compressor/pull/971/files#diff-c9ae8b3ad24d13abeea5b649a5fd6d0b0925f5c9cc40220cbfbe21ae81242f8dR63-R65 TEST PLAN: ran the test using transformers main Must pass tests/llmcompressor/transformers/obcq/test_consecutive_runs.py --------- Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Rahul Tuli <[email protected]>
~~Contingent on merge of huggingface/transformers#34719 ~~ ^ has been merged not yet released ~~ ^ has been released Blocked on neuralmagic/compressed-tensors#237 SUMMARY: * In multiple optimization tests, automatically decompress model if provided as optimized model * Fix recipe stage length * Revive old code * When running multiple optimizations (ex. oneshot then finetune, oneshot and oneshot), the recipes needs to be added to the session using `initialize_recipe`. Example here https://github.com/vllm-project/llm-compressor/pull/971/files#diff-c9ae8b3ad24d13abeea5b649a5fd6d0b0925f5c9cc40220cbfbe21ae81242f8dR63-R65 TEST PLAN: ran the test using transformers main Must pass tests/llmcompressor/transformers/obcq/test_consecutive_runs.py --------- Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>
What does this PR do?
Loading quantized model using compressed-tensors is currently hardcoded to run in
run_compressed
mode.This PR allows the model to be loaded in different ways