[WIP] Quark Quantizer Support #1207

shobrienDMA · 2025-01-29T15:56:25Z

This allows Quark Quantized models to be processed by ONNX Runtime GenAI.

Quark models must be exported in hf_format.

An example quark_quantize.py command:

python quantize_quark.py --model_dir /[Model_Path] /
--output_dir /[Output_Model_Path] /
--quant_scheme w_uint4_per_group_asym /
--num_calib_data 128 /
--quant_algo awq /
--dataset pileval_for_awq_benchmark /
--seq_len 512 /
--model_export hf_format /
--data_type float32

It also allows different group sizes for different layers depending on what is present in the config.json that Quark produces, a Quark config can look like:

...
  "quantization_config": {
    "algo_config": {
      "model_decoder_layers": "model.layers",
      "name": "awq",
      "num_attention_heads": -1,
      "num_key_value_heads": -1,
      "scaling_layers": [
        {
          "inp": "self_attn.q_proj",
          "layers": [
            "self_attn.q_proj",
            "self_attn.k_proj",
            "self_attn.v_proj"
          ],
          "module2inspect": "self_attn",
          "prev_op": "input_layernorm"
        },
        {
          "inp": "self_attn.o_proj",
          "layers": [
            "self_attn.o_proj"
          ],
          "prev_op": "self_attn.v_proj"
        },
        {
          "inp": "mlp.gate_proj",
          "layers": [
            "mlp.gate_proj",
            "mlp.up_proj"
          ],
          "module2inspect": "mlp",
          "prev_op": "post_attention_layernorm"
        },
        {
          "inp": "mlp.down_proj",
          "layers": [
            "mlp.down_proj"
          ],
          "prev_op": "mlp.up_proj"
        }
      ]
    },
    "exclude": [],
    "export": {
      "kv_cache_group": [],
      "pack_method": "reorder",
      "weight_format": "real_quantized",
      "weight_merge_groups": null
    },
    "global_quant_config": {
      "bias": null,
      "input_tensors": null,
      "output_tensors": null,
      "target_device": null,
      "weight": {
        "ch_axis": 1,
        "dtype": "uint4",
        "group_size": 128,
        "is_dynamic": false,
        "observer_cls": "PerGroupMinMaxObserver",
        "qscheme": "per_group",
        "round_method": "half_even",
        "scale_type": "float",
        "symmetric": false
      }
    },
    "layer_quant_config": {
      "lm_head": {
        "bias": null,
        "input_tensors": null,
        "output_tensors": null,
        "target_device": null,
        "weight": {
          "ch_axis": 1,
          "dtype": "uint4",
          "group_size": 32,
          "is_dynamic": false,
          "observer_cls": "PerGroupMinMaxObserver",
          "qscheme": "per_group",
          "round_method": "half_even",
          "scale_type": "float",
          "symmetric": false
        }
      }
    },
    "layer_type_quant_config": {},
    "quant_method": "quark",
    "quant_mode": "eager_mode"
  },
...

As you can see the lm_head in layer_quant_config has a different group size.

# Conflicts: # src/python/py/models/builder.py

…rks for this

…into amd_shobrien/per_layer_support

shobrienDMA added 14 commits December 8, 2024 23:04

get a quark quantized model through OGA without config modifications

16e23b7

remove potentially errant lm_head.g_idx = 0

4a458cf

fix lm_head layers not getting the right group size

9cd1b85

refactor quantized_model.py

c1d6341

Merge branch 'amd/main' into amd_shobrien/per_layer_support

ae05635

# Conflicts: # src/python/py/models/builder.py

remove GPTQ extra argument

c661471

use the quark model class

2e8e13e

tidy formatting, remove unecessary print and comments

67b5c09

tidy up comments and unused variables

a5891e7

Update the quark model class

38ddbf2

remove the make quantized lm_head method since the existing method wo…

229b30f

…rks for this

Merge branch 'main' of https://github.com/microsoft/onnxruntime-genai …

06d1f27

…into amd_shobrien/per_layer_support

update the ONNX Runtime Extensions dependency version

3c39016

Merge branch 'oga/main' into amd/shobrien/per_layer_support

0e4b0d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Quark Quantizer Support #1207

[WIP] Quark Quantizer Support #1207

shobrienDMA commented Jan 29, 2025

[WIP] Quark Quantizer Support #1207

Are you sure you want to change the base?

[WIP] Quark Quantizer Support #1207

Conversation

shobrienDMA commented Jan 29, 2025