Add quantization support for `Gemma`, `Gemma2` and `PaliGemma` #1670

james77777778 · 2024-06-22T09:27:09Z

~~We will need a new release of Keras for this. Currently, I have built the PR based on the master branch of Keras.~~

The implementation is simple and clean after introducing DTypePolicyMap and some other fixes.
Thanks to @fchollet and @mattdangerw for their help.

It is worth noting that float8 training & inference are also supported in this PR. You can check test_quantize for this.

Some numbers:

Model	Memory Usage (bfloat16)	Memory Usage (int8)	Weights (kagglehub)	Weights (int8)	Note
"gemma_1.1_instruct_2b_en"	5.69GB	2.82GB	4.7GB	2.4GB
"gemma2_instruct_9b_en"	20.93GB	10.14GB	18GB	8.7GB	Measured on CPU
"pali_gemma_3b_mix_224"	6.52GB	3.22GB	5.5GB	2.8GB

Script:

int8_gemma.py

import argparse
import os
import pathlib
import time
import typing

import keras
import psutil
import tensorflow as tf

import keras_nlp

# Setup kaggle information
os.environ["KAGGLE_USERNAME"] = "xxx"
os.environ["KAGGLE_KEY"] = "xxx"


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model",
        default="pali_gemma_3b_mix_224",
        choices=[
            "gemma_1.1_instruct_2b_en",
            "pali_gemma_3b_mix_224",
            "gemma2_instruct_9b_en",
        ],
        help="Which model to demonstrate",
    )
    parser.add_argument(
        "--path",
        default=".",
        help="Path to save and load the model",
    )
    parser.add_argument(
        "--save",
        action="store_true",
        help="Quantize and save the model",
    )
    args = parser.parse_args()
    return args


def get_memory_usage():
    # From CPU or GPU:0
    try:
        memory_stats = tf.config.experimental.get_memory_info("GPU:0")
        peak_usage = memory_stats["peak"] / (2**30)
    except Exception:
        memory_usage = psutil.Process().memory_info().rss
        peak_usage = memory_usage / (2**30)
    return peak_usage


def benchmark_pali_gemma(
    model: keras_nlp.models.PaliGemmaCausalLM, image, prompt: str
):
    # Warmup
    model.generate({"images": image, "prompts": prompt}, max_length=128)

    # Benchmark
    st = time.time()
    result = model.generate(
        {"images": image, "prompts": prompt}, max_length=128
    )
    ed = time.time()
    return result, ed - st


def benchmark_gemma(model: keras_nlp.models.GemmaCausalLM, prompt: str):
    # Warmup
    model.generate(prompt, max_length=128)

    # Benchmark
    st = time.time()
    result = model.generate(prompt, max_length=128)
    ed = time.time()
    return result, ed - st


def save_int8_model(
    preset: str,
    model: typing.Union[
        keras_nlp.models.GemmaCausalLM,
        keras_nlp.models.PaliGemmaCausalLM,
    ],
):
    model.quantize("int8")
    model.summary()
    model.save(f"{preset}_int8.keras")


def load(model_path: pathlib.Path):
    model = keras.saving.load_model(model_path)
    return model


if __name__ == "__main__":
    keras.config.set_dtype_policy("bfloat16")
    x = keras.ops.ones([1]) * keras.ops.ones([1])  # Trigger TF dummy logs

    args = get_args()
    path = pathlib.Path(args.path)
    is_pali_gemma = "pali_gemma" in str(args.model)
    print(f"Peak memory usage (init): {get_memory_usage():.3f} GB")

    # Save
    if args.save:
        if is_pali_gemma:
            model = keras_nlp.models.PaliGemmaCausalLM.from_preset(args.model)
        else:
            model = keras_nlp.models.GemmaCausalLM.from_preset(args.model)
        model.summary()
        print(
            "Peak memory usage (loaded float model): "
            f"{get_memory_usage():.3f} GB"
        )
        save_int8_model(args.model, model)
    # Load
    else:
        model_path = path / f"{args.model}_int8.keras"
        model = load(model_path)
        print(
            "Peak memory usage (loaded int8 model): "
            f"{get_memory_usage():.3f} GB"
        )

        if is_pali_gemma:
            image_path = keras.utils.get_file(
                "cow_beach_1.png",
                "https://storage.googleapis.com/keras-cv/models/paligemma/cow_beach_1.png",
            )
            image = keras.utils.load_img(image_path)
            image = keras.utils.img_to_array(image, "channels_last")
            prompt = "describe en\n"
            result, elapsed_time = benchmark_pali_gemma(model, image, prompt)
        else:
            prompt = "What is Keras3?"
            result, elapsed_time = benchmark_gemma(model, prompt)
        print(result)
        print(
            f"The elapsed time for model inference: {elapsed_time:.3f} seconds"
        )

Usage:

# Get quantized model
python int8_gemma.py --model "gemma_1.1_instruct_2b_en" --save
python int8_gemma.py --model "gemma2_instruct_9b_en" --save
python int8_gemma.py --model "pali_gemma_3b_mix_224" --save
# Run
python int8_gemma.py --model "gemma_1.1_instruct_2b_en"
python int8_gemma.py --model "gemma2_instruct_9b_en"
python int8_gemma.py --model "pali_gemma_3b_mix_224"

Outputs:

# Gemma
What is Keras3?

Keras3 is a high-level neural network library built on top of Keras 2. It provides a simplified and more efficient way to build and train deep learning models.

**Key features of Keras3:**

- Simplified API with Keras 2 compatibility
- High-level abstractions for common tasks
- Improved performance and efficiency
- Support for modern neural network architectures


**Benefits of using Keras3:**

- Easier to learn and use
- Faster and more accurate models
- Reduced development time
- Improved portability across different hardware platforms


**How to use Keras3:**

- Import

# PaliGemma
describe en
In this image I can see a cow which affor is in brown color and white color. I can see the sand. In the background I can see the water and the sky.

james77777778 · 2024-06-25T06:52:43Z

This PR should be ready for reviewing.
Both Gemma and PaliGemma now support quantization (int8 and float8).

james77777778 · 2024-06-28T02:06:15Z

Hi @fchollet @mattdangerw
I have added quantization support for Gemma2 (actually, adding tests is sufficient :) )
Please let me know if any updates are needed.

mattdangerw · 2024-06-30T21:15:37Z

@james77777778 thanks so much! Sorry for the delay, I was out last week, but just got back in town. Will take a look tomorrow!

james77777778 · 2024-07-01T00:16:19Z

No hurry. Please take your time.

mattdangerw

Looks good!

General comments.

Let's try to make the contract between the ReversibleEmbedding layer and the Embedding layer as minimal as possible. Any private functionality might change in core Keras, are we are using a lot here (which is fine, let's just reduce if we can).
Let's test this on all models if we can.

keras_nlp/src/layers/modeling/reversible_embedding.py

mattdangerw · 2024-07-01T22:28:23Z

keras_nlp/src/layers/modeling/reversible_embedding.py

+
+        return super()._int8_call(inputs)
+
+    def quantize(self, mode):


Could we chain to super here to keep most of the logic? and just handle the if mode == "int8" and not self.tie_weights case below? Would be great to keep as much logic on the super class as we can.

I'm afraid not.
The raising of NotImplementedError in keras.layers.Embedding is intentional and inevitable. The idea is to prevent undefined behavior when users call Model.quantize.

I can introduce an argument like type_check=True in keras.layers.Embedding to support super in the future.
However, for now, we can only implement quantize from scratch.

EDITED:
keras-team/keras#19949

I see, thanks for the explainer.

Not to solve on this PR, but I wonder if we can make the contract between Keras and downstream here more public and minimal. I see _int_8_call(), _int_8_build(), _quantization_mode_error(), _tracker, and _untrack_variable() all used here. That's a pretty significant level of private usage, which could easily break.

Separate question, will this work with older version of Keras 3? Or are there small changes we could make so we don't break older versions?

I agree that these methods are too verbose for downstream project. I will try to simplify the contract in the future, but currently, I don't have a good idea for it.

will this work with older version of Keras 3?

I haven't check the compatibility. My rough guess is that users will need keras>=3.4.0 due to the introduction of DTypePolicyMap

mattdangerw · 2024-07-01T22:33:44Z

keras_nlp/src/models/backbone.py

            "name": self.name,
            "trainable": self.trainable,
        }

+        # Add quantization support by utilizing `DTypePolicyMap`


This is great! This should buy us support for all models right? If possible we should consider extending our common backbones tests for this...

https://github.com/keras-team/keras-nlp/blob/e4f09b24c699857edae27c8054aab44078e9cbd5/keras_nlp/src/tests/test_case.py#L359-L367

https://github.com/keras-team/keras-nlp/blob/e4f09b24c699857edae27c8054aab44078e9cbd5/keras_nlp/src/models/gemma/gemma_backbone_test.py#L39-L45

Doing so would test quantization for the whole library. Seems like it should be doable, call quantize, asset output. WDYT?

If we run into failures for certain models, we could add an option to run_backbone_test, called run_quantization_check=True, and set the option to false if the model fails, with a TODO to investigate.

Yeah, it is doable.
I have added run_quantization_test to run_backbone_test. Only Bloom and OPT failed the test.
However, there is a significant speed regression after adding this test. The CI time increased from ~19mins to ~27mins. Is this acceptable?

I think having the coverage is important. Let's pull this in, and see if we can improve the runtime efficiency as a follow up.

Saving is slow. So maybe we can just do something like

Something like:

Basic quantization tests do not hit saving. Just test get_config(), from_config() maybe assigning weights over.

Separate quantization testing in our saving test harness. That is marked with large, and is only run on larger/faster hardware.

Will try this in another PR.

keras_nlp/src/layers/modeling/reversible_embedding.py

keras_nlp/src/tests/test_case.py

mattdangerw · 2024-07-03T22:32:08Z

@james77777778 thanks for the changes!

As soon as testing is all green I will pull this in, especially since the US is about to go into holiday until next Monday.

I think the coverage is worth it, but let's keep seeing if we can think of ways to speed up these testing with decent coverage as a follow up.

james77777778 added 3 commits June 22, 2024 14:35

Introduce quantization support to Gemma

48ab6df

Merge remote-tracking branch 'upstream/master' into quantization-support

97305f7

Revert SentencePieceTokenizer

1fb7ffe

github-actions bot added the Gemma Gemma model specific issues label Jun 22, 2024

james77777778 mentioned this pull request Jun 22, 2024

Support dynamic int8 quantization for Gemma #1612

Closed

Add tests for PaliGemma

876a030

james77777778 changed the title ~~[WIP] Add quantization support for Gemma~~ Add quantization support for Gemma and PaliGemma Jun 25, 2024

james77777778 marked this pull request as ready for review June 25, 2024 06:50

james77777778 added 2 commits June 27, 2024 22:08

Merge remote-tracking branch 'upstream/master' into quantization-support

b4e9990

Add quantization support for Gemma2

8f65809

james77777778 changed the title ~~Add quantization support for Gemma and PaliGemma~~ Add quantization support for Gemma, Gemma2 and PaliGemma Jun 28, 2024

mattdangerw self-requested a review July 1, 2024 22:24

mattdangerw reviewed Jul 1, 2024

View reviewed changes

james77777778 force-pushed the quantization-support branch from 3231e11 to 6f276c7 Compare July 3, 2024 01:04

james77777778 requested a review from mattdangerw July 3, 2024 01:29

james77777778 mentioned this pull request Jul 3, 2024

Add type_check for quantize keras-team/keras#19949

Merged

mattdangerw reviewed Jul 3, 2024

View reviewed changes

keras_nlp/src/tests/test_case.py Outdated Show resolved Hide resolved

Address comments

a71f83b

james77777778 force-pushed the quantization-support branch from 6f276c7 to a71f83b Compare July 3, 2024 06:58

mattdangerw added the kokoro:force-run Runs Tests on GPU label Jul 3, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Jul 3, 2024

mattdangerw merged commit bb423c8 into keras-team:master Jul 3, 2024
8 checks passed

james77777778 deleted the quantization-support branch July 4, 2024 01:54

james77777778 mentioned this pull request Jul 4, 2024

Simplify the implementation of quantization-related methods keras-team/keras#19954

Merged

james77777778 mentioned this pull request Jul 23, 2024

When using QLoRA, I get an InvalidArgumentError. #1705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantization support for `Gemma`, `Gemma2` and `PaliGemma` #1670

Add quantization support for `Gemma`, `Gemma2` and `PaliGemma` #1670

james77777778 commented Jun 22, 2024 •

edited

Loading

james77777778 commented Jun 25, 2024

james77777778 commented Jun 28, 2024

mattdangerw commented Jun 30, 2024

james77777778 commented Jul 1, 2024

mattdangerw left a comment

mattdangerw Jul 1, 2024

james77777778 Jul 3, 2024 •

edited

Loading

mattdangerw Jul 3, 2024

james77777778 Jul 4, 2024 •

edited

Loading

mattdangerw Jul 1, 2024

james77777778 Jul 3, 2024

mattdangerw Jul 3, 2024

james77777778 Jul 4, 2024

mattdangerw commented Jul 3, 2024

Add quantization support for Gemma, Gemma2 and PaliGemma #1670

Add quantization support for Gemma, Gemma2 and PaliGemma #1670

Conversation

james77777778 commented Jun 22, 2024 • edited Loading

james77777778 commented Jun 25, 2024

james77777778 commented Jun 28, 2024

mattdangerw commented Jun 30, 2024

james77777778 commented Jul 1, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Jul 1, 2024

Choose a reason for hiding this comment

james77777778 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

mattdangerw Jul 3, 2024

Choose a reason for hiding this comment

james77777778 Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

mattdangerw Jul 1, 2024

Choose a reason for hiding this comment

james77777778 Jul 3, 2024

Choose a reason for hiding this comment

mattdangerw Jul 3, 2024

Choose a reason for hiding this comment

james77777778 Jul 4, 2024

Choose a reason for hiding this comment

mattdangerw commented Jul 3, 2024

Add quantization support for `Gemma`, `Gemma2` and `PaliGemma` #1670

Add quantization support for `Gemma`, `Gemma2` and `PaliGemma` #1670

james77777778 commented Jun 22, 2024 •

edited

Loading

james77777778 Jul 3, 2024 •

edited

Loading

james77777778 Jul 4, 2024 •

edited

Loading