Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors #1763

Degnel · 2025-02-22T13:26:25Z

This PR is the first step towards addressing issue #1594. It includes the following implementations:

fp8 triton gemm for blockwise quantisation
quant, dequant and linear utilities
time & precision benchmarks
basic tests

If the code is validated, it would be great to bench it on H100.

pytorch-bot · 2025-02-22T13:26:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1763

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 26bb079 with merge base f343336 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
An error occurred trying to start process '/usr/bin/bash' with working directory '/home/ec2-user/actions-runner/_work/ao/ao/pytorch/ao'. No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-02-23T03:29:40Z

Thanks for your work on this! I'll take a closer look next week.

cc @vkuzo @drisspg

Degnel · 2025-02-25T15:48:46Z

Thanks for running the tests. I have two questions regarding the errors:

Where should I add Triton to allow the tests to run successfully without introducing unnecessary dependencies in dev-requirements.txt?
Does torchao provide any utility to check the available FP8 types for each gpu architecture?

danielvegamyhre · 2025-02-27T17:30:46Z

Thanks for running the tests. I have two questions regarding the errors:

Where should I add Triton to allow the tests to run successfully without introducing unnecessary dependencies in dev-requirements.txt?

Can you clarify what you mean? Are tests failing in CI due to a missing triton installation? That shouldn't be happening, please share the link/logs if so.

Does torchao provide any utility to check the available FP8 types for each gpu architecture?

We just use helpers which skip tests if GPU architecture is not at least SM 89:

ao/torchao/utils.py

Line 619 in f478692

def is_sm_at_least_89():

You can find examples in the float8 tests (example).

Degnel · 2025-02-28T14:22:41Z

Can you clarify what you mean? Are tests failing in CI due to a missing triton installation? That shouldn't be happening, please share the link/logs if so.

Indeed, they are. It looks like only the CPU runs are failing. I presume that bitsandbytes might not install triton when no GPU is available (I might be missing something there). Here is an instance of a failing log:

https://github.com/pytorch/ao/actions/runs/13484452669/job/37730985419?pr=1763#step:14:1276

We just use helpers which skip tests if GPU architecture is not at least SM 89:

ao/torchao/utils.py

Line 619 in f478692

def is_sm_at_least_89():

You can find examples in the float8 tests (example).

Thank you for the hint, I've locally updated the code accordingly 👍

danielvegamyhre · 2025-03-03T17:22:58Z

benchmarks/benchmark_blockwise_scaled_linear_triton.py

+    W_q, W_s = fp8_blockwise_weight_quant(W, block_size, dtype)
+    output_blockwise = blockwise_fp8_gemm(A_q, A_s, W_q, W_s)
+
+    quantize_(lin, int8_dynamic_activation_int4_weight())


why is int8_dynamic_activation_int4_weight being used here?

Thank's for noticing it. I was aiming for a static W4A8 quantization and I overlooked that it was dynamic. I will try to address this within the week.

danielvegamyhre · 2025-03-04T23:19:19Z

Can you clarify what you mean? Are tests failing in CI due to a missing triton installation? That shouldn't be happening, please share the link/logs if so.

Also @Degnel you should skip tests requiring triton if CUDA is not available.

danielvegamyhre · 2025-03-05T19:22:23Z

@Degnel thanks for your work on this, i ran the tests and it looks like your blockwise fp8 gemm test is failing due to quantization error

Degnel · 2025-03-07T14:12:58Z

@Degnel thanks for your work on this, i ran the tests and it looks like your blockwise fp8 gemm test is failing due to quantization error

Thanks for pointing that out! I had also noticed the issue, and I think I was just a bit too harsh with the threshold. I'll increase it to make it more reasonable. That said, I'll still double-check the calculations manually to ensure everything is mathematically correct.

Degnel · 2025-03-13T15:04:37Z

@danielvegamyhre I believe that everything should be alright except for the PR Label Check (I'm not sure if I have the required rights to edit this). The test-mps-ops (macos-m1-stable) failed, but I think that the merge will fix it as it seems to be a newly introduced test.

Degnel · 2025-03-14T11:16:56Z

The test-mps-ops (macos-m1-stable) failed once again. I've seen other recent PRs both succeeding and failing this test (due to the same missing package 'importlib_metadata'). I don't think this is related to the code I wrote, but I might be missing something. Please, let me know if you have any insights.

drisspg · 2025-03-14T23:35:45Z

the test mps is unrleated, re-running tests

Degnel · 2025-04-21T07:12:55Z

It seems like the new PRs are not failing anymore due to the macOS tests. Maybe we should try to rerun it here :) @danielvegamyhre @drisspg

drisspg · 2025-04-24T17:59:15Z

Sorry, could you do 1 more rebase to kick back off ci

- fp8 triton gemm - quant, dequant and linear utils - time & precision benchmarks - basic tests

- removing triton dependency - cleanning adaptative dtype

- fixing W4A8 quantization for cutlass kernel in precision benchmark - importing triton only if cuda available - setting a less harsh threshold for quant-dequant and for gemm kernel mm precision

- condition triton import in gemm - linting

Degnel · 2025-04-25T12:15:24Z

Sorry, could you do 1 more rebase to kick back off ci

No problem, it should be ok

…_ker

Degnel · 2025-04-26T07:43:46Z

Thank you @drisspg I've made the linting

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 22, 2025

Degnel mentioned this pull request Feb 22, 2025

Feat/blockwise fp8 quant #1668

Open

danielvegamyhre self-assigned this Feb 27, 2025

danielvegamyhre self-requested a review February 27, 2025 17:45

danielvegamyhre reviewed Mar 3, 2025

View reviewed changes

danielvegamyhre added the topic: new feature Use this tag if this PR adds a new feature label Mar 13, 2025

Degnel added 7 commits April 25, 2025 14:08

Feat: Integration of DeepSeek's blockwise quantization

5d61cbf

- fp8 triton gemm - quant, dequant and linear utils - time & precision benchmarks - basic tests

Doc: init + linting + readme

b76b178

Feat: adding triton dependency, adaptative testing dtype

4daa787

Fix:

ee18512

- removing triton dependency - cleanning adaptative dtype

Fix:

b9dbfa8

- fixing W4A8 quantization for cutlass kernel in precision benchmark - importing triton only if cuda available - setting a less harsh threshold for quant-dequant and for gemm kernel mm precision

Fix:

f5065e9

- condition triton import in gemm - linting

Fix: triton pytest skip

e41457c

Degnel force-pushed the feat/blockwise_fp8_quant_triton_gemm_ker branch from e8edea9 to e41457c Compare April 25, 2025 12:11

Degnel and others added 2 commits April 26, 2025 09:42

Linting

31ed7c1

Merge branch 'pytorch:main' into feat/blockwise_fp8_quant_triton_gemm…

26bb079

…_ker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors #1763

Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors #1763

Degnel commented Feb 22, 2025

pytorch-bot bot commented Feb 22, 2025 •

edited

Loading

danielvegamyhre commented Feb 23, 2025

Degnel commented Feb 25, 2025

danielvegamyhre commented Feb 27, 2025 •

edited

Loading

Degnel commented Feb 28, 2025 •

edited

Loading

danielvegamyhre Mar 3, 2025

Degnel Mar 5, 2025

danielvegamyhre commented Mar 4, 2025

danielvegamyhre commented Mar 5, 2025

Degnel commented Mar 7, 2025

Degnel commented Mar 13, 2025 •

edited

Loading

Degnel commented Mar 14, 2025 •

edited

Loading

drisspg commented Mar 14, 2025

Degnel commented Apr 21, 2025 •

edited

Loading

drisspg commented Apr 24, 2025

Degnel commented Apr 25, 2025

Degnel commented Apr 26, 2025

Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors #1763

Are you sure you want to change the base?

Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors #1763

Conversation

Degnel commented Feb 22, 2025

pytorch-bot bot commented Feb 22, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1763

❌ 1 New Failure

danielvegamyhre commented Feb 23, 2025

Degnel commented Feb 25, 2025

danielvegamyhre commented Feb 27, 2025 • edited Loading

Degnel commented Feb 28, 2025 • edited Loading

danielvegamyhre Mar 3, 2025

Choose a reason for hiding this comment

Degnel Mar 5, 2025

Choose a reason for hiding this comment

danielvegamyhre commented Mar 4, 2025

danielvegamyhre commented Mar 5, 2025

Degnel commented Mar 7, 2025

Degnel commented Mar 13, 2025 • edited Loading

Degnel commented Mar 14, 2025 • edited Loading

drisspg commented Mar 14, 2025

Degnel commented Apr 21, 2025 • edited Loading

drisspg commented Apr 24, 2025

Degnel commented Apr 25, 2025

Degnel commented Apr 26, 2025

pytorch-bot bot commented Feb 22, 2025 •

edited

Loading

danielvegamyhre commented Feb 27, 2025 •

edited

Loading

Degnel commented Feb 28, 2025 •

edited

Loading

Degnel commented Mar 13, 2025 •

edited

Loading

Degnel commented Mar 14, 2025 •

edited

Loading

Degnel commented Apr 21, 2025 •

edited

Loading