Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQ1_M: 1.75 bpw quantization #6302

Merged
merged 24 commits into from
Mar 26, 2024
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
2a2d66d
iq1_m: basics
Kawrakow Mar 22, 2024
ac8b3dd
iq1_m: basics-2
Kawrakow Mar 22, 2024
1df37b6
iq1_m: CUDA dequantize works
Kawrakow Mar 22, 2024
282f278
iq1_m: separate shifts for each group of 8 in a block
Kawrakow Mar 23, 2024
308c50d
iq1_m: go to 3-bit scales
Kawrakow Mar 23, 2024
64b9dfd
iq1_m: scalar dot product
Kawrakow Mar 23, 2024
a139de5
iq1_m: AVX2 dot product
Kawrakow Mar 23, 2024
379fdb6
iq1_m: very slightly faster AVX2 dot product
Kawrakow Mar 24, 2024
8009b6d
iq1_m: ARM_NEON dot product
Kawrakow Mar 24, 2024
0e36afa
iq1_m: Metal - dequantize works, dot product does not
Kawrakow Mar 25, 2024
19fb974
iq1_m: Metal now works
Kawrakow Mar 25, 2024
abc1d4f
iq1_m: minor
Kawrakow Mar 25, 2024
dff85a8
iq1_m: checking pure iq1_m quantization
Kawrakow Mar 25, 2024
f664692
iiq1_m: slightly faster ARM_NEON dot product
Kawrakow Mar 25, 2024
b1d1c26
iq1_m: faster ARM_NEON dot product
Kawrakow Mar 25, 2024
78ce561
iq1_m: another minor ARM_NEON dot product improvement
Kawrakow Mar 25, 2024
3d9c21f
iq1_m: small PPL improvement via super-block scale adjustment
Kawrakow Mar 25, 2024
480d6d6
iq1_m: adapt to CUDA refactoring
Kawrakow Mar 25, 2024
62dd11f
iq1_m: remove unused variable
Kawrakow Mar 25, 2024
22fa121
iq1_m: add to backend-ops tests
Kawrakow Mar 25, 2024
b68f32b
iq1_m: fix Windows ARM
Kawrakow Mar 26, 2024
9a5786e
iq1_m: use common definition of iq1m_scale_t
Kawrakow Mar 26, 2024
cdb2d65
cuda: assert -> NO_DEVICE_CODE
Kawrakow Mar 26, 2024
6e4cef5
iq1_M: PR comments
Kawrakow Mar 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
iq1_m: another minor ARM_NEON dot product improvement
14.9 -> 15.0 t/s
  • Loading branch information
Kawrakow committed Mar 25, 2024
commit 78ce561a3128f85bcd643913933f1b5620c5ac0f
13 changes: 7 additions & 6 deletions ggml-quants.c
Original file line number Diff line number Diff line change
Expand Up @@ -9779,6 +9779,9 @@ void ggml_vec_dot_iq1_m_q8_K (int n, float * restrict s, size_t bs, const void

iq1m_scale_t scale;

uint32_t aux32;
const uint8_t * aux8 = (const uint8_t *)&aux32;

float sumf = 0;
for (int i = 0; i < nb; ++i) {

Expand Down Expand Up @@ -9809,13 +9812,11 @@ void ggml_vec_dot_iq1_m_q8_K (int n, float * restrict s, size_t bs, const void
const int32x4_t p2 = vpaddq_s32(ggml_vdotq_s32(mzero, q1b.val[2], q8b.val[2]), ggml_vdotq_s32(mzero, q1b.val[3], q8b.val[3]));
const int32x4_t p12 = vpaddq_s32(p1, p2);

delta.val[0] = deltas.val[((qh[0] & 0x08) >> 3) | ((qh[0] & 0x80) >> 6)];
delta.val[1] = deltas.val[((qh[1] & 0x08) >> 3) | ((qh[1] & 0x80) >> 6)];
delta.val[2] = deltas.val[((qh[2] & 0x08) >> 3) | ((qh[2] & 0x80) >> 6)];
delta.val[3] = deltas.val[((qh[3] & 0x08) >> 3) | ((qh[3] & 0x80) >> 6)];
const uint32_t * qh32 = (const uint32_t *)qh; // we are 4-byte aligned, so we can do that
aux32 = ((qh32[0] >> 3) & 0x01010101) | ((qh32[0] >> 6) & 0x02020202);

const int32x4_t p3 = vpaddq_s32(ggml_vdotq_s32(mzero, delta.val[0], q8b.val[0]), ggml_vdotq_s32(mzero, delta.val[1], q8b.val[1]));
const int32x4_t p4 = vpaddq_s32(ggml_vdotq_s32(mzero, delta.val[2], q8b.val[2]), ggml_vdotq_s32(mzero, delta.val[3], q8b.val[3]));
const int32x4_t p3 = vpaddq_s32(ggml_vdotq_s32(mzero, deltas.val[aux8[0]], q8b.val[0]), ggml_vdotq_s32(mzero, deltas.val[aux8[1]], q8b.val[1]));
const int32x4_t p4 = vpaddq_s32(ggml_vdotq_s32(mzero, deltas.val[aux8[2]], q8b.val[2]), ggml_vdotq_s32(mzero, deltas.val[aux8[3]], q8b.val[3]));
const int32x4_t p34 = vpaddq_s32(p3, p4);

int32x4_t scales_4 = {sc[ib/2] >> 0, sc[ib/2] >> 3, sc[ib/2] >> 6, sc[ib/2] >> 9};
Expand Down