Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Merged
merged 33 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
bd80749
ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
compilade Jun 19, 2024
7ef4254
ggml-quants : faster 1.625 bpw AVX2 vec_dot
compilade Jun 19, 2024
48b73b8
ggml-quants : substract 1 when back in epi8
compilade Jun 19, 2024
ef1e345
ggml-quants : Q2_2 now faster than Q4_K on with AVX2
compilade Jun 20, 2024
638ad52
ggml-quants : cleanup Q1_3 code formatting
compilade Jun 23, 2024
9465ec6
ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
compilade Jun 25, 2024
89dc3b2
ggml-quants : use ceiling division when quantizing q1_3
compilade Jun 26, 2024
961e293
convert-hf : simplify BitNet pre-quantization
compilade Jun 26, 2024
0996149
convert-hf : allow converting the weird BitNet 1.3B
compilade Jun 27, 2024
bfd2f21
bitnet : replace 1.58b with b1.58, as in the paper
compilade Jun 29, 2024
ec50944
ggml-quants : fix build failure on Windows
compilade Jun 29, 2024
8fbd593
ggml-quants : attempt to fix Arm 32-bit support
compilade Jun 29, 2024
dd3e62a
ggml : add some informative comments in q1_3 vec_dot
compilade Jul 29, 2024
79a278e
Merge branch 'master' into compilade/bitnet-ternary
compilade Jul 29, 2024
77b8f84
ggml : add TQ1_0 and TQ2_0 ternary quantization types
compilade Jul 30, 2024
560873f
ggml : even faster TQ2_0
compilade Jul 31, 2024
e971957
ggml : also faster TQ1_0
compilade Jul 31, 2024
a6dd699
ggml : fix build issues in certain environments
compilade Aug 1, 2024
5417089
ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
compilade Aug 1, 2024
45719a2
ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
compilade Aug 1, 2024
04eec58
ggml : remove q1_3 and q2_2
compilade Aug 2, 2024
f034aa1
ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
compilade Aug 3, 2024
96b3d41
ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
compilade Aug 7, 2024
d911cd1
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 11, 2024
3a0bf17
gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
compilade Aug 12, 2024
895004f
convert : allow direct conversion to TQ1_0 and TQ2_0
compilade Aug 13, 2024
69f7726
ggml-quants : allow using ARM dot product instructions for TQ1_0
compilade Aug 13, 2024
82b2404
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 13, 2024
35cc556
ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
compilade Aug 13, 2024
cb6d996
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 22, 2024
7f3a619
Merge branch 'master' into compilade/bitnet-ternary
compilade Sep 4, 2024
8d61607
ggml ; remove unused ggml_mul special case
compilade Sep 4, 2024
75b3a09
test-backend-ops : add TQ1_0 and TQ2_0 comments for later
compilade Sep 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
ggml-quants : attempt to fix Arm 32-bit support
  • Loading branch information
compilade committed Jun 29, 2024
commit 8fbd59308b54729a191dcf3aee3388abfa7dd6e3
11 changes: 4 additions & 7 deletions ggml/src/ggml-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ typedef __fp16 ggml_fp16_internal_t;

// 32-bit ARM compatibility

// vaddvq_s16
// vaddlvq_s16
// vpaddq_s16
// vpaddq_s32
// vaddvq_s32
Expand All @@ -187,12 +187,9 @@ typedef __fp16 ggml_fp16_internal_t;
// vzip1_u8
// vzip2_u8

inline static int32_t vaddvq_s16(int16x8_t v) {
return
(int32_t)vgetq_lane_s16(v, 0) + (int32_t)vgetq_lane_s16(v, 1) +
(int32_t)vgetq_lane_s16(v, 2) + (int32_t)vgetq_lane_s16(v, 3) +
(int32_t)vgetq_lane_s16(v, 4) + (int32_t)vgetq_lane_s16(v, 5) +
(int32_t)vgetq_lane_s16(v, 6) + (int32_t)vgetq_lane_s16(v, 7);
inline static int32_t vaddlvq_s16(int16x8_t v) {
int32x4_t v0 = vreinterpretq_s32_s64(vpaddlq_s32(vpaddlq_s16(v)));
return vgetq_lane_s32(v0, 0) + vgetq_lane_s32(v0, 2);
}

inline static int16x8_t vpaddq_s16(int16x8_t a, int16x8_t b) {
Expand Down
8 changes: 4 additions & 4 deletions ggml/src/ggml-quants.c
Original file line number Diff line number Diff line change
Expand Up @@ -11483,10 +11483,10 @@ void ggml_vec_dot_q1_3_q8_0(int n, float * restrict s, size_t bs, const void * r
// WARNING: reading 3 bytes further than necessary
const uint8x16_t x13b = vld1q_u8((const uint8_t *) x);

uint8x16_t x0 = vqtbl1q_u8(x13b, mask0);
uint8x16_t x1 = vqtbl1q_u8(x13b, mask1);
uint8x16_t x2 = vqtbl1q_u8(x13b, mask2);
uint8x16_t x3 = vqtbl1q_u8(x13b, mask3);
uint8x16_t x0 = ggml_vqtbl1q_u8(x13b, mask0);
uint8x16_t x1 = ggml_vqtbl1q_u8(x13b, mask1);
uint8x16_t x2 = ggml_vqtbl1q_u8(x13b, mask2);
uint8x16_t x3 = ggml_vqtbl1q_u8(x13b, mask3);

x0 = vmulq_u8(x0, shift0);
x1 = vmulq_u8(x1, shift0);
Expand Down
Loading