ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

msy-kato · 2024-05-21T07:57:34Z

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_0_q8_0 and q4_0_q8_0 vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #5780, but it also includes changes to the block layout.

This PR implements the SVE vector dot with minimal changes as a first SVE support. The performance enhancement is less than that of PR #5780, but it is ~ x1.1 to x1.5 faster than the original implementation.

SVE is enabled if LLAMA_SVE=ON is set in cmake. Here is an example of the compilation commands:

$ cmake -DLLAMA_SVE=ON -B build -S .
$ cmake --build build -j$(($(nproc)/2))

Here are the performance measured on AWS Graviton3E (hpc7g).

### Q4_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin

### Q8_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin

Q4_0_Q8_0

Decoding throughput[token/sec]

Threads	Original(NEON)	This PR(SVE)	Ratio
2	3.16	4.05	1.28
4	6.21	7.88	1.27
8	11.92	14.81	1.24
16	21.54	25.77	1.20
32	32.38	36.21	1.12

Q8_0_Q8_0

Decoding throughput[token/sec]

Threads	Original(NEON)	This PR(SVE)	Ratio
2	3.14	4.60	1.46
4	6.10	8.97	1.47
8	11.46	16.29	1.42
16	20.20	23.77	1.18
32	24.72	26.01	1.05

Limitation: This pull request only supports SVE 256-bit.

github-actions · 2024-05-21T08:26:42Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 527 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8868.41ms p(95)=22845.51ms fails=, finish reason: stop=470 truncated=57
Prompt processing (pp): avg=103.39tk/s p(95)=462.02tk/s
Token generation (tg): avg=47.79tk/s p(95)=48.22tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-sve-q4_0_q8_0-q8_0_q8_0 commit=d28bfd5ef7492548d6e000b6ad2cb6042161ec95

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.5, 370.5, 370.5, 370.5, 370.5, 897.45, 897.45, 897.45, 897.45, 897.45, 931.93, 931.93, 931.93, 931.93, 931.93, 910.64, 910.64, 910.64, 910.64, 910.64, 913.93, 913.93, 913.93, 913.93, 913.93, 933.06, 933.06, 933.06, 933.06, 933.06, 918.94, 918.94, 918.94, 918.94, 918.94, 911.73, 911.73, 911.73, 911.73, 911.73, 911.08, 911.08, 911.08, 911.08, 911.08, 924.12, 924.12, 924.12, 924.12, 924.12, 918.9, 918.9, 918.9, 918.9, 918.9, 924.42, 924.42, 924.42, 924.42, 924.42, 940.48, 940.48, 940.48, 940.48, 940.48, 928.43, 928.43, 928.43, 928.43, 928.43, 922.96, 922.96, 922.96, 922.96, 922.96, 938.31, 938.31, 938.31, 938.31, 938.31, 872.83, 872.83, 872.83, 872.83, 872.83, 877.28, 877.28, 877.28, 877.28, 877.28, 878.68, 878.68, 878.68, 878.68, 878.68, 875.02, 875.02, 875.02, 875.02, 875.02, 831.24, 831.24, 831.24, 831.24, 831.24, 831.32, 831.32, 831.32, 831.32, 831.32, 827.98, 827.98, 827.98, 827.98, 827.98, 836.54, 836.54, 836.54, 836.54, 836.54, 838.99, 838.99, 838.99, 838.99, 838.99, 814.64, 814.64, 814.64, 814.64, 814.64, 817.76, 817.76, 817.76, 817.76, 817.76, 814.19, 814.19, 814.19, 814.19, 814.19, 762.91, 762.91, 762.91, 762.91, 762.91, 763.24, 763.24, 763.24, 763.24, 763.24, 764.72, 764.72, 764.72, 764.72, 764.72, 771.64, 771.64, 771.64, 771.64, 771.64, 772.88, 772.88, 772.88, 772.88, 772.88, 771.75, 771.75, 771.75, 771.75, 771.75, 782.64, 782.64, 782.64, 782.64, 782.64, 784.09, 784.09, 784.09, 784.09, 784.09, 790.53, 790.53, 790.53, 790.53, 790.53, 787.93, 787.93, 787.93, 787.93, 787.93, 785.44, 785.44, 785.44, 785.44, 785.44, 785.12, 785.12, 785.12, 785.12, 785.12, 784.82, 784.82, 784.82, 784.82, 784.82, 789.81, 789.81, 789.81, 789.81, 789.81, 791.52, 791.52, 791.52, 791.52, 791.52, 788.09, 788.09, 788.09, 788.09, 788.09, 790.83, 790.83, 790.83, 790.83, 790.83, 750.24, 750.24, 750.24, 750.24, 750.24, 749.28, 749.28, 749.28, 749.28, 749.28, 748.96, 748.96, 748.96, 748.96, 748.96, 749.96, 749.96, 749.96, 749.96, 749.96, 750.88, 750.88, 750.88, 750.88, 750.88, 750.83, 750.83, 750.83, 750.83, 750.83, 751.75, 751.75, 751.75, 751.75, 751.75, 756.07, 756.07, 756.07, 756.07, 756.07, 758.37, 758.37, 758.37, 758.37, 758.37, 764.0, 764.0, 764.0, 764.0, 764.0, 763.16, 763.16, 763.16, 763.16, 763.16, 765.44, 765.44, 765.44, 765.44, 765.44, 765.42, 765.42, 765.42, 765.42, 765.42, 767.66, 767.66, 767.66, 767.66, 767.66, 768.26, 768.26, 768.26, 768.26, 768.26, 770.42, 770.42, 770.42, 770.42]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.99, 41.99, 41.99, 41.99, 41.99, 34.85, 34.85, 34.85, 34.85, 34.85, 30.34, 30.34, 30.34, 30.34, 30.34, 27.49, 27.49, 27.49, 27.49, 27.49, 25.73, 25.73, 25.73, 25.73, 25.73, 26.42, 26.42, 26.42, 26.42, 26.42, 26.91, 26.91, 26.91, 26.91, 26.91, 27.62, 27.62, 27.62, 27.62, 27.62, 28.31, 28.31, 28.31, 28.31, 28.31, 29.25, 29.25, 29.25, 29.25, 29.25, 29.46, 29.46, 29.46, 29.46, 29.46, 30.28, 30.28, 30.28, 30.28, 30.28, 30.41, 30.41, 30.41, 30.41, 30.41, 30.66, 30.66, 30.66, 30.66, 30.66, 30.51, 30.51, 30.51, 30.51, 30.51, 30.22, 30.22, 30.22, 30.22, 30.22, 28.99, 28.99, 28.99, 28.99, 28.99, 28.36, 28.36, 28.36, 28.36, 28.36, 28.43, 28.43, 28.43, 28.43, 28.43, 28.95, 28.95, 28.95, 28.95, 28.95, 28.89, 28.89, 28.89, 28.89, 28.89, 28.86, 28.86, 28.86, 28.86, 28.86, 28.43, 28.43, 28.43, 28.43, 28.43, 28.65, 28.65, 28.65, 28.65, 28.65, 28.94, 28.94, 28.94, 28.94, 28.94, 28.98, 28.98, 28.98, 28.98, 28.98, 29.14, 29.14, 29.14, 29.14, 29.14, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.87, 29.87, 29.87, 29.87, 29.87, 29.89, 29.89, 29.89, 29.89, 29.89, 30.18, 30.18, 30.18, 30.18, 30.18, 30.3, 30.3, 30.3, 30.3, 30.3, 30.34, 30.34, 30.34, 30.34, 30.34, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 29.7, 29.7, 29.7, 29.7, 29.7, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.95, 29.95, 29.95, 29.95, 29.95, 29.98, 29.98, 29.98, 29.98, 29.98, 30.12, 30.12, 30.12, 30.12, 30.12, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.46, 29.46, 29.46, 29.46, 29.46, 29.2, 29.2, 29.2, 29.2, 29.2, 28.56, 28.56, 28.56, 28.56, 28.56, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.56, 28.56, 28.56, 28.56, 28.56, 28.55, 28.55, 28.55, 28.55, 28.55, 28.65, 28.65, 28.65, 28.65, 28.65, 28.66, 28.66, 28.66, 28.66, 28.66, 28.57, 28.57, 28.57, 28.57, 28.57, 28.55, 28.55, 28.55, 28.55, 28.55, 28.49, 28.49, 28.49, 28.49, 28.49, 28.58, 28.58, 28.58, 28.58, 28.58, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.42, 0.42, 0.42, 0.42, 0.42, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.35, 0.35, 0.35, 0.35, 0.35, 0.38, 0.38, 0.38, 0.38, 0.38, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.43, 0.43, 0.43, 0.43, 0.43, 0.58, 0.58, 0.58, 0.58, 0.58, 0.49, 0.49, 0.49, 0.49, 0.49, 0.47, 0.47, 0.47, 0.47, 0.47, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0]

ggerganov

Could you demonstrate short perplexity runs produce reasonable values compared to no-SVE?

msy-kato · 2024-05-24T05:32:55Z

Thanks for the comment! I ran perplexity with SVE and no-SVE. The following is the command and partial logs.

### Q8_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 906.69 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.47 seconds per pass - ETA 0.15 minutes
[1]5.2130,[2]7.4447,[3]7.4725,[4]8.4178,
Final estimate: PPL = 8.4178 +/- 1.61226

llama_print_timings:        load time =     314.22 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    9876.98 ms /   512 tokens (   19.29 ms per token,    51.84 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   10796.42 ms /   513 tokens

### Q8_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 915.193 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.99 seconds per pass - ETA 0.05 minutes
[1]5.2291,[2]7.4493,[3]7.4706,[4]8.4219,
Final estimate: PPL = 8.4219 +/- 1.61261

llama_print_timings:        load time =     304.68 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    3940.02 ms /   512 tokens (    7.70 ms per token,   129.95 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    4868.40 ms /   513 tokens

### Q4_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 898.157 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.53 seconds per pass - ETA 0.17 minutes
[1]5.4426,[2]7.4845,[3]7.9395,[4]9.0525,
Final estimate: PPL = 9.0525 +/- 1.80378

llama_print_timings:        load time =   13751.66 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   10110.36 ms /   512 tokens (   19.75 ms per token,    50.64 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   11021.03 ms /   513 tokens

### Q4_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 901.443 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.09 seconds per pass - ETA 0.07 minutes
[1]5.4306,[2]7.4762,[3]7.9293,[4]9.0456,
Final estimate: PPL = 9.0456 +/- 1.80407

llama_print_timings:        load time =     184.21 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    4340.33 ms /   512 tokens (    8.48 ms per token,   117.96 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    5254.53 ms /   513 tokens

And below is a summary.

SIMD	Type	PPL	total time[ms]
NEON	Q8_0	8.4178 +/- 1.61226	10796.42
SVE	Q8_0	8.4219 +/- 1.61261	4868.4
NEON	Q4_0	9.0525 +/- 1.80378	11021.03
SVE	Q4_0	9.0456 +/- 1.80407	5254.53

This correction does not appear to have any impact on accuracy.

ggerganov · 2024-05-25T08:42:48Z

Thanks. I checked Azure Cloud to see if I can rent a node that supports Arm SVE and it seems soon there will be VMs available: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series?tabs=sizebasic
These VMs are currently in preview - when they become generally available, we can add ggml-ci for that instruction set

JohannesGaessler · 2024-05-26T08:24:40Z

I don't understand why, but after this PR I was having build issues on one of my machines when using make where the GPU could not be detected to determine the correct CUDA arch for -arch=native even though there was no change to the Makefile. However, this seems to have been related to ccache since the compilation worked with LLAMA_NO_CCACHE; deleting ~/.cache/ccache has permanently fixed the issue for me.

msy-kato · 2024-05-27T00:50:30Z

@ggerganov It's greate. Thank you for sharing information. If there is anything I can do to help with CI/CD for SVE implementation, I would like to contribute!

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 21, 2024

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 21, 2024

msy-kato added 2 commits May 23, 2024 19:46

Add SVE support for q4_0_q8_0 q8_0_q8_0

19531ac

remove ifdef

d28bfd5

msy-kato force-pushed the feat-sve-q4_0_q8_0-q8_0_q8_0 branch from d671a17 to d28bfd5 Compare May 23, 2024 11:01

ggerganov approved these changes May 23, 2024

View reviewed changes

ggerganov merged commit faa0e69 into ggml-org:master May 25, 2024
62 of 73 checks passed

nivibilla mentioned this pull request Jun 25, 2024

Bug: abort on Android (pixel 8 pro) #8109

Open

Vithulep mentioned this pull request Sep 3, 2024

Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290

Merged

4 tasks

fj-y-saito mentioned this pull request Jan 14, 2025

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

Merged

Vithulep mentioned this pull request Feb 17, 2025

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Merged

Vithulep mentioned this pull request Feb 25, 2025

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

msy-kato commented May 21, 2024 •

edited

Loading

github-actions bot commented May 21, 2024 •

edited

Loading

ggerganov left a comment

msy-kato commented May 24, 2024 •

edited

Loading

ggerganov commented May 25, 2024

JohannesGaessler commented May 26, 2024

msy-kato commented May 27, 2024 •

edited

Loading

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

Conversation

msy-kato commented May 21, 2024 • edited Loading

Q4_0_Q8_0

Q8_0_Q8_0

github-actions bot commented May 21, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

msy-kato commented May 24, 2024 • edited Loading

ggerganov commented May 25, 2024

JohannesGaessler commented May 26, 2024

msy-kato commented May 27, 2024 • edited Loading

msy-kato commented May 21, 2024 •

edited

Loading

github-actions bot commented May 21, 2024 •

edited

Loading

msy-kato commented May 24, 2024 •

edited

Loading

msy-kato commented May 27, 2024 •

edited

Loading