Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

Merged
merged 2 commits into from
May 25, 2024

Conversation

msy-kato
Copy link
Contributor

@msy-kato msy-kato commented May 21, 2024

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_0_q8_0 and q4_0_q8_0 vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #5780, but it also includes changes to the block layout.

This PR implements the SVE vector dot with minimal changes as a first SVE support. The performance enhancement is less than that of PR #5780, but it is ~ x1.1 to x1.5 faster than the original implementation.

SVE is enabled if LLAMA_SVE=ON is set in cmake. Here is an example of the compilation commands:

$ cmake -DLLAMA_SVE=ON -B build -S .
$ cmake --build build -j$(($(nproc)/2))

Here are the performance measured on AWS Graviton3E (hpc7g).

### Q4_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin

### Q8_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin

Q4_0_Q8_0

Decoding throughput[token/sec]

Threads Original(NEON) This PR(SVE) Ratio
2 3.16 4.05 1.28
4 6.21 7.88 1.27
8 11.92 14.81 1.24
16 21.54 25.77 1.20
32 32.38 36.21 1.12

Q8_0_Q8_0

Decoding throughput[token/sec]

Threads Original(NEON) This PR(SVE) Ratio
2 3.14 4.60 1.46
4 6.10 8.97 1.47
8 11.46 16.29 1.42
16 20.20 23.77 1.18
32 24.72 26.01 1.05

Limitation: This pull request only supports SVE 256-bit.

@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 21, 2024
Copy link
Contributor

github-actions bot commented May 21, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 527 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8868.41ms p(95)=22845.51ms fails=, finish reason: stop=470 truncated=57
  • Prompt processing (pp): avg=103.39tk/s p(95)=462.02tk/s
  • Token generation (tg): avg=47.79tk/s p(95)=48.22tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-sve-q4_0_q8_0-q8_0_q8_0 commit=d28bfd5ef7492548d6e000b6ad2cb6042161ec95

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.5, 370.5, 370.5, 370.5, 370.5, 897.45, 897.45, 897.45, 897.45, 897.45, 931.93, 931.93, 931.93, 931.93, 931.93, 910.64, 910.64, 910.64, 910.64, 910.64, 913.93, 913.93, 913.93, 913.93, 913.93, 933.06, 933.06, 933.06, 933.06, 933.06, 918.94, 918.94, 918.94, 918.94, 918.94, 911.73, 911.73, 911.73, 911.73, 911.73, 911.08, 911.08, 911.08, 911.08, 911.08, 924.12, 924.12, 924.12, 924.12, 924.12, 918.9, 918.9, 918.9, 918.9, 918.9, 924.42, 924.42, 924.42, 924.42, 924.42, 940.48, 940.48, 940.48, 940.48, 940.48, 928.43, 928.43, 928.43, 928.43, 928.43, 922.96, 922.96, 922.96, 922.96, 922.96, 938.31, 938.31, 938.31, 938.31, 938.31, 872.83, 872.83, 872.83, 872.83, 872.83, 877.28, 877.28, 877.28, 877.28, 877.28, 878.68, 878.68, 878.68, 878.68, 878.68, 875.02, 875.02, 875.02, 875.02, 875.02, 831.24, 831.24, 831.24, 831.24, 831.24, 831.32, 831.32, 831.32, 831.32, 831.32, 827.98, 827.98, 827.98, 827.98, 827.98, 836.54, 836.54, 836.54, 836.54, 836.54, 838.99, 838.99, 838.99, 838.99, 838.99, 814.64, 814.64, 814.64, 814.64, 814.64, 817.76, 817.76, 817.76, 817.76, 817.76, 814.19, 814.19, 814.19, 814.19, 814.19, 762.91, 762.91, 762.91, 762.91, 762.91, 763.24, 763.24, 763.24, 763.24, 763.24, 764.72, 764.72, 764.72, 764.72, 764.72, 771.64, 771.64, 771.64, 771.64, 771.64, 772.88, 772.88, 772.88, 772.88, 772.88, 771.75, 771.75, 771.75, 771.75, 771.75, 782.64, 782.64, 782.64, 782.64, 782.64, 784.09, 784.09, 784.09, 784.09, 784.09, 790.53, 790.53, 790.53, 790.53, 790.53, 787.93, 787.93, 787.93, 787.93, 787.93, 785.44, 785.44, 785.44, 785.44, 785.44, 785.12, 785.12, 785.12, 785.12, 785.12, 784.82, 784.82, 784.82, 784.82, 784.82, 789.81, 789.81, 789.81, 789.81, 789.81, 791.52, 791.52, 791.52, 791.52, 791.52, 788.09, 788.09, 788.09, 788.09, 788.09, 790.83, 790.83, 790.83, 790.83, 790.83, 750.24, 750.24, 750.24, 750.24, 750.24, 749.28, 749.28, 749.28, 749.28, 749.28, 748.96, 748.96, 748.96, 748.96, 748.96, 749.96, 749.96, 749.96, 749.96, 749.96, 750.88, 750.88, 750.88, 750.88, 750.88, 750.83, 750.83, 750.83, 750.83, 750.83, 751.75, 751.75, 751.75, 751.75, 751.75, 756.07, 756.07, 756.07, 756.07, 756.07, 758.37, 758.37, 758.37, 758.37, 758.37, 764.0, 764.0, 764.0, 764.0, 764.0, 763.16, 763.16, 763.16, 763.16, 763.16, 765.44, 765.44, 765.44, 765.44, 765.44, 765.42, 765.42, 765.42, 765.42, 765.42, 767.66, 767.66, 767.66, 767.66, 767.66, 768.26, 768.26, 768.26, 768.26, 768.26, 770.42, 770.42, 770.42, 770.42]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.99, 41.99, 41.99, 41.99, 41.99, 34.85, 34.85, 34.85, 34.85, 34.85, 30.34, 30.34, 30.34, 30.34, 30.34, 27.49, 27.49, 27.49, 27.49, 27.49, 25.73, 25.73, 25.73, 25.73, 25.73, 26.42, 26.42, 26.42, 26.42, 26.42, 26.91, 26.91, 26.91, 26.91, 26.91, 27.62, 27.62, 27.62, 27.62, 27.62, 28.31, 28.31, 28.31, 28.31, 28.31, 29.25, 29.25, 29.25, 29.25, 29.25, 29.46, 29.46, 29.46, 29.46, 29.46, 30.28, 30.28, 30.28, 30.28, 30.28, 30.41, 30.41, 30.41, 30.41, 30.41, 30.66, 30.66, 30.66, 30.66, 30.66, 30.51, 30.51, 30.51, 30.51, 30.51, 30.22, 30.22, 30.22, 30.22, 30.22, 28.99, 28.99, 28.99, 28.99, 28.99, 28.36, 28.36, 28.36, 28.36, 28.36, 28.43, 28.43, 28.43, 28.43, 28.43, 28.95, 28.95, 28.95, 28.95, 28.95, 28.89, 28.89, 28.89, 28.89, 28.89, 28.86, 28.86, 28.86, 28.86, 28.86, 28.43, 28.43, 28.43, 28.43, 28.43, 28.65, 28.65, 28.65, 28.65, 28.65, 28.94, 28.94, 28.94, 28.94, 28.94, 28.98, 28.98, 28.98, 28.98, 28.98, 29.14, 29.14, 29.14, 29.14, 29.14, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.87, 29.87, 29.87, 29.87, 29.87, 29.89, 29.89, 29.89, 29.89, 29.89, 30.18, 30.18, 30.18, 30.18, 30.18, 30.3, 30.3, 30.3, 30.3, 30.3, 30.34, 30.34, 30.34, 30.34, 30.34, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 29.7, 29.7, 29.7, 29.7, 29.7, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.95, 29.95, 29.95, 29.95, 29.95, 29.98, 29.98, 29.98, 29.98, 29.98, 30.12, 30.12, 30.12, 30.12, 30.12, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.46, 29.46, 29.46, 29.46, 29.46, 29.2, 29.2, 29.2, 29.2, 29.2, 28.56, 28.56, 28.56, 28.56, 28.56, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.56, 28.56, 28.56, 28.56, 28.56, 28.55, 28.55, 28.55, 28.55, 28.55, 28.65, 28.65, 28.65, 28.65, 28.65, 28.66, 28.66, 28.66, 28.66, 28.66, 28.57, 28.57, 28.57, 28.57, 28.57, 28.55, 28.55, 28.55, 28.55, 28.55, 28.49, 28.49, 28.49, 28.49, 28.49, 28.58, 28.58, 28.58, 28.58, 28.58, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.42, 0.42, 0.42, 0.42, 0.42, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.35, 0.35, 0.35, 0.35, 0.35, 0.38, 0.38, 0.38, 0.38, 0.38, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.43, 0.43, 0.43, 0.43, 0.43, 0.58, 0.58, 0.58, 0.58, 0.58, 0.49, 0.49, 0.49, 0.49, 0.49, 0.47, 0.47, 0.47, 0.47, 0.47, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 21, 2024
@msy-kato msy-kato force-pushed the feat-sve-q4_0_q8_0-q8_0_q8_0 branch from d671a17 to d28bfd5 Compare May 23, 2024 11:01
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you demonstrate short perplexity runs produce reasonable values compared to no-SVE?

@msy-kato
Copy link
Contributor Author

msy-kato commented May 24, 2024

Thanks for the comment! I ran perplexity with SVE and no-SVE. The following is the command and partial logs.

### Q8_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 906.69 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.47 seconds per pass - ETA 0.15 minutes
[1]5.2130,[2]7.4447,[3]7.4725,[4]8.4178,
Final estimate: PPL = 8.4178 +/- 1.61226

llama_print_timings:        load time =     314.22 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    9876.98 ms /   512 tokens (   19.29 ms per token,    51.84 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   10796.42 ms /   513 tokens

### Q8_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 915.193 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.99 seconds per pass - ETA 0.05 minutes
[1]5.2291,[2]7.4493,[3]7.4706,[4]8.4219,
Final estimate: PPL = 8.4219 +/- 1.61261

llama_print_timings:        load time =     304.68 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    3940.02 ms /   512 tokens (    7.70 ms per token,   129.95 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    4868.40 ms /   513 tokens

### Q4_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 898.157 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.53 seconds per pass - ETA 0.17 minutes
[1]5.4426,[2]7.4845,[3]7.9395,[4]9.0525,
Final estimate: PPL = 9.0525 +/- 1.80378

llama_print_timings:        load time =   13751.66 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   10110.36 ms /   512 tokens (   19.75 ms per token,    50.64 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   11021.03 ms /   513 tokens

### Q4_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 901.443 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.09 seconds per pass - ETA 0.07 minutes
[1]5.4306,[2]7.4762,[3]7.9293,[4]9.0456,
Final estimate: PPL = 9.0456 +/- 1.80407

llama_print_timings:        load time =     184.21 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    4340.33 ms /   512 tokens (    8.48 ms per token,   117.96 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    5254.53 ms /   513 tokens

And below is a summary.

SIMD Type PPL total time[ms]
NEON Q8_0 8.4178 +/- 1.61226 10796.42
SVE Q8_0 8.4219 +/- 1.61261 4868.4
NEON Q4_0 9.0525 +/- 1.80378 11021.03
SVE Q4_0 9.0456 +/- 1.80407 5254.53

This correction does not appear to have any impact on accuracy.

@ggerganov ggerganov merged commit faa0e69 into ggml-org:master May 25, 2024
62 of 73 checks passed
@ggerganov
Copy link
Member

Thanks. I checked Azure Cloud to see if I can rent a node that supports Arm SVE and it seems soon there will be VMs available: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series?tabs=sizebasic
These VMs are currently in preview - when they become generally available, we can add ggml-ci for that instruction set

@JohannesGaessler
Copy link
Collaborator

I don't understand why, but after this PR I was having build issues on one of my machines when using make where the GPU could not be detected to determine the correct CUDA arch for -arch=native even though there was no change to the Makefile. However, this seems to have been related to ccache since the compilation worked with LLAMA_NO_CCACHE; deleting ~/.cache/ccache has permanently fixed the issue for me.

@msy-kato
Copy link
Contributor Author

msy-kato commented May 27, 2024

@ggerganov It's greate. Thank you for sharing information. If there is anything I can do to help with CI/CD for SVE implementation, I would like to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants