Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add loongarch lsx and lasx optimize code #6454

Merged
merged 8 commits into from
May 20, 2024

Conversation

junchao-loongson
Copy link
Contributor

@junchao-loongson junchao-loongson commented Apr 3, 2024

Description

Hello, we (@lixing-star @MQ-mengqing) are the developers of the Loongson team.

We have added 128 (LSX) and 256 (LASX) vector optimization codes for the Loongarch architecture.

test-quantize-fns

./bin/test-quantize-fns
Testing f32
Testing f16
Testing q4_0
Testing q4_1
Testing q5_0
Testing q5_1
Testing q8_0
Testing q8_1
Testing q2_K
Testing q3_K
Testing q4_K
Testing q5_K
Testing q6_K
Testing q8_K
Testing iq2_xxs
Testing iq2_xs
Testing iq3_xxs
Testing iq1_s
Testing iq4_nl
Testing iq3_s
Testing iq2_s
Testing iq4_xs
Testing i8
Testing i16
Testing i32
Testing i64
Testing f64
Testing iq1_m

benchmark

  • 3A5000
CPU: 
    Loongson-3A5000-HV
uname -a:  
    Linux 5a2k 4.19.0-19-loongson-3 #1 SMP 4.19.190.8.14 Thu Aug 24 08:54:20 UTC 2023 loongarch64 loongarch64 loongarch64 GNU/Linux

./build/bin/benchmark 
main: build = 2606 (e70d50e8)
main: built with cc (Loongnix 8.3.0-6.lnd.vec.37) 8.3.0 for loongarch64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            760593;     15.18
        1;       1; 11008;  4096;   128;    11542724608;            758773;     15.21
        2;       1; 11008;  4096;   128;    11542724608;            758563;     15.22
        3;       1; 11008;  4096;   128;    11542724608;            759198;     15.20
        4;       1; 11008;  4096;   128;    11542724608;            758189;     15.22
        5;       1; 11008;  4096;   128;    11542724608;            759360;     15.20
        6;       1; 11008;  4096;   128;    11542724608;            760177;     15.18
        7;       1; 11008;  4096;   128;    11542724608;            757374;     15.24
        8;       1; 11008;  4096;   128;    11542724608;            757833;     15.23
        9;       1; 11008;  4096;   128;    11542724608;            757848;     15.23

Average                                                                         15.21
=====================================================================================


  • 3A6000
CPU: 
    Loongson-3A6000
uname -a:  
    Linux arch6k 6.7.0-rc2-2 #1 SMP PREEMPT Mon, 27 Nov 2023 08:42:49 +0000 loongarch64 GNU/Linux

./bin/benchmark
main: build = 2590 (849cb13)
main: built with cc (GCC) 13.2.1 20230906 for loongarch64-unknown-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            502525;     22.97
        1;       1; 11008;  4096;   128;    11542724608;            502258;     22.98
        2;       1; 11008;  4096;   128;    11542724608;            502188;     22.98
        3;       1; 11008;  4096;   128;    11542724608;            502212;     22.98
        4;       1; 11008;  4096;   128;    11542724608;            502231;     22.98
        5;       1; 11008;  4096;   128;    11542724608;            502297;     22.98
        6;       1; 11008;  4096;   128;    11542724608;            502201;     22.98
        7;       1; 11008;  4096;   128;    11542724608;            502202;     22.98
        8;       1; 11008;  4096;   128;    11542724608;            502271;     22.98
        9;       1; 11008;  4096;   128;    11542724608;            502237;     22.98

Average                                                                         22.98
=====================================================================================


LonngArch Documents

CMakeLists.txt Show resolved Hide resolved
common/stb_image.h Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

@junchao-loongson Thanks for this PR. Just a heads up I will only be able to get to reviewing this after #6412 and #6414, so it can take me some time - sorry about that. In the meantime feel free to continue review with other devs

@mofosyne mofosyne added performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024
@cebtenzzre cebtenzzre removed their request for review May 10, 2024 15:44
@ggerganov
Copy link
Owner

Let's resolve the conflicts from the recent __POWER9_VECTOR__ changes and look to merge

@junchao-loongson
Copy link
Contributor Author

okay, I rebased the code.

@junchao-loongson
Copy link
Contributor Author

test ok

Copy link
Contributor

github-actions bot commented May 18, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 531 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8790.8ms p(95)=22532.69ms fails=, finish reason: stop=477 truncated=54
  • Prompt processing (pp): avg=113.24tk/s p(95)=523.32tk/s
  • Token generation (tg): avg=32.7tk/s p(95)=50.33tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=ee26b8ff10565458599dabdfaf41f65c2c313060

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 315.85, 315.85, 315.85, 315.85, 315.85, 662.77, 662.77, 662.77, 662.77, 662.77, 659.21, 659.21, 659.21, 659.21, 659.21, 684.45, 684.45, 684.45, 684.45, 684.45, 708.38, 708.38, 708.38, 708.38, 708.38, 758.25, 758.25, 758.25, 758.25, 758.25, 759.69, 759.69, 759.69, 759.69, 759.69, 776.02, 776.02, 776.02, 776.02, 776.02, 792.59, 792.59, 792.59, 792.59, 792.59, 804.53, 804.53, 804.53, 804.53, 804.53, 805.13, 805.13, 805.13, 805.13, 805.13, 813.38, 813.38, 813.38, 813.38, 813.38, 818.17, 818.17, 818.17, 818.17, 818.17, 812.36, 812.36, 812.36, 812.36, 812.36, 837.44, 837.44, 837.44, 837.44, 837.44, 842.23, 842.23, 842.23, 842.23, 842.23, 844.61, 844.61, 844.61, 844.61, 844.61, 846.66, 846.66, 846.66, 846.66, 846.66, 839.47, 839.47, 839.47, 839.47, 839.47, 821.92, 821.92, 821.92, 821.92, 821.92, 822.25, 822.25, 822.25, 822.25, 822.25, 827.74, 827.74, 827.74, 827.74, 827.74, 827.27, 827.27, 827.27, 827.27, 827.27, 832.48, 832.48, 832.48, 832.48, 832.48, 822.12, 822.12, 822.12, 822.12, 822.12, 825.56, 825.56, 825.56, 825.56, 825.56, 836.9, 836.9, 836.9, 836.9, 836.9, 839.95, 839.95, 839.95, 839.95, 839.95, 841.09, 841.09, 841.09, 841.09, 841.09, 840.72, 840.72, 840.72, 840.72, 840.72, 846.25, 846.25, 846.25, 846.25, 846.25, 846.07, 846.07, 846.07, 846.07, 846.07, 844.07, 844.07, 844.07, 844.07, 844.07, 842.92, 842.92, 842.92, 842.92, 842.92, 839.85, 839.85, 839.85, 839.85, 839.85, 845.58, 845.58, 845.58, 845.58, 845.58, 846.12, 846.12, 846.12, 846.12, 846.12, 827.06, 827.06, 827.06, 827.06, 827.06, 825.79, 825.79, 825.79, 825.79, 825.79, 824.51, 824.51, 824.51, 824.51, 824.51, 830.03, 830.03, 830.03, 830.03, 830.03, 831.97, 831.97, 831.97, 831.97, 831.97, 842.29, 842.29, 842.29, 842.29, 842.29, 845.15, 845.15, 845.15, 845.15, 845.15, 845.43, 845.43, 845.43, 845.43, 845.43, 845.04, 845.04, 845.04, 845.04, 845.04, 842.46, 842.46, 842.46, 842.46, 842.46, 841.04, 841.04, 841.04, 841.04, 841.04, 843.6, 843.6, 843.6, 843.6, 843.6, 840.16, 840.16, 840.16, 840.16, 840.16, 839.55, 839.55, 839.55, 839.55, 839.55, 841.85, 841.85, 841.85, 841.85, 841.85, 844.03, 844.03, 844.03, 844.03, 844.03, 840.08, 840.08, 840.08, 840.08, 840.08, 845.53, 845.53, 845.53, 845.53, 845.53, 845.09, 845.09, 845.09, 845.09, 845.09, 851.21, 851.21, 851.21, 851.21, 851.21, 851.46, 851.46, 851.46, 851.46, 851.46, 850.89, 850.89, 850.89, 850.89, 850.89, 852.2, 852.2, 852.2, 852.2, 852.2, 853.19, 853.19, 853.19, 853.19]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.44, 44.44, 44.44, 44.44, 44.44, 31.52, 31.52, 31.52, 31.52, 31.52, 28.24, 28.24, 28.24, 28.24, 28.24, 30.2, 30.2, 30.2, 30.2, 30.2, 31.38, 31.38, 31.38, 31.38, 31.38, 32.69, 32.69, 32.69, 32.69, 32.69, 34.22, 34.22, 34.22, 34.22, 34.22, 34.57, 34.57, 34.57, 34.57, 34.57, 34.61, 34.61, 34.61, 34.61, 34.61, 34.18, 34.18, 34.18, 34.18, 34.18, 34.26, 34.26, 34.26, 34.26, 34.26, 33.46, 33.46, 33.46, 33.46, 33.46, 33.48, 33.48, 33.48, 33.48, 33.48, 31.94, 31.94, 31.94, 31.94, 31.94, 31.21, 31.21, 31.21, 31.21, 31.21, 30.17, 30.17, 30.17, 30.17, 30.17, 30.02, 30.02, 30.02, 30.02, 30.02, 30.32, 30.32, 30.32, 30.32, 30.32, 30.43, 30.43, 30.43, 30.43, 30.43, 30.21, 30.21, 30.21, 30.21, 30.21, 30.05, 30.05, 30.05, 30.05, 30.05, 29.95, 29.95, 29.95, 29.95, 29.95, 30.21, 30.21, 30.21, 30.21, 30.21, 30.3, 30.3, 30.3, 30.3, 30.3, 30.41, 30.41, 30.41, 30.41, 30.41, 30.74, 30.74, 30.74, 30.74, 30.74, 30.53, 30.53, 30.53, 30.53, 30.53, 30.48, 30.48, 30.48, 30.48, 30.48, 30.66, 30.66, 30.66, 30.66, 30.66, 30.88, 30.88, 30.88, 30.88, 30.88, 30.97, 30.97, 30.97, 30.97, 30.97, 31.06, 31.06, 31.06, 31.06, 31.06, 31.22, 31.22, 31.22, 31.22, 31.22, 31.27, 31.27, 31.27, 31.27, 31.27, 31.06, 31.06, 31.06, 31.06, 31.06, 30.93, 30.93, 30.93, 30.93, 30.93, 30.79, 30.79, 30.79, 30.79, 30.79, 30.54, 30.54, 30.54, 30.54, 30.54, 30.65, 30.65, 30.65, 30.65, 30.65, 30.81, 30.81, 30.81, 30.81, 30.81, 30.9, 30.9, 30.9, 30.9, 30.9, 30.91, 30.91, 30.91, 30.91, 30.91, 30.79, 30.79, 30.79, 30.79, 30.79, 30.48, 30.48, 30.48, 30.48, 30.48, 30.43, 30.43, 30.43, 30.43, 30.43, 29.04, 29.04, 29.04, 29.04, 29.04, 28.75, 28.75, 28.75, 28.75, 28.75, 28.67, 28.67, 28.67, 28.67, 28.67, 28.66, 28.66, 28.66, 28.66, 28.66, 28.65, 28.65, 28.65, 28.65, 28.65, 28.75, 28.75, 28.75, 28.75, 28.75, 28.77, 28.77, 28.77, 28.77, 28.77, 28.81, 28.81, 28.81, 28.81, 28.81, 28.72, 28.72, 28.72, 28.72, 28.72, 28.83, 28.83, 28.83, 28.83, 28.83, 28.79, 28.79, 28.79, 28.79, 28.79, 28.87, 28.87, 28.87, 28.87, 28.87, 29.03, 29.03, 29.03, 29.03, 29.03, 29.1, 29.1, 29.1, 29.1, 29.1, 29.18, 29.18, 29.18, 29.18]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.35, 0.35, 0.35, 0.35, 0.35, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.42, 0.42, 0.42, 0.42, 0.42, 0.33, 0.33, 0.33, 0.33, 0.33, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.2, 0.2, 0.2, 0.2, 0.2, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.36, 0.36, 0.36, 0.36, 0.36, 0.54, 0.54, 0.54, 0.54, 0.54, 0.57, 0.57, 0.57, 0.57, 0.57, 0.69, 0.69, 0.69, 0.69, 0.69, 0.42, 0.42, 0.42, 0.42, 0.42, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose Github actions support this architecture, but if it does, it would be nice to add CI workflow

Have you done some inference/perplexity runs to make sure the generation looks find?

ggml-quants.c Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 18, 2024
CMakeLists.txt Outdated Show resolved Hide resolved
ggml.c Outdated
Comment on lines 1532 to 1548
typedef union
{
int32_t i;
float f;
} FloatInt;
/* float type data load instructions */
static __m128 __lsx_vreplfr2vr_s(float val)
{
FloatInt fi_tmpval = {.f = val};
return (__m128)__lsx_vreplgr2vr_w(fi_tmpval.i);
}

static __m256 __lasx_xvreplfr2vr_s(float val)
{
FloatInt fi_tmpval = {.f = val};
return (__m256)__lasx_xvreplgr2vr_w(fi_tmpval.i);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicate this code by moving it in ggml-impl.h and reusing it in ggml.c and ggml-quants.c

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to just deduplicate the __lsx_vreplfr2vr_s and __lasx_xvreplfr2vr_s code. The rest of the lsx/lasx code that is used only inside ggml-quants.c should remain in ggml-quants.c

@ggerganov
Copy link
Owner

Btw, for long-term support it would be very useful to add CI for this arch. If there is someone who can donate a machine we can deploy ggml-ci on it and have it run tests on each commit. Without CI, the code can quickly get outdated and break

@junchao-loongson
Copy link
Contributor Author

We have loongarch architecture machines available for remote connection, can we use them as ci?

@ggerganov ggerganov merged commit 65c5820 into ggerganov:master May 20, 2024
72 of 74 checks passed
@ggerganov
Copy link
Owner

We have loongarch architecture machines available for remote connection, can we use them as ci?

Great! If you could spare a machine we can add it as a node to the ggml-ci fleet. Easiest way would be if you could give me SSH access so I can log and configure it. If that is possible, send me an email and we can set it up

@junchao-loongson
Copy link
Contributor Author

I apologize for the late reply. We are in the process of checking in with our colleagues who are responsible for this matter and should have it ready within the next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants