add loongarch lsx and lasx optimize code #6454

junchao-loongson · 2024-04-03T07:54:35Z

Description

Hello, we (@lixing-star @MQ-mengqing) are the developers of the Loongson team.

We have added 128 (LSX) and 256 (LASX) vector optimization codes for the Loongarch architecture.

test-quantize-fns

./bin/test-quantize-fns
Testing f32
Testing f16
Testing q4_0
Testing q4_1
Testing q5_0
Testing q5_1
Testing q8_0
Testing q8_1
Testing q2_K
Testing q3_K
Testing q4_K
Testing q5_K
Testing q6_K
Testing q8_K
Testing iq2_xxs
Testing iq2_xs
Testing iq3_xxs
Testing iq1_s
Testing iq4_nl
Testing iq3_s
Testing iq2_s
Testing iq4_xs
Testing i8
Testing i16
Testing i32
Testing i64
Testing f64
Testing iq1_m

benchmark

3A5000

CPU: 
    Loongson-3A5000-HV
uname -a:  
    Linux 5a2k 4.19.0-19-loongson-3 #1 SMP 4.19.190.8.14 Thu Aug 24 08:54:20 UTC 2023 loongarch64 loongarch64 loongarch64 GNU/Linux

./build/bin/benchmark 
main: build = 2606 (e70d50e8)
main: built with cc (Loongnix 8.3.0-6.lnd.vec.37) 8.3.0 for loongarch64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            760593;     15.18
        1;       1; 11008;  4096;   128;    11542724608;            758773;     15.21
        2;       1; 11008;  4096;   128;    11542724608;            758563;     15.22
        3;       1; 11008;  4096;   128;    11542724608;            759198;     15.20
        4;       1; 11008;  4096;   128;    11542724608;            758189;     15.22
        5;       1; 11008;  4096;   128;    11542724608;            759360;     15.20
        6;       1; 11008;  4096;   128;    11542724608;            760177;     15.18
        7;       1; 11008;  4096;   128;    11542724608;            757374;     15.24
        8;       1; 11008;  4096;   128;    11542724608;            757833;     15.23
        9;       1; 11008;  4096;   128;    11542724608;            757848;     15.23

Average                                                                         15.21
=====================================================================================

3A6000

CPU: 
    Loongson-3A6000
uname -a:  
    Linux arch6k 6.7.0-rc2-2 #1 SMP PREEMPT Mon, 27 Nov 2023 08:42:49 +0000 loongarch64 GNU/Linux

./bin/benchmark
main: build = 2590 (849cb13)
main: built with cc (GCC) 13.2.1 20230906 for loongarch64-unknown-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            502525;     22.97
        1;       1; 11008;  4096;   128;    11542724608;            502258;     22.98
        2;       1; 11008;  4096;   128;    11542724608;            502188;     22.98
        3;       1; 11008;  4096;   128;    11542724608;            502212;     22.98
        4;       1; 11008;  4096;   128;    11542724608;            502231;     22.98
        5;       1; 11008;  4096;   128;    11542724608;            502297;     22.98
        6;       1; 11008;  4096;   128;    11542724608;            502201;     22.98
        7;       1; 11008;  4096;   128;    11542724608;            502202;     22.98
        8;       1; 11008;  4096;   128;    11542724608;            502271;     22.98
        9;       1; 11008;  4096;   128;    11542724608;            502237;     22.98

Average                                                                         22.98
=====================================================================================

LonngArch Documents

CMakeLists.txt

common/stb_image.h

ggerganov · 2024-04-08T11:54:25Z

@junchao-loongson Thanks for this PR. Just a heads up I will only be able to get to reviewing this after #6412 and #6414, so it can take me some time - sorry about that. In the meantime feel free to continue review with other devs

ggerganov · 2024-05-17T13:09:03Z

Let's resolve the conflicts from the recent __POWER9_VECTOR__ changes and look to merge

junchao-loongson · 2024-05-18T02:33:29Z

okay， I rebased the code.

junchao-loongson · 2024-05-18T03:14:48Z

test ok

github-actions · 2024-05-18T03:58:23Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 531 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8790.8ms p(95)=22532.69ms fails=, finish reason: stop=477 truncated=54
Prompt processing (pp): avg=113.24tk/s p(95)=523.32tk/s
Token generation (tg): avg=32.7tk/s p(95)=50.33tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=ee26b8ff10565458599dabdfaf41f65c2c313060

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 315.85, 315.85, 315.85, 315.85, 315.85, 662.77, 662.77, 662.77, 662.77, 662.77, 659.21, 659.21, 659.21, 659.21, 659.21, 684.45, 684.45, 684.45, 684.45, 684.45, 708.38, 708.38, 708.38, 708.38, 708.38, 758.25, 758.25, 758.25, 758.25, 758.25, 759.69, 759.69, 759.69, 759.69, 759.69, 776.02, 776.02, 776.02, 776.02, 776.02, 792.59, 792.59, 792.59, 792.59, 792.59, 804.53, 804.53, 804.53, 804.53, 804.53, 805.13, 805.13, 805.13, 805.13, 805.13, 813.38, 813.38, 813.38, 813.38, 813.38, 818.17, 818.17, 818.17, 818.17, 818.17, 812.36, 812.36, 812.36, 812.36, 812.36, 837.44, 837.44, 837.44, 837.44, 837.44, 842.23, 842.23, 842.23, 842.23, 842.23, 844.61, 844.61, 844.61, 844.61, 844.61, 846.66, 846.66, 846.66, 846.66, 846.66, 839.47, 839.47, 839.47, 839.47, 839.47, 821.92, 821.92, 821.92, 821.92, 821.92, 822.25, 822.25, 822.25, 822.25, 822.25, 827.74, 827.74, 827.74, 827.74, 827.74, 827.27, 827.27, 827.27, 827.27, 827.27, 832.48, 832.48, 832.48, 832.48, 832.48, 822.12, 822.12, 822.12, 822.12, 822.12, 825.56, 825.56, 825.56, 825.56, 825.56, 836.9, 836.9, 836.9, 836.9, 836.9, 839.95, 839.95, 839.95, 839.95, 839.95, 841.09, 841.09, 841.09, 841.09, 841.09, 840.72, 840.72, 840.72, 840.72, 840.72, 846.25, 846.25, 846.25, 846.25, 846.25, 846.07, 846.07, 846.07, 846.07, 846.07, 844.07, 844.07, 844.07, 844.07, 844.07, 842.92, 842.92, 842.92, 842.92, 842.92, 839.85, 839.85, 839.85, 839.85, 839.85, 845.58, 845.58, 845.58, 845.58, 845.58, 846.12, 846.12, 846.12, 846.12, 846.12, 827.06, 827.06, 827.06, 827.06, 827.06, 825.79, 825.79, 825.79, 825.79, 825.79, 824.51, 824.51, 824.51, 824.51, 824.51, 830.03, 830.03, 830.03, 830.03, 830.03, 831.97, 831.97, 831.97, 831.97, 831.97, 842.29, 842.29, 842.29, 842.29, 842.29, 845.15, 845.15, 845.15, 845.15, 845.15, 845.43, 845.43, 845.43, 845.43, 845.43, 845.04, 845.04, 845.04, 845.04, 845.04, 842.46, 842.46, 842.46, 842.46, 842.46, 841.04, 841.04, 841.04, 841.04, 841.04, 843.6, 843.6, 843.6, 843.6, 843.6, 840.16, 840.16, 840.16, 840.16, 840.16, 839.55, 839.55, 839.55, 839.55, 839.55, 841.85, 841.85, 841.85, 841.85, 841.85, 844.03, 844.03, 844.03, 844.03, 844.03, 840.08, 840.08, 840.08, 840.08, 840.08, 845.53, 845.53, 845.53, 845.53, 845.53, 845.09, 845.09, 845.09, 845.09, 845.09, 851.21, 851.21, 851.21, 851.21, 851.21, 851.46, 851.46, 851.46, 851.46, 851.46, 850.89, 850.89, 850.89, 850.89, 850.89, 852.2, 852.2, 852.2, 852.2, 852.2, 853.19, 853.19, 853.19, 853.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.44, 44.44, 44.44, 44.44, 44.44, 31.52, 31.52, 31.52, 31.52, 31.52, 28.24, 28.24, 28.24, 28.24, 28.24, 30.2, 30.2, 30.2, 30.2, 30.2, 31.38, 31.38, 31.38, 31.38, 31.38, 32.69, 32.69, 32.69, 32.69, 32.69, 34.22, 34.22, 34.22, 34.22, 34.22, 34.57, 34.57, 34.57, 34.57, 34.57, 34.61, 34.61, 34.61, 34.61, 34.61, 34.18, 34.18, 34.18, 34.18, 34.18, 34.26, 34.26, 34.26, 34.26, 34.26, 33.46, 33.46, 33.46, 33.46, 33.46, 33.48, 33.48, 33.48, 33.48, 33.48, 31.94, 31.94, 31.94, 31.94, 31.94, 31.21, 31.21, 31.21, 31.21, 31.21, 30.17, 30.17, 30.17, 30.17, 30.17, 30.02, 30.02, 30.02, 30.02, 30.02, 30.32, 30.32, 30.32, 30.32, 30.32, 30.43, 30.43, 30.43, 30.43, 30.43, 30.21, 30.21, 30.21, 30.21, 30.21, 30.05, 30.05, 30.05, 30.05, 30.05, 29.95, 29.95, 29.95, 29.95, 29.95, 30.21, 30.21, 30.21, 30.21, 30.21, 30.3, 30.3, 30.3, 30.3, 30.3, 30.41, 30.41, 30.41, 30.41, 30.41, 30.74, 30.74, 30.74, 30.74, 30.74, 30.53, 30.53, 30.53, 30.53, 30.53, 30.48, 30.48, 30.48, 30.48, 30.48, 30.66, 30.66, 30.66, 30.66, 30.66, 30.88, 30.88, 30.88, 30.88, 30.88, 30.97, 30.97, 30.97, 30.97, 30.97, 31.06, 31.06, 31.06, 31.06, 31.06, 31.22, 31.22, 31.22, 31.22, 31.22, 31.27, 31.27, 31.27, 31.27, 31.27, 31.06, 31.06, 31.06, 31.06, 31.06, 30.93, 30.93, 30.93, 30.93, 30.93, 30.79, 30.79, 30.79, 30.79, 30.79, 30.54, 30.54, 30.54, 30.54, 30.54, 30.65, 30.65, 30.65, 30.65, 30.65, 30.81, 30.81, 30.81, 30.81, 30.81, 30.9, 30.9, 30.9, 30.9, 30.9, 30.91, 30.91, 30.91, 30.91, 30.91, 30.79, 30.79, 30.79, 30.79, 30.79, 30.48, 30.48, 30.48, 30.48, 30.48, 30.43, 30.43, 30.43, 30.43, 30.43, 29.04, 29.04, 29.04, 29.04, 29.04, 28.75, 28.75, 28.75, 28.75, 28.75, 28.67, 28.67, 28.67, 28.67, 28.67, 28.66, 28.66, 28.66, 28.66, 28.66, 28.65, 28.65, 28.65, 28.65, 28.65, 28.75, 28.75, 28.75, 28.75, 28.75, 28.77, 28.77, 28.77, 28.77, 28.77, 28.81, 28.81, 28.81, 28.81, 28.81, 28.72, 28.72, 28.72, 28.72, 28.72, 28.83, 28.83, 28.83, 28.83, 28.83, 28.79, 28.79, 28.79, 28.79, 28.79, 28.87, 28.87, 28.87, 28.87, 28.87, 29.03, 29.03, 29.03, 29.03, 29.03, 29.1, 29.1, 29.1, 29.1, 29.1, 29.18, 29.18, 29.18, 29.18]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.35, 0.35, 0.35, 0.35, 0.35, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.42, 0.42, 0.42, 0.42, 0.42, 0.33, 0.33, 0.33, 0.33, 0.33, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.2, 0.2, 0.2, 0.2, 0.2, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.36, 0.36, 0.36, 0.36, 0.36, 0.54, 0.54, 0.54, 0.54, 0.54, 0.57, 0.57, 0.57, 0.57, 0.57, 0.69, 0.69, 0.69, 0.69, 0.69, 0.42, 0.42, 0.42, 0.42, 0.42, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716171353 --> 1716171979
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0]

ggerganov

I don't suppose Github actions support this architecture, but if it does, it would be nice to add CI workflow

Have you done some inference/perplexity runs to make sure the generation looks find?

ggml-quants.c

ggml.c

CMakeLists.txt

ggerganov · 2024-05-19T06:59:48Z

ggml.c

+typedef union
+{
+    int32_t i;
+    float f;
+} FloatInt;
+/* float type data load instructions */
+static __m128 __lsx_vreplfr2vr_s(float val)
+{
+    FloatInt fi_tmpval = {.f = val};
+    return (__m128)__lsx_vreplgr2vr_w(fi_tmpval.i);
+}
+
+static __m256 __lasx_xvreplfr2vr_s(float val)
+{
+    FloatInt fi_tmpval = {.f = val};
+    return (__m256)__lasx_xvreplgr2vr_w(fi_tmpval.i);
+}


Deduplicate this code by moving it in ggml-impl.h and reusing it in ggml.c and ggml-quants.c

I was thinking to just deduplicate the __lsx_vreplfr2vr_s and __lasx_xvreplfr2vr_s code. The rest of the lsx/lasx code that is used only inside ggml-quants.c should remain in ggml-quants.c

ggerganov · 2024-05-19T09:21:43Z

Btw, for long-term support it would be very useful to add CI for this arch. If there is someone who can donate a machine we can deploy ggml-ci on it and have it run tests on each commit. Without CI, the code can quickly get outdated and break

junchao-loongson · 2024-05-20T02:16:13Z

We have loongarch architecture machines available for remote connection, can we use them as ci?

ggerganov · 2024-05-20T07:21:43Z

We have loongarch architecture machines available for remote connection, can we use them as ci?

Great! If you could spare a machine we can add it as a node to the ggml-ci fleet. Easiest way would be if you could give me SSH access so I can log and configure it. If that is possible, send me an email and we can set it up

junchao-loongson · 2024-05-24T01:38:28Z

I apologize for the late reply. We are in the process of checking in with our colleagues who are responsible for this matter and should have it ready within the next week.

cebtenzzre reviewed Apr 4, 2024

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

common/stb_image.h Outdated Show resolved Hide resolved

junchao-loongson requested a review from cebtenzzre April 8, 2024 07:45

ggerganov mentioned this pull request Apr 10, 2024

Improve cpu prompt eval speed #6414

Merged

mofosyne added performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024

cebtenzzre removed their request for review May 10, 2024 15:44

junchao-loongson and others added 4 commits May 18, 2024 10:28

add loongarch lsx and lasx optimize code

ee42f24

Add loongarch compilation support to makefile

a719e98

revert stb_image.h

4cfd8b9

opt bytes_from_nibbles_32 and sum_i16_pairs_float

e8ed670

junchao-loongson force-pushed the master branch from 2cb9174 to e8ed670 Compare May 18, 2024 02:30

fix undeclared

fdef762

ggerganov reviewed May 18, 2024

View reviewed changes

ggml-quants.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

format code

3b6199b

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 18, 2024

junchao-loongson requested a review from ggerganov May 19, 2024 06:51

ggerganov reviewed May 19, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

ggerganov reviewed May 19, 2024

View reviewed changes

update

8a0d9a3

update 2

ee26b8f

ggerganov approved these changes May 20, 2024

View reviewed changes

ggerganov merged commit 65c5820 into ggerganov:master May 20, 2024
72 of 74 checks passed

HougeLangley mentioned this pull request May 21, 2024

Please support LoongArch ISA ollama/ollama#4552

Open

xen0n mentioned this pull request May 26, 2024

Content suggestion for This Week in LoongArch newsletter / 《每周一龙》新闻线索信箱 loongson-community/areweloongyet#16

Open

HougeLangley mentioned this pull request Jun 15, 2024

Add LoongArch64 ISA Support ollama/ollama#5067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add loongarch lsx and lasx optimize code #6454

add loongarch lsx and lasx optimize code #6454

junchao-loongson commented Apr 3, 2024 •

edited

Loading

ggerganov commented Apr 8, 2024

ggerganov commented May 17, 2024

junchao-loongson commented May 18, 2024

junchao-loongson commented May 18, 2024

github-actions bot commented May 18, 2024 •

edited

Loading

ggerganov left a comment

ggerganov May 19, 2024

ggerganov May 19, 2024

ggerganov commented May 19, 2024

junchao-loongson commented May 20, 2024

ggerganov commented May 20, 2024

junchao-loongson commented May 24, 2024

add loongarch lsx and lasx optimize code #6454

add loongarch lsx and lasx optimize code #6454

Conversation

junchao-loongson commented Apr 3, 2024 • edited Loading

Description

test-quantize-fns

benchmark

LonngArch Documents

ggerganov commented Apr 8, 2024

ggerganov commented May 17, 2024

junchao-loongson commented May 18, 2024

junchao-loongson commented May 18, 2024

github-actions bot commented May 18, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov May 19, 2024

Choose a reason for hiding this comment

ggerganov May 19, 2024

Choose a reason for hiding this comment

ggerganov commented May 19, 2024

junchao-loongson commented May 20, 2024

ggerganov commented May 20, 2024

junchao-loongson commented May 24, 2024

junchao-loongson commented Apr 3, 2024 •

edited

Loading

github-actions bot commented May 18, 2024 •

edited

Loading