ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966

snadampal · 2024-01-16T03:19:30Z

armv8.2-a and above supports MMLA instructions that have better throughput than DOT. this PR adds support for mmla kernels for
q8_0_q8_0
q4_0_q8_0
and q4_1_q8_1 quantized gemm routines.
The feature is enabled if the platform supports __ARM_FEATURE_MATMUL_INT8

on AWS Graviton3 processors these kernels resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel.

snadampal · 2024-01-16T03:25:10Z

I have tested this PR with few Llama2 models using different prompt sizes, compared the gemm output from mmla kernel and the default dot kernels and confirmed they were matching. Please let me know if there are any unit tests or perplexity tests I need to run for this PR. thank you!

Makefile

CMakeLists.txt

Makefile

ggerganov · 2024-01-17T17:37:56Z

Interesting work - I was not familiar with these instructions. Gaining performance for CPU-based prompt processing is of significant interest.

Looks like you want to process 2 rows at a time with a single kernel call. However, I feel this is implemented in a very convoluted way with quite a lot of duplication in the matrix multiplication logic. I could be missing something though

Can you try to fit this into the existing ggml_vec_dot_q4_0_q8_0 and ggml_vec_dot_q4_1_q8_1 kernels with an extra #ifdef similar to how we differentiate between AVX, AVX2, ARM NEON, WASM SIMD, etc?

Also run a perplexity test:

# get wikitext test data
./scripts/get-wikitext-2.sh
unzip wikitext-2-raw-v1.zip

# run test (takes a while)
./perplexity -m some-model-q4_0.gguf -f ./wikitext-2-raw/wiki.test.raw

snadampal · 2024-01-17T18:12:15Z

@ggerganov , thanks for the feedback. Yes, your understanding is correct; i'm processing two rows and columns at a time (SMMLA instruction operates at 2x8 * 8x2 --> 2x2). I came up with this logic while trying to understand the current algo and make the changes as isolated as possible :) sure, will check how best I can merge this logic with the dot kernel matmul loop.

snadampal · 2024-01-22T22:22:55Z

Hi @ggerganov , I have extended the ggml_vec_dot interface to add mmla kernels. I tried to fit the additional row/col pointers in the existing void* x/y but didn't find a better way than changing the interface to carry array of pointers. since vec_dot is the main api, the changes are all over. can you please trigger CI to make sure there are no breaks. happy to rework the PR if there is any feedback.

snadampal · 2024-01-23T05:10:27Z

some of the unit tests are being invoked only from cmake, not from make, hence I missed them in the previous version. Now i'm testing both make and cmake options. pushed fixes for all CI failures except the windows build errors. I will check if i can get access to windows machine.

https://github.com/ggerganov/llama.cpp/actions/runs/7618043882/job/20748830174?pr=4966
https://github.com/ggerganov/llama.cpp/actions/runs/7618043882/job/20748830837?pr=4966

ggml-quants.h

snadampal · 2024-01-24T21:55:53Z

I updated the PR for windows failures but haven't tested it on Windows machine yet. it will be great if it can be tested in CI runs otherwise I will pickup local windows testing effort.

snadampal · 2024-01-25T02:25:43Z

thanks, looks like there are more places to take care on Windows. I will check.

snadampal · 2024-01-26T03:02:42Z

Looks like MSVC doesn't support VLA as well. I have fixed all the windows failures and tested on windows and Linux platforms. Next I'm looking at the below error from macos-latest-swift and iso-xcode builds, linker not finding the symbol, this looks strange to me. I have ggml_cpu_has_matmul_int8() defined in ggml.h and implemented it it ggml.c in the similar lines of ggml_cpu_has_neon() function.
@cebtenzzre , do you have any pointers on what is different in macos-latest-swift and ios-xcode builds that not able to link correctly?

Undefined symbols for architecture x86_64:
  "_ggml_cpu_has_matmul_int8", referenced from:
      _llama_print_system_info in llama.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

slaren · 2024-01-26T03:08:38Z

The swift CI tests use a pinned version of ggml, so it fails every time that there are changes to ggml.c. You can ignore these errors.

snadampal · 2024-01-26T03:14:49Z

thank you! I see those builds are using the llama.cpp package, not building from sources.

Fetching from https://github.com/ggerganov/ggml.git

Cloning local copy of package ‘ggml’

Checking out release of package ‘ggml’

snadampal · 2024-01-26T03:58:44Z

updated the PR to fix windows builds and ran unit tests on Ubuntu and windows. It is ready for the final review and CI.

snadampal · 2024-01-26T13:45:00Z

perplexity test results:

# get wikitext test data
./scripts/get-wikitext-2.sh
unzip wikitext-2-raw-v1.zip

# run test (takes a while)
./perplexity -m some-model-q4_0.gguf -f ./wikitext-2-raw/wiki.test.raw

main: build = 1975 (9eaba38c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
main: seed  = 1706240339
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/ubuntu/Llama_setup2/llama.cpp/models/open_llama_7b_v2/ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 896.291 ms
perplexity: calculating perplexity over 631 chunks, batch_size=512
perplexity: 5.11 seconds per pass - ETA 53.75 minutes
[1]4.1006,[2]5.4902,[3]5.6471,[4]6.6776,[5]7.0017,[6]7.7207,[7]7.9831,[8]8.2151,[9]8.5785,[10]9.1584,[11]9.3710,[12]9.3985,[13]9.4508,[14]9.7168,[15]9.1693,[16]8.8449,[17]8.7281,[18]8.2459,[19]8.2586,[20]8.2317,[21]7.9581,[22]7.8929,[23]7.7834,[24]7.7557,[25]7.5015,[26]7.2262,[27]7.0801,[28]6.9451,[29]6.7618,[30]6.6763,[31]6.7396,[32]6.7262,[33]6.7796,[34]6.7551,[35]6.7997,[36]6.8193,[37]6.8824,[38]6.9279,[39]6.9951,[40]7.0751,[41]7.0246,[42]7.0092,[43]7.0083,[44]6.9956,[45]6.9532,[46]6.9736,[47]6.9956,[48]6.9595,[49]6.9366,[50]6.9036,[51]6.9706,[52]6.9577,[53]6.9250,[54]6.9449,[55]6.9254,[56]6.9532,[57]6.9838,[58]7.0189,[59]7.0262,[60]7.0679,[61]7.0457,[62]7.0596,[63]7.1159,[64]7.1264,[65]7.1528,[66]7.1579,[67]7.1928,[68]7.2179,[69]7.2760,[70]7.3236,[71]7.3528,[72]7.4042,[73]7.3822,[74]7.3917,[75]7.3975,[76]7.4047,[77]7.4486,[78]7.4440,[79]7.4246,[80]7.3857,[81]7.3640,[82]7.3416,[83]7.3493,[84]7.3392,[85]7.2976,[86]7.2989,[87]7.2713,[88]7.2904,[89]7.2948,[90]7.3053,[91]7.2985,[92]7.3308,[93]7.3308,[94]7.3340,[95]7.3401,[96]7.3365,[97]7.3325,[98]7.3656,[99]7.3575,[100]7.3905,[101]7.3930,[102]7.3802,[103]7.4032,[104]7.4080,[105]7.4545,[106]7.4529,[107]7.4421,[108]7.4609,[109]7.4855,[110]7.4811,[111]7.4874,[112]7.4721,[113]7.4581,[114]7.4495,[115]7.4684,[116]7.4948,[117]7.5385,[118]7.5649,[119]7.6030,[120]7.6158,[121]7.6040,[122]7.6388,[123]7.6836,[124]7.7097,[125]7.6881,[126]7.6880,[127]7.6818,[128]7.6525,[129]7.6434,[130]7.6581,[131]7.6588,[132]7.6291,[133]7.6082,[134]7.5936,[135]7.5888,[136]7.5787,[137]7.5414,[138]7.5352,[139]7.4971,[140]7.4565,[141]7.4384,[142]7.4232,[143]7.4358,[144]7.4349,[145]7.4322,[146]7.4285,[147]7.4184,[148]7.3997,[149]7.3721,[150]7.3758,[151]7.3866,[152]7.3837,[153]7.4064,[154]7.3984,[155]7.3875,[156]7.4033,[157]7.3717,[158]7.3458,[159]7.3215,[160]7.2826,[161]7.2528,[162]7.2096,[163]7.1839,[164]7.1677,[165]7.1470,[166]7.1168,[167]7.0952,[168]7.0761,[169]7.0388,[170]7.0143,[171]6.9884,[172]6.9571,[173]6.9364,[174]6.9197,[175]6.8968,[176]6.8683,[177]6.8566,[178]6.8284,[179]6.8183,[180]6.8087,[181]6.8142,[182]6.8016,[183]6.8300,[184]6.8358,[185]6.8718,[186]6.9028,[187]6.9188,[188]6.9579,[189]6.9821,[190]7.0071,[191]7.0433,[192]7.0800,[193]7.0899,[194]7.0870,[195]7.1038,[196]7.1217,[197]7.1300,[198]7.1481,[199]7.1558,[200]7.1536,[201]7.1699,[202]7.1730,[203]7.1762,[204]7.1914,[205]7.2026,[206]7.2155,[207]7.2253,[208]7.2157,[209]7.2321,[210]7.2450,[211]7.2660,[212]7.2665,[213]7.2718,[214]7.2703,[215]7.2666,[216]7.2484,[217]7.2452,[218]7.2664,[219]7.2736,[220]7.2818,[221]7.2845,[222]7.2783,[223]7.2902,[224]7.2724,[225]7.2611,[226]7.2417,[227]7.2262,[228]7.2207,[229]7.2081,[230]7.2025,[231]7.1856,[232]7.1827,[233]7.1668,[234]7.1583,[235]7.1444,[236]7.1267,[237]7.1063,[238]7.0989,[239]7.0856,[240]7.0737,[241]7.0701,[242]7.0583,[243]7.0508,[244]7.0378,[245]7.0276,[246]7.0110,[247]6.9928,[248]6.9793,[249]6.9637,[250]6.9493,[251]6.9432,[252]6.9405,[253]6.9369,[254]6.9271,[255]6.9286,[256]6.9276,[257]6.9193,[258]6.9186,[259]6.9283,[260]6.9302,[261]6.9415,[262]6.9451,[263]6.9441,[264]6.9466,[265]6.9531,[266]6.9571,[267]6.9733,[268]6.9848,[269]6.9903,[270]6.9970,[271]7.0107,[272]7.0172,[273]7.0342,[274]7.0422,[275]7.0498,[276]7.0685,[277]7.0753,[278]7.0843,[279]7.0702,[280]7.0508,[281]7.0327,[282]7.0119,[283]6.9986,[284]6.9938,[285]6.9890,[286]6.9889,[287]6.9841,[288]6.9802,[289]6.9702,[290]6.9594,[291]6.9507,[292]6.9460,[293]6.9369,[294]6.9295,[295]6.9240,[296]6.9090,[297]6.9097,[298]6.9003,[299]6.8978,[300]6.8933,[301]6.8901,[302]6.8876,[303]6.8823,[304]6.8689,[305]6.8611,[306]6.8402,[307]6.8135,[308]6.8269,[309]6.8375,[310]6.8403,[311]6.8326,[312]6.8264,[313]6.8271,[314]6.8379,[315]6.8417,[316]6.8425,[317]6.8444,[318]6.8473,[319]6.8594,[320]6.8651,[321]6.8788,[322]6.8754,[323]6.8634,[324]6.8591,[325]6.8548,[326]6.8514,[327]6.8468,[328]6.8458,[329]6.8582,[330]6.8600,[331]6.8610,[332]6.8642,[333]6.8645,[334]6.8612,[335]6.8624,[336]6.8664,[337]6.8712,[338]6.8698,[339]6.8713,[340]6.8709,[341]6.8608,[342]6.8593,[343]6.8700,[344]6.8722,[345]6.8644,[346]6.8683,[347]6.8646,[348]6.8757,[349]6.8724,[350]6.8786,[351]6.8814,[352]6.8943,[353]6.8964,[354]6.8956,[355]6.8998,[356]6.8925,[357]6.8885,[358]6.8903,[359]6.8891,[360]6.8972,[361]6.9002,[362]6.8924,[363]6.8953,[364]6.8882,[365]6.8849,[366]6.8883,[367]6.8794,[368]6.8731,[369]6.8642,[370]6.8568,[371]6.8607,[372]6.8594,[373]6.8563,[374]6.8535,[375]6.8466,[376]6.8399,[377]6.8310,[378]6.8236,[379]6.8163,[380]6.8114,[381]6.8115,[382]6.8091,[383]6.8113,[384]6.8191,[385]6.8269,[386]6.8271,[387]6.8201,[388]6.8246,[389]6.8246,[390]6.8294,[391]6.8238,[392]6.8238,[393]6.8279,[394]6.8298,[395]6.8448,[396]6.8565,[397]6.8740,[398]6.8887,[399]6.8975,[400]6.9076,[401]6.9207,[402]6.9356,[403]6.9375,[404]6.9430,[405]6.9564,[406]6.9640,[407]6.9628,[408]6.9719,[409]6.9852,[410]6.9973,[411]7.0062,[412]7.0118,[413]7.0220,[414]7.0293,[415]7.0389,[416]7.0521,[417]7.0619,[418]7.0603,[419]7.0588,[420]7.0612,[421]7.0770,[422]7.0888,[423]7.0910,[424]7.0987,[425]7.0940,[426]7.0936,[427]7.0973,[428]7.1008,[429]7.1017,[430]7.1045,[431]7.1086,[432]7.1171,[433]7.1216,[434]7.1177,[435]7.1108,[436]7.1071,[437]7.1027,[438]7.0981,[439]7.1009,[440]7.1008,[441]7.1000,[442]7.1019,[443]7.1069,[444]7.1157,[445]7.1177,[446]7.1224,[447]7.1223,[448]7.1206,[449]7.1121,[450]7.1181,[451]7.1185,[452]7.1229,[453]7.1233,[454]7.1213,[455]7.1288,[456]7.1283,[457]7.1302,[458]7.1344,[459]7.1385,[460]7.1328,[461]7.1324,[462]7.1464,[463]7.1472,[464]7.1559,[465]7.1547,[466]7.1533,[467]7.1562,[468]7.1518,[469]7.1489,[470]7.1507,[471]7.1403,[472]7.1376,[473]7.1432,[474]7.1391,[475]7.1331,[476]7.1333,[477]7.1347,[478]7.1301,[479]7.1257,[480]7.1247,[481]7.1192,[482]7.1148,[483]7.1105,[484]7.1072,[485]7.1070,[486]7.1006,[487]7.0992,[488]7.0994,[489]7.0992,[490]7.0915,[491]7.0894,[492]7.0884,[493]7.0865,[494]7.0889,[495]7.0978,[496]7.0998,[497]7.0984,[498]7.0970,[499]7.0988,[500]7.1051,[501]7.1064,[502]7.1095,[503]7.1131,[504]7.1159,[505]7.1220,[506]7.1268,[507]7.1245,[508]7.1280,[509]7.1224,[510]7.1241,[511]7.1168,[512]7.1143,[513]7.1157,[514]7.1118,[515]7.1058,[516]7.1022,[517]7.0962,[518]7.0976,[519]7.1093,[520]7.1137,[521]7.1106,[522]7.1128,[523]7.1187,[524]7.1215,[525]7.1191,[526]7.1218,[527]7.1162,[528]7.1085,[529]7.1067,[530]7.1029,[531]7.0993,[532]7.0969,[533]7.0925,[534]7.0884,[535]7.0853,[536]7.0848,[537]7.0859,[538]7.0885,[539]7.0853,[540]7.0828,[541]7.0809,[542]7.0751,[543]7.0796,[544]7.0830,[545]7.0811,[546]7.0814,[547]7.0799,[548]7.0783,[549]7.0798,[550]7.0755,[551]7.0772,[552]7.0761,[553]7.0709,[554]7.0692,[555]7.0679,[556]7.0630,[557]7.0611,[558]7.0593,[559]7.0521,[560]7.0482,[561]7.0493,[562]7.0482,[563]7.0459,[564]7.0379,[565]7.0366,[566]7.0355,[567]7.0332,[568]7.0402,[569]7.0351,[570]7.0347,[571]7.0321,[572]7.0324,[573]7.0301,[574]7.0338,[575]7.0306,[576]7.0294,[577]7.0335,[578]7.0334,[579]7.0313,[580]7.0407,[581]7.0465,[582]7.0459,[583]7.0507,[584]7.0572,[585]7.0488,[586]7.0424,[587]7.0461,[588]7.0463,[589]7.0481,[590]7.0477,[591]7.0449,[592]7.0347,[593]7.0356,[594]7.0330,[595]7.0246,[596]7.0178,[597]7.0084,[598]6.9969,[599]6.9934,[600]6.9969,[601]6.9999,[602]7.0005,[603]6.9991,[604]7.0054,[605]7.0069,[606]7.0101,[607]7.0132,[608]7.0218,[609]7.0293,[610]7.0278,[611]7.0300,[612]7.0294,[613]7.0285,[614]7.0262,[615]7.0302,[616]7.0259,[617]7.0274,[618]7.0296,[619]7.0373,[620]7.0372,[621]7.0392,[622]7.0407,[623]7.0433,[624]7.0440,[625]7.0474,[626]7.0462,[627]7.0505,[628]7.0553,[629]7.0636,[630]7.0597,[631]7.0620,
Final estimate: PPL = 7.0620 +/- 0.04195

llama_print_timings:        load time =     828.57 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 3211039.76 ms / 323072 tokens (    9.94 ms per token,   100.61 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 3213181.34 ms / 323073 tokens

ggerganov · 2024-01-26T13:50:55Z

Does this perplexity value match well with what you get on master without these changes?

For how many number of threads do you observe optimal performance with these kernels?

snadampal · 2024-01-26T15:50:11Z

Yes, the perplexity matches to the master without these changes (logs below).

main: build = 1970 (fe54033b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
main: seed  = 1706278254
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/ubuntu/Llama_setup2/llama.cpp/models/open_llama_7b_v2/ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 900.955 ms
perplexity: calculating perplexity over 631 chunks, batch_size=512
perplexity: 7.33 seconds per pass - ETA 1 hours 17.13 minutes
[1]4.1032,[2]5.4874,[3]5.6442,[4]6.6715,[5]6.9956,[6]7.7168,[7]7.9770,[8]8.2112,[9]8.5768,[10]9.1575,[11]9.3705,[12]9.3990,[13]9.4507,[14]9.7147,[15]9.1657,[16]8.8407,[17]8.7235,[18]8.2421,[19]8.2558,[20]8.2294,[21]7.9563,[22]7.8904,[23]7.7807,[24]7.7538,[25]7.4997,[26]7.2244,[27]7.0789,[28]6.9440,[29]6.7604,[30]6.6747,[31]6.7384,[32]6.7248,[33]6.7785,[34]6.7540,[35]6.7986,[36]6.8183,[37]6.8815,[38]6.9273,[39]6.9946,[40]7.0747,[41]7.0236,[42]7.0082,[43]7.0074,[44]6.9949,[45]6.9526,[46]6.9727,[47]6.9945,[48]6.9584,[49]6.9354,[50]6.9027,[51]6.9698,[52]6.9567,[53]6.9238,[54]6.9435,[55]6.9240,[56]6.9518,[57]6.9825,[58]7.0175,[59]7.0246,[60]7.0663,[61]7.0442,[62]7.0578,[63]7.1145,[64]7.1249,[65]7.1514,[66]7.1562,[67]7.1910,[68]7.2162,[69]7.2741,[70]7.3215,[71]7.3510,[72]7.4025,[73]7.3806,[74]7.3902,[75]7.3960,[76]7.4032,[77]7.4474,[78]7.4432,[79]7.4237,[80]7.3848,[81]7.3635,[82]7.3412,[83]7.3489,[84]7.3387,[85]7.2972,[86]7.2985,[87]7.2710,[88]7.2900,[89]7.2943,[90]7.3048,[91]7.2981,[92]7.3305,[93]7.3305,[94]7.3335,[95]7.3396,[96]7.3359,[97]7.3320,[98]7.3648,[99]7.3568,[100]7.3901,[101]7.3927,[102]7.3801,[103]7.4031,[104]7.4078,[105]7.4543,[106]7.4526,[107]7.4419,[108]7.4607,[109]7.4853,[110]7.4808,[111]7.4870,[112]7.4718,[113]7.4579,[114]7.4493,[115]7.4683,[116]7.4946,[117]7.5383,[118]7.5648,[119]7.6029,[120]7.6159,[121]7.6040,[122]7.6389,[123]7.6839,[124]7.7099,[125]7.6883,[126]7.6881,[127]7.6820,[128]7.6528,[129]7.6438,[130]7.6585,[131]7.6591,[132]7.6294,[133]7.6085,[134]7.5938,[135]7.5891,[136]7.5789,[137]7.5416,[138]7.5354,[139]7.4974,[140]7.4568,[141]7.4387,[142]7.4235,[143]7.4360,[144]7.4351,[145]7.4325,[146]7.4289,[147]7.4188,[148]7.4001,[149]7.3725,[150]7.3761,[151]7.3870,[152]7.3840,[153]7.4066,[154]7.3987,[155]7.3879,[156]7.4036,[157]7.3720,[158]7.3461,[159]7.3217,[160]7.2828,[161]7.2529,[162]7.2098,[163]7.1840,[164]7.1678,[165]7.1470,[166]7.1169,[167]7.0952,[168]7.0761,[169]7.0389,[170]7.0143,[171]6.9884,[172]6.9570,[173]6.9364,[174]6.9197,[175]6.8969,[176]6.8683,[177]6.8567,[178]6.8285,[179]6.8183,[180]6.8086,[181]6.8142,[182]6.8016,[183]6.8299,[184]6.8357,[185]6.8718,[186]6.9027,[187]6.9188,[188]6.9579,[189]6.9821,[190]7.0071,[191]7.0434,[192]7.0803,[193]7.0901,[194]7.0872,[195]7.1040,[196]7.1219,[197]7.1303,[198]7.1482,[199]7.1560,[200]7.1537,[201]7.1699,[202]7.1730,[203]7.1764,[204]7.1915,[205]7.2027,[206]7.2156,[207]7.2254,[208]7.2157,[209]7.2321,[210]7.2450,[211]7.2660,[212]7.2665,[213]7.2718,[214]7.2704,[215]7.2665,[216]7.2483,[217]7.2451,[218]7.2663,[219]7.2736,[220]7.2816,[221]7.2844,[222]7.2784,[223]7.2903,[224]7.2724,[225]7.2611,[226]7.2416,[227]7.2260,[228]7.2205,[229]7.2080,[230]7.2024,[231]7.1855,[232]7.1826,[233]7.1667,[234]7.1582,[235]7.1444,[236]7.1266,[237]7.1062,[238]7.0988,[239]7.0855,[240]7.0735,[241]7.0700,[242]7.0581,[243]7.0506,[244]7.0376,[245]7.0274,[246]7.0108,[247]6.9926,[248]6.9791,[249]6.9636,[250]6.9492,[251]6.9430,[252]6.9403,[253]6.9366,[254]6.9269,[255]6.9284,[256]6.9274,[257]6.9192,[258]6.9185,[259]6.9282,[260]6.9301,[261]6.9414,[262]6.9450,[263]6.9440,[264]6.9464,[265]6.9530,[266]6.9570,[267]6.9733,[268]6.9849,[269]6.9904,[270]6.9970,[271]7.0107,[272]7.0173,[273]7.0342,[274]7.0422,[275]7.0498,[276]7.0686,[277]7.0753,[278]7.0843,[279]7.0702,[280]7.0508,[281]7.0329,[282]7.0120,[283]6.9987,[284]6.9938,[285]6.9890,[286]6.9889,[287]6.9841,[288]6.9802,[289]6.9701,[290]6.9593,[291]6.9506,[292]6.9460,[293]6.9368,[294]6.9294,[295]6.9238,[296]6.9088,[297]6.9094,[298]6.9000,[299]6.8975,[300]6.8930,[301]6.8898,[302]6.8873,[303]6.8820,[304]6.8686,[305]6.8609,[306]6.8399,[307]6.8132,[308]6.8267,[309]6.8373,[310]6.8400,[311]6.8323,[312]6.8261,[313]6.8267,[314]6.8375,[315]6.8413,[316]6.8422,[317]6.8441,[318]6.8470,[319]6.8590,[320]6.8647,[321]6.8784,[322]6.8750,[323]6.8631,[324]6.8587,[325]6.8544,[326]6.8510,[327]6.8463,[328]6.8454,[329]6.8578,[330]6.8596,[331]6.8607,[332]6.8638,[333]6.8642,[334]6.8609,[335]6.8621,[336]6.8661,[337]6.8709,[338]6.8695,[339]6.8710,[340]6.8706,[341]6.8605,[342]6.8591,[343]6.8697,[344]6.8720,[345]6.8641,[346]6.8680,[347]6.8643,[348]6.8755,[349]6.8722,[350]6.8784,[351]6.8812,[352]6.8941,[353]6.8962,[354]6.8954,[355]6.8997,[356]6.8923,[357]6.8883,[358]6.8902,[359]6.8890,[360]6.8971,[361]6.9001,[362]6.8923,[363]6.8951,[364]6.8880,[365]6.8847,[366]6.8882,[367]6.8792,[368]6.8729,[369]6.8640,[370]6.8567,[371]6.8605,[372]6.8593,[373]6.8562,[374]6.8534,[375]6.8466,[376]6.8398,[377]6.8310,[378]6.8236,[379]6.8163,[380]6.8114,[381]6.8115,[382]6.8091,[383]6.8114,[384]6.8191,[385]6.8270,[386]6.8272,[387]6.8201,[388]6.8247,[389]6.8246,[390]6.8295,[391]6.8239,[392]6.8239,[393]6.8280,[394]6.8300,[395]6.8449,[396]6.8566,[397]6.8741,[398]6.8888,[399]6.8976,[400]6.9077,[401]6.9207,[402]6.9355,[403]6.9375,[404]6.9430,[405]6.9565,[406]6.9640,[407]6.9628,[408]6.9719,[409]6.9852,[410]6.9972,[411]7.0061,[412]7.0117,[413]7.0219,[414]7.0292,[415]7.0388,[416]7.0520,[417]7.0619,[418]7.0602,[419]7.0587,[420]7.0611,[421]7.0769,[422]7.0887,[423]7.0909,[424]7.0987,[425]7.0939,[426]7.0935,[427]7.0973,[428]7.1007,[429]7.1017,[430]7.1045,[431]7.1085,[432]7.1170,[433]7.1216,[434]7.1176,[435]7.1107,[436]7.1070,[437]7.1026,[438]7.0980,[439]7.1008,[440]7.1007,[441]7.0999,[442]7.1018,[443]7.1068,[444]7.1156,[445]7.1176,[446]7.1223,[447]7.1222,[448]7.1205,[449]7.1120,[450]7.1180,[451]7.1184,[452]7.1229,[453]7.1232,[454]7.1212,[455]7.1287,[456]7.1281,[457]7.1300,[458]7.1342,[459]7.1383,[460]7.1326,[461]7.1322,[462]7.1462,[463]7.1470,[464]7.1558,[465]7.1546,[466]7.1532,[467]7.1560,[468]7.1517,[469]7.1488,[470]7.1505,[471]7.1402,[472]7.1374,[473]7.1430,[474]7.1390,[475]7.1329,[476]7.1331,[477]7.1345,[478]7.1299,[479]7.1256,[480]7.1245,[481]7.1191,[482]7.1147,[483]7.1104,[484]7.1071,[485]7.1068,[486]7.1005,[487]7.0990,[488]7.0992,[489]7.0991,[490]7.0914,[491]7.0893,[492]7.0883,[493]7.0864,[494]7.0888,[495]7.0977,[496]7.0997,[497]7.0982,[498]7.0969,[499]7.0988,[500]7.1051,[501]7.1064,[502]7.1094,[503]7.1130,[504]7.1159,[505]7.1220,[506]7.1268,[507]7.1245,[508]7.1280,[509]7.1224,[510]7.1241,[511]7.1168,[512]7.1143,[513]7.1157,[514]7.1119,[515]7.1059,[516]7.1022,[517]7.0962,[518]7.0976,[519]7.1093,[520]7.1136,[521]7.1106,[522]7.1127,[523]7.1186,[524]7.1214,[525]7.1190,[526]7.1218,[527]7.1162,[528]7.1084,[529]7.1066,[530]7.1029,[531]7.0993,[532]7.0969,[533]7.0925,[534]7.0884,[535]7.0853,[536]7.0848,[537]7.0858,[538]7.0885,[539]7.0852,[540]7.0828,[541]7.0809,[542]7.0751,[543]7.0796,[544]7.0830,[545]7.0811,[546]7.0813,[547]7.0799,[548]7.0782,[549]7.0798,[550]7.0755,[551]7.0772,[552]7.0761,[553]7.0710,[554]7.0692,[555]7.0679,[556]7.0631,[557]7.0611,[558]7.0594,[559]7.0522,[560]7.0483,[561]7.0493,[562]7.0483,[563]7.0460,[564]7.0379,[565]7.0367,[566]7.0356,[567]7.0332,[568]7.0402,[569]7.0352,[570]7.0347,[571]7.0321,[572]7.0325,[573]7.0302,[574]7.0339,[575]7.0306,[576]7.0294,[577]7.0335,[578]7.0335,[579]7.0314,[580]7.0408,[581]7.0466,[582]7.0460,[583]7.0508,[584]7.0573,[585]7.0489,[586]7.0425,[587]7.0461,[588]7.0464,[589]7.0481,[590]7.0478,[591]7.0449,[592]7.0347,[593]7.0356,[594]7.0330,[595]7.0246,[596]7.0178,[597]7.0084,[598]6.9970,[599]6.9934,[600]6.9969,[601]6.9999,[602]7.0005,[603]6.9991,[604]7.0054,[605]7.0070,[606]7.0101,[607]7.0132,[608]7.0218,[609]7.0292,[610]7.0278,[611]7.0300,[612]7.0294,[613]7.0284,[614]7.0261,[615]7.0301,[616]7.0258,[617]7.0273,[618]7.0296,[619]7.0372,[620]7.0371,[621]7.0392,[622]7.0406,[623]7.0433,[624]7.0439,[625]7.0474,[626]7.0462,[627]7.0505,[628]7.0553,[629]7.0635,[630]7.0596,[631]7.0620,
Final estimate: PPL = 7.0620 +/- 0.04195

llama_print_timings:        load time =     831.57 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 4613485.32 ms / 323072 tokens (   14.28 ms per token,    70.03 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 4615632.34 ms / 323073 tokens

snadampal · 2024-01-26T15:54:51Z

I haven't collected data for different thread configs yet, but in general I see these gemm kernels scale with the number of threads, though not linearly.
I'm curious what the context for your question was. is there some heuristic we need to populate with the optimal thread count for each kernel?

For how many number of threads do you observe optimal performance with these kernels?

ggerganov · 2024-01-27T10:59:08Z

The expectation is that for prompt processing the speed should always increase with increasing the number of threads, while for text-generation there should be an optimal number of threads after which the performance will start degrading. llama.cpp allows 2 different number of threads to be passed - for batch size 1 and for bs > 1:

llama.cpp/llama.h

Lines 222 to 223 in 6fea843

    
           uint32_t n_threads;         // number of threads to use for generation 
        
           uint32_t n_threads_batch;   // number of threads to use for batch processing

These can be configured with the -t and -tb command-line arguments respectively.

I just now realized that the new kernels are used only for prompt-processing, so that's fine

ggerganov · 2024-01-27T11:07:25Z

ggml-quants.h

+void ggml_vec_dot_q4_1_q8_1(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
+void ggml_vec_dot_q5_0_q8_0(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
+void ggml_vec_dot_q5_1_q8_1(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
+void ggml_vec_dot_q8_0_q8_0(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);


I'm not convinced if this API is desirable - it requires preparing arrays of pointers which seems quite cumbersome

Normally, linear algebra libraries utilize an API of a pointer, number of elements and stride (in bytes or in elements). So I'm thinking that we should probably switch to something like:

void ggml_vec_dot_q4_0_q8_0(int n, float * restrict s, const void * restrict vx, size_t bx, const void * restrict vy, size_t by, int nrc);

Note that I'm mostly thinking out loud - not yet sure what is the best way.
It's a big change so we have to consider the options to make this less intrusive

I agree. as I mentioned earlier I tried to fit it into the existing interface itself but had changed to array mainly to consider stride. if it's better to add few more ags than arrays, how about we define a tensor attribute structure and pass it across instead of adding one arg for each attribute? this way we can extend it is future for any new functionality. For now the tensor object can just have elements, stride, format type.

Hm, adding a new tensor attribute structure would again introduce a lot of boilerplate around calling the dot functions. Adding extra arguments is better in this regard, because we already have the strides from the struct ggml_tensor

Currently, ggml stores the strides in number of bytes. So the numbers in ggml_tensor->nb are strides in bytes. The dot functions should also accept the row strides in bytes for consistency.

In the future, we will transition to storing the strides in number of elements: ggerganov/ggml#623. But this is not important for now

Hi @ggerganov , I have updated the PR. please review and let me know if it can be improved further, especially around the stride calculations. I was able to use the ggml_tensor strides (nb) for src0 and dst tensors, but, I had to arrive at the src1_col stride following the logic used for offset calculations.

ggml.h

ggml-quants.c

ggerganov · 2024-02-02T14:34:51Z

ggml-quants.c

+        vst1_f32(s, vget_low_f32(sumv2));
+        vst1_f32(s + 16, vget_high_f32(sumv2));


I'm wondering if we should add a stride argument for s too. This 16 offset is very obscure, but on the other hand the function signature would become a bit overloaded.

It's probably better to add it

Hi @ggerganov , addressed all the comments.

ggerganov · 2024-02-05T07:56:17Z

I think this should be good to merge. Want to take some time to do some AWS Graviton tests first and confirm the results. If anyone else gives this a try, please post some feedback as well

snadampal · 2024-02-05T16:39:16Z

I think this should be good to merge. Want to take some time to do some AWS Graviton tests first and confirm the results. If anyone else gives this a try, please post some feedback as well

@ggerganov or anyone trying this PR, please make sure you use the instances from AWS Graviton3 family, c7g/m7g/r7g (Graviton2 doesn't support MMLA instructions).

Dibakar · 2024-02-05T20:04:35Z

@ggerganov I tried this PR on an AWS Graviton3 instance. I can confirm that I observed a similar speedup as mentioned by the author of this patch. Please find below the tokens/s numbers.

ggml.c

armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel.

armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel.

armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel.

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info

snadampal force-pushed the smmla_aarch64 branch from d924089 to 2790dda Compare January 16, 2024 14:58

snadampal changed the title ~~ggml: aarch64: implement mmla kernel for q8_0_q8_0 quantized gemm~~ ggml: aarch64: implement mmla kernels for q8_0_q8_0 and q4_0_q8_0 quantized gemm Jan 16, 2024

AGSaidi reviewed Jan 16, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

snadampal force-pushed the smmla_aarch64 branch from 2790dda to 72cad33 Compare January 16, 2024 16:41

snadampal changed the title ~~ggml: aarch64: implement mmla kernels for q8_0_q8_0 and q4_0_q8_0 quantized gemm~~ ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm Jan 16, 2024

cebtenzzre reviewed Jan 16, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

snadampal force-pushed the smmla_aarch64 branch from 72cad33 to 9859c5b Compare January 16, 2024 18:31

ggerganov added the performance Speed related topics label Jan 17, 2024

snadampal force-pushed the smmla_aarch64 branch from c5c9140 to f434e52 Compare January 22, 2024 22:15

snadampal force-pushed the smmla_aarch64 branch from f434e52 to 99b811d Compare January 23, 2024 05:05

cebtenzzre reviewed Jan 23, 2024

View reviewed changes

ggml-quants.h Outdated Show resolved Hide resolved

snadampal mentioned this pull request Jan 24, 2024

[soft max] capping the num tasks to 4 is limiting the prompt eval perf #5103

Closed

snadampal force-pushed the smmla_aarch64 branch from 99b811d to d228130 Compare January 24, 2024 03:02

snadampal force-pushed the smmla_aarch64 branch from d228130 to 9eaba38 Compare January 26, 2024 03:40

ggerganov added the high priority Very important issue label Jan 26, 2024

ggerganov self-requested a review January 26, 2024 09:08

ggerganov reviewed Jan 27, 2024

View reviewed changes

snadampal force-pushed the smmla_aarch64 branch from 9eaba38 to d0b014f Compare January 31, 2024 04:20

ggerganov reviewed Feb 2, 2024

View reviewed changes

snadampal force-pushed the smmla_aarch64 branch 2 times, most recently from ff67775 to 4c840fd Compare February 2, 2024 21:38

ggerganov added the need feedback Testing and feedback with results are needed label Feb 5, 2024

ggerganov reviewed Feb 6, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

snadampal added 5 commits February 8, 2024 15:13

ggml: update unit tests for the new vec_dot interface

bca726f

llama.cpp: add MATMUL_INT8 capability to system_info

d8f132d

snadampal force-pushed the smmla_aarch64 branch from 4c840fd to d8f132d Compare February 8, 2024 15:27

ggerganov approved these changes Feb 11, 2024

View reviewed changes

ggerganov merged commit a07d0fe into ggerganov:master Feb 11, 2024
49 of 53 checks passed

ggerganov added a commit that referenced this pull request Feb 11, 2024

ggml : fix compile warnings (unused vars) (#4966)

0f2411f

ggerganov mentioned this pull request Mar 9, 2024

ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) #5951

Merged

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024

ggml : fix compile warnings (unused vars) (ggerganov#4966)

aab26bd

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

ggml : fix compile warnings (unused vars) (ggerganov#4966)

23bf51a

ggerganov mentioned this pull request Apr 10, 2024

Improve cpu prompt eval speed #6414

Merged

msy-kato mentioned this pull request May 21, 2024

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966

ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966

snadampal commented Jan 16, 2024 •

edited

Loading

snadampal commented Jan 16, 2024

ggerganov commented Jan 17, 2024

snadampal commented Jan 17, 2024 •

edited

Loading

snadampal commented Jan 22, 2024

snadampal commented Jan 23, 2024 •

edited

Loading

snadampal commented Jan 24, 2024

snadampal commented Jan 25, 2024

snadampal commented Jan 26, 2024

slaren commented Jan 26, 2024

snadampal commented Jan 26, 2024

snadampal commented Jan 26, 2024 •

edited

Loading

snadampal commented Jan 26, 2024

ggerganov commented Jan 26, 2024

snadampal commented Jan 26, 2024

snadampal commented Jan 26, 2024 •

edited

Loading

ggerganov commented Jan 27, 2024

ggerganov Jan 27, 2024

snadampal Jan 27, 2024

ggerganov Jan 29, 2024

ggerganov Jan 30, 2024

snadampal Jan 31, 2024

ggerganov Feb 2, 2024

snadampal Feb 3, 2024

ggerganov commented Feb 5, 2024

snadampal commented Feb 5, 2024

Dibakar commented Feb 5, 2024 •

edited

Loading

		vst1_f32(s, vget_low_f32(sumv2));
		vst1_f32(s + 16, vget_high_f32(sumv2));

ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966

ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966

Conversation

snadampal commented Jan 16, 2024 • edited Loading

snadampal commented Jan 16, 2024

ggerganov commented Jan 17, 2024

snadampal commented Jan 17, 2024 • edited Loading

snadampal commented Jan 22, 2024

snadampal commented Jan 23, 2024 • edited Loading

snadampal commented Jan 24, 2024

snadampal commented Jan 25, 2024

snadampal commented Jan 26, 2024

slaren commented Jan 26, 2024

snadampal commented Jan 26, 2024

snadampal commented Jan 26, 2024 • edited Loading

snadampal commented Jan 26, 2024

ggerganov commented Jan 26, 2024

snadampal commented Jan 26, 2024

snadampal commented Jan 26, 2024 • edited Loading

ggerganov commented Jan 27, 2024

ggerganov Jan 27, 2024

Choose a reason for hiding this comment

snadampal Jan 27, 2024

Choose a reason for hiding this comment

ggerganov Jan 29, 2024

Choose a reason for hiding this comment

ggerganov Jan 30, 2024

Choose a reason for hiding this comment

snadampal Jan 31, 2024

Choose a reason for hiding this comment

ggerganov Feb 2, 2024

Choose a reason for hiding this comment

snadampal Feb 3, 2024

Choose a reason for hiding this comment

ggerganov commented Feb 5, 2024

snadampal commented Feb 5, 2024

Dibakar commented Feb 5, 2024 • edited Loading

snadampal commented Jan 16, 2024 •

edited

Loading

snadampal commented Jan 17, 2024 •

edited

Loading

snadampal commented Jan 23, 2024 •

edited

Loading

snadampal commented Jan 26, 2024 •

edited

Loading

snadampal commented Jan 26, 2024 •

edited

Loading

Dibakar commented Feb 5, 2024 •

edited

Loading