Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966

Merged
merged 5 commits into from
Feb 11, 2024

Conversation

snadampal
Copy link
Contributor

@snadampal snadampal commented Jan 16, 2024

armv8.2-a and above supports MMLA instructions that have better throughput than DOT. this PR adds support for mmla kernels for
q8_0_q8_0
q4_0_q8_0
and q4_1_q8_1 quantized gemm routines.
The feature is enabled if the platform supports __ARM_FEATURE_MATMUL_INT8

on AWS Graviton3 processors these kernels resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel.

@snadampal
Copy link
Contributor Author

I have tested this PR with few Llama2 models using different prompt sizes, compared the gemm output from mmla kernel and the default dot kernels and confirmed they were matching. Please let me know if there are any unit tests or perplexity tests I need to run for this PR. thank you!

@snadampal snadampal changed the title ggml: aarch64: implement mmla kernel for q8_0_q8_0 quantized gemm ggml: aarch64: implement mmla kernels for q8_0_q8_0 and q4_0_q8_0 quantized gemm Jan 16, 2024
Makefile Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@snadampal snadampal changed the title ggml: aarch64: implement mmla kernels for q8_0_q8_0 and q4_0_q8_0 quantized gemm ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm Jan 16, 2024
Makefile Outdated Show resolved Hide resolved
@ggerganov ggerganov added the performance Speed related topics label Jan 17, 2024
@ggerganov
Copy link
Owner

Interesting work - I was not familiar with these instructions. Gaining performance for CPU-based prompt processing is of significant interest.

Looks like you want to process 2 rows at a time with a single kernel call. However, I feel this is implemented in a very convoluted way with quite a lot of duplication in the matrix multiplication logic. I could be missing something though

Can you try to fit this into the existing ggml_vec_dot_q4_0_q8_0 and ggml_vec_dot_q4_1_q8_1 kernels with an extra #ifdef similar to how we differentiate between AVX, AVX2, ARM NEON, WASM SIMD, etc?

Also run a perplexity test:

# get wikitext test data
./scripts/get-wikitext-2.sh
unzip wikitext-2-raw-v1.zip

# run test (takes a while)
./perplexity -m some-model-q4_0.gguf -f ./wikitext-2-raw/wiki.test.raw

@snadampal
Copy link
Contributor Author

snadampal commented Jan 17, 2024

@ggerganov , thanks for the feedback. Yes, your understanding is correct; i'm processing two rows and columns at a time (SMMLA instruction operates at 2x8 * 8x2 --> 2x2). I came up with this logic while trying to understand the current algo and make the changes as isolated as possible :) sure, will check how best I can merge this logic with the dot kernel matmul loop.

@snadampal
Copy link
Contributor Author

Hi @ggerganov , I have extended the ggml_vec_dot interface to add mmla kernels. I tried to fit the additional row/col pointers in the existing void* x/y but didn't find a better way than changing the interface to carry array of pointers. since vec_dot is the main api, the changes are all over. can you please trigger CI to make sure there are no breaks. happy to rework the PR if there is any feedback.

@snadampal
Copy link
Contributor Author

snadampal commented Jan 23, 2024

some of the unit tests are being invoked only from cmake, not from make, hence I missed them in the previous version. Now i'm testing both make and cmake options. pushed fixes for all CI failures except the windows build errors. I will check if i can get access to windows machine.

https://github.com/ggerganov/llama.cpp/actions/runs/7618043882/job/20748830174?pr=4966
https://github.com/ggerganov/llama.cpp/actions/runs/7618043882/job/20748830837?pr=4966

ggml-quants.h Outdated Show resolved Hide resolved
@snadampal
Copy link
Contributor Author

I updated the PR for windows failures but haven't tested it on Windows machine yet. it will be great if it can be tested in CI runs otherwise I will pickup local windows testing effort.

@snadampal
Copy link
Contributor Author

thanks, looks like there are more places to take care on Windows. I will check.

@snadampal
Copy link
Contributor Author

Looks like MSVC doesn't support VLA as well. I have fixed all the windows failures and tested on windows and Linux platforms. Next I'm looking at the below error from macos-latest-swift and iso-xcode builds, linker not finding the symbol, this looks strange to me. I have ggml_cpu_has_matmul_int8() defined in ggml.h and implemented it it ggml.c in the similar lines of ggml_cpu_has_neon() function.
@cebtenzzre , do you have any pointers on what is different in macos-latest-swift and ios-xcode builds that not able to link correctly?

Undefined symbols for architecture x86_64:
  "_ggml_cpu_has_matmul_int8", referenced from:
      _llama_print_system_info in llama.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

@slaren
Copy link
Collaborator

slaren commented Jan 26, 2024

The swift CI tests use a pinned version of ggml, so it fails every time that there are changes to ggml.c. You can ignore these errors.

@snadampal
Copy link
Contributor Author

thank you! I see those builds are using the llama.cpp package, not building from sources.

Fetching from https://github.com/ggerganov/ggml.git

Cloning local copy of package ‘ggml’

Checking out release of package ‘ggml’

@snadampal
Copy link
Contributor Author

snadampal commented Jan 26, 2024

updated the PR to fix windows builds and ran unit tests on Ubuntu and windows. It is ready for the final review and CI.

@ggerganov ggerganov added the high priority Very important issue label Jan 26, 2024
@ggerganov ggerganov self-requested a review January 26, 2024 09:08
@snadampal
Copy link
Contributor Author

perplexity test results:

# get wikitext test data
./scripts/get-wikitext-2.sh
unzip wikitext-2-raw-v1.zip

# run test (takes a while)
./perplexity -m some-model-q4_0.gguf -f ./wikitext-2-raw/wiki.test.raw
main: build = 1975 (9eaba38c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
main: seed  = 1706240339
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/ubuntu/Llama_setup2/llama.cpp/models/open_llama_7b_v2/ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 896.291 ms
perplexity: calculating perplexity over 631 chunks, batch_size=512
perplexity: 5.11 seconds per pass - ETA 53.75 minutes
[1]4.1006,[2]5.4902,[3]5.6471,[4]6.6776,[5]7.0017,[6]7.7207,[7]7.9831,[8]8.2151,[9]8.5785,[10]9.1584,[11]9.3710,[12]9.3985,[13]9.4508,[14]9.7168,[15]9.1693,[16]8.8449,[17]8.7281,[18]8.2459,[19]8.2586,[20]8.2317,[21]7.9581,[22]7.8929,[23]7.7834,[24]7.7557,[25]7.5015,[26]7.2262,[27]7.0801,[28]6.9451,[29]6.7618,[30]6.6763,[31]6.7396,[32]6.7262,[33]6.7796,[34]6.7551,[35]6.7997,[36]6.8193,[37]6.8824,[38]6.9279,[39]6.9951,[40]7.0751,[41]7.0246,[42]7.0092,[43]7.0083,[44]6.9956,[45]6.9532,[46]6.9736,[47]6.9956,[48]6.9595,[49]6.9366,[50]6.9036,[51]6.9706,[52]6.9577,[53]6.9250,[54]6.9449,[55]6.9254,[56]6.9532,[57]6.9838,[58]7.0189,[59]7.0262,[60]7.0679,[61]7.0457,[62]7.0596,[63]7.1159,[64]7.1264,[65]7.1528,[66]7.1579,[67]7.1928,[68]7.2179,[69]7.2760,[70]7.3236,[71]7.3528,[72]7.4042,[73]7.3822,[74]7.3917,[75]7.3975,[76]7.4047,[77]7.4486,[78]7.4440,[79]7.4246,[80]7.3857,[81]7.3640,[82]7.3416,[83]7.3493,[84]7.3392,[85]7.2976,[86]7.2989,[87]7.2713,[88]7.2904,[89]7.2948,[90]7.3053,[91]7.2985,[92]7.3308,[93]7.3308,[94]7.3340,[95]7.3401,[96]7.3365,[97]7.3325,[98]7.3656,[99]7.3575,[100]7.3905,[101]7.3930,[102]7.3802,[103]7.4032,[104]7.4080,[105]7.4545,[106]7.4529,[107]7.4421,[108]7.4609,[109]7.4855,[110]7.4811,[111]7.4874,[112]7.4721,[113]7.4581,[114]7.4495,[115]7.4684,[116]7.4948,[117]7.5385,[118]7.5649,[119]7.6030,[120]7.6158,[121]7.6040,[122]7.6388,[123]7.6836,[124]7.7097,[125]7.6881,[126]7.6880,[127]7.6818,[128]7.6525,[129]7.6434,[130]7.6581,[131]7.6588,[132]7.6291,[133]7.6082,[134]7.5936,[135]7.5888,[136]7.5787,[137]7.5414,[138]7.5352,[139]7.4971,[140]7.4565,[141]7.4384,[142]7.4232,[143]7.4358,[144]7.4349,[145]7.4322,[146]7.4285,[147]7.4184,[148]7.3997,[149]7.3721,[150]7.3758,[151]7.3866,[152]7.3837,[153]7.4064,[154]7.3984,[155]7.3875,[156]7.4033,[157]7.3717,[158]7.3458,[159]7.3215,[160]7.2826,[161]7.2528,[162]7.2096,[163]7.1839,[164]7.1677,[165]7.1470,[166]7.1168,[167]7.0952,[168]7.0761,[169]7.0388,[170]7.0143,[171]6.9884,[172]6.9571,[173]6.9364,[174]6.9197,[175]6.8968,[176]6.8683,[177]6.8566,[178]6.8284,[179]6.8183,[180]6.8087,[181]6.8142,[182]6.8016,[183]6.8300,[184]6.8358,[185]6.8718,[186]6.9028,[187]6.9188,[188]6.9579,[189]6.9821,[190]7.0071,[191]7.0433,[192]7.0800,[193]7.0899,[194]7.0870,[195]7.1038,[196]7.1217,[197]7.1300,[198]7.1481,[199]7.1558,[200]7.1536,[201]7.1699,[202]7.1730,[203]7.1762,[204]7.1914,[205]7.2026,[206]7.2155,[207]7.2253,[208]7.2157,[209]7.2321,[210]7.2450,[211]7.2660,[212]7.2665,[213]7.2718,[214]7.2703,[215]7.2666,[216]7.2484,[217]7.2452,[218]7.2664,[219]7.2736,[220]7.2818,[221]7.2845,[222]7.2783,[223]7.2902,[224]7.2724,[225]7.2611,[226]7.2417,[227]7.2262,[228]7.2207,[229]7.2081,[230]7.2025,[231]7.1856,[232]7.1827,[233]7.1668,[234]7.1583,[235]7.1444,[236]7.1267,[237]7.1063,[238]7.0989,[239]7.0856,[240]7.0737,[241]7.0701,[242]7.0583,[243]7.0508,[244]7.0378,[245]7.0276,[246]7.0110,[247]6.9928,[248]6.9793,[249]6.9637,[250]6.9493,[251]6.9432,[252]6.9405,[253]6.9369,[254]6.9271,[255]6.9286,[256]6.9276,[257]6.9193,[258]6.9186,[259]6.9283,[260]6.9302,[261]6.9415,[262]6.9451,[263]6.9441,[264]6.9466,[265]6.9531,[266]6.9571,[267]6.9733,[268]6.9848,[269]6.9903,[270]6.9970,[271]7.0107,[272]7.0172,[273]7.0342,[274]7.0422,[275]7.0498,[276]7.0685,[277]7.0753,[278]7.0843,[279]7.0702,[280]7.0508,[281]7.0327,[282]7.0119,[283]6.9986,[284]6.9938,[285]6.9890,[286]6.9889,[287]6.9841,[288]6.9802,[289]6.9702,[290]6.9594,[291]6.9507,[292]6.9460,[293]6.9369,[294]6.9295,[295]6.9240,[296]6.9090,[297]6.9097,[298]6.9003,[299]6.8978,[300]6.8933,[301]6.8901,[302]6.8876,[303]6.8823,[304]6.8689,[305]6.8611,[306]6.8402,[307]6.8135,[308]6.8269,[309]6.8375,[310]6.8403,[311]6.8326,[312]6.8264,[313]6.8271,[314]6.8379,[315]6.8417,[316]6.8425,[317]6.8444,[318]6.8473,[319]6.8594,[320]6.8651,[321]6.8788,[322]6.8754,[323]6.8634,[324]6.8591,[325]6.8548,[326]6.8514,[327]6.8468,[328]6.8458,[329]6.8582,[330]6.8600,[331]6.8610,[332]6.8642,[333]6.8645,[334]6.8612,[335]6.8624,[336]6.8664,[337]6.8712,[338]6.8698,[339]6.8713,[340]6.8709,[341]6.8608,[342]6.8593,[343]6.8700,[344]6.8722,[345]6.8644,[346]6.8683,[347]6.8646,[348]6.8757,[349]6.8724,[350]6.8786,[351]6.8814,[352]6.8943,[353]6.8964,[354]6.8956,[355]6.8998,[356]6.8925,[357]6.8885,[358]6.8903,[359]6.8891,[360]6.8972,[361]6.9002,[362]6.8924,[363]6.8953,[364]6.8882,[365]6.8849,[366]6.8883,[367]6.8794,[368]6.8731,[369]6.8642,[370]6.8568,[371]6.8607,[372]6.8594,[373]6.8563,[374]6.8535,[375]6.8466,[376]6.8399,[377]6.8310,[378]6.8236,[379]6.8163,[380]6.8114,[381]6.8115,[382]6.8091,[383]6.8113,[384]6.8191,[385]6.8269,[386]6.8271,[387]6.8201,[388]6.8246,[389]6.8246,[390]6.8294,[391]6.8238,[392]6.8238,[393]6.8279,[394]6.8298,[395]6.8448,[396]6.8565,[397]6.8740,[398]6.8887,[399]6.8975,[400]6.9076,[401]6.9207,[402]6.9356,[403]6.9375,[404]6.9430,[405]6.9564,[406]6.9640,[407]6.9628,[408]6.9719,[409]6.9852,[410]6.9973,[411]7.0062,[412]7.0118,[413]7.0220,[414]7.0293,[415]7.0389,[416]7.0521,[417]7.0619,[418]7.0603,[419]7.0588,[420]7.0612,[421]7.0770,[422]7.0888,[423]7.0910,[424]7.0987,[425]7.0940,[426]7.0936,[427]7.0973,[428]7.1008,[429]7.1017,[430]7.1045,[431]7.1086,[432]7.1171,[433]7.1216,[434]7.1177,[435]7.1108,[436]7.1071,[437]7.1027,[438]7.0981,[439]7.1009,[440]7.1008,[441]7.1000,[442]7.1019,[443]7.1069,[444]7.1157,[445]7.1177,[446]7.1224,[447]7.1223,[448]7.1206,[449]7.1121,[450]7.1181,[451]7.1185,[452]7.1229,[453]7.1233,[454]7.1213,[455]7.1288,[456]7.1283,[457]7.1302,[458]7.1344,[459]7.1385,[460]7.1328,[461]7.1324,[462]7.1464,[463]7.1472,[464]7.1559,[465]7.1547,[466]7.1533,[467]7.1562,[468]7.1518,[469]7.1489,[470]7.1507,[471]7.1403,[472]7.1376,[473]7.1432,[474]7.1391,[475]7.1331,[476]7.1333,[477]7.1347,[478]7.1301,[479]7.1257,[480]7.1247,[481]7.1192,[482]7.1148,[483]7.1105,[484]7.1072,[485]7.1070,[486]7.1006,[487]7.0992,[488]7.0994,[489]7.0992,[490]7.0915,[491]7.0894,[492]7.0884,[493]7.0865,[494]7.0889,[495]7.0978,[496]7.0998,[497]7.0984,[498]7.0970,[499]7.0988,[500]7.1051,[501]7.1064,[502]7.1095,[503]7.1131,[504]7.1159,[505]7.1220,[506]7.1268,[507]7.1245,[508]7.1280,[509]7.1224,[510]7.1241,[511]7.1168,[512]7.1143,[513]7.1157,[514]7.1118,[515]7.1058,[516]7.1022,[517]7.0962,[518]7.0976,[519]7.1093,[520]7.1137,[521]7.1106,[522]7.1128,[523]7.1187,[524]7.1215,[525]7.1191,[526]7.1218,[527]7.1162,[528]7.1085,[529]7.1067,[530]7.1029,[531]7.0993,[532]7.0969,[533]7.0925,[534]7.0884,[535]7.0853,[536]7.0848,[537]7.0859,[538]7.0885,[539]7.0853,[540]7.0828,[541]7.0809,[542]7.0751,[543]7.0796,[544]7.0830,[545]7.0811,[546]7.0814,[547]7.0799,[548]7.0783,[549]7.0798,[550]7.0755,[551]7.0772,[552]7.0761,[553]7.0709,[554]7.0692,[555]7.0679,[556]7.0630,[557]7.0611,[558]7.0593,[559]7.0521,[560]7.0482,[561]7.0493,[562]7.0482,[563]7.0459,[564]7.0379,[565]7.0366,[566]7.0355,[567]7.0332,[568]7.0402,[569]7.0351,[570]7.0347,[571]7.0321,[572]7.0324,[573]7.0301,[574]7.0338,[575]7.0306,[576]7.0294,[577]7.0335,[578]7.0334,[579]7.0313,[580]7.0407,[581]7.0465,[582]7.0459,[583]7.0507,[584]7.0572,[585]7.0488,[586]7.0424,[587]7.0461,[588]7.0463,[589]7.0481,[590]7.0477,[591]7.0449,[592]7.0347,[593]7.0356,[594]7.0330,[595]7.0246,[596]7.0178,[597]7.0084,[598]6.9969,[599]6.9934,[600]6.9969,[601]6.9999,[602]7.0005,[603]6.9991,[604]7.0054,[605]7.0069,[606]7.0101,[607]7.0132,[608]7.0218,[609]7.0293,[610]7.0278,[611]7.0300,[612]7.0294,[613]7.0285,[614]7.0262,[615]7.0302,[616]7.0259,[617]7.0274,[618]7.0296,[619]7.0373,[620]7.0372,[621]7.0392,[622]7.0407,[623]7.0433,[624]7.0440,[625]7.0474,[626]7.0462,[627]7.0505,[628]7.0553,[629]7.0636,[630]7.0597,[631]7.0620,
Final estimate: PPL = 7.0620 +/- 0.04195

llama_print_timings:        load time =     828.57 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 3211039.76 ms / 323072 tokens (    9.94 ms per token,   100.61 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 3213181.34 ms / 323073 tokens

@ggerganov
Copy link
Owner

Does this perplexity value match well with what you get on master without these changes?

For how many number of threads do you observe optimal performance with these kernels?

@snadampal
Copy link
Contributor Author

Yes, the perplexity matches to the master without these changes (logs below).

main: build = 1970 (fe54033b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
main: seed  = 1706278254
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/ubuntu/Llama_setup2/llama.cpp/models/open_llama_7b_v2/ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 900.955 ms
perplexity: calculating perplexity over 631 chunks, batch_size=512
perplexity: 7.33 seconds per pass - ETA 1 hours 17.13 minutes
[1]4.1032,[2]5.4874,[3]5.6442,[4]6.6715,[5]6.9956,[6]7.7168,[7]7.9770,[8]8.2112,[9]8.5768,[10]9.1575,[11]9.3705,[12]9.3990,[13]9.4507,[14]9.7147,[15]9.1657,[16]8.8407,[17]8.7235,[18]8.2421,[19]8.2558,[20]8.2294,[21]7.9563,[22]7.8904,[23]7.7807,[24]7.7538,[25]7.4997,[26]7.2244,[27]7.0789,[28]6.9440,[29]6.7604,[30]6.6747,[31]6.7384,[32]6.7248,[33]6.7785,[34]6.7540,[35]6.7986,[36]6.8183,[37]6.8815,[38]6.9273,[39]6.9946,[40]7.0747,[41]7.0236,[42]7.0082,[43]7.0074,[44]6.9949,[45]6.9526,[46]6.9727,[47]6.9945,[48]6.9584,[49]6.9354,[50]6.9027,[51]6.9698,[52]6.9567,[53]6.9238,[54]6.9435,[55]6.9240,[56]6.9518,[57]6.9825,[58]7.0175,[59]7.0246,[60]7.0663,[61]7.0442,[62]7.0578,[63]7.1145,[64]7.1249,[65]7.1514,[66]7.1562,[67]7.1910,[68]7.2162,[69]7.2741,[70]7.3215,[71]7.3510,[72]7.4025,[73]7.3806,[74]7.3902,[75]7.3960,[76]7.4032,[77]7.4474,[78]7.4432,[79]7.4237,[80]7.3848,[81]7.3635,[82]7.3412,[83]7.3489,[84]7.3387,[85]7.2972,[86]7.2985,[87]7.2710,[88]7.2900,[89]7.2943,[90]7.3048,[91]7.2981,[92]7.3305,[93]7.3305,[94]7.3335,[95]7.3396,[96]7.3359,[97]7.3320,[98]7.3648,[99]7.3568,[100]7.3901,[101]7.3927,[102]7.3801,[103]7.4031,[104]7.4078,[105]7.4543,[106]7.4526,[107]7.4419,[108]7.4607,[109]7.4853,[110]7.4808,[111]7.4870,[112]7.4718,[113]7.4579,[114]7.4493,[115]7.4683,[116]7.4946,[117]7.5383,[118]7.5648,[119]7.6029,[120]7.6159,[121]7.6040,[122]7.6389,[123]7.6839,[124]7.7099,[125]7.6883,[126]7.6881,[127]7.6820,[128]7.6528,[129]7.6438,[130]7.6585,[131]7.6591,[132]7.6294,[133]7.6085,[134]7.5938,[135]7.5891,[136]7.5789,[137]7.5416,[138]7.5354,[139]7.4974,[140]7.4568,[141]7.4387,[142]7.4235,[143]7.4360,[144]7.4351,[145]7.4325,[146]7.4289,[147]7.4188,[148]7.4001,[149]7.3725,[150]7.3761,[151]7.3870,[152]7.3840,[153]7.4066,[154]7.3987,[155]7.3879,[156]7.4036,[157]7.3720,[158]7.3461,[159]7.3217,[160]7.2828,[161]7.2529,[162]7.2098,[163]7.1840,[164]7.1678,[165]7.1470,[166]7.1169,[167]7.0952,[168]7.0761,[169]7.0389,[170]7.0143,[171]6.9884,[172]6.9570,[173]6.9364,[174]6.9197,[175]6.8969,[176]6.8683,[177]6.8567,[178]6.8285,[179]6.8183,[180]6.8086,[181]6.8142,[182]6.8016,[183]6.8299,[184]6.8357,[185]6.8718,[186]6.9027,[187]6.9188,[188]6.9579,[189]6.9821,[190]7.0071,[191]7.0434,[192]7.0803,[193]7.0901,[194]7.0872,[195]7.1040,[196]7.1219,[197]7.1303,[198]7.1482,[199]7.1560,[200]7.1537,[201]7.1699,[202]7.1730,[203]7.1764,[204]7.1915,[205]7.2027,[206]7.2156,[207]7.2254,[208]7.2157,[209]7.2321,[210]7.2450,[211]7.2660,[212]7.2665,[213]7.2718,[214]7.2704,[215]7.2665,[216]7.2483,[217]7.2451,[218]7.2663,[219]7.2736,[220]7.2816,[221]7.2844,[222]7.2784,[223]7.2903,[224]7.2724,[225]7.2611,[226]7.2416,[227]7.2260,[228]7.2205,[229]7.2080,[230]7.2024,[231]7.1855,[232]7.1826,[233]7.1667,[234]7.1582,[235]7.1444,[236]7.1266,[237]7.1062,[238]7.0988,[239]7.0855,[240]7.0735,[241]7.0700,[242]7.0581,[243]7.0506,[244]7.0376,[245]7.0274,[246]7.0108,[247]6.9926,[248]6.9791,[249]6.9636,[250]6.9492,[251]6.9430,[252]6.9403,[253]6.9366,[254]6.9269,[255]6.9284,[256]6.9274,[257]6.9192,[258]6.9185,[259]6.9282,[260]6.9301,[261]6.9414,[262]6.9450,[263]6.9440,[264]6.9464,[265]6.9530,[266]6.9570,[267]6.9733,[268]6.9849,[269]6.9904,[270]6.9970,[271]7.0107,[272]7.0173,[273]7.0342,[274]7.0422,[275]7.0498,[276]7.0686,[277]7.0753,[278]7.0843,[279]7.0702,[280]7.0508,[281]7.0329,[282]7.0120,[283]6.9987,[284]6.9938,[285]6.9890,[286]6.9889,[287]6.9841,[288]6.9802,[289]6.9701,[290]6.9593,[291]6.9506,[292]6.9460,[293]6.9368,[294]6.9294,[295]6.9238,[296]6.9088,[297]6.9094,[298]6.9000,[299]6.8975,[300]6.8930,[301]6.8898,[302]6.8873,[303]6.8820,[304]6.8686,[305]6.8609,[306]6.8399,[307]6.8132,[308]6.8267,[309]6.8373,[310]6.8400,[311]6.8323,[312]6.8261,[313]6.8267,[314]6.8375,[315]6.8413,[316]6.8422,[317]6.8441,[318]6.8470,[319]6.8590,[320]6.8647,[321]6.8784,[322]6.8750,[323]6.8631,[324]6.8587,[325]6.8544,[326]6.8510,[327]6.8463,[328]6.8454,[329]6.8578,[330]6.8596,[331]6.8607,[332]6.8638,[333]6.8642,[334]6.8609,[335]6.8621,[336]6.8661,[337]6.8709,[338]6.8695,[339]6.8710,[340]6.8706,[341]6.8605,[342]6.8591,[343]6.8697,[344]6.8720,[345]6.8641,[346]6.8680,[347]6.8643,[348]6.8755,[349]6.8722,[350]6.8784,[351]6.8812,[352]6.8941,[353]6.8962,[354]6.8954,[355]6.8997,[356]6.8923,[357]6.8883,[358]6.8902,[359]6.8890,[360]6.8971,[361]6.9001,[362]6.8923,[363]6.8951,[364]6.8880,[365]6.8847,[366]6.8882,[367]6.8792,[368]6.8729,[369]6.8640,[370]6.8567,[371]6.8605,[372]6.8593,[373]6.8562,[374]6.8534,[375]6.8466,[376]6.8398,[377]6.8310,[378]6.8236,[379]6.8163,[380]6.8114,[381]6.8115,[382]6.8091,[383]6.8114,[384]6.8191,[385]6.8270,[386]6.8272,[387]6.8201,[388]6.8247,[389]6.8246,[390]6.8295,[391]6.8239,[392]6.8239,[393]6.8280,[394]6.8300,[395]6.8449,[396]6.8566,[397]6.8741,[398]6.8888,[399]6.8976,[400]6.9077,[401]6.9207,[402]6.9355,[403]6.9375,[404]6.9430,[405]6.9565,[406]6.9640,[407]6.9628,[408]6.9719,[409]6.9852,[410]6.9972,[411]7.0061,[412]7.0117,[413]7.0219,[414]7.0292,[415]7.0388,[416]7.0520,[417]7.0619,[418]7.0602,[419]7.0587,[420]7.0611,[421]7.0769,[422]7.0887,[423]7.0909,[424]7.0987,[425]7.0939,[426]7.0935,[427]7.0973,[428]7.1007,[429]7.1017,[430]7.1045,[431]7.1085,[432]7.1170,[433]7.1216,[434]7.1176,[435]7.1107,[436]7.1070,[437]7.1026,[438]7.0980,[439]7.1008,[440]7.1007,[441]7.0999,[442]7.1018,[443]7.1068,[444]7.1156,[445]7.1176,[446]7.1223,[447]7.1222,[448]7.1205,[449]7.1120,[450]7.1180,[451]7.1184,[452]7.1229,[453]7.1232,[454]7.1212,[455]7.1287,[456]7.1281,[457]7.1300,[458]7.1342,[459]7.1383,[460]7.1326,[461]7.1322,[462]7.1462,[463]7.1470,[464]7.1558,[465]7.1546,[466]7.1532,[467]7.1560,[468]7.1517,[469]7.1488,[470]7.1505,[471]7.1402,[472]7.1374,[473]7.1430,[474]7.1390,[475]7.1329,[476]7.1331,[477]7.1345,[478]7.1299,[479]7.1256,[480]7.1245,[481]7.1191,[482]7.1147,[483]7.1104,[484]7.1071,[485]7.1068,[486]7.1005,[487]7.0990,[488]7.0992,[489]7.0991,[490]7.0914,[491]7.0893,[492]7.0883,[493]7.0864,[494]7.0888,[495]7.0977,[496]7.0997,[497]7.0982,[498]7.0969,[499]7.0988,[500]7.1051,[501]7.1064,[502]7.1094,[503]7.1130,[504]7.1159,[505]7.1220,[506]7.1268,[507]7.1245,[508]7.1280,[509]7.1224,[510]7.1241,[511]7.1168,[512]7.1143,[513]7.1157,[514]7.1119,[515]7.1059,[516]7.1022,[517]7.0962,[518]7.0976,[519]7.1093,[520]7.1136,[521]7.1106,[522]7.1127,[523]7.1186,[524]7.1214,[525]7.1190,[526]7.1218,[527]7.1162,[528]7.1084,[529]7.1066,[530]7.1029,[531]7.0993,[532]7.0969,[533]7.0925,[534]7.0884,[535]7.0853,[536]7.0848,[537]7.0858,[538]7.0885,[539]7.0852,[540]7.0828,[541]7.0809,[542]7.0751,[543]7.0796,[544]7.0830,[545]7.0811,[546]7.0813,[547]7.0799,[548]7.0782,[549]7.0798,[550]7.0755,[551]7.0772,[552]7.0761,[553]7.0710,[554]7.0692,[555]7.0679,[556]7.0631,[557]7.0611,[558]7.0594,[559]7.0522,[560]7.0483,[561]7.0493,[562]7.0483,[563]7.0460,[564]7.0379,[565]7.0367,[566]7.0356,[567]7.0332,[568]7.0402,[569]7.0352,[570]7.0347,[571]7.0321,[572]7.0325,[573]7.0302,[574]7.0339,[575]7.0306,[576]7.0294,[577]7.0335,[578]7.0335,[579]7.0314,[580]7.0408,[581]7.0466,[582]7.0460,[583]7.0508,[584]7.0573,[585]7.0489,[586]7.0425,[587]7.0461,[588]7.0464,[589]7.0481,[590]7.0478,[591]7.0449,[592]7.0347,[593]7.0356,[594]7.0330,[595]7.0246,[596]7.0178,[597]7.0084,[598]6.9970,[599]6.9934,[600]6.9969,[601]6.9999,[602]7.0005,[603]6.9991,[604]7.0054,[605]7.0070,[606]7.0101,[607]7.0132,[608]7.0218,[609]7.0292,[610]7.0278,[611]7.0300,[612]7.0294,[613]7.0284,[614]7.0261,[615]7.0301,[616]7.0258,[617]7.0273,[618]7.0296,[619]7.0372,[620]7.0371,[621]7.0392,[622]7.0406,[623]7.0433,[624]7.0439,[625]7.0474,[626]7.0462,[627]7.0505,[628]7.0553,[629]7.0635,[630]7.0596,[631]7.0620,
Final estimate: PPL = 7.0620 +/- 0.04195

llama_print_timings:        load time =     831.57 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 4613485.32 ms / 323072 tokens (   14.28 ms per token,    70.03 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 4615632.34 ms / 323073 tokens

@snadampal
Copy link
Contributor Author

snadampal commented Jan 26, 2024

I haven't collected data for different thread configs yet, but in general I see these gemm kernels scale with the number of threads, though not linearly.
I'm curious what the context for your question was. is there some heuristic we need to populate with the optimal thread count for each kernel?

For how many number of threads do you observe optimal performance with these kernels?

@ggerganov
Copy link
Owner

The expectation is that for prompt processing the speed should always increase with increasing the number of threads, while for text-generation there should be an optimal number of threads after which the performance will start degrading. llama.cpp allows 2 different number of threads to be passed - for batch size 1 and for bs > 1:

llama.cpp/llama.h

Lines 222 to 223 in 6fea843

uint32_t n_threads; // number of threads to use for generation
uint32_t n_threads_batch; // number of threads to use for batch processing

These can be configured with the -t and -tb command-line arguments respectively.

I just now realized that the new kernels are used only for prompt-processing, so that's fine

ggml-quants.h Outdated
void ggml_vec_dot_q4_1_q8_1(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
void ggml_vec_dot_q5_0_q8_0(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
void ggml_vec_dot_q5_1_q8_1(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
void ggml_vec_dot_q8_0_q8_0(int n, float * restrict s, const void ** restrict vx, const void ** restrict vy, const int nrc);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced if this API is desirable - it requires preparing arrays of pointers which seems quite cumbersome

Normally, linear algebra libraries utilize an API of a pointer, number of elements and stride (in bytes or in elements). So I'm thinking that we should probably switch to something like:

void ggml_vec_dot_q4_0_q8_0(int n, float * restrict s, const void * restrict vx, size_t bx, const void * restrict vy, size_t by, int nrc);

Note that I'm mostly thinking out loud - not yet sure what is the best way.
It's a big change so we have to consider the options to make this less intrusive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. as I mentioned earlier I tried to fit it into the existing interface itself but had changed to array mainly to consider stride. if it's better to add few more ags than arrays, how about we define a tensor attribute structure and pass it across instead of adding one arg for each attribute? this way we can extend it is future for any new functionality. For now the tensor object can just have elements, stride, format type.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, adding a new tensor attribute structure would again introduce a lot of boilerplate around calling the dot functions. Adding extra arguments is better in this regard, because we already have the strides from the struct ggml_tensor

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, ggml stores the strides in number of bytes. So the numbers in ggml_tensor->nb are strides in bytes. The dot functions should also accept the row strides in bytes for consistency.

In the future, we will transition to storing the strides in number of elements: ggerganov/ggml#623. But this is not important for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov , I have updated the PR. please review and let me know if it can be improved further, especially around the stride calculations. I was able to use the ggml_tensor strides (nb) for src0 and dst tensors, but, I had to arrive at the src1_col stride following the logic used for offset calculations.

ggml.h Outdated Show resolved Hide resolved
ggml-quants.c Outdated Show resolved Hide resolved
ggml-quants.c Outdated
Comment on lines 4123 to 4124
vst1_f32(s, vget_low_f32(sumv2));
vst1_f32(s + 16, vget_high_f32(sumv2));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should add a stride argument for s too. This 16 offset is very obscure, but on the other hand the function signature would become a bit overloaded.

It's probably better to add it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov , addressed all the comments.

@snadampal snadampal force-pushed the smmla_aarch64 branch 2 times, most recently from ff67775 to 4c840fd Compare February 2, 2024 21:38
@ggerganov
Copy link
Owner

I think this should be good to merge. Want to take some time to do some AWS Graviton tests first and confirm the results. If anyone else gives this a try, please post some feedback as well

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Feb 5, 2024
@snadampal
Copy link
Contributor Author

I think this should be good to merge. Want to take some time to do some AWS Graviton tests first and confirm the results. If anyone else gives this a try, please post some feedback as well

@ggerganov or anyone trying this PR, please make sure you use the instances from AWS Graviton3 family, c7g/m7g/r7g (Graviton2 doesn't support MMLA instructions).

@Dibakar
Copy link
Contributor

Dibakar commented Feb 5, 2024

@ggerganov I tried this PR on an AWS Graviton3 instance. I can confirm that I observed a similar speedup as mentioned by the author of this patch. Please find below the tokens/s numbers.
llama cpp with AWS mmla patches

ggml.c Outdated Show resolved Hide resolved
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
@ggerganov ggerganov merged commit a07d0fe into ggerganov:master Feb 11, 2024
49 of 53 checks passed
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue need feedback Testing and feedback with results are needed performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants