Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) #5951

Merged
merged 1 commit into from
Mar 9, 2024

Conversation

ggerganov
Copy link
Member

ref #4966

The struct block_q8_1 on the CPU uses float instead of ggml_fp16_t:

#define QK8_1 32
typedef struct {
    float d;               // delta
    float s;               // d * sum(qs[i])
    int8_t  qs[QK8_1];     // quants
} block_q8_1;
static_assert(sizeof(block_q8_1) == 2*sizeof(float) + QK8_1, "wrong q8_1 block size/padding");

@ggerganov
Copy link
Member Author

@snadampal I haven't tested this change - please give it a try just in case

@snadampal
Copy link
Contributor

Hi @ggerganov , LGTM. I have tested it on AWS Graviton3 based c7g instances.

@ggerganov ggerganov merged commit 8380ecf into master Mar 9, 2024
62 checks passed
@ggerganov ggerganov deleted the gg/fix-mmla-q4_1-q8_1 branch March 9, 2024 15:36
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants