Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[soft max] capping the num tasks to 4 is limiting the prompt eval perf #5103

Closed
snadampal opened this issue Jan 24, 2024 · 5 comments
Closed

Comments

@snadampal
Copy link
Contributor

snadampal commented Jan 24, 2024

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

System: AWS Graviton3, c7g.16xl instance with Ubuntu 22.04
llama.cpp version: latest, commit: 6f9939d

the following commit is capping the num of tasks to 4. I would like to understand why 4?

commit adf3de4f69ff7e44131222f05f9c7447ac0be3cb (HEAD, tag: b1605)
Author: Georgi Gerganov <[email protected]>
Date:   Sun Dec 3 15:56:22 2023 +0200

    ggml : fix soft max out-of-bounds access (#4307)

    ggml-ci

Without 4 and just using the src num rows or n_threads for n_tasks, the prompt eval performance is improved by 4% for DOT kernels and 9% for MMLA kernels (PR)
n_tasks = MIN(n_threads, ggml_nrows(node->src[0]));

Reproducer:
./main -m /llama.cpp/models/open_llama_13b/ggml-model-q8_0.gguf -c 1015 -n 256 -t 64 --file <input_file.txt>

@snadampal snadampal changed the title [soft max] capping the num tasks to 4 is regressing the prompt eval perf by 9% [soft max] capping the num tasks to 4 is limiting the prompt eval perf Jan 24, 2024
@ggerganov
Copy link
Owner

Not really sure where the limit of 4 came from - probably I did some measurements and decided that there is no reason to use larger number of tasks. But it's likely something to be revisited and updated

@snadampal
Copy link
Contributor Author

thanks. since it was done to take care of out of bounds issue, could you please share some details about the original issue? so that we can try to arrive at some formula to take care of systems with different number of threads.

@ggerganov
Copy link
Owner

Back then, we had 2 different functions that specified the number of tasks for each op and they had to be kept in sync. If they were not in sync (i.e. return 2 different values), it could cause out-of-bounds access due to small wdata array

Now, we have only one function: ggml_get_n_tasks(), so the original issue is no longer relevant and in theory it should work with any n_task smaller than the number of rows

@snadampal
Copy link
Contributor Author

great! I have raised this PR to update it.

@snadampal
Copy link
Contributor Author

the PR is merged, closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants