-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[soft max] capping the num tasks to 4 is limiting the prompt eval perf #5103
Comments
Not really sure where the limit of 4 came from - probably I did some measurements and decided that there is no reason to use larger number of tasks. But it's likely something to be revisited and updated |
thanks. since it was done to take care of out of bounds issue, could you please share some details about the original issue? so that we can try to arrive at some formula to take care of systems with different number of threads. |
Back then, we had 2 different functions that specified the number of tasks for each op and they had to be kept in sync. If they were not in sync (i.e. return 2 different values), it could cause out-of-bounds access due to small Now, we have only one function: |
great! I have raised this PR to update it. |
the PR is merged, closing the issue. |
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
System: AWS Graviton3, c7g.16xl instance with Ubuntu 22.04
llama.cpp version: latest, commit: 6f9939d
the following commit is capping the num of tasks to 4. I would like to understand why 4?
Without 4 and just using the src num rows or n_threads for n_tasks, the prompt eval performance is improved by 4% for DOT kernels and 9% for MMLA kernels (PR)
n_tasks = MIN(n_threads, ggml_nrows(node->src[0]));
Reproducer:
./main -m /llama.cpp/models/open_llama_13b/ggml-model-q8_0.gguf -c 1015 -n 256 -t 64 --file <input_file.txt>
The text was updated successfully, but these errors were encountered: