-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-backend: refine backend subsystem for CPU&GPU / CPU&NPU mixed inference more easily for a specified GGML backend #7641
Conversation
…ference more easily for a specified GGML backend
This is not correct. |
it works fine with whisper.cpp and llama.cpp using QNN backend and various testcases in my local dev envs. could you help to point out the reason. thanks. |
There are too many things wrong here to list. At the most basic level, this approach will not work because backends typically have a memory that is not accessible from other backends, and when switching to a different backend it is necessary to ensure that all the tensors required to evaluate the graph are available in the backend memory. This is the main job of Please wait until #6210 is complete, then |
This PR has no side-effect to the existing codes and works very well/perfectly with whisper.cpp and llama.cpp using QNN backend(I guess other new backend also works fine with whisper.cpp and llama.cpp if a new backend follow the style in this PR). I had been considered your concern carefully: the other/existing backend still keep the original behavior. In the fact,
Could you help to reopen this PR? so other programmers/developers can participate in the debate. Let community developers to decide whether this PR could be accepted. Thanks so much. |
This PR is NOT closed by myself. There is a more clear PR(with more code comments to explain how to do mixed inference between Qualcomm's CPU&GPU / CPU/NPU): I submitted this new PR because I can't update(submit a new commit in this PR) in this loop(I don't know why). |
Purpose
This PR is intent to refine ggml backend subsystem to enable mixed inference between CPU & GPU / CPU & NPU more easily.
There already is "Backend Scheduler" feature in ggml backend subsystem but the "Backend Scheduler" is too complex and not a straight way and some backend APIs is not make sense:
For example, ggml_backend_supports_op is only called/used in https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L406,
For example, ggml_backend_offload_op is not reasonable.
In the all, a special backend doesn't need to implement all GGML OPs and much of them can fallback to the default GGML backend(this is a long-term problem in ggml backend subsystem):
The entire framework of existing ggml backend subystem is really excellent, but part of subsystem seems too strict to a special backend;
GPU/NPU computing might be slower then CPU computing in some special scenarios if we considering data copy/data preparation between CPU/GPU or CPU/NPU and memory size or KV cache size.
Pros
This PR less then one hundred LoC based on the existing ggml backend subsystem and NO side-effect to existing codes.
This PR follow the existing OO principle in ggml.c&ggml-backend.c.
This PR works very fine/well with whisper.cpp and llama.cpp using QNN backend as expected on local dev side.
The GGML QNN backend and many other GGML backends will/might be benefit from this PR greatly.
It's very simple and straightforward and easy to understand.
Cons
A static function in ggml.c is changed to a global function and referenced in this PR. this is not make sense but the cost might be acceptable. A workaround to fix this problem is merge the entire ggml-backend.c to ggml.c and ggml-backend.h to ggml.h accordingly.
Todo
more sophisticated algorithm for mixed inference between CPU/GPU or CPU/NPU but this PR is a simple and concise and straight implementation for address a long-term problem in ggml backend subsystem.