ggml: add new member in GGML's internal data structure #6815
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
this PR is for multi purpose:
(1) borrow some advantages from Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK, it's a highly well-designed and concise SDK(the API in QNN SDK is really match with GGML).
this PR borrow the "rank" member from QNN SDK and re-use the existing code as much as possible and not brings side-effect or complexity to existing code
(2) borrow some advantages from PyTorch(the user could specify whether a GGML OP(such as mulmat) is accelerated by a specify backend)
this PR borrow the "use_hwaccel" member from PyTorch and re-use the existing code as much as possible and not brings side-effect or complexity to existing code
(3) cover more scenarios from upper layer code(see section "Explanation"), not including using multiple backends simultaneously because that's another topic/scenario and that's the ggml-backend-sched should/could cover.
(4) (the main purpose is) prepare for submit Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) backend to upstream GGML community: PoC: Add Qualcomm mobile SoC native backend for GGML
Status
this PR has verified/validated in whisper.cpp + QNN backend on different Android phone(Qualcomm SoC based low-end phone and Qualcomm SoC based high-end phone).
Explanation
this PR is useful/helpful/meaningful for some scenarios.
in the fact, the "gpu_device" in struct whisper_context_params is similar to use_hwaccel semantically. there are 2 * n combinations here: 2(use_gpu:true/false) * n (backend_device:1-n, attention here: a DSP backend is also considered as a "gpu_device", because we should re-use the existing code as much as possible and not brings side-effect or complexity to existing code)
so a special value of "gpu_device" could be considered/assumed non hardware acceleration or fall into the original default GGML CPU backend.
accordingly, we can re-use the existing "backend" member in core data structure with a new member "use_hwaccel" in struct ggml_context. btw, I personally think we should not remove the existing "backend" from "struct ggml_tensor" although there is a plan to remove this member:
the reason for this is that there are some different scenarios(not including using multiple backends simultaneously, that's another topic/scenario) which "use_gpu&gpu_device" cannot cover:
I personally think these new members("use_hwaccel" in struct ggml_context & "rank" in struct ggml_tensor) are not redundant and these new members will NOT bring side-effect to existing codes. I understand we should not bring too much "useful codes" into existing implementation of GGML internal and we should keep GGML as compact/clean as possible, so this PR re-use existing code as much as possible:
https://github.com/zhouwg/whisper.cpp/blob/add_hwaccel_in_data_structure/ggml.c#L2995
PR approval request
This PR is same to ggerganov/whisper.cpp#2073, it's seen in here because llama.cpp is the main playground of ggml.