Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: add new member in GGML's internal data structure #6815

Closed

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Apr 21, 2024

Purpose

this PR is for multi purpose:

(1) borrow some advantages from Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK, it's a highly well-designed and concise SDK(the API in QNN SDK is really match with GGML).

this PR borrow the "rank" member from QNN SDK and re-use the existing code as much as possible and not brings side-effect or complexity to existing code

(2) borrow some advantages from PyTorch(the user could specify whether a GGML OP(such as mulmat) is accelerated by a specify backend)

this PR borrow the "use_hwaccel" member from PyTorch and re-use the existing code as much as possible and not brings side-effect or complexity to existing code

(3) cover more scenarios from upper layer code(see section "Explanation"), not including using multiple backends simultaneously because that's another topic/scenario and that's the ggml-backend-sched should/could cover.

(4) (the main purpose is) prepare for submit Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) backend to upstream GGML community: PoC: Add Qualcomm mobile SoC native backend for GGML

Status

this PR has verified/validated in whisper.cpp + QNN backend on different Android phone(Qualcomm SoC based low-end phone and Qualcomm SoC based high-end phone).

Explanation

this PR is useful/helpful/meaningful for some scenarios.

in the fact, the "gpu_device" in struct whisper_context_params is similar to use_hwaccel semantically. there are 2 * n combinations here: 2(use_gpu:true/false) * n (backend_device:1-n, attention here: a DSP backend is also considered as a "gpu_device", because we should re-use the existing code as much as possible and not brings side-effect or complexity to existing code)

   struct whisper_context_params {
        bool  use_gpu;
        int   gpu_device;  // CUDA device

        // [EXPERIMENTAL] Token-level timestamps with DTW
        bool dtw_token_timestamps;
        enum whisper_alignment_heads_preset dtw_aheads_preset;

        int dtw_n_top;
        struct whisper_aheads dtw_aheads;

        size_t dtw_mem_size; // TODO: remove
    };

so a special value of "gpu_device" could be considered/assumed non hardware acceleration or fall into the original default GGML CPU backend.

accordingly, we can re-use the existing "backend" member in core data structure with a new member "use_hwaccel" in struct ggml_context. btw, I personally think we should not remove the existing "backend" from "struct ggml_tensor" although there is a plan to remove this member:

  // n-dimensional tensor
    struct ggml_tensor {
        enum ggml_type         type;
        enum ggml_backend_type backend;

        struct ggml_backend_buffer * buffer;

        int64_t ne[GGML_MAX_DIMS]; // number of elements
        size_t  nb[GGML_MAX_DIMS]; // stride in bytes:
                                   // nb[0] = ggml_type_size(type)
                                   // nb[1] = nb[0]   * (ne[0] / ggml_blck_size(type)) + padding
                                   // nb[i] = nb[i-1] * ne[i-1]

        // compute data
        enum ggml_op op;

        // op params - allocated as int32_t for alignment
        int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];

        int32_t flags;

        struct ggml_tensor * grad;
        struct ggml_tensor * src[GGML_MAX_SRC];

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;

        struct ggml_tensor * view_src;
        size_t               view_offs;

        void * data;

        char name[GGML_MAX_NAME];

        void * extra; // extra things e.g. for ggml-cuda.cu

        int32_t rank;

        char padding[20];
    };

the reason for this is that there are some different scenarios(not including using multiple backends simultaneously, that's another topic/scenario) which "use_gpu&gpu_device" cannot cover:

I personally think these new members("use_hwaccel" in struct ggml_context & "rank" in struct ggml_tensor) are not redundant and these new members will NOT bring side-effect to existing codes. I understand we should not bring too much "useful codes" into existing implementation of GGML internal and we should keep GGML as compact/clean as possible, so this PR re-use existing code as much as possible:

https://github.com/zhouwg/whisper.cpp/blob/add_hwaccel_in_data_structure/ggml.c#L2995

PR approval request

This PR is same to ggerganov/whisper.cpp#2073, it's seen in here because llama.cpp is the main playground of ggml.

@slaren
Copy link
Collaborator

slaren commented Apr 21, 2024

Hi @zhouwg. This is the scenario that ggml-backend was designed to handle. We want to decouple the implementation of the different backends, and we want to avoid adding hooks into the CPU backend to support other hardware accelerated backends, because when going that route it quickly turns into an unmaintainable mess. That's what we used to do before ggml-backend was implemented, and ggml_tensor::backend is a remain from these times that will be removed eventually.

ggml_backend_sched allows specifying a backend for each operation, and this is how we are able to use CPU and GPU simultaneously in llama.cpp, as well as use mutiple GPUs. In principle, any combination of backends is possible. I think this is what you want to do here. Please take the time to familiarize yourself with ggml-backend and consider implementing support for QNN using the ggml-backend interface.

@zhouwg zhouwg force-pushed the add_new_member_in_data_structure branch 2 times, most recently from 6496143 to 9cba545 Compare April 22, 2024 00:05
@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 22, 2024

Hi @zhouwg. This is the scenario that ggml-backend was designed to handle. We want to decouple the implementation of the different backends, and we want to avoid adding hooks into the CPU backend to support other hardware accelerated backends, because when going that route it quickly turns into an unmaintainable mess. That's what we used to do before ggml-backend was implemented, and ggml_tensor::backend is a remain from these times that will be removed eventually.

ggml_backend_sched allows specifying a backend for each operation, and this is how we are able to use CPU and GPU simultaneously in llama.cpp, as well as use mutiple GPUs. In principle, any combination of backends is possible. I think this is what you want to do here. Please take the time to familiarize yourself with ggml-backend and consider implementing support for QNN using the ggml-backend interface.

thanks for you comment. I'll try to familiarize with ggml-backend-sched (I tried ggml-backend-sched yesterday and I think it's bring more complicated things into ggml's internal although it's a good idea) as soon as possible. QNN backend of GGML(which launched on 03-29-2024) already ready(validate with whisper.cpp well on Qualcomm's low-end phone and high-end phone) and prepare to be submitted to upstream.

This PR is not intent/have no intention to cover the scenario that ggml-backend-sched was designed to handle(for example:using multiple backends simultaneously). ggml-backend-sched could not cover some special scenarios which mentioned in this PR and ggml_tensor::backend should not be removed in the future.

for example: the GGML API user(upper layer code) just want to test a specify OP using a specify backend(it’s very easy to do with PyTorch) and don't want to calling the complex ggml-backend-sched API. this is the source code of this scenario :https://github.com/zhouwg/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/ggml-jni-impl-external.cpp#L8250

@slaren I'm sorry to interrupt you, could you take a moment and look at it again? thanks so much. I had been spent many time on GGML's internal(including GGML backend) and I aslo think ggml_backend_sched is an excellent idea and it's implementation is also excellent although it can not cover the above scenario. the above scenario/codes has been validated on Qualcomm's low-end phone and high-end phone and it works perfectly so I submit this PR accordingly. btw, I recommend you spend a little time building the kantv apk and running it on a Qualcomm SoC based Android phone and then you might change your mind. if you have any questions about this PR, please feel free to contact me.thanks.

@zhouwg zhouwg force-pushed the add_new_member_in_data_structure branch 10 times, most recently from b0c3a3c to 9cba545 Compare April 22, 2024 12:13
@zhouwg zhouwg closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants