-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use thread pool #400
Use thread pool #400
Conversation
Tagging @JohannesGaessler since I think he was also planning on doing something similar. |
I'm now going to look into @ggerganov 's PR. Back again: I need guidance on how to merge the different approaches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I did not yet test the code.
- If code is taken verbatim from an external source I think we should document this.
- The code in question is licensed as MIT so there should be no legal issues.
} | ||
|
||
params.type = GGML_TASK_COMPUTE; | ||
ggml_compute_forward(¶ms, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the code correctly the main thread managing the other threads still does part of the computation itself. A design where the main thread queues work for all worker threads and then waits for them to be done would maybe be better since this would allow you to parallelize the computations with administrative overhead (and maybe computations from multiple tensors, but I think the overhaul by @slaren also does something like that). But this can be done in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work scheduling should be intentionally identical to current master to ease comparison (modulo errors in my understanding of it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this PR does not change the granularity with which work is distributed: all threads still receive one even share of work. With reduced threading overhead it may be a good idea to split the work into smaller chunks since on some Intel architectures the speed of the cores is uneven and smaller chunks would allow the faster cores to work on more chunks than the slow cores. But this can be done in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work chunks are already short enough as-is right now; depending on the OS it's up to the scheduler to decide what kind of thread rotation algorithm is implemented on a given device to even out execution time across threads on heterogeneous CPU cores at different speeds.
If you do wish to implement a smaller work quantum, I highly recommend this to be a parameterizable option as benefits and negatives will highly depend on OS/CPU target.
Absolutely. Assorted associations:
Maybe we should try to contact the author what to do? Thanks for looking into this! |
There are a couple more questions I'd like to discuss:
|
Please document this as well. |
On my Linux system with a Ryzen 3700X, 3200 MHz dual channel memory, and an RTX 3090 this PR has worse performance than master:
Edit: the numbers are also for |
Thanks for testing! I changed the order of unlocking/signaling and retested with a bigger model
for n = 4/6 and with Clang and ICX/MKL. Results for the compilers are similar so I'm showing Clang results. Master 4 threads/6 threads
PR 4 threads/6 threads
I'm not sure how to proceed. |
Have you tested with llama? If I understand correctly, this will cause the threads to wait on a mutex for each node in the graph. In normal conditions, a spin lock should have lower acquire overhead than a mutex (excluding edge cases such as using more threads than physically available). Since that overhead is paid for every node, the overhead will scale with the number of nodes in the graph. |
If the performance is not universally better I think this should be added as a compile option so that both implementations are still available. To get a better grasp regarding which should be the default I think we would need more data.
@ggerganov is the authority to ask regarding ggml architecture. |
To get back to what I said earlier: with the current design the threads need to wait and be notified for each tensor. With a design that queues all work ahead of time you could instead have the thread wait on a barrier (but I'm not sure whether that would be faster). Maybe we should wait for slaren's backend refactor? From my perspective the biggest issue with threading is that you manually have to set the threads to 1 to get optimal performance with llama.cpp CUDA which is an inconvenience to the user and suboptimal for performance if you offload only part of the network. |
Keep in mind that the backend refactor is still going to take a while (several weeks), and I wouldn't want it to cause a delay to any possible improvements that could be made in the meanwhile. I understand that it is a bit of an uncomfortable situation because the changes needed to fix that issue for CUDA may be irrelevant after the backends refactor, so ultimately it is up to you. |
With the introduction of |
Inspired by this discussion I tried to understand the work scheduling. The patch uses a plain
pthreads
emulation for Windows and a thread pool implementation from the same site with what I hope are corrections.The patch is only tested on Windows.
When testing I noticed that(This was due to different test sizes.)test_opt
seems to need quite a bit more run time than earlier. Is this a regression or due to a bug fix?Here are the results of a quick check with
Before:
After:
Here the numbers for
Before:
After:
Run on a Core i7 with 6 cores built with clang-cl and without BLAS support.