Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: fix ggml_graph_compute_thread did not terminate on abort. #1065

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

issixx
Copy link
Contributor

@issixx issixx commented Jan 10, 2025

some threads kept looping and failed to terminate properly after an abort during CPU execution.

The thread behavior causing the issue is as follows:

  1. All threads synchronize at ggml_barrier.
  2. Thread 0 enters the next loop, sets the abort flag, and waits at ggml_barrier.
  3. Other threads, starting later, attempt to enter the next loop but exit due to the abort flag.
  4. Only thread 0 remains waiting indefinitely at ggml_barrier.

This change resolves the issue by ensuring the loop counters for termination are the same.

some threads kept looping and failed to terminate properly after an abort during CPU execution.
@issixx issixx changed the title ggml-cpu: fixe ggml_graph_compute_thread did not terminate on abort. ggml-cpu: fix ggml_graph_compute_thread did not terminate on abort. Jan 10, 2025
@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

I am not sure if I understand correctly the problem. The ggml_barrier happens after the thread 0 sets the abort flag, and this ggml_barrier should also force a memory barrier, so all the threads should have the correct value of abort by the time the next iteration starts.

Is it possible that this happened at the start of the next graph compute instead?

@issixx
Copy link
Contributor Author

issixx commented Jan 17, 2025


|-------------------------------------------|
|             No problem case               |
|-------------------------------------------|
|      Thread 0       |      Thread 1,2...  |
|---------------------|---------------------|
| LoopCnt 0 EnterLoop | LoopCnt 0 EnterLoop |
| LoopCnt 0 Barrier   | LoopCnt 0 Barrier   |
| LoopCnt 0 NextLoop  | LoopCnt 0 NextLoop  | Thread 0 and 1,2... is wakeup.
| LoopCnt 1 EnterLoop | LoopCnt 1 EnterLoop |
| LoopCnt 1 Abort     |                     |
| LoopCnt 1 Barrier   | LoopCnt 1 Barrier   |
| LoopCnt 1 NextLoop  | LoopCnt 1 NextLoop  |
| LoopCnt 2 BreakLoop | LoopCnt 2 BreakLoop | Thread 0 and 1,2... is abort.
|-------------------------------------------|


|-------------------------------------------|
|              Problem case                 |
|-------------------------------------------|
|      Thread 0       |      Thread 1,2...  |
|---------------------|---------------------|
| LoopCnt 0 EnterLoop | LoopCnt 0 EnterLoop |
| LoopCnt 0 Barrier   | LoopCnt 0 Barrier   |
| LoopCnt 0 NextLoop  | (Still sleeping)    | Only thread 0 is wakeup
| LoopCnt 1 EnterLoop | (Still sleeping)    | And next loop enter
| LoopCnt 1 Abort     | (Still sleeping)    | And abort.
| LoopCnt 1 Barrier   | (Still sleeping)    |
| (Wait forever)      | LoopCnt 0 NextLoop  | Thread 1,2... finally wakeup.
|                     | LoopCnt 1 BreakLoop | Abort flag is already set, so LoopCnt 1 is not entered.
|-------------------------------------------|

I can't confirm if the thread is really running in this execution order, but the issue that almost certainly occurred in my environment was resolved with this commit.
With this fix, Thread1, 2... also enter Loop Cnt 1 and execute the final Barrier similarly Thread 0.

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

I see the problem now, thanks for the explanation (and the fix).

@slaren slaren merged commit f196832 into ggerganov:master Jan 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants