-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue] Are threadfences required for intra-wavefront shared memory communication? #3759
Comments
Hi @vbharadwaj-bk. Internal ticket has been created to assis with your issue. Thanks! |
An update: I find that certain pieces of code are working correctly when I add threadfences, but not otherwise. It would be great if I could get confirmation that two threads can only communicate via shared memory if a threadfence separates reads and writes; I suspect the documentation on the warp reduction should be updated as well. |
Hi @vbharadwaj-bk, sorry for the delay on this. I think your suspicion is correct and you should need synchronization (via threadfences or otherwise) to guarantee proper memory operation ordering, but I'll double check with internal teams to see if we're doing something here under the hood that would somehow remove the need for a sync. |
I've been told that we actually don't need synchronization in the reduction tutorial, still gathering details and will update when I have them. Thanks for pointing this out! |
NVIDIA requires syncwarp here because of independent thread scheduling, while we don't support independent thread scheduling and therefore don't have/require this. We have a few more details in the docs: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/call_stack.html#call-stack. This is a little different from a regular threadfence, so seeing a syncwarp does not imply whether we do or do not need synchronization in general.
If you have specific pieces of code you have questions about, you can post them and we can take a look.
We'll supplement the HIP programming guide with some more info about this. |
Thanks for the response! Hmm - yes I still have a piece of code that is failing, but succeeds when a threadfence is added in - let me distill it into a small snippet that reproduces the error. |
Update: I appreciate the response; I haven't been able to distill my code yet, so let's close this issue for now. |
Hi - I have a piece of code where thread 0 reads a value written to shared memory by thread 1 of a warp in a prior instruction. Because HIP does not provide any memory consistency guarantees, I take it that I must use
__threadfence_block()
between the read and write. On the other hand, I see in the reduction example that within a warp, no memory fence is used between reads and writes (there is a note that syncwarp is required on NVIDIA architectures, but not on AMD architectures due to the lockstep behavior of each wavefront.https://github.com/ROCm/HIP/blob/490f02a66eb62031007d93ee3c14d84e4d9f2dab/docs/tutorial/reduction.rst?plain=1#L388
Is a memory fence required for correct intra-warp commuication? If not, does this effectively remove the need for any __syncwarp() primitive, since wavefronts execute in lockstep and provide a consistency guarantee? There are some questions about why syncwarp is not supported yet, but I don't see a need for that if it's two functions are subsumed by the hardware guarantees. If a fence is required, perhaps the documentation should be updated.
The text was updated successfully, but these errors were encountered: