Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] Are threadfences required for intra-wavefront shared memory communication? #3759

Closed
vbharadwaj-bk opened this issue Mar 11, 2025 · 7 comments

Comments

@vbharadwaj-bk
Copy link

Hi - I have a piece of code where thread 0 reads a value written to shared memory by thread 1 of a warp in a prior instruction. Because HIP does not provide any memory consistency guarantees, I take it that I must use __threadfence_block() between the read and write. On the other hand, I see in the reduction example that within a warp, no memory fence is used between reads and writes (there is a note that syncwarp is required on NVIDIA architectures, but not on AMD architectures due to the lockstep behavior of each wavefront.

https://github.com/ROCm/HIP/blob/490f02a66eb62031007d93ee3c14d84e4d9f2dab/docs/tutorial/reduction.rst?plain=1#L388

Is a memory fence required for correct intra-warp commuication? If not, does this effectively remove the need for any __syncwarp() primitive, since wavefronts execute in lockstep and provide a consistency guarantee? There are some questions about why syncwarp is not supported yet, but I don't see a need for that if it's two functions are subsumed by the hardware guarantees. If a fence is required, perhaps the documentation should be updated.

@ppanchad-amd
Copy link

Hi @vbharadwaj-bk. Internal ticket has been created to assis with your issue. Thanks!

@vbharadwaj-bk
Copy link
Author

An update: I find that certain pieces of code are working correctly when I add threadfences, but not otherwise. It would be great if I could get confirmation that two threads can only communicate via shared memory if a threadfence separates reads and writes; I suspect the documentation on the warp reduction should be updated as well.

@schung-amd
Copy link

Hi @vbharadwaj-bk, sorry for the delay on this. I think your suspicion is correct and you should need synchronization (via threadfences or otherwise) to guarantee proper memory operation ordering, but I'll double check with internal teams to see if we're doing something here under the hood that would somehow remove the need for a sync.

@schung-amd
Copy link

I've been told that we actually don't need synchronization in the reduction tutorial, still gathering details and will update when I have them. Thanks for pointing this out!

@schung-amd
Copy link

NVIDIA requires syncwarp here because of independent thread scheduling, while we don't support independent thread scheduling and therefore don't have/require this. We have a few more details in the docs: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/call_stack.html#call-stack. This is a little different from a regular threadfence, so seeing a syncwarp does not imply whether we do or do not need synchronization in general.

I find that certain pieces of code are working correctly when I add threadfences, but not otherwise.

If you have specific pieces of code you have questions about, you can post them and we can take a look.

I suspect the documentation on the warp reduction should be updated as well.

We'll supplement the HIP programming guide with some more info about this.

@vbharadwaj-bk
Copy link
Author

Thanks for the response! Hmm - yes I still have a piece of code that is failing, but succeeds when a threadfence is added in - let me distill it into a small snippet that reproduces the error.

@vbharadwaj-bk
Copy link
Author

Update: I appreciate the response; I haven't been able to distill my code yet, so let's close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants