-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Implement max_num_work_groups from the launch queries extension #14333
[SYCL] Implement max_num_work_groups from the launch queries extension #14333
Conversation
This is required for CUTLASS. Update: Ready for review. |
0a362ab
to
cbb6fea
Compare
The currently proposed and implemented query is `max_num_work_group_occupancy_per_cu` which retrieves the maximum actively executing workgroups based on compute unit occupancy granularity. This commit also fixes an issue in the `max_num_work_group_sync` query that could have previously lead to out of launch resources issue. Additionally, it also overloads the `max_num_num_work_group_sync` query to take extra parameters for local work-group size and local dynamic memory size (in bytes) in order to be allow users to pass those important resource usage factors to the query, so they are take in account in the final group count suggestion. This overload is currently only usable when targetting Cuda.
cbb6fea
to
aead3e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the wording in the PR description? It talks about max_num_num_work_group_sync
, which I guess is an old name for the query?
This overload is currently only usable when targetting Cuda.
What is preventing us from implementing the API on Level Zero? If we cannot implement it on Level Zero, we should add a section "Backend support status" to the spec indicating what is supported. However, it would be better to implement it universally from the start.
Additional comments below.
sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc
Outdated
Show resolved
Hide resolved
possible occupancy in a portable way. | ||
|
||
List of currently planned queries. | ||
* max_num_work_group_occupancy_per_cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't render well in HTML. Asciidoc requires a blank line before the bullet.
Are you planning to add more queries to this extension soon? This list of currently planned queries seems odd. I'd suggest removing it unless you have some specific plan to add more things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning on at least one more, being recommended_work_group_size
. This would be useful in combination with the currently added one, to let the runtime assist in selecting a configuration for max HW occupancy. This is not super useful in most kernel launch configurations, but for small ones that are not the hot-path where sycl::nd_range
is specified explicitly and manual fine-tuning is not required, it is a useful feature. However, I am not adding this yet.
Also, this is not much of a list with one addition at this point, so I am removing it.
Thank you for questioning this!
sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc
Outdated
Show resolved
Hide resolved
|Returns the maximum number of actively executing work-groups per compute unit | ||
granularity, when the kernel is submitted to the specified queue with the | ||
specified work-group size and the specified amount of dynamic work-group local | ||
memory (in bytes). The actively executing work-groups are those that occupy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to be a little more detailed about what counts as dynamic work-group local memory. I assume this is the sum of the sizes of all local accessors, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamically allocated (SYCL) local memory, for which the size is know only at runtime and can change between kernel submissions, so yes, you are right. I will elaborate in a little more detail to make it clear.
I can see how it is not clear the way I phrased it.
specified work-group size and the specified amount of dynamic work-group local | ||
memory (in bytes). The actively executing work-groups are those that occupy | ||
the fundamental hardware unit responsible for the execution of work-groups in | ||
parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that max_num_work_group_occupancy_per_cu
returns a recommended work-group size? If that is the case, can we rename the query to something like recommended_num_work_groups
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a recommendation for the maximum number of work-groups (or Cuda blocks) of specified block size etc. that will theoretically execute concurrently on the compute unit (Cuda SM) to achieve maximum occupancy.
I think I like the naming you suggest and it does sound to me like a recommendation. What do you think would be a good name based on my description? (I am sold based on the fact the original I came up with sounds a little weird.)
Thank you, @gmlueck !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to recommended_num_work_groups
. Initially, I wanted to indicate that this is not per-device semantics but per compute unit (or whatever this maps to in the HW, i.e. SM for Cuda, EU for Intel Level-Zero or CU for AMD HIP), hence why I had the _per_cu
in the name. However, the extension docs describe the semantics, so I think that's okay now.
sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,4 @@ | |||
// TODO: Revisit 'max_num_work_group_sync' and align it with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7598 has been merged a while ago, this comment should be removed. I also suggest that we deprecate max_num_work_group_sync
info trait right away in favor of max_num_work_groups
. The former is not documented so we should aim to remove it as soon as possible to avoid wide adoption of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware and was thinking the same. However, I saw it as doing two separate things in one PR that eventually ultimately lands as a single squashed commit and have opted to follow-up with a separate PR solely deprecating max_num_work_group_sync
. Of course, I am not arguing against your suggestion and preferences. Having said that, let me know how you'd prefer it done. I am fine either way. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with doing so in a separate PR to have it a separate commit in our history
@@ -159,9 +159,29 @@ class __SYCL_EXPORT kernel : public detail::OwnerLessBase<kernel> { | |||
get_info(const device &Device, const range<3> &WGSize) const; | |||
|
|||
// TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension | |||
// once #7598 is merged. | |||
// once #7598 is merged. (regarding the 'max_num_work_group_sync' query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, #7598 has been already merged. Any APIs which do not correspond to root group or launch queries extension should be marked as deprecated together with introduction of a proper API that is documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other response wrt this. I see you suggested the introduction of a replacement and deprecating the replaced one go hand-in-hand together. So would you prefer that to happen in separate PR in order for the changes to land in separate squashed commits or you want them in one and just expand on the description?
@@ -3912,7 +3912,9 @@ _ZNK4sycl3_V16kernel16get_backend_infoINS0_4info6device15backend_versionEEENS0_6 | |||
_ZNK4sycl3_V16kernel16get_backend_infoINS0_4info6device7versionEEENS0_6detail20is_backend_info_descIT_E11return_typeEv | |||
_ZNK4sycl3_V16kernel16get_backend_infoINS0_4info8platform7versionEEENS0_6detail20is_backend_info_descIT_E11return_typeEv | |||
_ZNK4sycl3_V16kernel17get_kernel_bundleEv | |||
_ZNK4sycl3_V16kernel19ext_oneapi_get_infoINS0_3ext6oneapi12experimental4info21kernel_queue_specific23max_num_work_group_syncEEENT_11return_typeERKNS0_5queueE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: this is an ABI break. However, that symbol is from an experimental extension and as I understand it we can do such change even outside of ABI-breaking window. Still, would be good to hear feedback on that from someone else.
@GeorgeWeb UR PR merged, please update UR tag and fix the conflict to merge this, Thanks! |
Unrelated passing XFAIL: https://github.com/intel/llvm/actions/runs/10809701571/job/29985720662?pr=14333#step:22:2286
|
Wrt the Doxygen docs build failure: |
@intel/llvm-gatekeepers Please merge, the failures there are unrelated. |
Note for other gatekeepers who may come here from failed post-commit. Those failures were noticed, we had already synced with @GeorgeWeb about them and there is a follow-up PR expected with slight tweaks to tests to fix that failure |
As @AlexeySachkov already noted (thanks). The follow-up PR that should fix the failure is this one #15359. |
This PR implements the
max_num_work_groups
query from thesycl_ext_oneapi_launch_queries
extension.Additionally, this PR introduces changes that overload
ext_oneapi_get_info
for another kernel-queue-specific query -max_num_work_group_sync
to take extra parameters for local work-group size and dynamic local memory size (in bytes) in order to allow users to pass those runtime resource limiting factors to the query, so they are taken into account in the final group count suggestion.