New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SYCL] Implement max_num_work_groups from the launch queries extension #14333

Merged

AlexeySachkov merged 44 commits into intel:sycl from GeorgeWeb:georgi/sycl_ext_occupancy

Sep 11, 2024

Contributor

GeorgeWeb commented Jun 27, 2024 •

edited

Loading

This PR implements the max_num_work_groups query from the sycl_ext_oneapi_launch_queries extension.

Additionally, this PR introduces changes that overload ext_oneapi_get_info for another kernel-queue-specific query - max_num_work_group_sync to take extra parameters for local work-group size and dynamic local memory size (in bytes) in order to allow users to pass those runtime resource limiting factors to the query, so they are taken into account in the final group count suggestion.

GeorgeWeb requested review from a team as code owners

June 27, 2024 16:42

GeorgeWeb requested review from bso-intel and 0x12CC

June 27, 2024 16:42

GeorgeWeb had a problem deploying to WindowsCILock

June 27, 2024 16:44

— with

GitHub Actions Error

Contributor Author

GeorgeWeb commented Jun 27, 2024 •

edited

Loading

This is required for CUTLASS.
I am converting to DRAFT for now as there is a unified-runtime dependency: oneapi-src/unified-runtime#1796.

Update: Ready for review.

GeorgeWeb marked this pull request as draft

June 27, 2024 16:45

GeorgeWeb mentioned this pull request

[CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda oneapi-src/unified-runtime#1796

Merged

GeorgeWeb force-pushed the georgi/sycl_ext_occupancy branch from 0a362ab to cbb6fea Compare

June 27, 2024 16:57


          [SYCL][Ext] Query kernel maximum active work-groups based on occupancy

aead3e3

The currently proposed and implemented query is `max_num_work_group_occupancy_per_cu`
which retrieves the maximum actively executing workgroups based on compute unit occupancy
granularity.

This commit also fixes an issue in the `max_num_work_group_sync` query that could
have previously lead to out of launch resources issue.

Additionally, it also overloads the `max_num_num_work_group_sync` query to take
extra parameters for local work-group size and local dynamic memory size (in
bytes) in order to be allow users to pass those important resource usage factors
to the query, so they are take in account in the final group count suggestion.
This overload is currently only usable when targetting Cuda.

GeorgeWeb force-pushed the georgi/sycl_ext_occupancy branch from cbb6fea to aead3e3 Compare

June 27, 2024 16:59

GeorgeWeb added 3 commits

June 27, 2024 18:01


          Remove forgotten stale pi.h changes

e172c1e


          Fix query test

a840be1


          Update UR cuda-adapter commit tag

a9e17b4

GeorgeWeb had a problem deploying to WindowsCILock

June 27, 2024 17:30

— with

GitHub Actions Failure


          Fix formatting and add missing file

4f81d0a

GeorgeWeb had a problem deploying to WindowsCILock

June 28, 2024 10:16

— with

GitHub Actions Failure

GeorgeWeb temporarily deployed to WindowsCILock

June 28, 2024 10:47

— with

GitHub Actions Inactive

GeorgeWeb added 2 commits

June 28, 2024 12:47


          Rename the kernel_queue_specific traits definitions file

b2756c9


          Add windows symbols

28b09f4

GeorgeWeb temporarily deployed to WindowsCILock

June 28, 2024 12:04

— with

GitHub Actions Inactive

GeorgeWeb had a problem deploying to WindowsCILock

June 28, 2024 12:32

— with

GitHub Actions Error


          Update include_deps tests

fd51cfb

GeorgeWeb temporarily deployed to WindowsCILock

June 28, 2024 12:37

— with

GitHub Actions Inactive

GeorgeWeb temporarily deployed to WindowsCILock

June 28, 2024 13:08

— with

GitHub Actions Inactive

GeorgeWeb commented

View reviewed changes

sycl/source/detail/kernel_impl.hpp Outdated Show resolved Hide resolved

gmlueck reviewed

View reviewed changes

Contributor

gmlueck left a comment

Can you update the wording in the PR description? It talks about max_num_num_work_group_sync, which I guess is an old name for the query?

This overload is currently only usable when targetting Cuda.

What is preventing us from implementing the API on Level Zero? If we cannot implement it on Level Zero, we should add a section "Backend support status" to the spec indicating what is supported. However, it would be better to implement it universally from the start.

Additional comments below.

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated

+              possible occupancy in a portable way.
+              List of currently planned queries.
+              * max_num_work_group_occupancy_per_cu

Contributor

gmlueck Jul 3, 2024

This doesn't render well in HTML. Asciidoc requires a blank line before the bullet.

Are you planning to add more queries to this extension soon? This list of currently planned queries seems odd. I'd suggest removing it unless you have some specific plan to add more things.

Contributor Author

GeorgeWeb Jul 4, 2024 •

edited

Loading

I was planning on at least one more, being recommended_work_group_size. This would be useful in combination with the currently added one, to let the runtime assist in selecting a configuration for max HW occupancy. This is not super useful in most kernel launch configurations, but for small ones that are not the hot-path where sycl::nd_range is specified explicitly and manual fine-tuning is not required, it is a useful feature. However, I am not adding this yet.

Also, this is not much of a list with one addition at this point, so I am removing it.
Thank you for questioning this!

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated

+              |Returns the maximum number of actively executing work-groups per compute unit
+              granularity, when the kernel is submitted to the specified queue with the
+              specified work-group size and the specified amount of dynamic work-group local
+              memory (in bytes). The actively executing work-groups are those that occupy

Contributor

gmlueck Jul 3, 2024

It might be good to be a little more detailed about what counts as dynamic work-group local memory. I assume this is the sum of the sizes of all local accessors, right?

Contributor Author

GeorgeWeb Jul 4, 2024 •

edited

Loading

Dynamically allocated (SYCL) local memory, for which the size is know only at runtime and can change between kernel submissions, so yes, you are right. I will elaborate in a little more detail to make it clear.
I can see how it is not clear the way I phrased it.

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated

+              specified work-group size and the specified amount of dynamic work-group local
+              memory (in bytes). The actively executing work-groups are those that occupy
+              the fundamental hardware unit responsible for the execution of work-groups in
+              parallel.

Contributor

gmlueck Jul 3, 2024

Is the idea that max_num_work_group_occupancy_per_cu returns a recommended work-group size? If that is the case, can we rename the query to something like recommended_num_work_groups?

Contributor Author

GeorgeWeb Jul 4, 2024 •

edited

Loading

It is a recommendation for the maximum number of work-groups (or Cuda blocks) of specified block size etc. that will theoretically execute concurrently on the compute unit (Cuda SM) to achieve maximum occupancy.

I think I like the naming you suggest and it does sound to me like a recommendation. What do you think would be a good name based on my description? (I am sold based on the fact the original I came up with sounds a little weird.)

Thank you, @gmlueck !

Contributor Author

GeorgeWeb Jul 5, 2024

I renamed it to recommended_num_work_groups. Initially, I wanted to indicate that this is not per-device semantics but per compute unit (or whatever this maps to in the HW, i.e. SM for Cuda, EU for Intel Level-Zero or CU for AMD HIP), hence why I had the _per_cu in the name. However, the extension docs describe the semantics, so I think that's okay now.

sycl/doc/extensions/experimental/sycl_ext_oneapi_group_occupancy_queries.asciidoc Outdated Show resolved Hide resolved

GeorgeWeb added 2 commits

July 4, 2024 10:48


          Rename the query to recommended_num_work_groups

377cf3b


          Change return type to size_t from uint32_t

3a8f3bf

gmlueck approved these changes

View reviewed changes

bso-intel approved these changes

View reviewed changes

AlexeySachkov reviewed

View reviewed changes

sycl/include/sycl/info/ext_oneapi_kernel_queue_specific_traits.def

		@@ -0,0 +1,4 @@
		// TODO: Revisit 'max_num_work_group_sync' and align it with the

Contributor

AlexeySachkov Sep 6, 2024

#7598 has been merged a while ago, this comment should be removed. I also suggest that we deprecate max_num_work_group_sync info trait right away in favor of max_num_work_groups. The former is not documented so we should aim to remove it as soon as possible to avoid wide adoption of it.

Contributor Author

GeorgeWeb Sep 9, 2024 •

edited

Loading

I am aware and was thinking the same. However, I saw it as doing two separate things in one PR that eventually ultimately lands as a single squashed commit and have opted to follow-up with a separate PR solely deprecating max_num_work_group_sync. Of course, I am not arguing against your suggestion and preferences. Having said that, let me know how you'd prefer it done. I am fine either way. Thanks!

Contributor

AlexeySachkov Sep 9, 2024

I'm fine with doing so in a separate PR to have it a separate commit in our history

sycl/include/sycl/kernel.hpp

                     get_info(const device &Device, const range<3> &WGSize) const;
                 // TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension
-                // once #7598 is merged.
+                // once #7598 is merged. (regarding the 'max_num_work_group_sync' query)

Contributor

AlexeySachkov Sep 6, 2024

Same here, #7598 has been already merged. Any APIs which do not correspond to root group or launch queries extension should be marked as deprecated together with introduction of a proper API that is documented.

Contributor Author

GeorgeWeb Sep 9, 2024

Same as the other response wrt this. I see you suggested the introduction of a replacement and deprecating the replaced one go hand-in-hand together. So would you prefer that to happen in separate PR in order for the changes to land in separate squashed commits or you want them in one and just expand on the description?

sycl/include/sycl/kernel.hpp Outdated Show resolved Hide resolved

sycl/source/detail/kernel_impl.cpp Show resolved Hide resolved

sycl/source/detail/kernel_impl.hpp Show resolved Hide resolved

sycl/source/detail/kernel_impl.hpp Outdated Show resolved Hide resolved

sycl/test-e2e/Basic/launch_queries/max_num_work_groups.cpp Outdated Show resolved Hide resolved

sycl/test/abi/sycl_symbols_linux.dump

               _ZNK4sycl3_V16kernel16get_backend_infoINS0_4info6device7versionEEENS0_6detail20is_backend_info_descIT_E11return_typeEv
               _ZNK4sycl3_V16kernel16get_backend_infoINS0_4info8platform7versionEEENS0_6detail20is_backend_info_descIT_E11return_typeEv
               _ZNK4sycl3_V16kernel17get_kernel_bundleEv
-              _ZNK4sycl3_V16kernel19ext_oneapi_get_infoINS0_3ext6oneapi12experimental4info21kernel_queue_specific23max_num_work_group_syncEEENT_11return_typeERKNS0_5queueE

Contributor

AlexeySachkov Sep 6, 2024

Note for reviewers: this is an ABI break. However, that symbol is from an experimental extension and as I understand it we can do such change even outside of ABI-breaking window. Still, would be good to hear feedback on that from someone else.

Contributor

omarahmed1111 commented Sep 10, 2024 •

edited

Loading

@GeorgeWeb UR PR merged, please update UR tag and fix the conflict to merge this, Thanks!

GeorgeWeb added 4 commits

September 10, 2024 13:25


          Address more review comments

1698bb8


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…

529caa5

…upancy


          Update queue argument as per review comment suggestion

2e190a2


          Bump UR tag

762c7e1

GeorgeWeb had a problem deploying to WindowsCILock

September 10, 2024 13:28

— with

GitHub Actions Failure

GeorgeWeb had a problem deploying to WindowsCILock

September 10, 2024 13:57

— with

GitHub Actions Error

GeorgeWeb added 2 commits

September 10, 2024 15:11


          Update symbols


          Update max_num_work_groups query ext docs

c2788f6

GeorgeWeb temporarily deployed to WindowsCILock

September 10, 2024 14:14

— with

GitHub Actions Inactive

omarahmed1111 reviewed

View reviewed changes

sycl/cmake/modules/FetchUnifiedRuntime.cmake Outdated Show resolved Hide resolved

GeorgeWeb temporarily deployed to WindowsCILock

September 10, 2024 15:29

— with

GitHub Actions Inactive

AlexeySachkov approved these changes

View reviewed changes


          Update UR merge-commit tag

6c5485f

GeorgeWeb had a problem deploying to WindowsCILock

September 11, 2024 10:09

— with

GitHub Actions Error


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…

cb5e47e

…upancy

GeorgeWeb temporarily deployed to WindowsCILock

September 11, 2024 10:20

— with

GitHub Actions Inactive

GeorgeWeb temporarily deployed to WindowsCILock

September 11, 2024 11:11

— with

GitHub Actions Inactive

Contributor Author

GeorgeWeb commented Sep 11, 2024 •

edited

Loading

Unrelated passing XFAIL: https://github.com/intel/llvm/actions/runs/10809701571/job/29985720662?pr=14333#step:22:2286

********************
Unexpectedly Passed Tests (1):
  SYCL :: OneapiDeviceSelector/no_duplicate_devices.cpp

see #15341 and #15288

omarahmed1111 approved these changes

View reviewed changes

Contributor Author

GeorgeWeb commented Sep 11, 2024 •

edited

Loading

Wrt the Doxygen docs build failure:
see issue #15355 for the failing Generate Doxygen documentation build action. The link pointing to the now removed kernel-fusion extension is wrong.

Contributor

omarahmed1111 commented Sep 11, 2024

@intel/llvm-gatekeepers Please merge, the failures there are unrelated.

AlexeySachkov merged commit 81aacfa into intel:sycl

12 of 14 checks passed

Contributor

AlexeySachkov commented Sep 11, 2024 •

edited

Loading

Note for other gatekeepers who may come here from failed post-commit. Those failures were noticed, we had already synced with @GeorgeWeb about them and there is a follow-up PR expected with slight tweaks to tests to fix that failure

omarahmed1111 mentioned this pull request

[SYCL][UR] Bump UR tag to 2fea679 #15350

Merged

Contributor Author

GeorgeWeb commented Sep 11, 2024

As @AlexeySachkov already noted (thanks). The follow-up PR that should fix the failure is this one #15359.

npmiller mentioned this pull request

cudaOccupancyMaxActiveBlocksPerMultiprocessor oneapi-src/unified-runtime#1424

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

omarahmed1111 omarahmed1111 approved these changes

bso-intel bso-intel approved these changes

gmlueck gmlueck approved these changes

AlexeySachkov AlexeySachkov approved these changes

0x12CC Awaiting requested review from 0x12CC

Labels

None yet