Revise design for optional features and AOT #12252

gmlueck · 2023-12-27T18:44:43Z

Revise the design for handling optional kernel features for AOT compilations. There is no need to create a separate tool "aspect-filter" because the filtering can be done in the "sycl-post-link" tool instead. This is better aligned with our long term direction because "sycl-post-link" will need to do some AOT-specific IR transformations in the future as part of the design presented in #9127, and those transformation will also make use of the device configuration file.

This commit also addresses some issues that were missed before related to embedding several AOT-compiled offload bundles into the host executable.

Revise the design for handling optional kernel features for AOT compilations. There is no need to create a separate tool "aspect-filter" because the filtering can be done in the "sycl-post-link" tool instead. This is better aligned with our long term direction because "sycl-post-link" will need to do some AOT-specific IR transformations in the future as part of the design presented in intel#9127, and those transformation will also make use of the device configuration file. This commit also addresses some issues that were missed before related to embedding several AOT-compiled offload bundles into the host executable.

gmlueck · 2023-12-27T18:51:12Z

sycl/doc/design/OptionalDeviceFeatures.md

+`opencl-aot`.  This would allow us to run the device compiler just once even if
+the user AOT compiles for both Intel GPU and CPU.  However, this means that the
+SPIR-V is generated using a different triple than we use today.  It's unclear
+what ramification this has.


Tagging some front-end people here for their perspective: @tahonermann @elizabethandrews @premanandrao

What are the front-end ramifications if we compile to SPIR-V using the triple spir64-unknown-unknown vs spir64_gen-unknown-unknown vs. spir64_x86_64-unknown-unknown?

@gmlueck, I can't think of any FE impact in particular. Did you have anything specific in mind? If there are any changes in size/layout of types, this would be more of a downstream component impact than in FE itself. Maybe I am missing the point of your question.

I might be wrong, but I had the impression that the triple had some effect on the front-end, like enabling different builtins. If that's not the case, it would be good news.

I can't comment specifically on spir64 vs spir64_gen vs spir64_x86_64, but in general, such differences certainly can effect the FE in terms of predefined macros, enabled features, builtins, etc... Some investigation would be needed to discover what specific differences would be enabled in these cases.

I think it would be worthwhile to do this investigation. If there is a difference, I don't think we are documenting it well to our users. Most people assume that the only difference between JIT compilation (spir64) vs. AOT compilation (spir64_gen or spir64_x86_64) is a tradeoff between portability and runtime performance -- JIT mode is more portable to different h/w but has a higher runtime cost due to the JIT compilation step.

However, if the FE behaves differently in these three modes, then that means there is also a functional difference between JIT compilation and AOT compilation.

The idea of having a single device triple for all targets would be ideal, and also simplifies how things are represented in general, allowing for specific device targets to be specified via an --offload-arch (community reference here) value without depending on the triple arch_subarch.

sycl/doc/design/DeviceConfigFile.md

mdtoguchi · 2024-01-04T16:24:42Z

sycl/doc/design/OptionalDeviceFeatures.md

+`sycl-post-link` generates two file tables: one for PVC and one for DG1.  The
+driver invokes `ocloc -device pvc` once for each IR module in the PVC file
+table, and it invokes `ocloc -device dg1` once for each IR module in the DG1
+file table.


In the current toolchain setup, each target device that is provided (i.e. -fsycl-targets=intel_gpu_pvc,intel_gpu_dg1) will create an individual unique toolchain for each device from creating the IR -> llvm-link -> sycl-post-link -> clang-offload-wrapper.

Given this, would there still be a need to have multiple -o options for sycl-post-link?

Does this mean that the driver will spawn a separate device compiler "clang" for each of those targets? (I.e. clang gets invoked once for "intel_gpu_pvc" and again for "intel_gpu_dg1"?) That would be a compile-time performance regression compared to the current flow. Currently, if you compile for multiple Intel GPU targets via -fsycl-targets=spir64_gen -Xs "-device pvc,dg1", then clang gets invoked only once. I thought it would be important to preserve this behavior, so that people don't see a regression in compile-time performance.

Okay - this makes sense. Some additional work would be needed in the driver to accommodate the combining of the targets given multiple intel_gpu* values. The expected behavior of the sycl-post-link call will need to be modified to allow for multiple outputs with the consumption of those outputs to correlate with the new -target values being passed to the individual wrapper calls.

The expected behavior of the sycl-post-link call will need to be modified to allow for multiple outputs with the consumption of those outputs to correlate with the new -target values being passed to the individual wrapper calls.

Yes, I agree. I think this is captured in the sections "Expanded command line options to the post link tool" and "New filtering pass in the post link tool" above. Or did you mean something more?

Co-authored-by: Michael Toguchi <[email protected]>

AlexeySachkov · 2024-02-01T13:46:42Z

sycl/doc/design/OptionalDeviceFeatures.md

+that are supported by a target device.  If any IR module uses an incompatible
+feature, the corresponding row in the table is changed to use an empty IR
+module instead.  To illustrate, let's assume that


If any IR module uses an incompatible feature, the corresponding row in the table is changed to use an empty IR
module instead.

I don't understand why should we preserve those dummy empty modules. I think that we can just avoid adding them t the table in the first place: less temporary files, less I/O, less work for linker, less work at runtime to skip through those empty device images, etc.

Of course, there is a corner case with a completely empty table, but conceptually this seems fine, we just need to make sure that no other tools freak out and handle this properly.

AlexeySachkov · 2024-02-01T13:52:11Z

sycl/doc/design/OptionalDeviceFeatures.md

+"mod2.sym") are unchanged.  It's important to preserve the original properties,
+so that the runtime can diagnose an error if the application mistakenly tries
+to submit a kernel from this IR module to a device that does not support the
+features it uses.


I'm still not entirely convinced that we need those empty device images and their properties. If runtime can't find an image which is compatible with a device then corresponding error will be thrown and it should be enough indication for a user that this kernel can't be used on that device and we can adjust our exception messages for that.

Moreover, if there is no PVC-specific version of a kernel, but there is a generic SPIR-V version of a kernel, then we can silently fallback to JIT and everything will work just fine. As a user, that is exactly what I would expect from the implementation.

For some advanced users who would like to be more aware of what SYCL implementation does under the hood we can provide env variables which would emit extra diagnostics, or prevent AOT -> JIT fallback. Alternatively, such users could use kernel_bundle API to explicitly control device images and their lifecycle - in that case it would be even more obvious for a user that certain kernel is not AOT compiled for a certain device and there is again no need for having properties and dummy device image to detect and report that.

If runtime can't find an image which is compatible with a device then corresponding error will be thrown and it should be enough indication for a user that this kernel can't be used on that device and we can adjust our exception messages for that.

The runtime uses the "SYCL/device requirements" property set to construct a meaningful error message like:

Kernel uses optional feature corresponding to 'aspect::fp16' but device does not support this aspect.

If we delete the row with the empty device image, we also delete the property set, so it becomes impossible to raise a helpful error message like this.

Moreover, if there is no PVC-specific version of a kernel, but there is a generic SPIR-V version of a kernel, then we can silently fallback to JIT and everything will work just fine. As a user, that is exactly what I would expect from the implementation.

This won't work because the SPIR-V module uses some feature that is not supported on PVC in this case. (That's the only reason we substitute the empty device image.) Therefore, we must ensure that the runtime does not try to JIT compile the SPIR-V in this case. If it did invoke the JIT compiler, it would fail with an exception.

less temporary files, less I/O, less work for linker, less work at runtime to skip through those empty device images, etc

This addresses only part of your concern, but sycl-post-link would only need to generate a single "empty.bc" file, and that file can be shared by all rows in all tables that need an empty entry. Also, the runtime does not need to skip over the empty entries. In fact, we want the runtime to find the empty entry, so it can find the corresponding property set.

AlexeySachkov · 2024-02-01T14:11:39Z

sycl/doc/design/OptionalDeviceFeatures.md

+In the new design, `ocloc` does not embed SPIR-V in its output because we
+invoke it with `-exclude_ir`.  Instead, DPC++ is responsible for embedding the
+SPIR-V in the host executable, but should that SPIR-V be generated with the
+triple `spir64-unknown-unknown` or the triple `spir64_gen-unknown-unknown`?
+We could generate both, but that seems wasteful.  If we generate the SPIR-V
+with `spir64-unknown-unknown`, does that mean we need to run the device
+compiler twice, once with `spir64_gen-unknown-unknown` (to generate SPIR-V that
+is fed to `ocloc`) and once with `spir64-unknown-unknown` (to generate SPIR-V
+that is embedded in the host executable)?  This also seems wasteful because
+it requires running the device compiler twice whenever the user AOT compiles
+for an Intel GPU.


I expect that we will be encouraging our users to move away from -fsycl-targets=spir64_gen -Xs "-device pvc" towards -fsycl-targets=spir64_gen_pvc. In that case, situation you describing here would more rather an exception.

We also call the device compiler twice right now for -fsycl-targets=spir64,spir64_gen if I'm not mistaken.

Considering those two facts, I wouldn't worry too much about this case. I think that we can leave it as-is, slowly deprecate and then completely (or almost completely) drop

I expect that we will be encouraging our users to move away from -fsycl-targets=spir64_gen -Xs "-device pvc" towards -fsycl-targets=spir64_gen_pvc. In that case, situation you describing here would more rather an exception.

The situation I describe occurs with the new command line syntax. Consider a simple case where the user passes -fsycl-targets=intel_gpu_pvc. This will cause the compiler to embed both a PVC and a SPIR-V image in the ELF file. What triple will we use to generate that SPIR-V, spir64-unknown-unknown or spir64_gen-unknown-unknown?

AlexeySachkov · 2024-02-01T14:24:20Z

sycl/doc/design/OptionalDeviceFeatures.md

-raise a synchronous `errc::kernel_not_supported` exception.
+#### Identifying a compatible offload bundle
+
+Since the user may AOT compile for several different offload targets, the


Ideally, this should be moved down to PI/UR layer, because it is too backend-specific.

There are different approaches possible:

an API which checks if a device image is compatible with particular device/arch

an API which accepts a set of device images and returns an index of a "best" for a specified device/target/arch

We have though about adding the second API into our SW stack to implement AOT -> JIT fallback a while ago. However, if we also factor in some extra features that may affect selection of a device image, like presence or manipulations with specialization constants, then the first API may be a better fit for our needs.

Yes, I agree that we need some new PI/UR API here. I guess I was trying to avoid the details of where in the runtime we implement the check.

KornevNikita · 2024-02-20T10:37:58Z

sycl/doc/design/OptionalDeviceFeatures.md

+[9]: <https://spec.oneapi.io/level-zero/latest/core/api.html#ze-device-ip-version-ext-t>
+[10]: <https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_device_attribute_query.html>
+
+#### Raising an exception when features are incompatible


@gmlueck could we also extend SYCL/device requirements in scope of this PR with the additional property set reflecting the value(-s) of -fsycl-targets flag?

What I mean - previously we had an update to clarify is_compatible()

to tell if a kernel was compiled in a way that restricts the devices on which it can run

At the moment in scope of is_compatible we only check that the device has the same backend as the target backend of the binary image. We can't yet handle the case when -fsycl-targets=nvidia_gpu_sm_86 and the device is sm_90. To check this the image should contain the info from -fsycl-targets, so we probably need to put it into device requirements.

Tagging @AerialMantis here because it would be good to get his perspective on how this should work with CUDA.

I was thinking that the solution I outline above for Intel GPU targets would work also for CUDA targets. These are the relevant paragraphs from earlier in the document:

The driver also invokes clang-offload-wrapper to wrap the native offload modules for inclusion in the host executable. Currently, the driver passes -target=spir64 when not doing AOT compilation (i.e. when generating SPIR-V for the target code), and it passes -target=spir64_gen when doing AOT compilation. The target name spir64_gen indicates that the offload bundle is native code that can run on any Intel GPU.

With the new design, the driver still uses -target=spir64 to identify a SPIR-V offload bundle, however it must use a more specific device name for native offload bundles. We use the same target names accepted by -fsycl-targets. For example, if DPC++ is invoked with -fsycl-targets=intel_gpu_pvc,intel_gpu_dg1, then the driver invokes clang-offload-wrapper twice, once with -target=intel_gpu_pvc and once with -target=intel_gpu_dg1.

If we follow this for CUDA, it would work like this. If the user compiles with -fsycl-targets=nvidia_gpu_sm_86, the compiler driver would end up bundling the offload image via clang-offload-wrapper -target=nvidia_gpu_sm_86 .... When the SYCL runtime looks for a compatible device image, it would then decide if the image's target type "nvidia_gpu_sm_86" is compatible with the queue's device.

To clarify - do I understand it correctly, that the idea is to use not just __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64 as target but something like __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64_##DEVICE_NAME?

Sorry for the delay. I wanted to talk with @mdtoguchi first to make sure I was on the right track.

The design in this document is written using the old clang offloading design (i.e. the one using clang-offload-wrapper), which is how DPC++ is currently implemented. In that design, the clang-offload-wrapper tool accepts a command line option -target=, which specifies a string name for the offload target device. We currently pass this option as "-target=spir64_gen" when compiling for an Intel GPU device. I think this string is stored in the host ELF file in the pi_device_binary_struct::DeviceTargetSpec field, but I'm not certain of this. (If you know this code better, maybe you can answer.)

My proposal is that we pass a more specific target string here, for example -target=intel_gpu_pvc (or maybe just -target=pvc). This would allow the SYCL runtime to identify a PVC device image by looking at the pi_device_binary_struct::DeviceTargetSpec field.

The exact spelling of these target strings can be decided later. For example, I think we will eventually migrate away from -fsycl-targets= to the standard clang option --offload-arch=. When we do this, we will need to decide on the names of the Intel GPU devices that we will use with --offload-arch=, and we may as well use the same strings for the pi_device_binary_struct::DeviceTargetSpec field.

BTW, I know that we are migrating away from the old clang offloading design towards the new design that uses clang-linker-wrapper. I believe this new offloading design still identifies each device image with some sort of target name, so I think the strategy I'm proposing here still works even with that new design.

(If you know this code better, maybe you can answer.)

Yes, that's exactly what I was talking about. is_compatible(Dev) checks the value of pi_device_binary_struct::DeviceTargetSpec and decides if it matches the backend of the device. I didn't know about that idea:

My proposal is that we pass a more specific target string here

That should work fine for us, thanks for the clarification.

…12727) This PR adds the required changes to `sycl-post-link` support optional kernel features in AOT mode as described by the design doc in #12252. Additionally, it also updates the driver to invoke `sycl-post-link` with a device architecture when Intel GPU targets are passed in `-fsycl-targets`.

github-actions · 2024-10-09T01:58:56Z

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

gmlueck requested a review from mdtoguchi December 27, 2023 18:44

gmlueck requested a review from a team as a code owner December 27, 2023 18:44

gmlueck requested a review from bso-intel December 27, 2023 18:44

gmlueck commented Dec 27, 2023

View reviewed changes

mdtoguchi reviewed Jan 4, 2024

View reviewed changes

Update sycl/doc/design/DeviceConfigFile.md

5117ea7

Co-authored-by: Michael Toguchi <[email protected]>

bso-intel approved these changes Jan 12, 2024

View reviewed changes

AlexeySachkov reviewed Feb 1, 2024

View reviewed changes

jzc mentioned this pull request Feb 15, 2024

[SYCL] Add support for multiple filtered outputs in sycl-post-link #12727

Merged

KornevNikita reviewed Feb 20, 2024

View reviewed changes

gmlueck added 2 commits February 28, 2024 16:45

Merge branch 'sycl' into gmlueck/opt-features-aot

d5176c2

Fix typo

d4c783c

github-actions bot added the Stale label Oct 9, 2024

gmlueck removed the Stale label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise design for optional features and AOT #12252

Revise design for optional features and AOT #12252

gmlueck commented Dec 27, 2023

gmlueck Dec 27, 2023

premanandrao Jan 5, 2024

gmlueck Jan 5, 2024

tahonermann Jan 7, 2024

gmlueck Jan 8, 2024

mdtoguchi Apr 11, 2024

mdtoguchi Jan 4, 2024

gmlueck Jan 4, 2024

mdtoguchi Jan 4, 2024

gmlueck Jan 5, 2024

AlexeySachkov Feb 1, 2024

AlexeySachkov Feb 1, 2024

gmlueck Feb 28, 2024

AlexeySachkov Feb 1, 2024

gmlueck Feb 28, 2024

AlexeySachkov Feb 1, 2024

gmlueck Feb 28, 2024

KornevNikita Feb 20, 2024 •

edited

Loading

gmlueck Feb 29, 2024

KornevNikita Mar 1, 2024

gmlueck Mar 5, 2024

KornevNikita Mar 6, 2024

github-actions bot commented Oct 9, 2024

Revise design for optional features and AOT #12252

Are you sure you want to change the base?

Revise design for optional features and AOT #12252

Conversation

gmlueck commented Dec 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KornevNikita Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 9, 2024

KornevNikita Feb 20, 2024 •

edited

Loading