Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise design for optional features and AOT #12252

Open
wants to merge 4 commits into
base: sycl
Choose a base branch
from

Conversation

gmlueck
Copy link
Contributor

@gmlueck gmlueck commented Dec 27, 2023

Revise the design for handling optional kernel features for AOT compilations. There is no need to create a separate tool "aspect-filter" because the filtering can be done in the "sycl-post-link" tool instead. This is better aligned with our long term direction because "sycl-post-link" will need to do some AOT-specific IR transformations in the future as part of the design presented in #9127, and those transformation will also make use of the device configuration file.

This commit also addresses some issues that were missed before related to embedding several AOT-compiled offload bundles into the host executable.

Revise the design for handling optional kernel features for AOT
compilations.  There is no need to create a separate tool
"aspect-filter" because the filtering can be done in the
"sycl-post-link" tool instead.  This is better aligned with our
long term direction because "sycl-post-link" will need to do some
AOT-specific IR transformations in the future as part of the design
presented in intel#9127, and those transformation will also make use of
the device configuration file.

This commit also addresses some issues that were missed before related
to embedding several AOT-compiled offload bundles into the host
executable.
@gmlueck gmlueck requested a review from mdtoguchi December 27, 2023 18:44
@gmlueck gmlueck requested a review from a team as a code owner December 27, 2023 18:44
@gmlueck gmlueck requested a review from bso-intel December 27, 2023 18:44
`opencl-aot`. This would allow us to run the device compiler just once even if
the user AOT compiles for both Intel GPU and CPU. However, this means that the
SPIR-V is generated using a different triple than we use today. It's unclear
what ramification this has.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging some front-end people here for their perspective: @tahonermann @elizabethandrews @premanandrao

What are the front-end ramifications if we compile to SPIR-V using the triple spir64-unknown-unknown vs spir64_gen-unknown-unknown vs. spir64_x86_64-unknown-unknown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck, I can't think of any FE impact in particular. Did you have anything specific in mind? If there are any changes in size/layout of types, this would be more of a downstream component impact than in FE itself. Maybe I am missing the point of your question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but I had the impression that the triple had some effect on the front-end, like enabling different builtins. If that's not the case, it would be good news.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment specifically on spir64 vs spir64_gen vs spir64_x86_64, but in general, such differences certainly can effect the FE in terms of predefined macros, enabled features, builtins, etc... Some investigation would be needed to discover what specific differences would be enabled in these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worthwhile to do this investigation. If there is a difference, I don't think we are documenting it well to our users. Most people assume that the only difference between JIT compilation (spir64) vs. AOT compilation (spir64_gen or spir64_x86_64) is a tradeoff between portability and runtime performance -- JIT mode is more portable to different h/w but has a higher runtime cost due to the JIT compilation step.

However, if the FE behaves differently in these three modes, then that means there is also a functional difference between JIT compilation and AOT compilation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of having a single device triple for all targets would be ideal, and also simplifies how things are represented in general, allowing for specific device targets to be specified via an --offload-arch (community reference here) value without depending on the triple arch_subarch.

sycl/doc/design/DeviceConfigFile.md Outdated Show resolved Hide resolved
`sycl-post-link` generates two file tables: one for PVC and one for DG1. The
driver invokes `ocloc -device pvc` once for each IR module in the PVC file
table, and it invokes `ocloc -device dg1` once for each IR module in the DG1
file table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current toolchain setup, each target device that is provided (i.e. -fsycl-targets=intel_gpu_pvc,intel_gpu_dg1) will create an individual unique toolchain for each device from creating the IR -> llvm-link -> sycl-post-link -> clang-offload-wrapper.

Given this, would there still be a need to have multiple -o options for sycl-post-link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the driver will spawn a separate device compiler "clang" for each of those targets? (I.e. clang gets invoked once for "intel_gpu_pvc" and again for "intel_gpu_dg1"?) That would be a compile-time performance regression compared to the current flow. Currently, if you compile for multiple Intel GPU targets via -fsycl-targets=spir64_gen -Xs "-device pvc,dg1", then clang gets invoked only once. I thought it would be important to preserve this behavior, so that people don't see a regression in compile-time performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - this makes sense. Some additional work would be needed in the driver to accommodate the combining of the targets given multiple intel_gpu* values. The expected behavior of the sycl-post-link call will need to be modified to allow for multiple outputs with the consumption of those outputs to correlate with the new -target values being passed to the individual wrapper calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior of the sycl-post-link call will need to be modified to allow for multiple outputs with the consumption of those outputs to correlate with the new -target values being passed to the individual wrapper calls.

Yes, I agree. I think this is captured in the sections "Expanded command line options to the post link tool" and "New filtering pass in the post link tool" above. Or did you mean something more?

Comment on lines +1021 to +1023
that are supported by a target device. If any IR module uses an incompatible
feature, the corresponding row in the table is changed to use an empty IR
module instead. To illustrate, let's assume that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any IR module uses an incompatible feature, the corresponding row in the table is changed to use an empty IR
module instead.

I don't understand why should we preserve those dummy empty modules. I think that we can just avoid adding them t the table in the first place: less temporary files, less I/O, less work for linker, less work at runtime to skip through those empty device images, etc.

Of course, there is a corner case with a completely empty table, but conceptually this seems fine, we just need to make sure that no other tools freak out and handle this properly.

Comment on lines +1037 to +1040
"mod2.sym") are unchanged. It's important to preserve the original properties,
so that the runtime can diagnose an error if the application mistakenly tries
to submit a kernel from this IR module to a device that does not support the
features it uses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not entirely convinced that we need those empty device images and their properties. If runtime can't find an image which is compatible with a device then corresponding error will be thrown and it should be enough indication for a user that this kernel can't be used on that device and we can adjust our exception messages for that.

Moreover, if there is no PVC-specific version of a kernel, but there is a generic SPIR-V version of a kernel, then we can silently fallback to JIT and everything will work just fine. As a user, that is exactly what I would expect from the implementation.

For some advanced users who would like to be more aware of what SYCL implementation does under the hood we can provide env variables which would emit extra diagnostics, or prevent AOT -> JIT fallback. Alternatively, such users could use kernel_bundle API to explicitly control device images and their lifecycle - in that case it would be even more obvious for a user that certain kernel is not AOT compiled for a certain device and there is again no need for having properties and dummy device image to detect and report that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If runtime can't find an image which is compatible with a device then corresponding error will be thrown and it should be enough indication for a user that this kernel can't be used on that device and we can adjust our exception messages for that.

The runtime uses the "SYCL/device requirements" property set to construct a meaningful error message like:

Kernel uses optional feature corresponding to 'aspect::fp16' but device does
not support this aspect.

If we delete the row with the empty device image, we also delete the property set, so it becomes impossible to raise a helpful error message like this.

Moreover, if there is no PVC-specific version of a kernel, but there is a generic SPIR-V version of a kernel, then we can silently fallback to JIT and everything will work just fine. As a user, that is exactly what I would expect from the implementation.

This won't work because the SPIR-V module uses some feature that is not supported on PVC in this case. (That's the only reason we substitute the empty device image.) Therefore, we must ensure that the runtime does not try to JIT compile the SPIR-V in this case. If it did invoke the JIT compiler, it would fail with an exception.

less temporary files, less I/O, less work for linker, less work at runtime to skip through those empty device images, etc

This addresses only part of your concern, but sycl-post-link would only need to generate a single "empty.bc" file, and that file can be shared by all rows in all tables that need an empty entry. Also, the runtime does not need to skip over the empty entries. In fact, we want the runtime to find the empty entry, so it can find the corresponding property set.

Comment on lines +1111 to +1121
In the new design, `ocloc` does not embed SPIR-V in its output because we
invoke it with `-exclude_ir`. Instead, DPC++ is responsible for embedding the
SPIR-V in the host executable, but should that SPIR-V be generated with the
triple `spir64-unknown-unknown` or the triple `spir64_gen-unknown-unknown`?
We could generate both, but that seems wasteful. If we generate the SPIR-V
with `spir64-unknown-unknown`, does that mean we need to run the device
compiler twice, once with `spir64_gen-unknown-unknown` (to generate SPIR-V that
is fed to `ocloc`) and once with `spir64-unknown-unknown` (to generate SPIR-V
that is embedded in the host executable)? This also seems wasteful because
it requires running the device compiler twice whenever the user AOT compiles
for an Intel GPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that we will be encouraging our users to move away from -fsycl-targets=spir64_gen -Xs "-device pvc" towards -fsycl-targets=spir64_gen_pvc. In that case, situation you describing here would more rather an exception.

We also call the device compiler twice right now for -fsycl-targets=spir64,spir64_gen if I'm not mistaken.

Considering those two facts, I wouldn't worry too much about this case. I think that we can leave it as-is, slowly deprecate and then completely (or almost completely) drop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that we will be encouraging our users to move away from -fsycl-targets=spir64_gen -Xs "-device pvc" towards -fsycl-targets=spir64_gen_pvc. In that case, situation you describing here would more rather an exception.

The situation I describe occurs with the new command line syntax. Consider a simple case where the user passes -fsycl-targets=intel_gpu_pvc. This will cause the compiler to embed both a PVC and a SPIR-V image in the ELF file. What triple will we use to generate that SPIR-V, spir64-unknown-unknown or spir64_gen-unknown-unknown?

raise a synchronous `errc::kernel_not_supported` exception.
#### Identifying a compatible offload bundle

Since the user may AOT compile for several different offload targets, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should be moved down to PI/UR layer, because it is too backend-specific.

There are different approaches possible:

  • an API which checks if a device image is compatible with particular device/arch
  • an API which accepts a set of device images and returns an index of a "best" for a specified device/target/arch

We have though about adding the second API into our SW stack to implement AOT -> JIT fallback a while ago. However, if we also factor in some extra features that may affect selection of a device image, like presence or manipulations with specialization constants, then the first API may be a better fit for our needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we need some new PI/UR API here. I guess I was trying to avoid the details of where in the runtime we implement the check.

[9]: <https://spec.oneapi.io/level-zero/latest/core/api.html#ze-device-ip-version-ext-t>
[10]: <https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_device_attribute_query.html>

#### Raising an exception when features are incompatible
Copy link
Contributor

@KornevNikita KornevNikita Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck could we also extend SYCL/device requirements in scope of this PR with the additional property set reflecting the value(-s) of -fsycl-targets flag?

What I mean - previously we had an update to clarify is_compatible()

to tell if a kernel was compiled in a way that restricts the devices on which it can run

At the moment in scope of is_compatible we only check that the device has the same backend as the target backend of the binary image. We can't yet handle the case when -fsycl-targets=nvidia_gpu_sm_86 and the device is sm_90. To check this the image should contain the info from -fsycl-targets, so we probably need to put it into device requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging @AerialMantis here because it would be good to get his perspective on how this should work with CUDA.

I was thinking that the solution I outline above for Intel GPU targets would work also for CUDA targets. These are the relevant paragraphs from earlier in the document:

The driver also invokes clang-offload-wrapper to wrap the native offload modules for inclusion in the host executable. Currently, the driver passes -target=spir64 when not doing AOT compilation (i.e. when generating SPIR-V for the target code), and it passes -target=spir64_gen when doing AOT compilation. The target name spir64_gen indicates that the offload bundle is native code that can run on any Intel GPU.

With the new design, the driver still uses -target=spir64 to identify a SPIR-V offload bundle, however it must use a more specific device name for native offload bundles. We use the same target names accepted by -fsycl-targets. For example, if DPC++ is invoked with -fsycl-targets=intel_gpu_pvc,intel_gpu_dg1, then the driver invokes clang-offload-wrapper twice, once with -target=intel_gpu_pvc and once with -target=intel_gpu_dg1.

If we follow this for CUDA, it would work like this. If the user compiles with -fsycl-targets=nvidia_gpu_sm_86, the compiler driver would end up bundling the offload image via clang-offload-wrapper -target=nvidia_gpu_sm_86 .... When the SYCL runtime looks for a compatible device image, it would then decide if the image's target type "nvidia_gpu_sm_86" is compatible with the queue's device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - do I understand it correctly, that the idea is to use not just __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64 as target but something like __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64_##DEVICE_NAME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I wanted to talk with @mdtoguchi first to make sure I was on the right track.

The design in this document is written using the old clang offloading design (i.e. the one using clang-offload-wrapper), which is how DPC++ is currently implemented. In that design, the clang-offload-wrapper tool accepts a command line option -target=, which specifies a string name for the offload target device. We currently pass this option as "-target=spir64_gen" when compiling for an Intel GPU device. I think this string is stored in the host ELF file in the pi_device_binary_struct::DeviceTargetSpec field, but I'm not certain of this. (If you know this code better, maybe you can answer.)

My proposal is that we pass a more specific target string here, for example -target=intel_gpu_pvc (or maybe just -target=pvc). This would allow the SYCL runtime to identify a PVC device image by looking at the pi_device_binary_struct::DeviceTargetSpec field.

The exact spelling of these target strings can be decided later. For example, I think we will eventually migrate away from -fsycl-targets= to the standard clang option --offload-arch=. When we do this, we will need to decide on the names of the Intel GPU devices that we will use with --offload-arch=, and we may as well use the same strings for the pi_device_binary_struct::DeviceTargetSpec field.

BTW, I know that we are migrating away from the old clang offloading design towards the new design that uses clang-linker-wrapper. I believe this new offloading design still identifies each device image with some sort of target name, so I think the strategy I'm proposing here still works even with that new design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If you know this code better, maybe you can answer.)

Yes, that's exactly what I was talking about. is_compatible(Dev) checks the value of pi_device_binary_struct::DeviceTargetSpec and decides if it matches the backend of the device. I didn't know about that idea:

My proposal is that we pass a more specific target string here

That should work fine for us, thanks for the clarification.

AlexeySachkov pushed a commit that referenced this pull request Apr 8, 2024
…12727)

This PR adds the required changes to `sycl-post-link` support optional
kernel features in AOT mode as described by the design doc in
#12252. Additionally, it also updates
the driver to invoke `sycl-post-link` with a device architecture when
Intel GPU targets are passed in `-fsycl-targets`.
Copy link
Contributor

github-actions bot commented Oct 9, 2024

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Oct 9, 2024
@gmlueck gmlueck removed the Stale label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants