-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise design for optional features and AOT #12252
base: sycl
Are you sure you want to change the base?
Conversation
Revise the design for handling optional kernel features for AOT compilations. There is no need to create a separate tool "aspect-filter" because the filtering can be done in the "sycl-post-link" tool instead. This is better aligned with our long term direction because "sycl-post-link" will need to do some AOT-specific IR transformations in the future as part of the design presented in intel#9127, and those transformation will also make use of the device configuration file. This commit also addresses some issues that were missed before related to embedding several AOT-compiled offload bundles into the host executable.
`opencl-aot`. This would allow us to run the device compiler just once even if | ||
the user AOT compiles for both Intel GPU and CPU. However, this means that the | ||
SPIR-V is generated using a different triple than we use today. It's unclear | ||
what ramification this has. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging some front-end people here for their perspective: @tahonermann @elizabethandrews @premanandrao
What are the front-end ramifications if we compile to SPIR-V using the triple spir64-unknown-unknown
vs spir64_gen-unknown-unknown
vs. spir64_x86_64-unknown-unknown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck, I can't think of any FE impact in particular. Did you have anything specific in mind? If there are any changes in size/layout of types, this would be more of a downstream component impact than in FE itself. Maybe I am missing the point of your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but I had the impression that the triple had some effect on the front-end, like enabling different builtins. If that's not the case, it would be good news.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment specifically on spir64
vs spir64_gen
vs spir64_x86_64
, but in general, such differences certainly can effect the FE in terms of predefined macros, enabled features, builtins, etc... Some investigation would be needed to discover what specific differences would be enabled in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worthwhile to do this investigation. If there is a difference, I don't think we are documenting it well to our users. Most people assume that the only difference between JIT compilation (spir64
) vs. AOT compilation (spir64_gen
or spir64_x86_64
) is a tradeoff between portability and runtime performance -- JIT mode is more portable to different h/w but has a higher runtime cost due to the JIT compilation step.
However, if the FE behaves differently in these three modes, then that means there is also a functional difference between JIT compilation and AOT compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of having a single device triple for all targets would be ideal, and also simplifies how things are represented in general, allowing for specific device targets to be specified via an --offload-arch
(community reference here) value without depending on the triple arch_subarch.
`sycl-post-link` generates two file tables: one for PVC and one for DG1. The | ||
driver invokes `ocloc -device pvc` once for each IR module in the PVC file | ||
table, and it invokes `ocloc -device dg1` once for each IR module in the DG1 | ||
file table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current toolchain setup, each target device that is provided (i.e. -fsycl-targets=intel_gpu_pvc,intel_gpu_dg1
) will create an individual unique toolchain for each device from creating the IR -> llvm-link -> sycl-post-link -> clang-offload-wrapper.
Given this, would there still be a need to have multiple -o
options for sycl-post-link
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the driver will spawn a separate device compiler "clang" for each of those targets? (I.e. clang gets invoked once for "intel_gpu_pvc" and again for "intel_gpu_dg1"?) That would be a compile-time performance regression compared to the current flow. Currently, if you compile for multiple Intel GPU targets via -fsycl-targets=spir64_gen -Xs "-device pvc,dg1"
, then clang gets invoked only once. I thought it would be important to preserve this behavior, so that people don't see a regression in compile-time performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - this makes sense. Some additional work would be needed in the driver to accommodate the combining of the targets given multiple intel_gpu* values. The expected behavior of the sycl-post-link
call will need to be modified to allow for multiple outputs with the consumption of those outputs to correlate with the new -target values being passed to the individual wrapper calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected behavior of the
sycl-post-link
call will need to be modified to allow for multiple outputs with the consumption of those outputs to correlate with the new -target values being passed to the individual wrapper calls.
Yes, I agree. I think this is captured in the sections "Expanded command line options to the post link tool" and "New filtering pass in the post link tool" above. Or did you mean something more?
Co-authored-by: Michael Toguchi <[email protected]>
that are supported by a target device. If any IR module uses an incompatible | ||
feature, the corresponding row in the table is changed to use an empty IR | ||
module instead. To illustrate, let's assume that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any IR module uses an incompatible feature, the corresponding row in the table is changed to use an empty IR
module instead.
I don't understand why should we preserve those dummy empty modules. I think that we can just avoid adding them t the table in the first place: less temporary files, less I/O, less work for linker, less work at runtime to skip through those empty device images, etc.
Of course, there is a corner case with a completely empty table, but conceptually this seems fine, we just need to make sure that no other tools freak out and handle this properly.
"mod2.sym") are unchanged. It's important to preserve the original properties, | ||
so that the runtime can diagnose an error if the application mistakenly tries | ||
to submit a kernel from this IR module to a device that does not support the | ||
features it uses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not entirely convinced that we need those empty device images and their properties. If runtime can't find an image which is compatible with a device then corresponding error will be thrown and it should be enough indication for a user that this kernel can't be used on that device and we can adjust our exception messages for that.
Moreover, if there is no PVC-specific version of a kernel, but there is a generic SPIR-V version of a kernel, then we can silently fallback to JIT and everything will work just fine. As a user, that is exactly what I would expect from the implementation.
For some advanced users who would like to be more aware of what SYCL implementation does under the hood we can provide env variables which would emit extra diagnostics, or prevent AOT -> JIT fallback. Alternatively, such users could use kernel_bundle
API to explicitly control device images and their lifecycle - in that case it would be even more obvious for a user that certain kernel is not AOT compiled for a certain device and there is again no need for having properties and dummy device image to detect and report that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If runtime can't find an image which is compatible with a device then corresponding error will be thrown and it should be enough indication for a user that this kernel can't be used on that device and we can adjust our exception messages for that.
The runtime uses the "SYCL/device requirements" property set to construct a meaningful error message like:
Kernel uses optional feature corresponding to 'aspect::fp16' but device does
not support this aspect.
If we delete the row with the empty device image, we also delete the property set, so it becomes impossible to raise a helpful error message like this.
Moreover, if there is no PVC-specific version of a kernel, but there is a generic SPIR-V version of a kernel, then we can silently fallback to JIT and everything will work just fine. As a user, that is exactly what I would expect from the implementation.
This won't work because the SPIR-V module uses some feature that is not supported on PVC in this case. (That's the only reason we substitute the empty device image.) Therefore, we must ensure that the runtime does not try to JIT compile the SPIR-V in this case. If it did invoke the JIT compiler, it would fail with an exception.
less temporary files, less I/O, less work for linker, less work at runtime to skip through those empty device images, etc
This addresses only part of your concern, but sycl-post-link
would only need to generate a single "empty.bc" file, and that file can be shared by all rows in all tables that need an empty entry. Also, the runtime does not need to skip over the empty entries. In fact, we want the runtime to find the empty entry, so it can find the corresponding property set.
In the new design, `ocloc` does not embed SPIR-V in its output because we | ||
invoke it with `-exclude_ir`. Instead, DPC++ is responsible for embedding the | ||
SPIR-V in the host executable, but should that SPIR-V be generated with the | ||
triple `spir64-unknown-unknown` or the triple `spir64_gen-unknown-unknown`? | ||
We could generate both, but that seems wasteful. If we generate the SPIR-V | ||
with `spir64-unknown-unknown`, does that mean we need to run the device | ||
compiler twice, once with `spir64_gen-unknown-unknown` (to generate SPIR-V that | ||
is fed to `ocloc`) and once with `spir64-unknown-unknown` (to generate SPIR-V | ||
that is embedded in the host executable)? This also seems wasteful because | ||
it requires running the device compiler twice whenever the user AOT compiles | ||
for an Intel GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that we will be encouraging our users to move away from -fsycl-targets=spir64_gen -Xs "-device pvc"
towards -fsycl-targets=spir64_gen_pvc
. In that case, situation you describing here would more rather an exception.
We also call the device compiler twice right now for -fsycl-targets=spir64,spir64_gen
if I'm not mistaken.
Considering those two facts, I wouldn't worry too much about this case. I think that we can leave it as-is, slowly deprecate and then completely (or almost completely) drop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that we will be encouraging our users to move away from
-fsycl-targets=spir64_gen -Xs "-device pvc"
towards-fsycl-targets=spir64_gen_pvc
. In that case, situation you describing here would more rather an exception.
The situation I describe occurs with the new command line syntax. Consider a simple case where the user passes -fsycl-targets=intel_gpu_pvc
. This will cause the compiler to embed both a PVC and a SPIR-V image in the ELF file. What triple will we use to generate that SPIR-V, spir64-unknown-unknown
or spir64_gen-unknown-unknown
?
raise a synchronous `errc::kernel_not_supported` exception. | ||
#### Identifying a compatible offload bundle | ||
|
||
Since the user may AOT compile for several different offload targets, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this should be moved down to PI/UR layer, because it is too backend-specific.
There are different approaches possible:
- an API which checks if a device image is compatible with particular device/arch
- an API which accepts a set of device images and returns an index of a "best" for a specified device/target/arch
We have though about adding the second API into our SW stack to implement AOT -> JIT fallback a while ago. However, if we also factor in some extra features that may affect selection of a device image, like presence or manipulations with specialization constants, then the first API may be a better fit for our needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we need some new PI/UR API here. I guess I was trying to avoid the details of where in the runtime we implement the check.
[9]: <https://spec.oneapi.io/level-zero/latest/core/api.html#ze-device-ip-version-ext-t> | ||
[10]: <https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_device_attribute_query.html> | ||
|
||
#### Raising an exception when features are incompatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck could we also extend SYCL/device requirements
in scope of this PR with the additional property set reflecting the value(-s) of -fsycl-targets
flag?
What I mean - previously we had an update to clarify is_compatible()
to tell if a kernel was compiled in a way that restricts the devices on which it can run
At the moment in scope of is_compatible
we only check that the device has the same backend as the target backend of the binary image. We can't yet handle the case when -fsycl-targets=nvidia_gpu_sm_86
and the device is sm_90
. To check this the image should contain the info from -fsycl-targets
, so we probably need to put it into device requirements
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging @AerialMantis here because it would be good to get his perspective on how this should work with CUDA.
I was thinking that the solution I outline above for Intel GPU targets would work also for CUDA targets. These are the relevant paragraphs from earlier in the document:
The driver also invokes
clang-offload-wrapper
to wrap the native offload modules for inclusion in the host executable. Currently, the driver passes-target=spir64
when not doing AOT compilation (i.e. when generating SPIR-V for the target code), and it passes-target=spir64_gen
when doing AOT compilation. The target namespir64_gen
indicates that the offload bundle is native code that can run on any Intel GPU.With the new design, the driver still uses
-target=spir64
to identify a SPIR-V offload bundle, however it must use a more specific device name for native offload bundles. We use the same target names accepted by-fsycl-targets
. For example, if DPC++ is invoked with-fsycl-targets=intel_gpu_pvc,intel_gpu_dg1
, then the driver invokesclang-offload-wrapper
twice, once with-target=intel_gpu_pvc
and once with-target=intel_gpu_dg1
.
If we follow this for CUDA, it would work like this. If the user compiles with -fsycl-targets=nvidia_gpu_sm_86
, the compiler driver would end up bundling the offload image via clang-offload-wrapper -target=nvidia_gpu_sm_86 ...
. When the SYCL runtime looks for a compatible device image, it would then decide if the image's target type "nvidia_gpu_sm_86" is compatible with the queue's device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify - do I understand it correctly, that the idea is to use not just __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64
as target but something like __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64_##DEVICE_NAME
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I wanted to talk with @mdtoguchi first to make sure I was on the right track.
The design in this document is written using the old clang offloading design (i.e. the one using clang-offload-wrapper
), which is how DPC++ is currently implemented. In that design, the clang-offload-wrapper
tool accepts a command line option -target=
, which specifies a string name for the offload target device. We currently pass this option as "-target=spir64_gen" when compiling for an Intel GPU device. I think this string is stored in the host ELF file in the pi_device_binary_struct::DeviceTargetSpec
field, but I'm not certain of this. (If you know this code better, maybe you can answer.)
My proposal is that we pass a more specific target string here, for example -target=intel_gpu_pvc
(or maybe just -target=pvc
). This would allow the SYCL runtime to identify a PVC device image by looking at the pi_device_binary_struct::DeviceTargetSpec
field.
The exact spelling of these target strings can be decided later. For example, I think we will eventually migrate away from -fsycl-targets=
to the standard clang option --offload-arch=
. When we do this, we will need to decide on the names of the Intel GPU devices that we will use with --offload-arch=
, and we may as well use the same strings for the pi_device_binary_struct::DeviceTargetSpec
field.
BTW, I know that we are migrating away from the old clang offloading design towards the new design that uses clang-linker-wrapper
. I believe this new offloading design still identifies each device image with some sort of target name, so I think the strategy I'm proposing here still works even with that new design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If you know this code better, maybe you can answer.)
Yes, that's exactly what I was talking about. is_compatible(Dev)
checks the value of pi_device_binary_struct::DeviceTargetSpec
and decides if it matches the backend of the device. I didn't know about that idea:
My proposal is that we pass a more specific target string here
That should work fine for us, thanks for the clarification.
This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days. |
Revise the design for handling optional kernel features for AOT compilations. There is no need to create a separate tool "aspect-filter" because the filtering can be done in the "sycl-post-link" tool instead. This is better aligned with our long term direction because "sycl-post-link" will need to do some AOT-specific IR transformations in the future as part of the design presented in #9127, and those transformation will also make use of the device configuration file.
This commit also addresses some issues that were missed before related to embedding several AOT-compiled offload bundles into the host executable.