-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] thinLTO for SYCL #15083
base: sycl
Are you sure you want to change the base?
[RFC] thinLTO for SYCL #15083
Changes from 7 commits
817a0c9
70ee0e0
00a28c1
d3ccfd1
8c4edb3
3322684
43df8c9
9922d30
1deaea2
c68e797
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
set(LLVM_LINK_COMPONENTS | ||
${LLVM_TARGETS_TO_BUILD} | ||
BitReader | ||
BitWriter | ||
Core | ||
BinaryFormat | ||
|
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,9 +114,4 @@ DEVICE_EXTERN_C void __devicelib_assert_fail(const char *expr, const char *file, | |
__assertfail(expr, file, line, func, 1); | ||
} | ||
|
||
DEVICE_EXTERN_C void _wassert(const char *_Message, const char *_File, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this is a change that can be merged and submitted separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't even know if it's correct, I just hit a build error on windows about |
||
unsigned _Line) { | ||
__assertfail(_Message, _File, _Line, 0, 1); | ||
} | ||
|
||
#endif |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
//===-- SYCLLinkedModuleProcessor.h - finalize a fully linked module ---===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
// | ||
// The file contains a number of functions to create a pass that can be called | ||
// by the LTO backend that will finalize a fully-linked module. | ||
//===----------------------------------------------------------------------===// | ||
#pragma once | ||
#include "SpecConstants.h" | ||
namespace llvm { | ||
|
||
class PassRegistry; | ||
class ModulePass; | ||
ModulePass * | ||
createSYCLLinkedModuleProcessorPass(llvm::SpecConstantsPass::HandlingMode); | ||
void initializeSYCLLinkedModuleProcessorPass(PassRegistry &); | ||
|
||
} // namespace llvm |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
//===-- SYCLLinkedModuleProcessor.cpp - finalize a fully linked module ---===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
// See comments in the header. | ||
//===----------------------------------------------------------------------===// | ||
|
||
#include "llvm/SYCLLowerIR/SYCLLinkedModuleProcessor.h" | ||
|
||
#include "llvm/Pass.h" | ||
|
||
#define DEBUG_TYPE "sycl-linked-module-processor" | ||
using namespace llvm; | ||
|
||
namespace { | ||
class SYCLLinkedModuleProcessor : public ModulePass { | ||
public: | ||
static char ID; | ||
SYCLLinkedModuleProcessor(SpecConstantsPass::HandlingMode Mode) | ||
: ModulePass(ID), Mode(Mode) { | ||
initializeSYCLLinkedModuleProcessorPass(*PassRegistry::getPassRegistry()); | ||
} | ||
|
||
bool runOnModule(Module &M) override { | ||
// TODO: determine if we need to run other passes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, that's an equivalent of what's being run by
If we also taking about what happens after
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So when we do early splitting in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe that most of E2E are single-file tests with no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest adding tests similar to |
||
ModuleAnalysisManager MAM; | ||
SpecConstantsPass SCP(Mode); | ||
auto PA = SCP.run(M, MAM); | ||
return !PA.areAllPreserved(); | ||
} | ||
|
||
private: | ||
SpecConstantsPass::HandlingMode Mode; | ||
}; | ||
} // namespace | ||
char SYCLLinkedModuleProcessor::ID = 0; | ||
INITIALIZE_PASS(SYCLLinkedModuleProcessor, "SYCLLinkedModuleProcessor", | ||
"Finalize a fully linked SYCL module", false, false) | ||
ModulePass *llvm::createSYCLLinkedModuleProcessorPass( | ||
SpecConstantsPass::HandlingMode Mode) { | ||
return new SYCLLinkedModuleProcessor(Mode); | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
# ThinLTO for SYCL | ||
|
||
This document describes the purpose and design of ThinLTO for SYCL. | ||
|
||
**NOTE**: This is not the final version. The document is still in progress. | ||
|
||
## Background | ||
|
||
With traditional SYCL device code linking, all user code is linked together | ||
along with device libraries into a single huge module and then split and | ||
processed by `sycl-post-link`. This requires sequential processing, has a large | ||
memory footprint, and differs from the linking flow for AMD and NVIDIA devices. | ||
|
||
## Summary | ||
SYCL ThinLTO will hook into the existing community mechanism to run LTO as part | ||
of device linking inside `clang-linker-wrapper`. We split the device images | ||
early at compilation time, and at link time we use ThinLTO's function importing | ||
feature | ||
to bring in the defintions for referenced functions. Only the new offload model | ||
is supported. | ||
|
||
## Device code compilation time changes | ||
Most of the changes for ThinLTO occur during device link time, however there is | ||
one major change during compilation (-c) time: we now run device code split | ||
during compilaton instead of linking. | ||
The main reason for doing this is increased parallelization. Many compilation | ||
jobs can be run at the same time, but linking happens once total for the | ||
application. Device code split is currently a common source of performance | ||
issues. | ||
|
||
Splitting early means that the resulting IR after splitting is not complete, it | ||
still may contain calls to functions (user code and/or the SYCL device | ||
libraries) from other object files. | ||
|
||
We rely on the assumption that all function defintions matching a declaration | ||
will be the same and we can let ThinLTO pull in any one. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. C++ one definition rule guarantees this property of the code, doesn't it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it depends what the original IR linkage is. if the original IR is |
||
|
||
For example, let's start with user device code that defines a `SYCL_EXTERNAL` | ||
function `foo` in translation unit `tu_foo`. There is also another translation | ||
unit `tu_bar` that references `foo`. | ||
During the early device code splitting run of `tu_foo`, we may find that more | ||
than one of the resultant device images contain a defintion for `foo`. | ||
|
||
We assert that any function defintion for `foo` that is deemed a match by the | ||
ThinLTO infrastruction during the processing of `tu_bar` is valid. | ||
|
||
As a result of running early device code split, the fat object file generated | ||
as part of device compilation may contain multiple device code images. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would be the linkage type of Can this process duplicate SYCL kernel function definitions? If so, is SYCL runtime can handle this duplication? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After split, the linkage should be the same as it was before split. After ThinLTO runs, it could have the same linkage as after splitting or it could be internalized (not yet implemented) I don't think there is any way to get multiple kernel definitions in a way that isn't already possible with splitting. Maybe @AlexeySachkov has an idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but this might cause "multiple definition" problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll have to try some examples and tests and see if we hit a problem like this, in my testing I've never seen a duplicate symbol problem, only undefined symbol when importing fails for some reason, but of course we may just not have a test for the failing case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We shouldn't ever duplicate kernels in our compiler stack, that won't be properly handled at RT for multiple reasons. |
||
|
||
# Device code link time changes | ||
|
||
Before we go into the link time changes for SYCL, let's understand the device | ||
linking flow for community devices (AMD/NVIDIA): | ||
|
||
 | ||
|
||
SYCL has two differenting requirements: | ||
1) The SPIR-V backend is not production ready and the SPIR-V translator is used. | ||
2) The SYCL runtime requires metadata (module properties and module symbol | ||
table) computed from device images that will be stored along the device images | ||
in the fat executable. | ||
|
||
The effect of requirement 1) is that instead of letting ThinLTO call the SPIR-V | ||
backend, we add a callback that runs right before codegen would run. | ||
In that callback, we call the SPIR-V translator and store the resultant file | ||
path for use later, and we instruct the ThinLTO framework to not | ||
perform codegen. | ||
|
||
An interesting additional fact about requirement 2) is that we actually need to | ||
process fully linked module to accurate compute the module properties. One | ||
example where we need the full module is to [compute the required devicelib mask](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLLowerIR/SYCLDeviceLibReqMask.cpp). | ||
If we only process the device code that was included in the | ||
original fat object input to `clang-linker-wrapper`, we will miss devicelib | ||
calls in referenced `SYCL_EXTERNAL` functions. | ||
|
||
The effect of requirement 2) is that we store the fully linked device image for | ||
metadata computation in the SYCL-specific handing code after the ThinLTO | ||
framework has completed. Another option would be to try to compute the metadata | ||
inside the ThinLTO framework callbacks, but this would require SYCL-specific | ||
arguments to many caller functions in the stack and pollute community code. | ||
|
||
Here is the current ThinLTO flow for SYCL: | ||
|
||
 | ||
|
||
We add a `PreCodeGenModuleHook` function to the `LTOConfig` object so that we | ||
can process the fully linked module without running the backend. | ||
|
||
However, the flow is not ideal for many reasons: | ||
1) We are relying on the external `llvm-spirv` tool instead of the SPIR-V | ||
backend. We could slightly improve this issue by using a library call to the | ||
SPIR-V translator instead of the tool, however the library API requires setting | ||
up an object to represent the arguments while we only have strings, and it's | ||
non-trivial to parse the strings to figure out how to create the argument | ||
object. Since we plan to use the SPIR-V backend in the long term, this does not | ||
seem to be worth the effort. | ||
|
||
2) We manually run passes inside `PreCodeGenModuleHook`. This is because we | ||
don't run codegen, so we can't take advantage of the `PreCodeGenPassesHook` | ||
field of `LTOConfig` to run some custom passes, as those passes are only run | ||
when we actually are going to run codegen. | ||
|
||
3) We have to store the fully linked module. This is needed because we need a | ||
fully linked module to accurately compute metadata, see the above explanation | ||
of SYCL requirement 2). We could get around storing the module by computing the | ||
metadata inside the LTO framework and storing it for late use by the SYCL | ||
bundling code, but doing this would require even more SYCL-only customizations including | ||
even more new function arguments and modifications of the `OffloadFile` class. | ||
There are also compliations because the LTO framework is multithreaded, and not all | ||
LLVM data structures are thread safe. | ||
|
||
The proposed long-term SYCL ThinLTO flow is as follows: | ||
|
||
 | ||
|
||
The biggest difference here is that we are running codegen using the SPIR-V | ||
backend. | ||
|
||
Also, instead of using a lambda function in the `PreCodeGenModuleHook` | ||
callback to run SYCL finalization passes, we can take advantage of the `PreCodeGenPassesHook` field to add | ||
passes to the pass manager that the LTO framework will run. | ||
|
||
It is possible that the number of device images in the fat executable | ||
and which device image contains which kernel is different with ThinLTO | ||
enabled, but we do expect this to have any impact on correctness or | ||
performance, nor we do expect users to care. | ||
|
||
|
||
# Current limitations | ||
|
||
`-O0`: Compiling with `-O0` prevent clang from generating ThinLTO metadata | ||
during the compilation phase. In the current implementation, this is an error. | ||
In the final version, we could either silently fall back to full LTO or | ||
generate ThinLTO metadata even for `-O0`. | ||
|
||
SYCL libdevice: Current all `libdevice` functions are explicitly marked to be | ||
weak symbols. The ThinLTO framework does not consider a defintion of function | ||
with weak linkage as it cannot be sure that this definiton is the correct one. | ||
Ideally we could remove the weak symbol annotation. | ||
|
||
No binary linkage: The SPIR-V target does not currently have a production | ||
quality binary linker. This means that we must generate a fully linked image as | ||
part of device linkage. At least for AMD devices, this is not a requirement as | ||
`lld` is used for the final link which can resolve any unresolved symbols. | ||
`-fno-gpu-rdc` is default for AMD, so in that case it can call `lld` during | ||
compile, but if `-fno-gpu-rdc` is passed, the lld call happens as part of | ||
sarnex marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`clang-linker-wrapper` to resolve any symbols not resolved by ThinLTO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to get libdevice functions linked in by the thinLTO function importing infrastructure, see here. I'm looking for a better solution for this, I just kept this here in case anybody plans on trying the prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose importing devicelib symbols at compile step can be a solution (see #15114).
On the other hand, I recall discussing the possibility of linking device libraries with upstream maintainers, who expressed a preference for shifting device library linking from the "compile" to the "link" step. It would be ideal if we could discover a solution that aligns with the long-term strategy of upstream and enables us to utilize the thinLTO framework for offload code linking, thereby avoiding the use of weak symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention to @mdtoguchi who has been looking at importing devicelib at compile step from the SYCL perspective.
Point to note: During one of the LLVM community presentation, it was mentioned that they are trying to move importing devicelib to link time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we already perform device library linking at link time we can consider abandoning the efforts to pull them into the compilation step. My main concern with performing at the link step is the communication required from the driver to the
clang-linker-wrapper
informing which device libraries should be linked. The less tie-in we have between the driver and theclang-linker-wrapper
at link time, the better. IMO, at the very least the linker wrapper should know a minimum default device libraries to link and any communication from the driver is manipulating that list.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking
clang-linker-wrapper
. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values toclang-linker-wrapper
where these options should be processed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is very interesting.
While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step.
Could I get some clarification on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Naghasan, @npmiller, are you able to help here?