Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] thinLTO for SYCL #15083

Open
wants to merge 10 commits into
base: sycl
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions clang/tools/clang-linker-wrapper/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
set(LLVM_LINK_COMPONENTS
${LLVM_TARGETS_TO_BUILD}
BitReader
BitWriter
Core
BinaryFormat
Expand Down
535 changes: 426 additions & 109 deletions clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions libdevice/device.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@

#if defined(__SPIR__) || defined(__SPIRV__) || defined(__NVPTX__)
#ifdef __SYCL_DEVICE_ONLY__
#define DEVICE_EXTERNAL SYCL_EXTERNAL __attribute__((weak))
#define DEVICE_EXTERNAL SYCL_EXTERNAL
#else // __SYCL_DEVICE_ONLY__
#define DEVICE_EXTERNAL __attribute__((weak))
#define DEVICE_EXTERNAL
#endif // __SYCL_DEVICE_ONLY__
Comment on lines +20 to 23
Copy link
Contributor Author

@sarnex sarnex Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to get libdevice functions linked in by the thinLTO function importing infrastructure, see here. I'm looking for a better solution for this, I just kept this here in case anybody plans on trying the prototype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose importing devicelib symbols at compile step can be a solution (see #15114).

On the other hand, I recall discussing the possibility of linking device libraries with upstream maintainers, who expressed a preference for shifting device library linking from the "compile" to the "link" step. It would be ideal if we could discover a solution that aligns with the long-term strategy of upstream and enables us to utilize the thinLTO framework for offload code linking, thereby avoiding the use of weak symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attention to @mdtoguchi who has been looking at importing devicelib at compile step from the SYCL perspective.
Point to note: During one of the LLVM community presentation, it was mentioned that they are trying to move importing devicelib to link time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we already perform device library linking at link time we can consider abandoning the efforts to pull them into the compilation step. My main concern with performing at the link step is the communication required from the driver to the clang-linker-wrapper informing which device libraries should be linked. The less tie-in we have between the driver and the clang-linker-wrapper at link time, the better. IMO, at the very least the linker wrapper should know a minimum default device libraries to link and any communication from the driver is manipulating that list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking clang-linker-wrapper. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values to clang-linker-wrapper where these options should be processed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very interesting.
While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step.
Could I get some clarification on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very interesting. While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step. Could I get some clarification on that?

@Naghasan, @npmiller, are you able to help here?


#define DEVICE_EXTERN_C DEVICE_EXTERNAL EXTERN_C
Expand Down
5 changes: 0 additions & 5 deletions libdevice/fallback-cassert.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,4 @@ DEVICE_EXTERN_C void __devicelib_assert_fail(const char *expr, const char *file,
__assertfail(expr, file, line, func, 1);
}

DEVICE_EXTERN_C void _wassert(const char *_Message, const char *_File,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a change that can be merged and submitted separately. _wassert is a wrapper for MSVC's assert implementation to redirect it to ours, so it really shouldn't be implemented in fallback library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even know if it's correct, I just hit a build error on windows about _wassert defined twice, probably it works today because they're all weak symbols but I remove that as part of this PR.

unsigned _Line) {
__assertfail(_Message, _File, _Line, 0, 1);
}

#endif
1 change: 1 addition & 0 deletions llvm/include/llvm/SYCLLowerIR/ModuleSplitter.h
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,7 @@ splitSYCLModule(std::unique_ptr<Module> M, ModuleSplitterSettings Settings);

bool isESIMDFunction(const Function &F);
bool canBeImportedFunction(const Function &F);
bool isEntryPoint(const Function &F, bool EmitOnlyKernelsAsEntryPoints);

} // namespace module_split

Expand Down
22 changes: 22 additions & 0 deletions llvm/include/llvm/SYCLLowerIR/SYCLLinkedModuleProcessor.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
//===-- SYCLLinkedModuleProcessor.h - finalize a fully linked module ---===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// The file contains a number of functions to create a pass that can be called
// by the LTO backend that will finalize a fully-linked module.
//===----------------------------------------------------------------------===//
#pragma once
#include "SpecConstants.h"
namespace llvm {

class PassRegistry;
class ModulePass;
ModulePass *
createSYCLLinkedModuleProcessorPass(llvm::SpecConstantsPass::HandlingMode);
void initializeSYCLLinkedModuleProcessorPass(PassRegistry &);

} // namespace llvm
4 changes: 2 additions & 2 deletions llvm/lib/LTO/LTO.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1077,8 +1077,8 @@ Error LTO::addThinLTO(BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
for (const std::string &Name : Conf.ThinLTOModulesToCompile) {
if (BM.getModuleIdentifier().contains(Name)) {
ThinLTO.ModulesToCompile->insert({BM.getModuleIdentifier(), BM});
llvm::errs() << "[ThinLTO] Selecting " << BM.getModuleIdentifier()
<< " to compile\n";
LLVM_DEBUG(dbgs() << "[ThinLTO] Selecting " << BM.getModuleIdentifier()
<< " to compile\n");
}
}
}
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/SYCLLowerIR/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ add_llvm_component_library(LLVMSYCLLowerIR
SYCLDeviceRequirements.cpp
SYCLKernelParamOptInfo.cpp
SYCLJointMatrixTransform.cpp
SYCLLinkedModuleProcessor.cpp
SYCLPropagateAspectsUsage.cpp
SYCLPropagateJointMatrixUsage.cpp
SYCLVirtualFunctionsAnalysis.cpp
Expand Down
52 changes: 26 additions & 26 deletions llvm/lib/SYCLLowerIR/ModuleSplitter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,32 +117,6 @@ bool isKernel(const Function &F) {
F.getCallingConv() == CallingConv::AMDGPU_KERNEL;
}

bool isEntryPoint(const Function &F, bool EmitOnlyKernelsAsEntryPoints) {
// Skip declarations, if any: they should not be included into a vector of
// entry points groups or otherwise we will end up with incorrectly generated
// list of symbols.
if (F.isDeclaration())
return false;

// Kernels are always considered to be entry points
if (isKernel(F))
return true;

if (!EmitOnlyKernelsAsEntryPoints) {
// If not disabled, SYCL_EXTERNAL functions with sycl-module-id attribute
// are also considered as entry points (except __spirv_* and __sycl_*
// functions)
return llvm::sycl::utils::isSYCLExternalFunction(&F) &&
!isSpirvSyclBuiltin(F.getName()) && !isESIMDBuiltin(F.getName()) &&
!isGenericBuiltin(F.getName());
}

// Even if we are emitting only kernels as entry points, virtual functions
// should still be treated as entry points, because they are going to be
// outlined into separate device images and linked in later.
return F.hasFnAttribute("indirectly-callable");
}

// Represents "dependency" or "use" graph of global objects (functions and
// global variables) in a module. It is used during device code split to
// understand which global variables and functions (other than entry points)
Expand Down Expand Up @@ -445,6 +419,32 @@ class ModuleSplitter : public ModuleSplitterBase {
namespace llvm {
namespace module_split {

bool isEntryPoint(const Function &F, bool EmitOnlyKernelsAsEntryPoints) {
// Skip declarations, if any: they should not be included into a vector of
// entry points groups or otherwise we will end up with incorrectly generated
// list of symbols.
if (F.isDeclaration())
return false;

// Kernels are always considered to be entry points
if (isKernel(F))
return true;

if (!EmitOnlyKernelsAsEntryPoints) {
// If not disabled, SYCL_EXTERNAL functions with sycl-module-id attribute
// are also considered as entry points (except __spirv_* and __sycl_*
// functions)
return llvm::sycl::utils::isSYCLExternalFunction(&F) &&
!isSpirvSyclBuiltin(F.getName()) && !isESIMDBuiltin(F.getName()) &&
!isGenericBuiltin(F.getName());
}

// Even if we are emitting only kernels as entry points, virtual functions
// should still be treated as entry points, because they are going to be
// outlined into separate device images and linked in later.
return F.hasFnAttribute("indirectly-callable");
}

std::optional<IRSplitMode> convertStringToSplitMode(StringRef S) {
static const StringMap<IRSplitMode> Values = {{"kernel", SPLIT_PER_KERNEL},
{"source", SPLIT_PER_TU},
Expand Down
45 changes: 45 additions & 0 deletions llvm/lib/SYCLLowerIR/SYCLLinkedModuleProcessor.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
//===-- SYCLLinkedModuleProcessor.cpp - finalize a fully linked module ---===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
// See comments in the header.
//===----------------------------------------------------------------------===//

#include "llvm/SYCLLowerIR/SYCLLinkedModuleProcessor.h"

#include "llvm/Pass.h"

#define DEBUG_TYPE "sycl-linked-module-processor"
using namespace llvm;

namespace {
class SYCLLinkedModuleProcessor : public ModulePass {
public:
static char ID;
SYCLLinkedModuleProcessor(SpecConstantsPass::HandlingMode Mode)
: ModulePass(ID), Mode(Mode) {
initializeSYCLLinkedModuleProcessorPass(*PassRegistry::getPassRegistry());
}

bool runOnModule(Module &M) override {
// TODO: determine if we need to run other passes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, that's an equivalent of what's being run by sycl-post-link after device code split is performed. If so, then we have the following other transformations applied at this stage:

  • ESIMD handling, which includes some special module fixup for invoke_simd, as well as potential additional split by ESIMD followed up by optional linking that back
  • Generation of a separate device image with default values of spec constants

If we also taking about what happens after llvm-link but before device code split, then it is also:

  • Something about invoke_simd
  • Sanitizer-related passes
  • Joint matrix passes

Copy link
Contributor Author

@sarnex sarnex Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when we do early splitting in -c we actually run sycl-post-link in full, including all those passes. So in that case, we only need to run passes here that need the fully linked module. If we decide to change the design such that we do only split in -c but no passes, then we would need every pass that sycl-post-link runs. In the current implementation ~2100/2200 E2E tests are passing, so it seems most passes don't need the full module and running it early does the right thing, at least for the test cases we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation ~2100/2000 E2E tests are passing, so it seems most passes don't need the full module, at least for the test cases we have.

I believe that most of E2E are single-file tests with no SYCL_EXTERNAL dependencies. Even SYCL-CTS won't help you here. I suppose that we need more or less real-life applications here to be sure and gather more data if we need it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding tests similar to sycl/test-e2e/Basic/multisource.cpp.

ModuleAnalysisManager MAM;
SpecConstantsPass SCP(Mode);
auto PA = SCP.run(M, MAM);
return !PA.areAllPreserved();
}

private:
SpecConstantsPass::HandlingMode Mode;
};
} // namespace
char SYCLLinkedModuleProcessor::ID = 0;
INITIALIZE_PASS(SYCLLinkedModuleProcessor, "SYCLLinkedModuleProcessor",
"Finalize a fully linked SYCL module", false, false)
ModulePass *llvm::createSYCLLinkedModuleProcessorPass(
SpecConstantsPass::HandlingMode Mode) {
return new SYCLLinkedModuleProcessor(Mode);
}
3 changes: 3 additions & 0 deletions sycl/doc/design/CompilerAndRuntimeDesign.md
Original file line number Diff line number Diff line change
Expand Up @@ -550,6 +550,9 @@ unit)
- `off` - disables device code split. If `-fno-sycl-rdc` is specified, the behavior is
the same as `per_source`

If ThinLTO is enabled, device code splitting is run during the compilation stage.
See [here](ThinLTO.md) for more information.

##### Symbol table generation

TBD
Expand Down
147 changes: 147 additions & 0 deletions sycl/doc/design/ThinLTO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# ThinLTO for SYCL

This document describes the purpose and design of ThinLTO for SYCL.

**NOTE**: This is not the final version. The document is still in progress.

## Background

With traditional SYCL device code linking, all user code is linked together
along with device libraries into a single huge module and then split and
processed by `sycl-post-link`. This requires sequential processing, has a large
memory footprint, and differs from the linking flow for AMD and NVIDIA devices.

## Summary
SYCL ThinLTO will hook into the existing community mechanism to run LTO as part
of device linking inside `clang-linker-wrapper`. We split the device images
early at compilation time, and at link time we use ThinLTO's function importing
feature
to bring in the defintions for referenced functions. Only the new offload model
is supported.

## Device code compilation time changes
Most of the changes for ThinLTO occur during device link time, however there is
one major change during compilation (-c) time: we now run device code split
during compilaton instead of linking.
The main reason for doing this is increased parallelization. Many compilation
jobs can be run at the same time, but linking happens once total for the
application. Device code split is currently a common source of performance
issues.

Splitting early means that the resulting IR after splitting is not complete, it
still may contain calls to functions (user code and/or the SYCL device
libraries) from other object files.

We rely on the assumption that all function defintions matching a declaration
will be the same and we can let ThinLTO pull in any one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ one definition rule guarantees this property of the code, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends what the original IR linkage is. if the original IR is linkonce_odr or something similar I think yes, but I don't know if we can guarantee every SYCL function will have that linkage (at least for libdevice it not this way in syclos HEAD)


For example, let's start with user device code that defines a `SYCL_EXTERNAL`
function `foo` in translation unit `tu_foo`. There is also another translation
unit `tu_bar` that references `foo`.
During the early device code splitting run of `tu_foo`, we may find that more
than one of the resultant device images contain a defintion for `foo`.

We assert that any function defintion for `foo` that is deemed a match by the
ThinLTO infrastruction during the processing of `tu_bar` is valid.

As a result of running early device code split, the fat object file generated
as part of device compilation may contain multiple device code images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the linkage type of foo definitions? We need to make sure that device images are linkable i.e. foo definitions will not conflict at link time.

Can this process duplicate SYCL kernel function definitions? If so, is SYCL runtime can handle this duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After split, the linkage should be the same as it was before split.

After ThinLTO runs, it could have the same linkage as after splitting or it could be internalized (not yet implemented)

I don't think there is any way to get multiple kernel definitions in a way that isn't already possible with splitting. Maybe @AlexeySachkov has an idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After split, the linkage should be the same as it was before split.

I'm not 100% sure, but this might cause "multiple definition" problem. tu_foo has only one foo definition so using external is fine (assuming that all other modules referencing foo use different linkage types), but after split we will have foo defined in multiple modules. I'm not sure if LLVM allows linking modules where foo will have external linkage type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to try some examples and tests and see if we hit a problem like this, in my testing I've never seen a duplicate symbol problem, only undefined symbol when importing fails for some reason, but of course we may just not have a test for the failing case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any way to get multiple kernel definitions in a way that isn't already possible with splitting. Maybe @AlexeySachkov has an idea.

We shouldn't ever duplicate kernels in our compiler stack, that won't be properly handled at RT for multiple reasons.


# Device code link time changes

Before we go into the link time changes for SYCL, let's understand the device
linking flow for community devices (AMD/NVIDIA):

![Community linking flow](images/ThinLTOCommunityFlow.svg)

SYCL has two differenting requirements:
1) The SPIR-V backend is not production ready and the SPIR-V translator is used.
2) The SYCL runtime requires metadata (module properties and module symbol
table) computed from device images that will be stored along the device images
in the fat executable.

The effect of requirement 1) is that instead of letting ThinLTO call the SPIR-V
backend, we add a callback that runs right before codegen would run.
In that callback, we call the SPIR-V translator and store the resultant file
path for use later, and we instruct the ThinLTO framework to not
perform codegen.

An interesting additional fact about requirement 2) is that we actually need to
process fully linked module to accurate compute the module properties. One
example where we need the full module is to [compute the required devicelib mask](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLLowerIR/SYCLDeviceLibReqMask.cpp).
If we only process the device code that was included in the
original fat object input to `clang-linker-wrapper`, we will miss devicelib
calls in referenced `SYCL_EXTERNAL` functions.

The effect of requirement 2) is that we store the fully linked device image for
metadata computation in the SYCL-specific handing code after the ThinLTO
framework has completed. Another option would be to try to compute the metadata
inside the ThinLTO framework callbacks, but this would require SYCL-specific
arguments to many caller functions in the stack and pollute community code.

Here is the current ThinLTO flow for SYCL:

![SYCL linking flow](images/ThinLTOSYCLFlow.svg)

We add a `PreCodeGenModuleHook` function to the `LTOConfig` object so that we
can process the fully linked module without running the backend.

However, the flow is not ideal for many reasons:
1) We are relying on the external `llvm-spirv` tool instead of the SPIR-V
backend. We could slightly improve this issue by using a library call to the
SPIR-V translator instead of the tool, however the library API requires setting
up an object to represent the arguments while we only have strings, and it's
non-trivial to parse the strings to figure out how to create the argument
object. Since we plan to use the SPIR-V backend in the long term, this does not
seem to be worth the effort.

2) We manually run passes inside `PreCodeGenModuleHook`. This is because we
don't run codegen, so we can't take advantage of the `PreCodeGenPassesHook`
field of `LTOConfig` to run some custom passes, as those passes are only run
when we actually are going to run codegen.

3) We have to store the fully linked module. This is needed because we need a
fully linked module to accurately compute metadata, see the above explanation
of SYCL requirement 2). We could get around storing the module by computing the
metadata inside the LTO framework and storing it for late use by the SYCL
bundling code, but doing this would require even more SYCL-only customizations including
even more new function arguments and modifications of the `OffloadFile` class.
There are also compliations because the LTO framework is multithreaded, and not all
LLVM data structures are thread safe.

The proposed long-term SYCL ThinLTO flow is as follows:

![SYCL SPIR-V backend linking flow](images/ThinLTOSYCLSPIRVBackendFlow.svg)

The biggest difference here is that we are running codegen using the SPIR-V
backend.

Also, instead of using a lambda function in the `PreCodeGenModuleHook`
callback to run SYCL finalization passes, we can take advantage of the `PreCodeGenPassesHook` field to add
passes to the pass manager that the LTO framework will run.

It is possible that the number of device images in the fat executable
and which device image contains which kernel is different with ThinLTO
enabled, but we do expect this to have any impact on correctness or
performance, nor we do expect users to care.


# Current limitations

`-O0`: Compiling with `-O0` prevent clang from generating ThinLTO metadata
during the compilation phase. In the current implementation, this is an error.
In the final version, we could either silently fall back to full LTO or
generate ThinLTO metadata even for `-O0`.

SYCL libdevice: Current all `libdevice` functions are explicitly marked to be
weak symbols. The ThinLTO framework does not consider a defintion of function
with weak linkage as it cannot be sure that this definiton is the correct one.
Ideally we could remove the weak symbol annotation.

No binary linkage: The SPIR-V target does not currently have a production
quality binary linker. This means that we must generate a fully linked image as
part of device linkage. At least for AMD devices, this is not a requirement as
`lld` is used for the final link which can resolve any unresolved symbols.
`-fno-gpu-rdc` is default for AMD, so in that case it can call `lld` during
compile, but if `-fno-gpu-rdc` is passed, the lld call happens as part of
`clang-linker-wrapper` to resolve any symbols not resolved by ThinLTO.
Loading
Loading