-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Intel GPU integration #99
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# Integrate Intel GPU to OpenXLA | ||
| Status | Proposed | | ||
| ------ | -------- | | ||
| RFC# | [99](https://github.com/openxla/community/pull/99) | | ||
| Author(s) | Teng, Lu ([email protected]), Yang, Sheng ([email protected]), Zhoulong, Jiang ([email protected]), Jianhui, Li ([email protected]) | | ||
| Updated | 2023-11-02 | | ||
|
||
## Objective | ||
`XLA GPU` is a mechanism to extend new GPU to OpenXLA as in-tree build device. This RFC is to introduce the related changes to integrate Intel GPU to `XLA GPU`. | ||
|
||
## Motivation | ||
Intel has experimental released [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla) based on `PJRT C API` to support runing applications on Intel GPU when use OpenXLA, | ||
but in-tree build is a better way to maximize the capabilities of OpenXLA and improve user experience, | ||
Intel would like to upstream the related changes inside **Intel® Extension for OpenXLA*** to OpenXLA to make Intel GPU as in-tree build device. | ||
|
||
## User benifit | ||
This RFC allows user to run their applications on Intel GPU directly when use OpenXLA, w/o installing any extra extensions or modifying any code. | ||
|
||
## Deisgn proposal | ||
### Overview | ||
Below marked components in OpenXLA will be modified to support Intel GPU: | ||
|
||
 | ||
|
||
Here we would like to distinguish these components as 2 different priorities: | ||
* **P1**: Related to basic functionality and is covered by this RFC | ||
- `LLVM IR`: Basic code generator for Intel GPU | ||
- `Lib Call`: Advanced `oneDNN` library call to replace `LLVM IR` for core ops, to improve performance for Intel GPU | ||
- `XLA GPU Runtime` (based on Stream Executor): Basic runtime for Intel GPU | ||
* **P2**: Related to performance or user experience and is not covered by this RFC. We will propose new RFCs to track these features | ||
- `Global Cost Model` | ||
- `Tool` (including Profiler, Debug API, etc.) | ||
|
||
In future, we will follow community to enable more **advanced code generator** rather than `LLVM IR` for Intel GPU. | ||
|
||
### Code integration & binary release | ||
We would like to introduce a new macro `INTEL_GPU` (Tentative) for code integration: | ||
```c++ | ||
#ifndef INTEL_GPU | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems contradictory with the stated goal of having another "in-tree build device". The typical way that XLA differentiates between multiple devices is such that they represent distinct source code that can co-exist in the same binary. I understand the desire for carrying such ifdefs, but it does mean that this is not a truly independent device in the same way (for example) that the XLA CPU device is different from the XLA GPU and TPU devices (i.e. all three of which can be enabled simultaneously in the same build invocation). I think this reinforces that the integration point here has to be at the PJRT plugin layer, and this plugin just happens to use common code from XLA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is because it's hard to integrate new device to OpenXLA GPU and make it co-existing with current in-tree GPU at once without any bugs... We just follow the experience of what we have done in TF before, oneDNN took a long time to become the default build of TF CPU. It also started from macro build like this. Thanks for your question, I will update RFC to call this out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current situation where you feel like you would have to be in-tree and have to use I would really rather want to look at how to make it possible for platforms to be developed independently (by improving the pluggability/modularity in XLA) rather than gluing all of it monolithically with Google already develops the entire support for the TPU platform out-of-tree and seems to be successful at doing so, so it should be in-reach. I'm curious what's missing for the Intel platform or what makes it so different than TPU that it couldn't be out-of-tree? |
||
// Original functions | ||
#else | ||
// Intel GPU functions | ||
#endif | ||
``` | ||
And only enable it with `config=xpu` (Tentative) in OpenXLA, to differentiate Intel GPU from other devices. | ||
In this way we separate the binary release of Intel GPU from the original OpenXLA release to minimize the impact on other in-tree devices. | ||
|
||
### LLVM IR | ||
Most `LLVM IR` work in community can be reused directly, and only a few modification are needed for Intel GPU as below: | ||
* Integrate [SPIRV tranlator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator). Intel GPU can't use `LLVM IR` directly and need to converted it to `SPIRV IR` by this component first | ||
* Add target specific intrinsics. Here's an example to show what the OpenXLA function [`TargetIntrinsicID()`](https://github.com/openxla/xla/blob/fb9e7064dade52134a0858a865f4be97e894bb81/xla/service/gpu/target_util.cc#L52) looks like for Intel GPU: | ||
```c++ | ||
// Gets the llvm intrinsic ids on different platforms (NVPTX, AMDGPU) | ||
// corresponding to the give TargetIntrinsicID. | ||
struct TargetIntrinsics GetIntrinsic(TargetIntrinsicID intrin) { | ||
switch (intrin) { | ||
case TargetIntrinsicID::kThreadIdx: { | ||
return { | ||
llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x, | ||
llvm::Intrinsic::amdgcn_workitem_id_x, | ||
[](llvm::IRBuilder<>* b_) -> llvm::CallInst* { | ||
return EmitDeviceFunctionCall("__builtin_IB_get_local_id_x", {}, {}, | ||
U32, {b_->getContext()}, b_); | ||
}, | ||
}; | ||
} | ||
... | ||
``` | ||
* Change the index of address space. Intel GPU has no extra pass in OpenXLA to handle its address space, | ||
so it needs to use index `1` in OpenXLA function [`BuildKernelPrototype()`](https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/fusion_emitter.cc#L83C1-L116) which is different as other in-tree devices: | ||
```c++ | ||
IrEmitterUnnested::KernelAndIrArrays IrEmitterUnnested::BuildKernelPrototype( | ||
absl::string_view suggested_name, | ||
absl::Span<const KernelArgument> arguments, | ||
const LaunchDimensions& launch_dimensions) { | ||
... | ||
// Create the kernel and add it to the module. | ||
llvm::LLVMContext& context = module_->getContext(); | ||
llvm::FunctionType* kernel_type = llvm::FunctionType::get( | ||
/*Result=*/llvm::Type::getVoidTy(context), | ||
// SYCL: Hardcode to global device addrspace. | ||
std::vector<llvm::Type*>( | ||
kNumLlvmArgs, | ||
llvm::Type::getInt8PtrTy(b_.getInt8PtrTy()->getContext(), 1)), | ||
/*isVarArg=*/false); | ||
... | ||
``` | ||
* Turn off advanced LLVM optimization pass to avoid unsupported LLVM features on Intel GPU | ||
|
||
**~250 LoC** are estimated for all of `LLVM IR` changes. | ||
|
||
### Lib Call | ||
Some core ops (Conv/MatMul) will be lowered to [`oneDNN`](https://github.com/oneapi-src/oneDNN) library call instead of `LLVM IR` for better performance, so `oneDNN` will be integrated as third-party depedency. | ||
Currently the lib call list is hard coded for specific core ops, and it will be combined with `Global Cost Model` in future for dynamic dispatching. | ||
|
||
### XLA GPU Runtime | ||
Intel GPU is based on `SYCL` runtime from [Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html), | ||
so `Intel® oneAPI DPC++/C++ Compiler` will be required as runtime environment for user to execute applications on Intel GPU. | ||
Based on current `XLA GPU` runtime implementation, we chose `Stream Executor` as runtime infrastruct and reimplemented it by `SYCL` runtime including `Allocator`, `Event`, `Executor`, `Kernel`, `Platform`, `Stream`... | ||
The initial implementation can be found in [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla/tree/main/xla/stream_executor/sycl). It will be addressed to align OpenXLA code style before upstreaming. | ||
|
||
**~3000 LoC** are estimated for all of `XLA GPU Runtime` changes. | ||
|
||
### Performance Implications | ||
We don’t expect performance impact due to this RFC. The functions described by this RFC are realized at the initialization stage. | ||
|
||
### Dependencies | ||
* Build dependency: | ||
- [OneDNN](https://github.com/oneapi-src/oneDNN) | ||
- [OneMKL](https://github.com/oneapi-src/oneMKL) | ||
- [SPIRV-LLVM-Translator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator) | ||
* Execution (runtime) dependency: [Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html) | ||
|
||
This RFC also relies on some upcoming RFCs of `XLA GPU` from OpenXLA team, so some details will be changed by these upcoming RFCs progress, E.g.: | ||
- Command Buffer: A new propasal from OpenXLA community and we haven't implemented it in **Intel® Extension for OpenXLA**. | ||
As early initialization stage, this should not block current work based on `LLVM IR` + `Thunk` + `Stream Excutor` | ||
|
||
### Engineering Impact | ||
The impact to binary size / startup time / build time are minimum, but test time will be increased due to new added device. | ||
|
||
The whole OpenXLA community (Intel is contributor as part of the community) will maintain this code. Intel will help to setup CI in below way to ensure project quality: | ||
* Enable CI on Intel Dev Cloud with Intel® Data Center GPU MAX Series | ||
|
||
### Platforms and Environments | ||
Intel GPU hardware (with correct driver) and `Intel® oneAPI DPC++/C++ Compiler` runtime environment are required. Other dependencies are the same as original OpenXLA. | ||
|
||
### Compatibility | ||
The RFC follows `XLA GPU` [roadmap](https://docs.google.com/presentation/d/1FPVjZUkTApV80TKJ-WbPvLynjIxb3sdFGwn6Qs9UCrw/edit#slide=id.g224a3cf318c_0_1047) (WW33'2023) to integrate new GPU to OpenXLA. | ||
We don't expect this proposal to impact with other parts of the OpenXLA ecosystem. In this moment it only supports basic functionality of OpenXLA and some advanced features including `Profiler` and `Debug API` are still not supported yet, they will be supported in next RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the tradeoffs you see here: why in-tree?
I would think that having a monolithic repository for XLA isn't scalable, or desirable. TensorFlow paid the price of this and tried for a long time to split up its component in multiple repos.
Have you considered having this component inside OpenXLA as a separate repo? It would be great if we could use this work to make XLA more modular and keep the reusable "core" of XLA independent of the various platform-specific instantiation of XLA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - probably important to coordinate with the Jax team on this. Afaik, they've been skating the other direction, even looking to have the NVIDIA XLA backend be built and installable separately.
Maybe there are two axes being conflated here: having the code be in-tree is a policy decision that the XLA authors can make with the usual implications on ifdef'ey stuff with disjoint dependencies that never builds all at once. I wouldn't make that decision personally, but doing so is a matter of source code/project organization that could be decided either way.
But in terms of deployment, it seems that this will always be a separately built and deployed binary with its own distinct dependencies and pinned to the PJRT API version. If this were some other way, I could see the attraction of moving in-tree since that would provide some privileged position for features that couldn't be accessed elsewhere. Since that is not the way things are going, though, everything I know of the xla dev process, CI, etc tells me that this would be better off as its own repo with its own CI and deployment lifecycle. The development experience will be significantly degraded by moving in-tree in my experience, so I would treat it as something to be done if there is no other way.
Personally, I would do what @hawkinsp recommends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we do this job to follow the new XLA GPU roadmap. This simple reason may not answer your question completely, I will give more details later, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That roadmap, which is very non technical, does seem to call out the desire to support these things as out of tree builds.
I would treat that roadmap more as marketing or product guidance vs a basis for technical requirements. It was not authored with an abundance of technical collaboration from most of the people who are the primary stakeholders on the things it claims to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I don't quite see how the link provides anything that supports the current proposal in this RFC, can you clarify what you are referring to?
We should decouple the interests of the project (OpenXLA), which definitely would want to grow support for more platforms, and the scalability of the development in the repo, which I'm concerned about right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PluggableDevice experience, I actually think that we need a modular maintainer's guidance on the technical direction. It is not practical to propose any structural changes if the modular maintainer is planning something else. As OpenXLA experienced multiple changes recently, it is even more important to get maintainer's opinion.
For the low-level design to identify good abstraction/interface (zero if-def), I agree that a new device backend has better visibility on what changes are needed. If the changes align with modular maintainer's vision, Intel could step up and co-own these modules and drive the changes.
Seeking a good abstraction doesn't mean Intel GPU backend likes to live out-of-tree. The out-of-tree integration like PJRT is valuable for plugin to mature and experiment new features. As a backend designed specifically for OpenXLA, pushing mature features upstream with a good interface is a better destiny. It helps Intel GPU more accessible to users and OpenXLA to cover more device targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this part: why would living in the same repo changes anything?
In particular you wrote "OpenXLA to cover more device targets": it's not clear to me why being "in-tree" would relate in any way to what OpenXLA would support! As long as we organize our GitHub collection of repos together, I don't quite see the difference really: we would still be able to say "OpenXLA support Intel GPU" (as much as any other platform), and we should still be able to have JAX (for example) released with support for Intel GPU.
In-tree vs out-of-tree should really be orthogonal to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my observation, a backend living "out-of-tree" has higher risk of developing features driven by short-term business needs but known hard to be upstreamed. The fact of living "out-of-tree" certainly impacts the "mind of share" of upstream module maintainer and how tightly the backend tracks the upstream changes. So from Intel side, there is strong motivation to get mature and compatible features upstreamed.
Although the RFC is namely for intel GPU, but the proposed change potentially benefits other devices. The RFC is to add spirv target and SYCL runtime to OpenXLA, and these are industry standard and used by other devices. From that perspective, the efforts help OpenXLA cover more device targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating, much appreciated!
Considering individual "feature" as separate reusable components in-tree (like "spirv target" or "SYCL runtime") makes a lot of sense to me, I didn't grasp this from the RFC.
I assume contributions would be separate for each new component right?
Each would end up independent library/component in-tree, with its own tests and documentation, and that decouples the question of the full backend support which assembles these components in a coherent PJRT registered platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification from @Jianhui-Li , I have added the purpose of supporting
SPIRV target
andSYCL runtime
in RFC.Yes, in original plan we already decided to upstream them separately if get approved.
SPIRV target
(means common changes mentioned above) will be upstreamed first after addressed#ifdef
.