-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Intel GPU integration #99
Conversation
## Motivation | ||
Intel has experimental released [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla) based on `PJRT C API` to support runing applications on Intel GPU when use OpenXLA, | ||
but in-tree build is a better way to maximize the capabilities of OpenXLA and improve user experience, | ||
Intel would like to upstream the related changes inside **Intel® Extension for OpenXLA*** to OpenXLA to make Intel GPU as in-tree build device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the tradeoffs you see here: why in-tree?
I would think that having a monolithic repository for XLA isn't scalable, or desirable. TensorFlow paid the price of this and tried for a long time to split up its component in multiple repos.
Have you considered having this component inside OpenXLA as a separate repo? It would be great if we could use this work to make XLA more modular and keep the reusable "core" of XLA independent of the various platform-specific instantiation of XLA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - probably important to coordinate with the Jax team on this. Afaik, they've been skating the other direction, even looking to have the NVIDIA XLA backend be built and installable separately.
Maybe there are two axes being conflated here: having the code be in-tree is a policy decision that the XLA authors can make with the usual implications on ifdef'ey stuff with disjoint dependencies that never builds all at once. I wouldn't make that decision personally, but doing so is a matter of source code/project organization that could be decided either way.
But in terms of deployment, it seems that this will always be a separately built and deployed binary with its own distinct dependencies and pinned to the PJRT API version. If this were some other way, I could see the attraction of moving in-tree since that would provide some privileged position for features that couldn't be accessed elsewhere. Since that is not the way things are going, though, everything I know of the xla dev process, CI, etc tells me that this would be better off as its own repo with its own CI and deployment lifecycle. The development experience will be significantly degraded by moving in-tree in my experience, so I would treat it as something to be done if there is no other way.
Personally, I would do what @hawkinsp recommends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we do this job to follow the new XLA GPU roadmap. This simple reason may not answer your question completely, I will give more details later, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That roadmap, which is very non technical, does seem to call out the desire to support these things as out of tree builds.
I would treat that roadmap more as marketing or product guidance vs a basis for technical requirements. It was not authored with an abundance of technical collaboration from most of the people who are the primary stakeholders on the things it claims to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I don't quite see how the link provides anything that supports the current proposal in this RFC, can you clarify what you are referring to?
We should decouple the interests of the project (OpenXLA), which definitely would want to grow support for more platforms, and the scalability of the development in the repo, which I'm concerned about right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PluggableDevice experience, I actually think that we need a modular maintainer's guidance on the technical direction. It is not practical to propose any structural changes if the modular maintainer is planning something else. As OpenXLA experienced multiple changes recently, it is even more important to get maintainer's opinion.
For the low-level design to identify good abstraction/interface (zero if-def), I agree that a new device backend has better visibility on what changes are needed. If the changes align with modular maintainer's vision, Intel could step up and co-own these modules and drive the changes.
Seeking a good abstraction doesn't mean Intel GPU backend likes to live out-of-tree. The out-of-tree integration like PJRT is valuable for plugin to mature and experiment new features. As a backend designed specifically for OpenXLA, pushing mature features upstream with a good interface is a better destiny. It helps Intel GPU more accessible to users and OpenXLA to cover more device targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The out-of-tree integration like PJRT is valuable for plugin to mature and experiment new features. As a backend designed specifically for OpenXLA, pushing mature features upstream with a good interface is a better destiny. It helps Intel GPU more accessible to users and OpenXLA to cover more device targets.
I don't quite understand this part: why would living in the same repo changes anything?
In particular you wrote "OpenXLA to cover more device targets": it's not clear to me why being "in-tree" would relate in any way to what OpenXLA would support! As long as we organize our GitHub collection of repos together, I don't quite see the difference really: we would still be able to say "OpenXLA support Intel GPU" (as much as any other platform), and we should still be able to have JAX (for example) released with support for Intel GPU.
In-tree vs out-of-tree should really be orthogonal to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my observation, a backend living "out-of-tree" has higher risk of developing features driven by short-term business needs but known hard to be upstreamed. The fact of living "out-of-tree" certainly impacts the "mind of share" of upstream module maintainer and how tightly the backend tracks the upstream changes. So from Intel side, there is strong motivation to get mature and compatible features upstreamed.
Although the RFC is namely for intel GPU, but the proposed change potentially benefits other devices. The RFC is to add spirv target and SYCL runtime to OpenXLA, and these are industry standard and used by other devices. From that perspective, the efforts help OpenXLA cover more device targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating, much appreciated!
Considering individual "feature" as separate reusable components in-tree (like "spirv target" or "SYCL runtime") makes a lot of sense to me, I didn't grasp this from the RFC.
I assume contributions would be separate for each new component right?
Each would end up independent library/component in-tree, with its own tests and documentation, and that decouples the question of the full backend support which assembles these components in a coherent PJRT registered platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification from @Jianhui-Li , I have added the purpose of supporting SPIRV target
and SYCL runtime
in RFC.
I assume contributions would be separate for each new component right?
Each would end up independent library/component in-tree, with its own tests and documentation, and that decouples the question of the full backend support which assembles these components in a coherent PJRT registered platform.
Yes, in original plan we already decided to upstream them separately if get approved. SPIRV target
(means common changes mentioned above) will be upstreamed first after addressed #ifdef
.
### Code integration & binary release | ||
We would like to introduce a new macro `INTEL_GPU` (Tentative) for code integration: | ||
```c++ | ||
#ifndef INTEL_GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems contradictory with the stated goal of having another "in-tree build device". The typical way that XLA differentiates between multiple devices is such that they represent distinct source code that can co-exist in the same binary. I understand the desire for carrying such ifdefs, but it does mean that this is not a truly independent device in the same way (for example) that the XLA CPU device is different from the XLA GPU and TPU devices (i.e. all three of which can be enabled simultaneously in the same build invocation).
I think this reinforces that the integration point here has to be at the PJRT plugin layer, and this plugin just happens to use common code from XLA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because it's hard to integrate new device to OpenXLA GPU and make it co-existing with current in-tree GPU at once without any bugs... We just follow the experience of what we have done in TF before, oneDNN took a long time to become the default build of TF CPU. It also started from macro build like this.
Thanks for your question, I will update RFC to call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current situation where you feel like you would have to be in-tree and have to use #ifdef
looks more to me like a symptom of a problem to address rather than something that is intended, or that we should consider intangible.
I would really rather want to look at how to make it possible for platforms to be developed independently (by improving the pluggability/modularity in XLA) rather than gluing all of it monolithically with #ifdef
everywhere.
Google already develops the entire support for the TPU platform out-of-tree and seems to be successful at doing so, so it should be in-reach. I'm curious what's missing for the Intel platform or what makes it so different than TPU that it couldn't be out-of-tree?
Hi all, thanks for all of your discussion, I'm glad to see so many suggestions! Here are some updates/plans for future work after simple talked with Google:
|
Hi all, Sorry for the delayed response! I somehow missed that the RFC was opened until now. Below is what I understood from the discussion: Intel's motivations for upstreaming some parts of the Intel GPU PJRT plug-in:
So far, we seem to all agree that:
There are also different levels of in-tree-ness/modularity that should be distinguished:
If I understand correctly, Intel proposes upstreaming 3 separate components T2-style, and will have fast-moving, Intel-specific features (not mentioned in this RFC) remain in the plug-in (T4). @joker-eph initially proposed the 3 components should be done T3-style, but later on seemed to be okay with T2 as well, did I get this right? Let's decide on each component separately?
|
Thanks for the great summary @penpornk!
I think there is a nuance :) On the other hand for the "non shared components", it's not clear to me why T2 is the best, it seems worth evaluating more closely if every platform could have its own repo for the non-shared code (without duplicating code: that would defeat the point of course), that is T4 I think? (which is how the TPU compiler is setup somehow). Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are? |
I think that for non NVIDIA GPU targets, it is a very non abstract (and primarily, non technically driven) desire to be in tree: in my experience, core developers for this kind of system -- which has grown in a non modular fashion against one target, benefit from a tangible itch to scratch. Putting the burden on those who come second or third to take the step to live outside of the peripheral vision of the primary authors is unfair and will not produce a good result with the style of development that the core authors do on this component. Case in point: it was three weeks before someone affiliated with primary stewardship of the project commented (and thank you for doing so, Penporn -- much appreciated). With that kind of dynamic, code locality is the only lever available to bias towards equal access to the platform, and the option to develop in tree should be retained. I think that the GPU vendors should be colocated at this point: if one is in tree, then they all should be as a matter of policy. That can be reevaluated if the primary authors publish a more detailed plan and level of support track record for this kind of situation. It would be different if this were the fourth or fifth implementation and/or there was a strong track record of multi targeting for the GPU component specifically. But in effect, this is the second. The only way to evolve the situation from this point is colocation, imo. |
Thanks for the summary @penpornk ! We will have some internal discussion first and give the feedback later. |
Thank you all for the quick replies! :)
I also got feedback from JAX/PJRT. The code structure underneath doesn't matter to them, they only require that the device support for JAX still be built and packaged as a separate PJRT plug-in (even if the code lives in the same XLA tree). Our team will look more into the Intel GPU plug-in code before making per-component suggestions too. The goal is to still keep XLA in good health with reasonable work scopes that fits each party's high-level goals and timelines. Re: @joker-eph:
Great question. @Zantares, do you have a quick answer here? @joker-eph Do you consider the SYCL runtime support "clearly T2"? Re: @stellaraccident:
I agree. Device vendors may help with core code however they can out of goodwill / necessity to make things happen, but it's not their responsibility to do so. Core maintainers need to be actively involved. Also +1 to @Jianhui-Li's point:
While Intel contributed most of the PluggableDevice design and implementation, it was done in close collaboration with TF core maintainers (Google). Some examples:
We apologize for the delay in responding to this RFC and aim to have better response times from now on. Re: @stellaraccident:
Great point here! I'll bring this up to the team. |
Regarding Intel GPU support in StreamExecutor, I'd love it to be in tree as Also StreamExecutor today is really a hardware abstraction layer for XLA, so we are considering renaming the namespace to |
Seems like something that other SYCL vendors would want instead of being a specific component of the IntelGPU backend right? If so I would say yes here.
Absolutely, and actually I alluded to this work (without the links you have!) before in my one of my comments as an example of a "success story". It has to be collaborative, but it still implies "work" from the "new platforms" to help build their extension points. |
@penpornk @joker-eph Quick answer for your question:
It belongs to the 1st situation for
In addition, there's chance to be parameterized for the rest changes in my thought:
We can discuss it after preparing the code example, thanks! |
@Zantares There is a number of backend-specific contributions Intel could provide, is it possible to elaborate on planned changes? There are the following areas which might require changes:
Is the set of planned changes mainly based around (1)? Is it possible to see the preview of features-to-land? |
To @penpornk , sorry I have missed this point before. Please check below link to see how we use oneDNN lib call in plug-in. We also started to check CPU RFC to see how to reuse the 3rd-party passes to complete this work.
To @cheshire , I assume Q1 is already answered above. For Q2 & Q3, we have already made some experiments to support Triton in our plug-in, it still needs lots of work. For Q4, right now in plug-in we reused most of public code and use macro to add specific fusion, in future we want to combine this work with new proposed Global Cost feature in OpenXLA. |
Hi @ezhulenev , does it mean that OpenXLA repo will send the PR patch to AMD CI and get its feedback? Is this detail available in public? We also want to provide similar check here. |
No, currently it works as a post-submit, CI jobs runs periodically with latest commit and then we accept patches. @ddunl might know if we have any plans for pre-submit CI jobs for AMD, and do we have plans for supporting other CI jobs. I'd love to have an options to include backend-specific CI jobs optionally for a PR. Not sure about enabling it by default, as even with current set of jobs it takes a long time to get back results. |
Thanks for the feedback! It's a good idea that take backend-specific CI as post-submit to avoid possible blocking situations. We'll start from this. And we agree that not enabling backend-specific CI jobs by default. Maybe a better way is to trigger it by GitHub label/specific comment on demand. |
I think for now we don't have specific plans for supporting external CI in general, just that we know we want to do it. Definitely backend-specific CI with the ability to trigger on demand is great though, and would be very helpful |
Hi all, the 1st PR of SPRI-V target is submitted: openxla/xla#7940. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping and submitting, already in progress so addressing missed step.
The SYCL runtime is submitted: openxla/xla#9042 Note this big PR is for preview and CI test, we will continue to split it into small PRs and welcome your suggestions for this PR. Please refer to comment openxla/xla#9042 (comment) for more details |
This RFC is from Intel, following
XLA GPU
roadmap to integrate Intel GPU as in-tree device in OpenXLA. Hope to get your suggestion for this proposal, thanks!