-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop upstream sync 20250204 #2837
Conversation
PiperOrigin-RevId: 720935918
PiperOrigin-RevId: 720936848
Updates LLVM usage to match [a06c89387621](llvm/llvm-project@a06c89387621) PiperOrigin-RevId: 720937292
Imported from GitHub PR openxla/xla#21800 This PR adds a transformation pass that supports custom calls to block quantize/dequantize/dot ops. Such calls are replaced by an equivalent sequence of HLO operations. This pass is supposed to support MX scaling formats, such as MXFP8, but is not limited to those and can be used with any data types and block sizes. The quantization op sequence matches the one described in the section 6.3 of the MX spec: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf Once cuDNN frontend 1.10 is released, a lowering to a cuDNN graph will be enabled for the hardware that supports block scaled dot natively (i.e. Blackwell). This pass will stay disabled until then. I also plan on introducing a new HLO op, "block-scaled-dot", which will be more generic than a custom call - for example, will have configurable dimensions numbers akin to the general dot op. This will follow in a separate PR, once that is approved, I'll replace the custom call "__op$block_scaled_dot" with it. Copybara import of the project: -- 5dcc610e804e7aaad9b79369f714a63f9f096ad8 by Sergey Kozub <[email protected]>: Add block scaling rewriter pass Merging this change closes tensorflow#21800 PiperOrigin-RevId: 720940919
PiperOrigin-RevId: 720952363
This both simplifies the giant EmitMatmul function & makes it more generic, simplifying the TMA change (see CL chain). PiperOrigin-RevId: 720962746
…k to the default layout in HLO parser just for entry_computation_layout. PiperOrigin-RevId: 720970103
PiperOrigin-RevId: 720982236
PiperOrigin-RevId: 721002434
PiperOrigin-RevId: 721003875
Updating: - `env.h` - `env_time.h` - `errors.h` - `file_statistics.h` - `file_system.h` - `file_system_helper.h` - `logging.h` - `macros.h` - `status.h` - `status_matchers.h` - `status_to_from_proto.h` - `statusor.h` - `test.h` - `test_benchmark.h` - `threadpool.h` - `threadpool_async_executor.h` - `threadpool_interface.h` - `threadpool_options.h` - `types.h` and associated targets. PiperOrigin-RevId: 721004530
PiperOrigin-RevId: 721004911
`CudaExecutor::Allocate` used to always return a nullptr when the user requested an allocation the collective memory space. This was caused by a mistake in one of my refactorings a while ago. PiperOrigin-RevId: 721009378
PiperOrigin-RevId: 721022206
Add ToProto support remaining Thunk types Reverts 444d561 PiperOrigin-RevId: 721025054
PiperOrigin-RevId: 721025822
PiperOrigin-RevId: 721026017
Imported from GitHub PR openxla/xla#21901 Add rocm 6.1.0 dependency for ubuntu 20.04 Copybara import of the project: -- 0acf028eeca5923c7f2aa5762297686836eda310 by Alexandros Theodoridis <[email protected]>: Add rocm6.1 deps for ubuntu 20.04 -- fc88c83061d6efff2482599489d622ab3114b9a7 by Alexandros Theodoridis <[email protected]>: Fix hermetic build for 6.0 -- 73ace5591f4731e1b95b6d3e6a349b528977c580 by Alexandros Theodoridis <[email protected]>: Add ci config for hermetic build -- bbc048bcffd9d35bfad76ff816ed22f3e3f761f8 by Alexandros Theodoridis <[email protected]>: Introduce rocm 6.1.0 dependency for 22.04 -- 9776f398c2711ba37333d29b934d6ba67c55dbef by Alexandros Theodoridis <[email protected]>: Add missing 24.04 redist -- acf275d57cc185b9c2122d5930d8cf54e473ad95 by Alexandros Theodoridis <[email protected]>: Fix test -- 3e49285b0f55597ab5f44c1d0a422bf931d72cda by Alexandros Theodoridis <[email protected]>: Add comment explaining the reason for a new target -- 35838bf8d6e678717e9b1c551f840918b00a91f8 by Alexandros Theodoridis <[email protected]>: Rever force verbose in the compiler wrapper -- 2952e115b044e1a8ac8aadc7eac7802e8d79cf91 by Alexandros Theodoridis <[email protected]>: Add explanation comment for the new target Merging this change closes tensorflow#21901 PiperOrigin-RevId: 721043735
Updating: - `env.h` - `env_time.h` - `errors.h` - `file_statistics.h` - `file_system.h` - `file_system_helper.h` - `logging.h` - `macros.h` - `status.h` - `status_matchers.h` - `status_to_from_proto.h` - `statusor.h` - `test.h` - `test_benchmark.h` - `threadpool.h` - `threadpool_async_executor.h` - `threadpool_interface.h` - `threadpool_options.h` - `types.h` and associated targets. PiperOrigin-RevId: 721044569
Update the rule to include all LiteRtXXXX symbols. PiperOrigin-RevId: 721045739
PiperOrigin-RevId: 721056767
PiperOrigin-RevId: 721066280
Imported from GitHub PR openxla/xla#21948 Copybara import of the project: -- affa734c3c6e2af934dd12eafe7e8771ab0ee8db by Ilia Sergachev <[email protected]>: [GPU] Upgrade cuDNN frontend to 1.10.0. Merging this change closes tensorflow#21948 PiperOrigin-RevId: 721075669
…itectures (Blackwell) Imported from GitHub PR openxla/xla#22029 In addition to SM120a, also add SM101a mentioned in the PTX 8.7 spec (https://docs.nvidia.com/cuda/parallel-thread-execution/#release-notes), which is a slight variation of SM100a. Bumping the max supported PTX version to 8.7, as the LLVM PR (llvm/llvm-project#124155) adding the support is now integrated to OpenXLA. Copybara import of the project: -- be59b7a51721637d880207e7adb69a18c3a92bea by Sergey Kozub <[email protected]>: [XLA:GPU] Add support for SM101a and SM120a architectures (Blackwell) Merging this change closes tensorflow#22029 PiperOrigin-RevId: 721088886
PiperOrigin-RevId: 721089414
…e beginning of loop bodies PiperOrigin-RevId: 721089737
The first bug is that all-to-all ops with multiple replica groups did not work, because the thunk stored a map from local_id to some temporary memory used by the a2a implementation, where local_id was relative to the start of the replica_group. But this means devices of different groups would use the same temporary memory, overwriting each other's results. The fix is to change the map's key from local_id to StreamExecutor*. The second bug is that the temporary memory mentioned above is registered as host memory but never deregistered. It is deregistered in NcclAllToAllStartThunk::Cleanup(), but Cleanup() is never called. If Cleanup() were to be called, it would fix the bug, but cause the memory to registered and deregistered every run of the executable, which is unacceptably slow. The fix is to deregister the memory in the thunk destructor instead, which is implicitly done by storing a se::MemoryAllocation instead of a int64_t* in the map. Since the two fixes affect the exact same code, I'm putting them in a single change instead of two separate changes. PiperOrigin-RevId: 722909883
The Create() method is to create the EnvironmentSingleton with options. It will fail if there is pre-created instance. PiperOrigin-RevId: 722913256
PiperOrigin-RevId: 722913817
PiperOrigin-RevId: 722914748
…ncat operations in SPMD partitioner. PiperOrigin-RevId: 722915685
PiperOrigin-RevId: 722921509
PiperOrigin-RevId: 722921904
PiperOrigin-RevId: 722956003
PiperOrigin-RevId: 722965304
PiperOrigin-RevId: 722974000
PiperOrigin-RevId: 722974169
PiperOrigin-RevId: 722974173
PiperOrigin-RevId: 722977003
PiperOrigin-RevId: 722985539
The dependencies may have been needed before, but some passes have been moved. PiperOrigin-RevId: 722987106
PiperOrigin-RevId: 723005231
c6b6681
to
f59a9fb
Compare
f59a9fb
to
b372f3b
Compare
!gen-cache |
The disk cache generation for the cpu-pycpp tests status: successfully finished The disk cache generation for the XLA tests status: successfully finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this skipped test related to openxla/xla#22383 ? if not could you create a task on our board to track this test?
b372f3b
to
9482582
Compare
No description provided.