Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop upstream sync 20250204 #2837

Merged
merged 373 commits into from
Feb 10, 2025
Merged

Conversation

hsharsha
Copy link

@hsharsha hsharsha commented Feb 4, 2025

No description provided.

tensorflower-gardener and others added 30 commits January 29, 2025 05:36
PiperOrigin-RevId: 720935918
PiperOrigin-RevId: 720936848
Updates LLVM usage to match
[a06c89387621](llvm/llvm-project@a06c89387621)

PiperOrigin-RevId: 720937292
PiperOrigin-RevId: 720939787
Imported from GitHub PR openxla/xla#21800

This PR adds a transformation pass that supports custom calls to block quantize/dequantize/dot ops.
Such calls are replaced by an equivalent sequence of HLO operations.

This pass is supposed to support MX scaling formats, such as MXFP8, but is not limited to those and can be used with any data types and block sizes.
The quantization op sequence matches the one described in the section 6.3 of the MX spec:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Once cuDNN frontend 1.10 is released, a lowering to a cuDNN graph will be enabled for the hardware that supports block scaled dot natively (i.e. Blackwell). This pass will stay disabled until then.

I also plan on introducing a new HLO op, "block-scaled-dot", which will be more generic than a custom call - for example, will have configurable dimensions numbers akin to the general dot op. This will follow in a separate PR, once that is approved, I'll replace the custom call "__op$block_scaled_dot" with it.

Copybara import of the project:

--
5dcc610e804e7aaad9b79369f714a63f9f096ad8 by Sergey Kozub <[email protected]>:

Add block scaling rewriter pass

Merging this change closes tensorflow#21800

PiperOrigin-RevId: 720940919
PiperOrigin-RevId: 720951690
This both simplifies the giant EmitMatmul function & makes it more generic, simplifying the TMA change (see CL chain).

PiperOrigin-RevId: 720962746
…k to the default layout in HLO parser just for entry_computation_layout.

PiperOrigin-RevId: 720970103
PiperOrigin-RevId: 720982236
Updating:
 - `env.h`
 - `env_time.h`
 - `errors.h`
 - `file_statistics.h`
 - `file_system.h`
 - `file_system_helper.h`
 - `logging.h`
 - `macros.h`
 - `status.h`
 - `status_matchers.h`
 - `status_to_from_proto.h`
 - `statusor.h`
 - `test.h`
 - `test_benchmark.h`
 - `threadpool.h`
 - `threadpool_async_executor.h`
 - `threadpool_interface.h`
 - `threadpool_options.h`
 - `types.h`

and associated targets.

PiperOrigin-RevId: 721004530
`CudaExecutor::Allocate` used to always return a nullptr when the user requested an allocation the collective memory space. This was caused by a mistake in one of my refactorings a while ago.

PiperOrigin-RevId: 721009378
Add ToProto support remaining Thunk types

Reverts 444d561

PiperOrigin-RevId: 721025054
Imported from GitHub PR openxla/xla#21901

Add rocm 6.1.0 dependency for ubuntu 20.04
Copybara import of the project:

--
0acf028eeca5923c7f2aa5762297686836eda310 by Alexandros Theodoridis <[email protected]>:

Add rocm6.1 deps for ubuntu 20.04

--
fc88c83061d6efff2482599489d622ab3114b9a7 by Alexandros Theodoridis <[email protected]>:

Fix hermetic build for 6.0

--
73ace5591f4731e1b95b6d3e6a349b528977c580 by Alexandros Theodoridis <[email protected]>:

Add ci config for hermetic build

--
bbc048bcffd9d35bfad76ff816ed22f3e3f761f8 by Alexandros Theodoridis <[email protected]>:

Introduce rocm 6.1.0 dependency for 22.04

--
9776f398c2711ba37333d29b934d6ba67c55dbef by Alexandros Theodoridis <[email protected]>:

Add missing 24.04 redist

--
acf275d57cc185b9c2122d5930d8cf54e473ad95 by Alexandros Theodoridis <[email protected]>:

Fix test

--
3e49285b0f55597ab5f44c1d0a422bf931d72cda by Alexandros Theodoridis <[email protected]>:

Add comment explaining the reason for a new target

--
35838bf8d6e678717e9b1c551f840918b00a91f8 by Alexandros Theodoridis <[email protected]>:

Rever force verbose in the compiler wrapper

--
2952e115b044e1a8ac8aadc7eac7802e8d79cf91 by Alexandros Theodoridis <[email protected]>:

Add explanation comment for the new target

Merging this change closes tensorflow#21901

PiperOrigin-RevId: 721043735
Updating:
 - `env.h`
 - `env_time.h`
 - `errors.h`
 - `file_statistics.h`
 - `file_system.h`
 - `file_system_helper.h`
 - `logging.h`
 - `macros.h`
 - `status.h`
 - `status_matchers.h`
 - `status_to_from_proto.h`
 - `statusor.h`
 - `test.h`
 - `test_benchmark.h`
 - `threadpool.h`
 - `threadpool_async_executor.h`
 - `threadpool_interface.h`
 - `threadpool_options.h`
 - `types.h`

and associated targets.

PiperOrigin-RevId: 721044569
Update the rule to include all LiteRtXXXX symbols.

PiperOrigin-RevId: 721045739
Imported from GitHub PR openxla/xla#21948

Copybara import of the project:

--
affa734c3c6e2af934dd12eafe7e8771ab0ee8db by Ilia Sergachev <[email protected]>:

[GPU] Upgrade cuDNN frontend to 1.10.0.

Merging this change closes tensorflow#21948

PiperOrigin-RevId: 721075669
PiperOrigin-RevId: 721086005
…itectures (Blackwell)

Imported from GitHub PR openxla/xla#22029

In addition to SM120a, also add SM101a mentioned in the PTX 8.7 spec (https://docs.nvidia.com/cuda/parallel-thread-execution/#release-notes), which is a slight variation of SM100a.

Bumping the max supported PTX version to 8.7, as the LLVM PR (llvm/llvm-project#124155) adding the support is now integrated to OpenXLA.
Copybara import of the project:

--
be59b7a51721637d880207e7adb69a18c3a92bea by Sergey Kozub <[email protected]>:

[XLA:GPU] Add support for SM101a and SM120a architectures (Blackwell)

Merging this change closes tensorflow#22029

PiperOrigin-RevId: 721088886
PiperOrigin-RevId: 721089414
…e beginning of loop bodies

PiperOrigin-RevId: 721089737
reedwm and others added 17 commits February 3, 2025 20:44
The first bug is that all-to-all ops with multiple replica groups did not work, because the thunk stored a map from local_id to some temporary memory used by the a2a implementation, where local_id was relative to the start of the replica_group. But this means devices of different groups would use the same temporary memory, overwriting each other's results. The fix is to change the map's key from local_id to StreamExecutor*.

The second bug is that the temporary memory mentioned above is registered as host memory but never deregistered. It is deregistered in NcclAllToAllStartThunk::Cleanup(), but Cleanup() is never called. If Cleanup() were to be called, it would fix the bug, but cause the memory to registered and deregistered every run of the executable, which is unacceptably slow. The fix is to deregister the memory in the thunk destructor instead, which is implicitly done by storing a se::MemoryAllocation instead of a int64_t* in the map.

Since the two fixes affect the exact same code, I'm putting them in a single change instead of two separate changes.

PiperOrigin-RevId: 722909883
The Create() method is to create the EnvironmentSingleton with options. It will fail if there is pre-created instance.

PiperOrigin-RevId: 722913256
PiperOrigin-RevId: 722913817
PiperOrigin-RevId: 722914748
…ncat operations in SPMD partitioner.

PiperOrigin-RevId: 722915685
PiperOrigin-RevId: 722921509
PiperOrigin-RevId: 722921904
PiperOrigin-RevId: 722956003
PiperOrigin-RevId: 722965304
PiperOrigin-RevId: 722974173
PiperOrigin-RevId: 722977003
The dependencies may have been needed before, but some passes have been moved.

PiperOrigin-RevId: 722987106
PiperOrigin-RevId: 723005231
@hsharsha hsharsha force-pushed the develop-upstream-sync-20250204 branch from c6b6681 to f59a9fb Compare February 5, 2025 13:57
@hsharsha hsharsha force-pushed the develop-upstream-sync-20250204 branch from f59a9fb to b372f3b Compare February 6, 2025 10:26
@alekstheod
Copy link

!gen-cache

@okakarpa
Copy link
Collaborator

okakarpa commented Feb 6, 2025

The disk cache generation for the cpu-pycpp tests status: successfully finished
The disk cache generation for the gpu-pycpp tests status: successfully finished
The disk cache generation for the gpu-nonpip-multi tests status: successfully finished

The disk cache generation for the XLA tests status: successfully finished

Copy link

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this skipped test related to openxla/xla#22383 ? if not could you create a task on our board to track this test?

@hsharsha hsharsha force-pushed the develop-upstream-sync-20250204 branch from b372f3b to 9482582 Compare February 10, 2025 17:57
@hsharsha hsharsha merged commit e5b0d2f into develop-upstream Feb 10, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.