Develop upstream sync 20250204 #2837

hsharsha · 2025-02-04T15:40:10Z

No description provided.

PiperOrigin-RevId: 720935918

PiperOrigin-RevId: 720936848

Updates LLVM usage to match [a06c89387621](llvm/llvm-project@a06c89387621) PiperOrigin-RevId: 720937292

PiperOrigin-RevId: 720939787

Imported from GitHub PR openxla/xla#21800 This PR adds a transformation pass that supports custom calls to block quantize/dequantize/dot ops. Such calls are replaced by an equivalent sequence of HLO operations. This pass is supposed to support MX scaling formats, such as MXFP8, but is not limited to those and can be used with any data types and block sizes. The quantization op sequence matches the one described in the section 6.3 of the MX spec: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf Once cuDNN frontend 1.10 is released, a lowering to a cuDNN graph will be enabled for the hardware that supports block scaled dot natively (i.e. Blackwell). This pass will stay disabled until then. I also plan on introducing a new HLO op, "block-scaled-dot", which will be more generic than a custom call - for example, will have configurable dimensions numbers akin to the general dot op. This will follow in a separate PR, once that is approved, I'll replace the custom call "__op$block_scaled_dot" with it. Copybara import of the project: -- 5dcc610e804e7aaad9b79369f714a63f9f096ad8 by Sergey Kozub <[email protected]>: Add block scaling rewriter pass Merging this change closes tensorflow#21800 PiperOrigin-RevId: 720940919

PiperOrigin-RevId: 720951690

PiperOrigin-RevId: 720952363

This both simplifies the giant EmitMatmul function & makes it more generic, simplifying the TMA change (see CL chain). PiperOrigin-RevId: 720962746

…k to the default layout in HLO parser just for entry_computation_layout. PiperOrigin-RevId: 720970103

PiperOrigin-RevId: 720982236

PiperOrigin-RevId: 721002434

PiperOrigin-RevId: 721003875

Updating: - `env.h` - `env_time.h` - `errors.h` - `file_statistics.h` - `file_system.h` - `file_system_helper.h` - `logging.h` - `macros.h` - `status.h` - `status_matchers.h` - `status_to_from_proto.h` - `statusor.h` - `test.h` - `test_benchmark.h` - `threadpool.h` - `threadpool_async_executor.h` - `threadpool_interface.h` - `threadpool_options.h` - `types.h` and associated targets. PiperOrigin-RevId: 721004530

PiperOrigin-RevId: 721004911

`CudaExecutor::Allocate` used to always return a nullptr when the user requested an allocation the collective memory space. This was caused by a mistake in one of my refactorings a while ago. PiperOrigin-RevId: 721009378

PiperOrigin-RevId: 721022206

Add ToProto support remaining Thunk types Reverts 444d561 PiperOrigin-RevId: 721025054

PiperOrigin-RevId: 721025822

PiperOrigin-RevId: 721026017

Imported from GitHub PR openxla/xla#21901 Add rocm 6.1.0 dependency for ubuntu 20.04 Copybara import of the project: -- 0acf028eeca5923c7f2aa5762297686836eda310 by Alexandros Theodoridis <[email protected]>: Add rocm6.1 deps for ubuntu 20.04 -- fc88c83061d6efff2482599489d622ab3114b9a7 by Alexandros Theodoridis <[email protected]>: Fix hermetic build for 6.0 -- 73ace5591f4731e1b95b6d3e6a349b528977c580 by Alexandros Theodoridis <[email protected]>: Add ci config for hermetic build -- bbc048bcffd9d35bfad76ff816ed22f3e3f761f8 by Alexandros Theodoridis <[email protected]>: Introduce rocm 6.1.0 dependency for 22.04 -- 9776f398c2711ba37333d29b934d6ba67c55dbef by Alexandros Theodoridis <[email protected]>: Add missing 24.04 redist -- acf275d57cc185b9c2122d5930d8cf54e473ad95 by Alexandros Theodoridis <[email protected]>: Fix test -- 3e49285b0f55597ab5f44c1d0a422bf931d72cda by Alexandros Theodoridis <[email protected]>: Add comment explaining the reason for a new target -- 35838bf8d6e678717e9b1c551f840918b00a91f8 by Alexandros Theodoridis <[email protected]>: Rever force verbose in the compiler wrapper -- 2952e115b044e1a8ac8aadc7eac7802e8d79cf91 by Alexandros Theodoridis <[email protected]>: Add explanation comment for the new target Merging this change closes tensorflow#21901 PiperOrigin-RevId: 721043735

Updating: - `env.h` - `env_time.h` - `errors.h` - `file_statistics.h` - `file_system.h` - `file_system_helper.h` - `logging.h` - `macros.h` - `status.h` - `status_matchers.h` - `status_to_from_proto.h` - `statusor.h` - `test.h` - `test_benchmark.h` - `threadpool.h` - `threadpool_async_executor.h` - `threadpool_interface.h` - `threadpool_options.h` - `types.h` and associated targets. PiperOrigin-RevId: 721044569

Update the rule to include all LiteRtXXXX symbols. PiperOrigin-RevId: 721045739

PiperOrigin-RevId: 721056767

PiperOrigin-RevId: 721066280

Imported from GitHub PR openxla/xla#21948 Copybara import of the project: -- affa734c3c6e2af934dd12eafe7e8771ab0ee8db by Ilia Sergachev <[email protected]>: [GPU] Upgrade cuDNN frontend to 1.10.0. Merging this change closes tensorflow#21948 PiperOrigin-RevId: 721075669

PiperOrigin-RevId: 721086005

…itectures (Blackwell) Imported from GitHub PR openxla/xla#22029 In addition to SM120a, also add SM101a mentioned in the PTX 8.7 spec (https://docs.nvidia.com/cuda/parallel-thread-execution/#release-notes), which is a slight variation of SM100a. Bumping the max supported PTX version to 8.7, as the LLVM PR (llvm/llvm-project#124155) adding the support is now integrated to OpenXLA. Copybara import of the project: -- be59b7a51721637d880207e7adb69a18c3a92bea by Sergey Kozub <[email protected]>: [XLA:GPU] Add support for SM101a and SM120a architectures (Blackwell) Merging this change closes tensorflow#22029 PiperOrigin-RevId: 721088886

PiperOrigin-RevId: 721089414

…e beginning of loop bodies PiperOrigin-RevId: 721089737

The first bug is that all-to-all ops with multiple replica groups did not work, because the thunk stored a map from local_id to some temporary memory used by the a2a implementation, where local_id was relative to the start of the replica_group. But this means devices of different groups would use the same temporary memory, overwriting each other's results. The fix is to change the map's key from local_id to StreamExecutor*. The second bug is that the temporary memory mentioned above is registered as host memory but never deregistered. It is deregistered in NcclAllToAllStartThunk::Cleanup(), but Cleanup() is never called. If Cleanup() were to be called, it would fix the bug, but cause the memory to registered and deregistered every run of the executable, which is unacceptably slow. The fix is to deregister the memory in the thunk destructor instead, which is implicitly done by storing a se::MemoryAllocation instead of a int64_t* in the map. Since the two fixes affect the exact same code, I'm putting them in a single change instead of two separate changes. PiperOrigin-RevId: 722909883

The Create() method is to create the EnvironmentSingleton with options. It will fail if there is pre-created instance. PiperOrigin-RevId: 722913256

PiperOrigin-RevId: 722913817

PiperOrigin-RevId: 722914748

…ncat operations in SPMD partitioner. PiperOrigin-RevId: 722915685

PiperOrigin-RevId: 722921509

PiperOrigin-RevId: 722921904

PiperOrigin-RevId: 722956003

PiperOrigin-RevId: 722965304

PiperOrigin-RevId: 722974000

PiperOrigin-RevId: 722974169

PiperOrigin-RevId: 722974173

PiperOrigin-RevId: 722977003

PiperOrigin-RevId: 722985539

The dependencies may have been needed before, but some passes have been moved. PiperOrigin-RevId: 722987106

PiperOrigin-RevId: 723005231

…sync-20250204

alekstheod · 2025-02-06T11:27:53Z

!gen-cache

okakarpa · 2025-02-06T11:28:05Z

The disk cache generation for the cpu-pycpp tests status: successfully finished
The disk cache generation for the gpu-pycpp tests status: successfully finished
The disk cache generation for the gpu-nonpip-multi tests status: successfully finished

The disk cache generation for the XLA tests status: successfully finished

i-chaochen

does this skipped test related to openxla/xla#22383 ? if not could you create a task on our board to track this test?

tensorflower-gardener and others added 30 commits January 29, 2025 05:36

Automated Code Change

5764746

PiperOrigin-RevId: 720935918

Automated Code Change

ca4d9e0

PiperOrigin-RevId: 720936848

Integrate LLVM at llvm/llvm-project@a06c89387621

b566131

Updates LLVM usage to match [a06c89387621](llvm/llvm-project@a06c89387621) PiperOrigin-RevId: 720937292

Reverts 3bf73c7

46f98fe

PiperOrigin-RevId: 720939787

Reverts 9dcbe84

ad67e8c

PiperOrigin-RevId: 720951690

Create "internal" visibility for LiteRT-internal targets

205a086

PiperOrigin-RevId: 720952363

[xla:gpu] [cleanup] Pull out some logic into IterableInput

6bafbae

This both simplifies the giant EmitMatmul function & makes it more generic, simplifying the TMA change (see CL chain). PiperOrigin-RevId: 720962746

[XLA:GPU] Introduce a separate option which would control falling bac…

2db6835

…k to the default layout in HLO parser just for entry_computation_layout. PiperOrigin-RevId: 720970103

cleanup of deprecated test methods

40e02fe

PiperOrigin-RevId: 720982236

fix kw args

f0f38d3

Extend timeout for running E2E XLA CPU benchmarks

a693ef7

PiperOrigin-RevId: 721002434

[xla:ffi] Add support for passing RunId to FFI handlers

e6e0261

PiperOrigin-RevId: 721003875

[XLA:GPU] Allow AUTO layout in the multihost runner.

079cf0b

PiperOrigin-RevId: 721004911

Fix collective memory allocation in cuda executor

278b3ca

`CudaExecutor::Allocate` used to always return a nullptr when the user requested an allocation the collective memory space. This was caused by a mistake in one of my refactorings a while ago. PiperOrigin-RevId: 721009378

Fix line wraps in gpu_latency_hiding_scheduler_test

fa148fd

PiperOrigin-RevId: 721022206

[XLA:CPU][roll forward] Read thunks from proto when loading executable.

965349d

Add ToProto support remaining Thunk types Reverts 444d561 PiperOrigin-RevId: 721025054

[Checkpoint] Add units (microseconds) to duration log message.

0ea72e8

PiperOrigin-RevId: 721025822

Update XNNPack to the lastest version (and dependencies).

326a8c4

PiperOrigin-RevId: 721026017

litert: Fix LiteRT Runtime C library build

97cbb04

Update the rule to include all LiteRtXXXX symbols. PiperOrigin-RevId: 721045739

Update Build files to allow visibility to for internal use_case

da05af8

PiperOrigin-RevId: 721056767

[xla:cpu] Add benchmarks for transpose + copy + dot

cda8b1b

PiperOrigin-RevId: 721066280

Reverts 5764746

469dfee

PiperOrigin-RevId: 721086005

Fix tsan/asan error

76eed36

PiperOrigin-RevId: 721089414

Only apply collective permute decomposer to collective-permutes at th…

baa09a3

…e beginning of loop bodies PiperOrigin-RevId: 721089737

reedwm and others added 17 commits February 3, 2025 20:44

litert: Refactor EnvironmentSingleton to use it outside

9b406e0

The Create() method is to create the EnvironmentSingleton with options. It will fail if there is pre-created instance. PiperOrigin-RevId: 722913256

Automated Code Change

6fb58a6

PiperOrigin-RevId: 722913817

Automated Code Change

f5ca0e6

PiperOrigin-RevId: 722914748

Refactor FindRotateRightPattern and FindPadWithWrapPattern for co…

e14edae

…ncat operations in SPMD partitioner. PiperOrigin-RevId: 722915685

Automated Code Change

c1f0701

PiperOrigin-RevId: 722921509

Automated Code Change

0a1e2a6

PiperOrigin-RevId: 722921904

Automated Code Change

baf6355

PiperOrigin-RevId: 722956003

Automated Code Change

50218d0

PiperOrigin-RevId: 722965304

[pjrt] Removed the deprecated overloads of BufferFromHostBuffer

03d9016

PiperOrigin-RevId: 722974000

compat: Update forward compatibility horizon to 2025-02-04

e9b2468

PiperOrigin-RevId: 722974169

Update GraphDef version to 2128.

cf5cdf9

PiperOrigin-RevId: 722974173

Automated Code Change

0d8e62d

PiperOrigin-RevId: 722977003

[pjrt] Removed unused "interpreter" PjRt client

bcba089

PiperOrigin-RevId: 722985539

Remove unused dependencies.

eedba9f

The dependencies may have been needed before, but some passes have been moved. PiperOrigin-RevId: 722987106

Automated Code Change

5fe090b

PiperOrigin-RevId: 723005231

Merge remote-tracking branch 'upstream/master' into develop-upstream-…

409ac07

…sync-20250204

hsharsha force-pushed the develop-upstream-sync-20250204 branch from c6b6681 to f59a9fb Compare February 5, 2025 13:57

Resolve merge conflicts

74fa172

hsharsha force-pushed the develop-upstream-sync-20250204 branch from f59a9fb to b372f3b Compare February 6, 2025 10:26

hsharsha requested review from i-chaochen, mmakevic-amd and jayfurmanek February 6, 2025 18:35

i-chaochen reviewed Feb 6, 2025

View reviewed changes

mmakevic-amd approved these changes Feb 10, 2025

View reviewed changes

Skip collective tests for ROCm

9482582

hsharsha force-pushed the develop-upstream-sync-20250204 branch from b372f3b to 9482582 Compare February 10, 2025 17:57

hsharsha merged commit e5b0d2f into develop-upstream Feb 10, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop upstream sync 20250204 #2837

Develop upstream sync 20250204 #2837

hsharsha commented Feb 4, 2025

alekstheod commented Feb 6, 2025

okakarpa commented Feb 6, 2025 •

edited

Loading

i-chaochen left a comment •

edited

Loading

Develop upstream sync 20250204 #2837

Develop upstream sync 20250204 #2837

Conversation

hsharsha commented Feb 4, 2025

alekstheod commented Feb 6, 2025

okakarpa commented Feb 6, 2025 • edited Loading

i-chaochen left a comment • edited Loading

Choose a reason for hiding this comment

okakarpa commented Feb 6, 2025 •

edited

Loading

i-chaochen left a comment •

edited

Loading