Build failure: magma #220357

bcdarwin · 2023-03-09T16:08:13Z

Steps To Reproduce

Steps to reproduce the behavior:

build magma

Build log

[2813/3430] Linking CXX shared library lib/libmagma.so
FAILED: lib/libmagma.so
: && /nix/store/ds6ivg31k3l0pjhhf3s769bkpmafa54g-gcc-wrapper-11.3.0/bin/c++ -fPIC -std=c++11 -fopenmp -Wall -Wno-unused-function -O3 -DNDEBUG   -shared -Wl,-soname,libmagma
/nix/store/76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_zgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x23a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `zgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_cgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x31a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `cgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_dgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x3fa): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_sgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x4da): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `__static_initialization_and_destruction_0(int, int) [clone .constprop.0]':
get_batched_crossover.cpp:(.text.startup+0xce9): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_mi100' defined in .bss section in
get_batched_crossover.cpp:(.text.startup+0xd35): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_mi100' defined in .bss section in
get_batched_crossover.cpp:(.text.startup+0xd3c): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `std::vector<std::vector<int, std::allocator<int> >, std
get_batched_crossover.cpp:(.text.startup+0xd43): relocation truncated to fit: R_X86_64_PC32 against symbol `__dso_handle' defined in .data.rel.local section in /nix/store/v
get_batched_crossover.cpp:(.text.startup+0x16ea): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_mi100' defined in .bss section i
get_batched_crossover.cpp:(.text.startup+0x1739): additional relocation overflows omitted from the output
lib/libmagma.so: PC-relative offset overflow in PLT entry for `magma_cgerc'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
lines 6661-6707/6707 (END)

Notify maintainers

@tbenst

Also @ConnorBaker @samuela may be interested.

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 4.15.0-169-generic, Ubuntu, 18.04.6 LTS (Bionic Beaver), nobuild`
 - multi-user?: `no`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.14.0pre20230222_4a921ba`
 - channels(ben): `"home-manager, nixpkgs"`
 - nixpkgs: `/home/ben/.nix-defexpr/channels/nixpkgs`

The text was updated successfully, but these errors were encountered:

samuela · 2023-03-09T16:19:43Z

@ConnorBaker Is this related to the race condition error you saw before?

ConnorBaker · 2023-03-09T17:27:51Z

@bcdarwin can you tell me more about what hardware you're using, and what your config.nix looks like?

For reference, nix build --impure -L nixpkgs/master#magma runs without issue for me with

# ~/.config/nixpkgs/config.nix 
{
  allowUnfree = true;
  cudaSupport = true;
  cudaCapabilities = [ "8.6" ];
  cudaForwardCompat = false;
}

using a 4090 and an i9 13900k:

$ nix run nixpkgs#nix-info -- -m
 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.14-200.fc37.x86_64, Fedora Linux, 37 (Workstation Edition), nobuild`
 - multi-user?: `no`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.14.0pre20230208_ec78896`
 - channels(connorbaker): `"nixpkgs"`
 - nixpkgs: `/home/connorbaker/.nix-defexpr/channels/nixpkgs`

EDIT:

Would you also try building with #220366? I made some changes there that moved the CUDA runtime stub and NVCC into nativeBuildInputs. I'm curious if that helps at all.

ConnorBaker · 2023-03-09T19:45:27Z

@samuela I don't think so -- closest I can remember to this was the linking error AMD HIP had with 2.7.x (which is why its stuck on magma 2.6.x), but that was a different inscrutable error from ld.

bcdarwin · 2023-03-09T19:52:39Z

CPU is a Xeon Gold 5218 with Quadro RTX 8000 GPUs. I haven't set any configuration other than cudaSupport and allowUnfree.

bcdarwin · 2023-03-09T19:53:42Z

I don't know why this issue isn't affecting your builds but most likely the fix is to set -mcmodel: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

bcdarwin · 2023-03-09T19:54:32Z

Haven't tried #220366 yet either, sorry.

ConnorBaker · 2023-03-09T20:01:58Z

Any idea when this failure started occurring?

bcdarwin · 2023-03-09T20:03:31Z

magma 2.6.2 I believe works but 2.7.1 fails. I can try to bisect at some point if needed.

ConnorBaker · 2023-03-09T20:04:47Z

What channel were you building from by the way? I'm gonna try to find a way to match your setup but I need to know exactly which version of nixpkgs you're using to build.

ConnorBaker · 2023-03-09T20:12:46Z

Okay, I was able to reproduce it. It happened while a Nixpkgs-review for that magma PR I linked earlier.

I suspect it has something to do with either my disabling cudaForwardCompat or restricting cudaCapabilities. Time to investigate 🕵️‍♂️

ConnorBaker · 2023-03-09T21:58:38Z

Something for me to look at later tonight; if it is the case that the increased number of cuda capabilities being targeted is causing the binary to bloat, check that we're using -Xfatbin -compress-all", like PyTorch : https://github.com/pytorch/pytorch/blob/fe05266fda4f908130dea7cbac37e9264c0429a2/CMakeLists.txt#L548. IIRC Magma doesn't set that flag. Also, I remember being unable to find specifically -Xfatbin in the NVCC docs.

Since PyTorch also ships binaries targeting a bunch of different CUDA capabilities, their configs are a goldmine of flags we might need to look into.

EDIT: Haven't built with it yet, but it seems like it should fix it. Apache MXNet had the same issue: apache/mxnet#19123.

EDIT2: Reminder to self: if that is the fix, add -Xfatbin=-compress-all here:

nixpkgs/pkgs/development/compilers/cudatoolkit/redist/overrides.nix

Line 52 in a59944f

export NVCC_PREPEND_FLAGS+=' --compiler-bindir=${cc}/bin'

and here:

nixpkgs/pkgs/development/compilers/cudatoolkit/common.nix

Line 163 in a59944f

export NVCC_PREPEND_FLAGS+=' --compiler-bindir=${backendStdenv.cc}/bin'

ConnorBaker · 2023-03-10T02:36:16Z

I know it's still a draft @bcdarwin but can you try building again with #220402? I think it should fix your issue.

Unrelated, but if you're building anything from source and you want faster builds I highly recommend specifying the single compute capability you need to build for (like I do here #220357 (comment)) because it results in massively faster builds. HEAD right now builds for 14 different capabilities and that PR gets it down to 8, but it's much faster to build for just one.

bcdarwin · 2023-03-10T03:49:33Z

I know it's still a draft @bcdarwin but can you try building again with #220402? I think it should fix your issue.

Unrelated, but if you're building anything from source and you want faster builds I highly recommend specifying the single compute capability you need to build for (like I do here #220357 (comment)) because it results in massively faster builds. HEAD right now builds for 14 different capabilities and that PR gets it down to 8, but it's much faster to build for just one.

magma is building for me from that branch (without any configuring compute capabilities yet), thanks.

…tbins cudaPackages: fix #220357; use -Xfatbin=-compress-all; prune default cudaCapabilities

bcdarwin added the 0.kind: build failure A package fails to build label Mar 9, 2023

ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Mar 9, 2023

ConnorBaker added this to CUDA Team Mar 9, 2023

github-project-automation bot moved this to 🆕 New in CUDA Team Mar 9, 2023

ConnorBaker self-assigned this Mar 9, 2023

ConnorBaker moved this from 🆕 New to 🏗 In progress in CUDA Team Mar 9, 2023

ConnorBaker mentioned this issue Mar 9, 2023

cudaPackages: fix #220357; use -Xfatbin=-compress-all; prune default cudaCapabilities #220402

Merged

12 tasks

ConnorBaker moved this from 🏗 In progress to 👀 In review in CUDA Team Mar 10, 2023

samuela closed this as completed in #220402 Mar 13, 2023

github-project-automation bot moved this from 👀 In review to ✅ Done in CUDA Team Mar 13, 2023

samuela added a commit that referenced this issue Mar 13, 2023

Merge pull request #220402 from ConnorBaker/fix/cuda-nvcc-compress-fa…

13939e2

…tbins cudaPackages: fix #220357; use -Xfatbin=-compress-all; prune default cudaCapabilities

nviets mentioned this issue Apr 22, 2023

cudaPackages: bump default cudaPackages_11_7 -> cudaPackages_11_8 #224927

Closed

4 tasks

ConnorBaker mentioned this issue Jun 22, 2023

Build failure: dynamic Magma with cudaSupport #239237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build failure: magma #220357

Build failure: magma #220357

bcdarwin commented Mar 9, 2023 •

edited

Loading

samuela commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023 •

edited

Loading

ConnorBaker commented Mar 9, 2023

bcdarwin commented Mar 9, 2023

bcdarwin commented Mar 9, 2023

bcdarwin commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023

bcdarwin commented Mar 9, 2023 •

edited

Loading

ConnorBaker commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023 •

edited

Loading

ConnorBaker commented Mar 10, 2023

bcdarwin commented Mar 10, 2023

Build failure: magma #220357

Build failure: magma #220357

Comments

bcdarwin commented Mar 9, 2023 • edited Loading

Steps To Reproduce

Build log

Notify maintainers

Metadata

samuela commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023 • edited Loading

ConnorBaker commented Mar 9, 2023

bcdarwin commented Mar 9, 2023

bcdarwin commented Mar 9, 2023

bcdarwin commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023

bcdarwin commented Mar 9, 2023 • edited Loading

ConnorBaker commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023

ConnorBaker commented Mar 9, 2023 • edited Loading

ConnorBaker commented Mar 10, 2023

bcdarwin commented Mar 10, 2023

bcdarwin commented Mar 9, 2023 •

edited

Loading

ConnorBaker commented Mar 9, 2023 •

edited

Loading

bcdarwin commented Mar 9, 2023 •

edited

Loading

ConnorBaker commented Mar 9, 2023 •

edited

Loading