Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure: magma #220357

Closed
bcdarwin opened this issue Mar 9, 2023 · 13 comments · Fixed by #220402
Closed

Build failure: magma #220357

bcdarwin opened this issue Mar 9, 2023 · 13 comments · Fixed by #220402
Assignees
Labels
0.kind: build failure A package fails to build 6.topic: cuda Parallel computing platform and API

Comments

@bcdarwin
Copy link
Member

bcdarwin commented Mar 9, 2023

Steps To Reproduce

Steps to reproduce the behavior:

  1. build magma

Build log

From the full log:

[2813/3430] Linking CXX shared library lib/libmagma.so
FAILED: lib/libmagma.so
: && /nix/store/ds6ivg31k3l0pjhhf3s769bkpmafa54g-gcc-wrapper-11.3.0/bin/c++ -fPIC -std=c++11 -fopenmp -Wall -Wno-unused-function -O3 -DNDEBUG   -shared -Wl,-soname,libmagma
/nix/store/76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_zgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x23a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `zgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_cgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x31a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `cgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_dgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x3fa): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_sgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x4da): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_a100' defined in .bss section in CMakeFil
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `__static_initialization_and_destruction_0(int, int) [clone .constprop.0]':
get_batched_crossover.cpp:(.text.startup+0xce9): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_mi100' defined in .bss section in
get_batched_crossover.cpp:(.text.startup+0xd35): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_mi100' defined in .bss section in
get_batched_crossover.cpp:(.text.startup+0xd3c): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `std::vector<std::vector<int, std::allocator<int> >, std
get_batched_crossover.cpp:(.text.startup+0xd43): relocation truncated to fit: R_X86_64_PC32 against symbol `__dso_handle' defined in .data.rel.local section in /nix/store/v
get_batched_crossover.cpp:(.text.startup+0x16ea): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_mi100' defined in .bss section i
get_batched_crossover.cpp:(.text.startup+0x1739): additional relocation overflows omitted from the output
lib/libmagma.so: PC-relative offset overflow in PLT entry for `magma_cgerc'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
lines 6661-6707/6707 (END)

Notify maintainers

@tbenst

Also @ConnorBaker @samuela may be interested.

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 4.15.0-169-generic, Ubuntu, 18.04.6 LTS (Bionic Beaver), nobuild`
 - multi-user?: `no`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.14.0pre20230222_4a921ba`
 - channels(ben): `"home-manager, nixpkgs"`
 - nixpkgs: `/home/ben/.nix-defexpr/channels/nixpkgs`
@bcdarwin bcdarwin added the 0.kind: build failure A package fails to build label Mar 9, 2023
@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Mar 9, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in CUDA Team Mar 9, 2023
@samuela
Copy link
Member

samuela commented Mar 9, 2023

@ConnorBaker Is this related to the race condition error you saw before?

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Mar 9, 2023

@bcdarwin can you tell me more about what hardware you're using, and what your config.nix looks like?

For reference, nix build --impure -L nixpkgs/master#magma runs without issue for me with

# ~/.config/nixpkgs/config.nix 
{
  allowUnfree = true;
  cudaSupport = true;
  cudaCapabilities = [ "8.6" ];
  cudaForwardCompat = false;
}

using a 4090 and an i9 13900k:

$ nix run nixpkgs#nix-info -- -m
 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.14-200.fc37.x86_64, Fedora Linux, 37 (Workstation Edition), nobuild`
 - multi-user?: `no`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.14.0pre20230208_ec78896`
 - channels(connorbaker): `"nixpkgs"`
 - nixpkgs: `/home/connorbaker/.nix-defexpr/channels/nixpkgs`

EDIT:

Would you also try building with #220366? I made some changes there that moved the CUDA runtime stub and NVCC into nativeBuildInputs. I'm curious if that helps at all.

@ConnorBaker ConnorBaker self-assigned this Mar 9, 2023
@ConnorBaker ConnorBaker moved this from 🆕 New to 🏗 In progress in CUDA Team Mar 9, 2023
@ConnorBaker
Copy link
Contributor

@samuela I don't think so -- closest I can remember to this was the linking error AMD HIP had with 2.7.x (which is why its stuck on magma 2.6.x), but that was a different inscrutable error from ld.

@bcdarwin
Copy link
Member Author

bcdarwin commented Mar 9, 2023

CPU is a Xeon Gold 5218 with Quadro RTX 8000 GPUs. I haven't set any configuration other than cudaSupport and allowUnfree.

@bcdarwin
Copy link
Member Author

bcdarwin commented Mar 9, 2023

I don't know why this issue isn't affecting your builds but most likely the fix is to set -mcmodel: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

@bcdarwin
Copy link
Member Author

bcdarwin commented Mar 9, 2023

Haven't tried #220366 yet either, sorry.

@ConnorBaker
Copy link
Contributor

Any idea when this failure started occurring?

@bcdarwin
Copy link
Member Author

bcdarwin commented Mar 9, 2023

magma 2.6.2 I believe works but 2.7.1 fails. I can try to bisect at some point if needed.

@ConnorBaker
Copy link
Contributor

What channel were you building from by the way? I'm gonna try to find a way to match your setup but I need to know exactly which version of nixpkgs you're using to build.

@ConnorBaker
Copy link
Contributor

Okay, I was able to reproduce it. It happened while a Nixpkgs-review for that magma PR I linked earlier.

I suspect it has something to do with either my disabling cudaForwardCompat or restricting cudaCapabilities. Time to investigate 🕵️‍♂️

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Mar 9, 2023

Something for me to look at later tonight; if it is the case that the increased number of cuda capabilities being targeted is causing the binary to bloat, check that we're using -Xfatbin -compress-all", like PyTorch : https://github.com/pytorch/pytorch/blob/fe05266fda4f908130dea7cbac37e9264c0429a2/CMakeLists.txt#L548. IIRC Magma doesn't set that flag. Also, I remember being unable to find specifically -Xfatbin in the NVCC docs.

Since PyTorch also ships binaries targeting a bunch of different CUDA capabilities, their configs are a goldmine of flags we might need to look into.

EDIT: Haven't built with it yet, but it seems like it should fix it. Apache MXNet had the same issue: apache/mxnet#19123.

EDIT2: Reminder to self: if that is the fix, add -Xfatbin=-compress-all here:

export NVCC_PREPEND_FLAGS+=' --compiler-bindir=${cc}/bin'
and here:
export NVCC_PREPEND_FLAGS+=' --compiler-bindir=${backendStdenv.cc}/bin'

@ConnorBaker
Copy link
Contributor

I know it's still a draft @bcdarwin but can you try building again with #220402? I think it should fix your issue.

Unrelated, but if you're building anything from source and you want faster builds I highly recommend specifying the single compute capability you need to build for (like I do here #220357 (comment)) because it results in massively faster builds. HEAD right now builds for 14 different capabilities and that PR gets it down to 8, but it's much faster to build for just one.

@ConnorBaker ConnorBaker moved this from 🏗 In progress to 👀 In review in CUDA Team Mar 10, 2023
@bcdarwin
Copy link
Member Author

I know it's still a draft @bcdarwin but can you try building again with #220402? I think it should fix your issue.

Unrelated, but if you're building anything from source and you want faster builds I highly recommend specifying the single compute capability you need to build for (like I do here #220357 (comment)) because it results in massively faster builds. HEAD right now builds for 14 different capabilities and that PR gets it down to 8, but it's much faster to build for just one.

magma is building for me from that branch (without any configuring compute capabilities yet), thanks.

@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in CUDA Team Mar 13, 2023
samuela added a commit that referenced this issue Mar 13, 2023
…tbins

cudaPackages: fix #220357; use -Xfatbin=-compress-all; prune default cudaCapabilities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: build failure A package fails to build 6.topic: cuda Parallel computing platform and API
Projects
Status: Done
3 participants