Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CABLE linked against incorrect libraries when run from the hh5 conda environment #279

Closed
SeanBryan51 opened this issue Apr 11, 2024 · 4 comments · Fixed by #282
Closed
Assignees
Labels
bug Something isn't working priority:high High priority issues that should be included in the next release.

Comments

@SeanBryan51
Copy link
Collaborator

SeanBryan51 commented Apr 11, 2024

CABLE is linked against incorrect libraries for netcdf and MPI when running benchcab (v4.0.2) from the hh5 conda environment.

The following behaviour only occurs when running from the hh5 conda environment and not when running from the benchcab-dev environment.

Steps to reproduce:

module use /g/data/hh5/public/modules
module load conda/analysis3-unstable
git clone https://github.com/CABLE-LSM/bench_example.git
cd bench_example
cat > config.yaml << EOL
project: $PROJECT
realisations:
  - repo:
      git:
        branch: main
        commit: 46830b3773f1932680af158cab27ae223fd8685a
fluxsite:
  experiment: AU-Tum
modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]
EOL
benchcab checkout -v && benchcab build --mpi -v

Running the above causes the build to fail when compiling the MPI executable:

[ 96%] Building Fortran object CMakeFiles/cable-mpi.dir/src/science/pop/pop_mpi.F90.o
/scratch/tm70/sb8430/bench_example/src/main/src/science/pop/pop_mpi.F90(25): error #7013: This module file was not generated by any release of this compiler.   [MPI]
    USE MPI
--------^
...
compilation aborted for /scratch/tm70/sb8430/bench_example/src/main/src/science/pop/pop_mpi.F90 (code 1)

Output from CMake shows that we are linking against an MPI library found in the conda environment:

-- Found MPI_Fortran: /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/libmpi_usempif08.so (found version "3.1")

when the MPI_Fortran path should instead be pointing to /apps/openmpi/4.1.0/lib/....

The serial executable compiles successfully but is linked against the netcdf-fortran library found in the conda environment:

$ ldd src/main/bin/cable | grep netcdff
	libnetcdff.so.7 => /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/libnetcdff.so.7 (0x00007fe324c17000)

when this should point to /apps/netcdf/4.7.4/lib/....

Running the serial executable crashes due to undefined symbols from the netcdf library:

module use /g/data/hh5/public/modules
module load conda/analysis3-unstable
git clone https://github.com/CABLE-LSM/bench_example.git
cd bench_example
cat > config.yaml << EOL
project: $PROJECT
realisations:
  - repo:
      git:
        branch: main
        commit: 46830b3773f1932680af158cab27ae223fd8685a
fluxsite:
  experiment: AU-Tum
modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]
EOL
benchcab fluxsite -v

The PBS job script outputs:

2024-04-11 12:02:06,251 - DEBUG - fluxsite.fluxsite.py:242 - Error: CABLE returned an error for task AU-Tum_2002-2017_OzFlux_Met_R0_S0

Inspecting the standard output from CABLE:

$ cat runs/fluxsite/tasks/AU-Tum_2002-2017_OzFlux_Met_R0_S0/out.txt
./cable: symbol lookup error: ./cable: undefined symbol: netcdf_mp_nf90_inquire_variable_
@SeanBryan51 SeanBryan51 added the bug Something isn't working label Apr 11, 2024
@ccarouge
Copy link
Member

Sounds like when we think of a release of CABLE, we may want to release benchcab independently of hh5 environments... Still need to think on this one but this is annoying.

@ccarouge ccarouge added the priority:high High priority issues that should be included in the next release. label Apr 16, 2024
@SeanBryan51 SeanBryan51 self-assigned this Apr 16, 2024
@SeanBryan51
Copy link
Collaborator Author

The issue is due to environment variables being set which affect the behaviour the build, notably LDFLAGS and CMAKE_PREFIX_PATH:

$ module load conda/analysis3-unstable
$ echo $LDFLAGS
-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib -Wl,-rpath-link,/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib -L/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib
$ echo $CMAKE_PREFIX_PATH
/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01:/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/x86_64-conda-linux-gnu/sysroot/usr

A quick fix would be to unset these variables before invoking CMake.

SeanBryan51 added a commit that referenced this issue Apr 17, 2024
CABLE is linked against incorrect libraries for netcdf and MPI when
running benchcab (v4.0.2) from the hh5 conda environment. The issue is
due to environment variables being set which affect the behaviour of the
build, notably LDFLAGS and CMAKE_PREFIX_PATH, which point CMake to find
the netcdf and MPI libraries installed in the current conda environment.
This change unsets these variables so that CMake finds the appropriate
libraries which get loaded in as modules.

Fixes #279
@SeanBryan51 SeanBryan51 linked a pull request Apr 17, 2024 that will close this issue
SeanBryan51 added a commit that referenced this issue Apr 17, 2024
CABLE is linked against incorrect libraries for netcdf and MPI when
running benchcab (v4.0.2) from the hh5 conda environment. The issue is
due to environment variables being set which affect the behaviour of the
build, notably LDFLAGS and CMAKE_PREFIX_PATH, which point CMake to find
the netcdf and MPI libraries installed in the current conda environment.
This change unsets these variables so that CMake finds the appropriate
libraries which get loaded in as modules.

Fixes #279
@SeanBryan51
Copy link
Collaborator Author

@dsroberts I noticed there are other environment variables being set when loading conda/analysis3-unstable which may impact build systems (e.g. CC, CFLAGS, CPPFLAGS, ...). It seems strange that these variables are being exported to the user environment. Do you know where these variables are coming from?

@dsroberts
Copy link

dsroberts commented Apr 17, 2024

Hi @SeanBryan51 Yep. These come from the environment activation script for gcc_linux-64 found here: /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/etc/conda/activate.d/activate-gcc_linux-64.sh. gcc_linux-64 is bought in by parcels, which is a dependency of some COSIMA recipes, so it can't just be removed. The conda module works by running conda activate in a 'blank' environment and parsing the output of the env command. There is some level of filtering, but I'm not sure we can assume that no one ever wants to build against the analysis3 environments. What you've done with passing env to subprocess.run is probably the most sensible solution, though I think rather than removing the LDFLAGS and CMAKE_PREFIX_PATH environment variables entirely, you could create copies of them with references to /g/data/hh5/... removed, then pass those to subprocess.run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:high High priority issues that should be included in the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants