Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: Static musl build #719

Merged
merged 72 commits into from
Feb 27, 2025
Merged

experiment: Static musl build #719

merged 72 commits into from
Feb 27, 2025

Conversation

markphelps
Copy link
Contributor

@markphelps markphelps commented Feb 12, 2025

try building a true static lib for musl as previously we were still dynamically linking

For musl targets, we need use +crt-static instead of -crt-static to enable static linking

-crt-static means disabling static linking which Claude says is still required for Android

The idea is that if we use a static lib built with musl we can still use it in hosts that use glibc.. which would mean we could finally get away from the two different versions of libs for Go/Java/etc (glibc vs musl)

This also switches to use the rusttls crate and feature for reqwest which is a TLS lib implemented in Rust instead of depending on the system openssl which we cant link against for musl if we want static links

TODO:

  • Get rid of the glibc builds all together
  • Update the packaging code to switch to .a instead of .so for linux
  • Test!

@markphelps markphelps requested a review from a team as a code owner February 12, 2025 23:56
@codecov-commenter
Copy link

codecov-commenter commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 2.34742% with 416 lines in your changes missing coverage. Please review.

Project coverage is 80.35%. Comparing base (4e57bc1) to head (79c7661).

Files with missing lines Patch % Lines
flipt-engine-wasm/src/lib.rs 0.00% 191 Missing ⚠️
flipt-engine-wasm-js/src/lib.rs 0.00% 115 Missing ⚠️
flipt-engine-ffi/src/lib.rs 0.00% 108 Missing ⚠️
flipt-engine-ffi/src/http.rs 83.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #719      +/-   ##
==========================================
- Coverage   85.35%   80.35%   -5.01%     
==========================================
  Files           7        8       +1     
  Lines        3927     4164     +237     
==========================================
- Hits         3352     3346       -6     
- Misses        575      818     +243     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@erka erka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks great and would be awesome to finish this up

@markphelps
Copy link
Contributor Author

markphelps commented Feb 16, 2025

So I updated the tests to build the static lib and put them in the right place like we were doing for the dynamic libs in the various SDKS (in /test/main.go).

Now we'll need to determine which languages can call into a static lib via FFI or some other mechanism. Go seems to work after I added one -l flag (-lm to link the math library). But the others are currently failing, although I havent made any changes to them so probably expected.

Current Status

✅ Go (required adding -lm flag)

❌ Ruby (it seems the FFI gem only works for dynamic libs ffi/ffi#225)

❌ Java (java.lang.UnsatisfiedLinkError at TestFliptEvaluationClient.java:25)

❌ Dart (trying to load dynamic lib.. Invalid argument(s): Failed to load dynamic library '/src/lib/src/ffi/linux_x86_64/libfliptengine.a': /src/lib/src/ffi/linux_x86_64/libfliptengine.a: invalid ELF header)

❌ Python (OSError: /src/flipt_client/../ext/linux_x86_64/libfliptengine.a: invalid ELF header)

❌ CSharp (/src/tests/FliptClient.Tests/bin/Debug/net8.0/runtimes/linux-x64/native/libfliptengine.a: invalid ELF header)

@markphelps
Copy link
Contributor Author

New Approach

Go back to building as a .so for Linux but on a musl (alpine) host... which should also work on glibc hosts 🤞🏻

@markphelps
Copy link
Contributor Author

markphelps commented Feb 16, 2025

Ok.. so now with that approach here are the results:

 === Test Results Summary ===
Successful tests:
✅ react
✅ ruby
✅ python
✅ dart
✅ go
✅ csharp
✅ browser

Failed tests:
❌ java: input: container.from.withWorkdir.withDirectory.withFile.withServiceBinding.withEnvVariable.withEnvVariable.withExec.sync resolve: process "gradle test" did not complete successfully: exit code: 1

In the tests we're compiling the .so on a linux-musl (alpine) host and then copying it to the test containers for each language which AFAIK are all debian based.. so this should prove that musl .sos can be linked and run properly on glibc hosts.

Now we just gotta fix the java one.. Looks like I forgot to rename .a to .so for the Java tests.. so hopefully this next run should fix it

@erka
Copy link
Collaborator

erka commented Feb 17, 2025

just wow

@markphelps
Copy link
Contributor Author

Ok.. so back to the drawing board it seems. In the previous commit(s) I was using rust-slim which is actually debian based to build the .so, then copying it to other debian based images for the tests.. so it was still linking against glibc :(

Now when I fixed this and run the Go tests on different platforms like: bookworm, bullseye, and alpine

I get this on the debian based hosts (which is to be expected I guess because we are still building a .so because of the problems with FFI not generally working for static libs as mentioned above):

/usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1
/usr/bin/gcc -s -o $WORK/b001/flipt-client.test -Wl,--export-dynamic-symbol=_cgo_panic -Wl,--export-dynamic-symbol=_cgo_topofstack -Wl,--export-dynamic-symbol=crosscall2 -Wl,--compress-debug-sections=zlib /tmp/go-link-613118912/go.o /tmp/go-link-613118912/000000.o /tmp/go-link-613118912/000001.o /tmp/go-link-613118912/000002.o /tmp/go-link-613118912/000003.o /tmp/go-link-613118912/000004.o /tmp/go-link-613118912/000005.o /tmp/go-link-613118912/000006.o /tmp/go-link-613118912/000007.o /tmp/go-link-613118912/000008.o /tmp/go-link-613118912/000009.o /tmp/go-link-613118912/000010.o /tmp/go-link-613118912/000011.o /tmp/go-link-613118912/000012.o /tmp/go-link-613118912/000013.o /tmp/go-link-613118912/000014.o /tmp/go-link-613118912/000015.o /tmp/go-link-613118912/000016.o /tmp/go-link-613118912/000017.o /tmp/go-link-613118912/000018.o /tmp/go-link-613118912/000019.o /tmp/go-link-613118912/000020.o /tmp/go-link-613118912/000021.o -O2 -g -L/src/ext/linux_arm64 -lfliptengine -lm -Wl,-rpath,/src/ext/linux_arm64 -O2 -g -lpthread -O2 -g -lresolv -no-pie
/usr/bin/ld: warning: libc.musl-aarch64.so.1, needed by /src/ext/linux_arm64/libfliptengine.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: $WORK/b001/flipt-client.test: hidden symbol `stat' in /usr/lib/aarch64-linux-gnu/libc_nonshared.a(stat.oS) is referenced by DSO
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
exit status 1

@markphelps
Copy link
Contributor Author

Phew I went down some rabbit holes..

Here's a summary of the key changes made to fix the musl/glibc compatibility issues:

In test/main.go:

  • Modified the Rust build container to use Alpine Linux
  • Added flags to build a dynamic library with musl libc: -C target-feature=-crt-static
  • Added symbolic binding optimizations: -Bsymbolic and -Bsymbolic-functions
  • Copied required musl runtime libraries (ld-musl-.so.1 and libc.musl-.so.1) alongside our libfliptengine.so
  • Used architecture-specific paths for ARM64/x86_64 support

In flipt-client-go/evaluation.go:

  • Simplified the cgo LDFLAGS to properly handle library and runtime paths

The key insight was that building with musl on Alpine allows the library to work on both musl and glibc systems, but we need to bundle the musl runtime libraries with our .so file. This approach avoids requiring users to install musl packages on their systems while maintaining compatibility across different libc implementations.

So.. now we need to test with the other languages.. but Go works on the various distros:

=== Test Results Summary ===

Successful tests:
✅ go 1.23-bullseye
✅ go 1.23-bookworm
✅ go 1.23-alpine

I suppose its not a big deal to copy the required musl libs ld-musl-*.so.1 and libc.musl-*.so.1 into our distributed clients alongside our own libfliptengine.so if it means now we only need to build one version of the lib for each OS/Arch instead of both a glibc & musl version? @erka, @GeorgeMac wdyt??

@erka
Copy link
Collaborator

erka commented Feb 17, 2025

I don't know about other distros but on Debian 13 /usr/lib/ld-musl-x86_64.so.1 is a symlink to /usr/lib/x86_64-linux-musl/libc.so.

@markphelps
Copy link
Contributor Author

So I spent pretty much the whole day yesterday trying to get this to work and Im thinking its not possible to create a musl .so and run it on both musl and glibc hosts reliably. I think the ABIs are too different and we'll run into problems of it working on one distro but not another because of various differences in versions of glibc..etc.

I think at this point we have three options:

  1. Try again to build a static lib (.a) on musl which should contain everything it needs to run on any host.. However FFI is incompatible with static libs so we can try to wrap the .a with a .so that simply delegates all calls to the .a. Maybe this will allow us to have the best of both worlds.. currently trying this.
  2. Abandon all this and just live with the fact that we'll need 2 versions for each SDK (musl and non-musl)
  3. Try to move all the SDKs to wasm.. which means we'll need to do the HTTP work in the 'host' language itself and only do the evaluation in wasm like we do with the JS ones

@markphelps
Copy link
Contributor Author

markphelps commented Feb 18, 2025

Update!

So it looks like option 1 does work at least on x86 & ARM Linux (via docker) in my initial testing. I likely broke Darwin and Windows support though because now we are creating only a static lib from Rust and not a dylib, and Darwin and Windows can work with dylibs just fine..

But if this passes CI for Go I'll try to get it working for the other languages 🤞🏻 on Linux and test on Debian and Alpine hosts for both x86_64 and arm64.

Then if we can get the other languages to work again we can go back and fix the differences between Linux and Windows/Darwin

@erka
Copy link
Collaborator

erka commented Feb 18, 2025

The staticlib is used in Android. I have "borrowed" part of your PR (it will be a wild merge) and finally get Android CI running.

@erka
Copy link
Collaborator

erka commented Feb 19, 2025

@markphelps could we keep rlib? Without it it's impossible to use the crate outside. The examples are failing because of this too.

@markphelps
Copy link
Contributor Author

markphelps commented Feb 19, 2025

@markphelps could we keep rlib? Without it it's impossible to use the crate outside. The examples are failing because of this too.

Ah yes. I can add that back. I think we'll also need dylib back as well for Darwin and Windows

* chore: split up package-ffi workflows by OS

Signed-off-by: Mark Phelps <[email protected]>

* chore: add back notify android test job

Signed-off-by: Mark Phelps <[email protected]>

* chore: revert back to dynamic lib for now

Signed-off-by: Mark Phelps <[email protected]>

---------

Signed-off-by: Mark Phelps <[email protected]>
Signed-off-by: Mark Phelps <[email protected]>
Signed-off-by: Mark Phelps <[email protected]>
Signed-off-by: Mark Phelps <[email protected]>
Signed-off-by: Mark Phelps <[email protected]>
…-client-sdks into static-musl-build

* 'static-musl-build' of https://github.com/flipt-io/flipt-client-sdks:
  build(deps): bump androidx.test.ext:junit (#710)
  build(deps): bump com.android.tools.build:gradle (#736)
  build(deps-dev): bump @types/node in /flipt-client-node (#735)
  build(deps-dev): bump rollup in /flipt-client-browser (#733)
  build(deps-dev): bump prettier from 3.4.2 to 3.5.1 in /flipt-client-node (#734)
  build(deps-dev): bump rollup in /flipt-client-react (#732)
  build(deps-dev): bump globals in /flipt-client-react (#731)
  build(deps-dev): bump @typescript-eslint/parser in /flipt-client-react (#728)
  build(deps-dev): bump @typescript-eslint/eslint-plugin (#730)
  build(deps-dev): bump @types/react in /flipt-client-react (#729)
  build(deps): bump flipt-io/setup-action from 0.2.0 to 0.3.1 (#727)
  chore: split up package-ffi workflows by OS (#724)
@markphelps
Copy link
Contributor Author

markphelps commented Feb 19, 2025

Ok so I'm consistently getting segfaults when trying to run the Go tests.

All the other SDKs are passing except Go: https://github.com/flipt-io/flipt-client-sdks/actions/runs/13422162445

Im assuming because they are using FFI libs and Go is using Cgo which tries to do the linking itself (i think?)

I spun up a Linux (Ubuntu) VM and went through the steps to build the .so and try to run the tests using gdb to debug the segfault:

# this builds the static lib (.a)
cargo build -p flipt-engine-ffi --release
# next we copy it to the flipt-engine-ffi dir where the build.sh script is which creates our .so using musl-gcc
cp target/release/libfliptengine.a flipt-engine-ffi
cd flipt-engine-ffi
./build.sh
# then we move the .so and .h files to the places that the Go SDK expects them (see the existing CGO directives)
cd ..
mkdir -p flipt-client-go/ext/linux_aarch64
cp flipt-engine-ffi/libfliptengine.so flipt-client-go/ext/linux_aarch64/
cp flipt-engine-ffi/include/flipt_engine.h flipt-client-go/ext/
# finally we build our test with debugging symbols added
go test -c -gcflags="all=-N -l"
# and run the compiled test with gdb
gdb ./flipt-client.test

within gdb:

(gdb) run -test.v
Starting program: /home/mark/workspace/flipt-client-sdks/flipt-client-go/flipt-client.test -test.v
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000fffff7d9b900 in getauxval () from /home/mark/workspace/flipt-client-sdks/flipt-client-go/ext/linux_aarch64/libfliptengine.so
(gdb) bt
#0  0x0000fffff7d9b900 in getauxval () from /home/mark/workspace/flipt-client-sdks/flipt-client-go/ext/linux_aarch64/libfliptengine.so
#1  0x0000fffff79efdec in init_have_lse_atomics () at ./lib/builtins/cpu_model/aarch64/lse_atomics/sysauxv.inc:4
#2  0x0000fffff7fc7624 in call_init (env=0xfffffffff080, argv=0xfffffffff068, argc=2, l=<optimized out>) at ./elf/dl-init.c:70
#3  call_init (l=<optimized out>, argc=2, argv=0xfffffffff068, env=0xfffffffff080) at ./elf/dl-init.c:26
#4  0x0000fffff7fc772c in _dl_init (main_map=0xfffff7fff370, argc=2, argv=0xfffffffff068, env=0xfffffffff080) at ./elf/dl-init.c:117
#5  0x0000fffff7fd9cc8 in _dl_start_user () from /lib/ld-linux-aarch64.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

so we can see the crash is coming from getauxval() call

I found this Go issue from 2017: golang/go#13492

Currently reading through it to see if there's any solution.. but I wanted to post my steps to reproduce here incase anyone else had any idea / could reproduce

cc @erka

@markphelps
Copy link
Contributor Author

After doing some more digging I don't think it's gonna work for Go.. seems that segfault is a known issue after reading the above linked GitHub issue and won't be fixed any time soon.

Also seems that cgo doesn't officially support any c std lib other than glibc..

Anyways I think we'll keep this as is for the other languages that use FFI as it seems to work quite well.

I think next steps for go cross compatibility is to compile the evaluator to wasm and use something like wazero to call it from Go

…-client-sdks into static-musl-build

* 'static-musl-build' of https://github.com/flipt-io/flipt-client-sdks:
  feat(java): added error strategy option (#739)
  feat(ruby): added error strategy option (#740)
* main:
  Release flipt-client-dart-v0.5.0
  feat(dart): added error strategy option (#742)
  Release flipt-client-java-musl-v0.6.0
  Release flipt-client-java-v0.14.0
  Release flipt-client-python-v0.16.0
  Release flipt-client-ruby-v0.15.0
  feat(python): added error strategy option (#741)
@markphelps markphelps mentioned this pull request Feb 24, 2025
8 tasks
@markphelps markphelps enabled auto-merge (squash) February 27, 2025 15:01
@markphelps markphelps merged commit 0beeab1 into main Feb 27, 2025
25 checks passed
@markphelps markphelps deleted the static-musl-build branch February 27, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants